Professional Documents
Culture Documents
Nonlinear Biomedical Signal Processing - Fuzzy Logic, Neural Networks, and New Algorithms, Volume 1 (PDFDrive)
Nonlinear Biomedical Signal Processing - Fuzzy Logic, Neural Networks, and New Algorithms, Volume 1 (PDFDrive)
Nonlinear Biomedical Signal Processing - Fuzzy Logic, Neural Networks, and New Algorithms, Volume 1 (PDFDrive)
SIGNAL PROCESSING
Volume I
IEEE Press Series on Biomedical Engineering
The focus of our series is to introduce current and emerging technologies to biomedical and electrical
engineering practitioners, researchers, and students. This series seeks to foster interdisciplinary
biomedical engineering education to satisfy the needs of the industrial and academic areas. This
requires an innovative approach that overcomes the difficulties associated with the traditional textbook
and edited collections.
Advisory Board
Editorial Board
Eric W. Abel Gabor Herman Kris Ropella
Dan Adam Helene Hoffman Joseph Rosen
Peter Adlassing Donna Hudson Christian Roux
Berj Bardakjian Yasemin Kahya Janet Rutledge
Erol Basar Michael Khoo Wim L. C. Rutten
Katarzyna Blinowska Yongmin Kim Alan Sahakian
Bernadette Bouchon-Meunier Andrew Laine Paul S. Schenker
Tom Brotherton Rosa Lancini G. W. Schmid-Schönbein
Eugene Bruce Swamy Laxminarayan Ernest Stokely
Jean-Louis Coatrieux Richard Leahy Ahmed Tewfik
Sergio Cerutti Zhi-Pei Liang Nitish Thakor
Maurice Cohen Jennifer Linderman Michael Unser
John Collier Richard Magin Eugene Veklerov
Steve Cowin Jaakko Malmivuo AI Wald
Jerry Daniels Jorge Monzon Bruce Wheeler
Jaques Duchene Michael Neuman Mark Wiederhold
Walter Greenleaf Banu Onaral William Williams
Daniel Hammer Keith Paulsen Andy Yagle
Dennis Healy Peter Richardson Yuan-Ting Zhang
Volume I
Edited by
Metin Akay
Darmouth College
Hanover, NH
IEEE
PRESS
All rights reserved. No part of this book may be reproduced in any form,
nor may it be stored in a retrieval system or transmitted in any form,
without written permission from the publisher.
10 9 8 7 6 5 4 3 2 1
ISBN 0-7803-6011-7
IEEE Order No. PC5861
Technical Reviewers
Eric W. Abel, University of Dundee, United Kingdom
Richard D. Jones, Christchurch Hospital, Christchurch, New Zealand
Suzanne Keilson, Loyola College, MD
Kristina M. Ropella, Marquette University, Milwaukee, WI
Alvin Wald, Columbia University, New York, NY
PREFACE xiii
LIST OF CONTRIBUTORS xv
1. Introduction 1
2. Imperfect Knowledge 1
2.1. Types of Imperfections 1
2.1.1. Uncertainties 1
2.1.2. Imprecisions 2
2.1.3. Incompleteness 2
2.1.4. Causes of Imperfect Knowledge 2
2.2. Choice of a Method 2
3. Fuzzy Set Theory 4
3.1. Introduction to Fuzzy Set Theory 4
3.2. Main Basic Concepts of Fuzzy Set Theory 5
3.2.1. Definitions 5
3.2.2. Operations on Fuzzy Sets 6
3.2.3. The Zadeh Extension Principle 8
3.3. Fuzzy Arithmetic 10
3.4. Fuzzy Relations 11
4. Possibility Theory 12
4.1. Possibility Measures 12
4.2. Possibility Distributions 14
4.3. Necessity Measures 15
4.4. Relative Possibility and Necessity of Fuzzy Sets 17
5. Approximate Reasoning 17
5.1. Linguistic Variables 17
5.2. Fuzzy Propositions 19
5.3. Possibility Distribution Associated with a Fuzzy Proposition 19
5.4. Fuzzy Implications 21
5.5. Fuzzy Inferences 22
6. Examples of Applications of Numerical Methods in '.Biology 23
7. Conclusion 24
References 25
vii
viii Contents
1. Introduction 27
1.1. Time Series Prediction and System Identification 28
1.2. Fuzzy Clustering 29
1.3. Nonstationary Signal Processing Using Unsupervised Fuzzy Clustering 29
2. Methods 30
2.1. State Recognition and Time Series Prediction Using Unsupervised Fuzzy
Clustering 31
2.2. Features Extraction and Reduction 32
2.2.1. Spectrum Estimation 33
2.2.2. Time-Frequency Analysis 33
2.3. The Hierarchical Unsupervised Fuzzy Clustering (HUFC) Algorithm 34
2.4. The Weighted Unsupervised Optimal Fuzzy Clustering (WUOFC)
Algorithm 36
2.5. The Weighted Fuzzy K-Mean (WFKM) Algorithm 37
2.6. The Fuzzy Hypervolume Cluster Validity Criteria 39
2.7. The Dynamic WUOFC Algorithm 40
3. Results 40
3.1. State Recognition and Events Detection 41
3.2. Time Series Prediction 44
4. Conclusion and Discussion 48
Acknowledgments 51
References 51
1. Introduction 98
2. Sample Stratification 98
3. Stratifying Coefficients 99
3.1. Derivation of a Modified Back-Propagation Algorithm 100
3.2. Approximation of A Posteriori Probabilities 102
4. Bootstrap Stratification 104
4.1. Bootstrap Procedures 104
4.2. Bootstrapping of Rare Events 105
4.3. Subsampling of Common Events 105
4.4. Aggregating of Multiple Neural Networks 105
4.5. The Bootstrap Aggregating Rare Event Neural Networks 105
5. Data Set Used in the Experiments 106
5.1. Genomic Sequence Data 106
5.2. Normally Distributed Data 1, 2 107
5.3. Four-Class Synthetic Data 113
6. Experimental Results 113
6.1. Experiments with Genomic Sequence Data 113
6.2. Experiments with Normally Distributed Data 1 115
6.3. Experiments with Normally Distributed Data 2 118
6.4. Experiments with Four-Class Synthetic Data 118
7. Conclusions 120
References 120
1. Introduction 122
2. Function Approximation Models and RBF Neural Networks 125
3. Reformulating Radial Basis Neural Networks 127
4. Admissible Generator Functions 129
4.1. Linear Generator Functions 129
4.2. Exponential Generator Functions 132
5. Selecting Generator Functions 133
5.1. The Blind Spot 134
5.2. Criteria for Selecting Generator Functions 136
5.3. Evaluation of Linear and Exponential Generator Functions 137
5.3.1. Linear Generator Functions 137
5.3.2. Exponential Generator Functions 138
6. Learning Algorithms Based on Gradient Descent 141
6.1. Batch Learning Algorithms 141
6.2. Sequential Learning Algorithms 143
7. Generator Functions and Gradient Descent Learning 144
8. Experimental Results 146
9. Conclusions 154
References 155
1. Introduction 158
2. Clustering Algorithms 159
2.1. Crisp and Fuzzy Partitions 160
2.2. Crisp c-Means Algorithm 162
2.3. Fuzzy c-Means Algorithm 164
2.4. Entropy-Constrained Fuzzy Clustering 165
3. Reformulating Fuzzy Clustering 168
3.1. Reformulating the Fuzzy c-Means Algorithm 168
3.2. Reformulating ECFC Algorithms 170
4. Generalized Reformulation Function 171
4.1. Update Equations 171
4.2. Admissible Reformulation Functions 173
4.3. Special Cases 173
5. Constructing Reformulation Functions: Generator Functions 174
6. Constructing Admissible Generator Functions 175
6.1. Increasing Generator Functions 176
6.2. Decreasing Generator Functions 176
6.3. Duality of Increasing and Decreasing Generator Functions 177
7. From Generator Functions to LVQ and Clustering Algorithms 178
7.1. Competition and Membership Functions 178
7.2. Special Cases: Fuzzy LVQ and Clustering Algorithms 180
7.2.1. Linear Generator Functions 180
7.2.2. Exponential Generator Functions 181
8. Soft LVQ and Clustering Algorithms Based on Nonlinear Generator
Functions 182
8.1. Implementation of the Algorithms 185
9. Initialization of Soft LVQ and Clustering Algorithms 186
9.1. A Prototype Splitting Procedure 186
9.2. Initialization Schemes 187
10. Magnetic Resonance Image Segmentation 188
Contents xi
1. Introduction 216
2. Methods 217
2.1. Partial Least Squares 217
2.2. Back-Propagation Networks 218
2.3. Radial Basis Function Networks 219
2.4. Spectral Data Collection and Preprocessing 220
3. Results 221
3.1. PLS 221
3.2. BP 221
3.3. RBF 222
4. Discussion 222
Acknowledgments 231
References 231
1. Introduction 233
2. Measurements and Preprocessing of the EGG 234
2.1. Measurements of the EGG 234
2.2. Preprocessing of the EGG Data 235
2.2.1. ARMA Modeling Parameters 235
2.2.2. Running Power Spectra 236
2.2.3. Amplitude (Power) Spectrum 238
xii Contents
INDEX 257
Fuzzy set theory derives from the fact that almost all natural classes and concepts are
fuzzy rather than crisp in nature. According to Lotfi Zadeh, who is the founder of
fuzzy logic, all the reasoning that people use everyday is approximate in nature.
People work from approximate data, extract meaningful information from massive
data, and find crisp solutions. Fuzzy logic provides a suitable basis for the ability to
summarize information and to extricate information from the collections of masses of
data.
Like fuzzy logic, the concept of neural networks was introduced approximately
four decades ago. But theoretical developments in the last decade have led to numerous
new approaches, including multiple-layer networks, Kohenen networks, and Hopfield
networks. In addition to the various structures, numerous learning algorithms have
been developed, including back-propagation, Bayesian, potential functions, and genetic
algorithms.
In Volume I, the concepts of fuzzy logic are applied, including fuzzy clustering,
uncertainty management, fuzzy set theory, possibility theory, and approximate reason-
ing for biomedical signals and biological systems. In addition, the fundamentals of
neural networks and new learning algorithms with implementations and medical appli-
cations are presented.
Chapter 1 by Bouchon-Meunier is devoted to a review of the concepts of fuzzy
logic, uncertainty management, possibility theories, and their implementations.
Chapter 2 by Geva discusses the fundamentals of fuzzy clustering and nonstation-
ary fuzzy clustering algorithms and their applications to electroencephalography and
heart rate variability signals.
Chapter 3 by Haykin gives a guided tour of neural networks, including supervised
and unsupervised learning, neurodynamical programming, and dynamically driven
recurrent neural networks.
Chapter 4 by Nazeran and Behbehani reviews in-depth the classical neural net-
works implementations for the analysis of biomedical signals including electro-
cardiography, electromyography, electroencephalography, and respiratory signals.
Chapter 5 by Choe et al. discusses rare event detection in genomic sequences using
neural networks and sample stratification, which makes each sample in the data
sequence to have equal influence during the learning process.
Chapters 6 and 7 by Karayiannis are devoted to the soft learning vector quantiza-
tion and clustering algorithms based on reformulation and an axiomatic approach to
reformulating radial-basis neural networks.
xiii
xiv Preface
Metin Akay
Dartmouth College
Hanover, NH
LIST OF CONTRIBUTORS
xv
xvi List of Contributors
Bernadette Bouchon-Meunier
1. INTRODUCTION
2. IMPERFECT KNOWLEDGE
2.1.1. Uncertainties
Imperfections are called uncertainties when there is doubt about the validity of a
piece of information. This means that we are not certain that a statement is true or false
because of
1
2 Chapter 1 Uncertainty Management in Medical Applications
2.1.2. Imprecisions
2.1.3. Incompleteness
Probabilistic logic
Uncertainties Probabilities Bayesian Induction
Belief networks
Confidence degrees Propagation of degrees
Belief, plausibility measures Evidence theory
Possibility, necessity degrees Possibilistic logic
Fuzzy logic
Imprecisions Fuzzy sets Fuzzy set-based techniques
Error intervals Interval analysis
Validity frequencies Numerical quantifiers
0.5
information will be taken into account together with measured (numerical) levels
obtained in the future.
It is easy to see that fuzzy sets are useful for representing imprecise knowledge with
ill-defined boundaries, such as approximate values of vague characterizations (see
Figure 2). Such a representation is also compatible with the representation of some
kinds of uncertainty by means of possibility theory, which we will develop later.
fB
Its height:
Its cardinality:
Fuzzy sets with a nonempty kernel and a height equal to 1 are called normalized.
Chapter 1 Uncertainty Management in Medical Applications
We define the inclusion of fuzzy sets of X as a partial order such that A is included
in B, and we note A c B, if and only if
with:
• The empty set (Vx e X fA{x) = 0) as smallest element
• The universe itself (VJC e X fA(x) = 1) as greatest element
It is then necessary to define operations on fuzzy sets extending the operations on
crisp subsets of X.
The intersection of A and B (Figure 3) is defined as the fuzzy set C = A Π B of X
with the following membership function:
The union of A and B (Figure 4) is defined as the fuzzy set D = A U B of X with the
following membership function:
This definition preserves almost all the properties available in classical set theory,
except the following ones:
• Ac n A Φ 0
• Α°υΑφΧ
which means that a class and its complement may overlap, in agreement with the basic
idea of partial membership in fuzzy set theory.
In some cases, it can be interesting to lose some other properties and to use
definitions of intersection and union with a slightly different behavior.
The most common alternative operators are triangular norms (t-norms) T: [0,1] x
[0,1] ->■ [0,1] to define the intersection and triangular conorms (t-conorms) ±: [0, 1] x
[0,1] -> [0,1] to define the union. These operators have been introduced in probabilistic
metric spaces and they are
• Commutative
• Associative
• Monotonous
• Such that T(x, 1) = x, ±(JC, 0) x for any x in [0, 1]
8 Chapter 1 Uncertainty Management in Medical Applications
It is easy to check that min is a t-norm and max a t-conorm, which are dual in the
following sense:
• \-T(x,y) = 1(1-x, \-y)
• l-±(x,y) = T(l-x, l-y)
The other widely used t-norms are the product T{x, y) = xy and the so-called
Lukasiewicz t-norm T(x, y) = max(x-l-y- 1,0), respectively dual from the following
t-conorms: ±(x, y) = x + y — xy and l(x, y) = vam(x + y, 1) (Figures 6 and 7).
L· T(x,y) = maK(x+y-\,(S)
Another important concept of fuzzy set theory is the so-called Zadeh extension
principle, enabling us to extend to fuzzy values the operations or tools used in classical
set theory or mathematics. Let us explain how it works. Fuzzy sets of X are imperfect
information about the elements of X. For instance, instead of observing Λ: precisely, we
Section 3 Fuzzy Set Theory 9
can only perceive a fuzzy set of X with a high membership degree attached to x. The
methods that would be available to manage the information regarding X in the case of
precise information need to be adapted to be able to manage fuzzy sets.
We consider a mapping φ from a first universe X to a second one Y, which can be
identical to X. The Zadeh extension principle defines a fuzzy set B of Y from a fuzzy set
A of X, in agreement with the mapping φ, in the following way:
andfB(y) = 0 otherwise, with φ*(γ) = {x e X/y = φ(χ)} ϊΐφ-.Χ^- Y, and φ*(γ) = {x <=
A7J> € φ(χ)} if φ: X ^- Ρ(Υ) (i.e., <>
/ is multivalued).
If A is a crisp subset of X reduced to a singleton {a}, the Zadeh extension principle
constructs a fuzzy set B of Y reduced to φ({α}).
If φ is a one-to-one mapping, then
Yy € Y fB(y) = supy>xfA(x)
(14)
fB(y) = 0 if x<\3, fB(y) =1 if y> 1.4
fc(d) = supl{xy)eX^y<Kxy)=d]mm(fA(x),fB(y))
if {(x, y)eX,xjiy, φ(χ, y) = d] φ 0 (15)
fc(d) = 0 otherwise
10 Chapter 1 Uncertainty Management in Medical Applications
L(0) = R(0) = 1
L(l) = 0 or L(x) > 0 Vx with lim^oc L(x) = 0
.R(l) = 0 or R(x) > 0 Vx with l i m ^ ^ R(x) = 0 (17)
πάηίΐ,Ζ-τΓ?) if y > x
V(x,y)eR2 fR(x,y). V Pr> (21)
0 otherwise
for a parameter β > 0 indicating the range of difference between x and y we accept.
If we have three universes X, Y, and Z, it is useful to combine fuzzy relations
between them. The max-min composition of two fuzzy relations R\ on X x Y and R2 on
Y x Z defines a fuzzy relation R = R{ o R2 on X x Z, with membership function:
4. POSSIBILITY THEORY
universe X, the possibility and the necessity measure. Let P{X) denote the set of subsets
of the universe X.
A possibility measure is a mapping Π: P(X) -*■ [0, 1], such that
i. Π(0)=Ο,Π(ΛΓ) = 1, (23)
ii. VAX € P(X), A2 € P(X)... Π(υ,-=1)2...Λ,-) = sup/=li2.. Π(^,·)· (24)
In the case of afiniteuniverse X, we can reduce ii to ii', which is a particular case of
ii for any X:
ii'. VAeP(X),BeP(X) U(AUB) = max(JO.(A),U{B)) (25)
We can interpret this measure as follows: Π(Α) represents the extent to which it is
possible that the subset (or event) A of X occurs. If Π(Α) = 0, A is impossible; if
Π(Α) = 1, A is absolutely possible.
We remark that the possibility measure of the intersection of two subsets of X is
not determined from the possibility measure of these subsets. The only information we
obtain from i and ii is the following:
Let us remark that two subsets can be individually possible (Π(Α) Φ 0, Π(Β) φ 0)
but jointly impossible (Π(,4 Γ\Β) = 0).
Let us consider the example of identification of a disease in a universe
X = [dlt d2,, d3, d4}. We suppose that it is absolutely possible to be in the presence of
disease d\ or disease d2, disease d3 is relatively possible, and disease i/4 is impossible, and
we represent this information as follows:
We deduce that it is absolutely possible that the disease is one of {di,d2, d4], since
but the intersection {d4} of these two subsets [d\, d2, d4] and {i/3, d4] of X corresponds to
a possibility measure equal to 0.
We deduce from conditions i and ii that
Π is monotonous with respect to the inclusion of subsets of X:
If we consider any subset A of X and its complement Ac, at least one of them is
absolutely possible. This means that either an event or its complement is absolutely
possible:
14 Chapter 1 Uncertainty Management in Medical Applications
It is easy to see that possibility measures are less restricting than probability
measures, because the possibility degree of an event is not necessarily determined by
the possibihty degree of its complement.
4.2. Possibility Distributions
A possibility measure Π is completely defined if we assign a coefficient in [0,1] to
any subset of X. In the example of four diseases, we need 16 coefficients to determine Π.
It is easier to define possibility degrees if we restrict ourselves to the elements (and not
to the subsets) of X and we use condition ii to deduce the other coefficients.
A possibility distribution is a mapping π: X -*■ [0,1] satisfying the normalization
condition:
™ΡχεχΦ) = 1 (32)
A possibility distribution assigns a coefficient between 0 and 1 to every element of
X, for instance, to each of the four diseases d\, d2, d^, d4. Furthermore, at least one
element of X is absolutely possible, for instance, one disease in {d\,d2, d^, d4} is abso-
lutely possible. This does not mean that this disease is identified, because several of
them can be absolutely possible and other information is necessary to make a choice
between them.
Possibihty measure and distribution can be associated. From a possibility distribu-
tion π, assigning a coefficient to any element of X, we construct a possibility measure
assigning a coefficient to any subset of X as follows:
is compatible with the preceding possibihty measure, which is not given completely as
only 3 of the 16 coefficients are indicated.
In the case of two universes X and Y, we need to define the extent to which a pair
(x, y) is possible, with x e X and y e Y.
The joint possibility distribution π(χ, y) on the Cartesian product X x Y is defined
for any x € X and y e Y and it expresses the extent to which x and y can occur
simultaneously.
The global knowledge of X x Y through the joint possibihty distribution π(χ, y)
provides marginal information on X and Y by means of the marginal possibility dis-
tributions, for instance on Y:
Section 4 Possibility Theory 15
which satisfy:
This possibility distribution π(χ, y) is the greatest among all those compatible with
πχ and nY. Two variables respectively defined on these universes are also called non-
interactive.
The effect of A' on Y can also be represented by means of a conditional possibility
distribution πΥ/Χ such that
In the case of a finite universe X, we can reduce iv to iv' which is a particular case
of iv for any X:
iv'. VA 6 P(X), B € P{X) N(A Π B) = mm(N(A), N(B)) (42)
We can interpret this measure as follows: N(A) represents the extent to which it is
certain that the subset (or event)^ of X occurs. If N(A) = 0, we have no certainty about
the occurrence of the event A; if N(A) = 1, we are absolutely certain that A occurs.
Necessity measures are monotonous with regard to set inclusion:
The necessity degree of the union of subsets of X is not known precisely, but we
know a lower bound:
We deduce also from iii and iv a link between the necessity measure of an event A
and its complement Ac:
We see that the information provided by a possibility measure and that provided
by a necessity measure are complementary and their properties show that they are
linked together. Furthermore, we can point out a duality between possibility and neces-
sity measures, as follows.
For a given universe X and a possibility measure Π on X, the measure defined by
which means we need only one collection of coefficients between 0 and 1 associated with
the elements of the universe X (the values of π(χ)) to determine both possibility and
necessity measures.
With the previous example, the certainty on the fact that the patient suffers from
disease d\ is measured by N^d^}) and it can be deduced from the greatest possibility
that the patient suffers from one of the three other diseases:
The duality between Π and N also appears in the following relations, satisfied
VA e P(X):
Section 5 Approximate Reasoning 17
• Π(Α)>Ν(Α),
• max(n(yi), l-N(A)) = l,
• If Ν(Α)φ 0, then U(A)= I,
• If Π(Α)ΦΙ, then ΛΤ(Λ) = 0.
These properties are important if we elicit the possibility and necessity measures
from a physician. For instance, if the physician provides first possibility degrees Π(Α)
for events A, we should not ask the physician to give necessity degrees for events with
possibility degrees strictly smaller than 1, because N(A) = 0 in this case. If the physician
provides first degrees of certainty, corresponding to values of a necessity measure, we
should not ask for possibility degrees for events with necessity degrees different from 0,
as U(A) = 1 in this case.
4.4. Relative Possibility and Necessity of Fuzzy
Sets
Possibility and necessity measures have been defined for crisp subsets of X, not for
fuzzy sets. In the case in which fuzzy sets are observed, analogous measures are defined
with a somewhat different purpose, which is to compare an observed fuzzy set F to a
reference fuzzy set A of X.
The possibility of F relative to A is defined as
These coefficients are used, among other things, to measure the extent to which F
is suitable with A. For example, with the universe X of real numbers, we can evaluate
the compatibility of the glycemia level F of a patient, described as "about 1.4g/l"
(Figure 2), with a reference description of the glycemia level as "abnormal" (Figure
1), by means of Tl(F; A) and N(F; A), and this information will express the extent to
which the glycemia level can be considered abnormal.
5. APPROXIMATE REASONING
Tv = {normal, abnormal} (Figure 8). We use the same notation for a linguistic char-
acterization and for its representation by a fuzzy set of X. The set Ty corresponds to
basic characterizations of V.
We need to construct more characterizations of V to enable efficient reasoning
from values of V.
A linguistic modifier is an operator m yielding a new characterization m(A) from
any characterization A of V in such a way that fm(A) = tm(fA) for a mathematical
transformation tm associated with m.
For a set M of modifiers, M{TV) denotes the set of fuzzy characterizations deduced
from Tv. For example, with M = {almost, very}, we obtain M(TV) = {very abnormal,
almost normal...} from Tv = {normal, abnormal} (Figure 9).
Examples of linguistic modifiers are defined by the following mathematical defini-
tions, corresponding to translations or homotheties:
ΠΚ,ΛΦ) = sup^jiy^C*)
NKA(D) = l-nv,A(Dc)
Analogously, a compound fuzzy proposition induces a possibility distribution on
the Cartesian product of the universes. For instance, a fuzzy proposition such as " V is
A and W is 2?," with V and W defined on universes X and Y, induces the following
possibility distribution:
0.5
ε
Figure 10 Possibility distribution of an
1.3 1.5 g/1
uncertain fuzzy proposition.
Section 5 Approximate Reasoning 21
πνΑ deduced from the membership function of "abnormal" given in Figure 1 and the
value 0.4 of e.
5.4. Fuzzy Implications
The use of imprecise and/or uncertain knowledge leads to reasoning in a way close
to human reasoning and different from classical logic. More particularly, we need:
To manipulate truth values intermediate between absolute truth and absolute
falsity
To use soft forms of quantifiers, more gradual than the universal and existential
quantifiers V and 3
To use deduction rules when the available information is imperfectly compatible
with the premise of the rule.
For these reasons, fuzzy logic has been introduced with the following character-
istics:
Propositions are fuzzy propositions constructed from sets L of linguistic variables
and M of linguistic modifiers.
The truth value of a fuzzy proposition belongs to [0,1] and is given by the member-
ship function of the fuzzy set used in the proposition.
Fuzzy logic can be considered as an extension of classical logic and it is identical
to classical logic when the propositions are based on crisp characterizations of
the variables.
Let us consider a fuzzy rule "if V is A then W is B," based on two linguistic
variables (V, X, 7 » and (W, Y, Tw).
A fuzzy implication associates with this fuzzy rule the membership function of a
fuzzy relation Ä o n J f x Y defined as
for a function F chosen in such a way that, if A and B are singletons, then the fuzzy
implication is identical to the classical implication.
There exist many definitions of fuzzy implications. The most commonly used are
the following:
The last two quantities (*) do not generalize the classical implication, but they are
used in fuzzy control to manage fuzzy rules.
Generalized modus ponens is an extension of the scheme of reasoning called modus
ponens in classical logic. For two propositions p and q such that p =» q, if p is true, we
deduce that q is true. In fuzzy logic, we use fuzzy propositions and, if p' is true, with/»'
approximately identical to p, we want to get a conclusion, even though it is not q itself.
Generalized modus ponens (g.m.p.) is based on the following propositions:
Rule if V is A then W is B
Observed fact V is A'
Conclusion W is B'
The membership function fB> of the conclusion is computed from the available
information: fR to represent the rule,fA> to represent the observed fact, by means of the
so-called combination-projection rule:
There exist many knowledge-based systems using fuzzy logic. The treatment of glyce-
mia, for instance, has given rise to several automatic systems supporting diagnosis or
helping patients to take care of their glycemia level [13-15]. An example in other
domains is a system supporting the prescription of antibiotics [16].
Some general systems, which are expert system engines using fuzzy logic, have been
used to solve medical problems. MILORD is particularly interesting for its module of
24 Chapter 1 Uncertainty Management in Medical Applications
expert knowledge elicitation [17] and FLOPS takes into account fuzzy numbers and
fuzzy relations and is used to process medical images in cardiology [18]. Also,
CADIAG-2 provides a general diagnosis support system using fuzzy descriptions and
also fuzzy quantifiers such as "frequently" or "rarely" [19].
The management of temporal knowledge in an imprecise framework can be solved
by using fuzzy temporal constraints, and such an approach has been used for the
management of data in cardiology [20], for instance.
It is also interesting to use fuzzy techniques for diagnosis support systems taking
into account clinical indications that are difficult to describe precisely, such as the
density, compacity, and texture of visual marks. Such systems have been proposed
for the diagnosis of hormone disorders [21] or the analysis of symptoms of patients
admitted to a hospital [22].
In medical image processing, problems of pattern identification are added to the
difficulty in eliciting precise and certain rules from specialists, even though they are able
to make a diagnosis from an image. A system for the analysis of microcalcifications in
mammographic images has been proposed [23], a segmentation method based on fuzzy
logic has been described [24], and the fusion of cranial magnetic resonance has been
explained [25].
Databases can also be explored by means of imprecise queries, and an example of
an approach to this problem using fuzzy concepts has been proposed [26].
In this section, we have listed the main directions in using fuzzy logic in the
construction of automatic systems in medicine on the basis of existing practical appli-
cations. This list is obviously not exhaustive. More applications are discussed elsewhere
[27].
7. CONCLUSION
We have presented the main problems concerning the management of uncertainty and
imprecision in automatic systems, especially in medical applications. We have intro-
duced methodologies that enable us to cope with these imperfections.
We have not developed evidence theory, also called Dempster-Shafer theory,
which concerns the management of degrees of belief assigned to the occurrence of
events. The main interest lies in the combination rule introduced by Dempster that
provides a means of aggregating information obtained from several sources.
Another methodology used in medical applications is the construction of causal
networks, generally regarded as graphs, the vertices of which are associated with situa-
tions or symptoms or diseases. The arcs forward probabilities of occurrence of events
from one vertex to another and enable us to update probabilities of hypotheses when
new information is received or to point out dependences between elements.
As we focused on methods for dealing with imprecisions, let us point out the
reasons for their importance [1,2]: fuzzy set and possibility theory are of interest
when at least one of the following problems occurs:
• We have to deal with imperfect knowledge.
• Precise modeling of a system is difficult.
• We have to cope with both uncertain and imprecise knowledge.
References 25
The number of medical applications developed since the 1970s justifies the devel-
opment we have presented.
REFERENCES
[1] B. Bouchon-Meunier, La logique floue, Que Sais-Jel 2nd ed. No. 2702. Paris: Presses
Universitaires de France, 1994.
[2] B. Bouchon-Meunier, La logiquefloueet ses applications. Paris: Addison-Wesley, 1995.
[3] B. Bouchon-Meunier and H. T. Nguyen, Les incertitudes dans les systemes intelligents. Paris:
Presses Universitaires de France, 1996.
[4] D. Dubois and H. Prade, Fuzzy Sets and Systems, Theory and Applications. New York:
Academic Press, 1980.
[5] D. Dubois and H. Prade, Theorie des possibilites, applications ά la representation des con-
naissances en informatique, 2nd ed. Paris: Masson, 1987.
[6] D. Dubois, H. Prade, and R. R. Yager, Readings in Fuzzy Sets for Intelligent Systems. San
Mateo, CA: Morgan Kaufmann, 1993.
[7] G. Klir and T. Folger, Fuzzy Sets, Uncertainty and Information. Englewood Cliffs, NJ:
Prentice Hall, 1988.
[8] L. Sombe, Raisonnements sur des informations incompletes en intelligence artificielle—
Comparaison de formalismes apartir d'un exemple. Toulouse: Editions Teknea, 1989.
[9] R. E. Neapolitan, Probabilistic Reasoning in Expert Systems. New York: Wiley, 1994.
[10] R. R. Yager and D. P. Filev, Essentials of Fuzzy Modeling and Control. New York: Wiley,
1990.
[11] L. A. Zadeh, Fuzzy sets. Information Control 8: 338-353, 1965.
[12] L. A. Zadeh, Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst. 1: 3-28, 1978.
[13] G. Soula, B. Vialettes, and J. L. San Marco, PROTIS, a fuzzy deduction rule system:
Application to the treatment of diabetes. Proceedings MEDINFO 83, IFIP-IMIA, pp.
177-187, Amsterdam, 1983.
[14] P. Y. Glorennec, H. Pircher, and J. P. Hespel, Fuzzy logic control of blood glucose.
Proceedings International Conference IPMU, pp. 916-920, Paris, 1994.
[15] J. C. Buisson, H. Farreny, and H. Prade, Dealing with imprecision and uncertainty in the
expert system DIABETO-III. Actes CIIAM-86, pp. 705-721, Hermes, 1986.
26 Chapter 1 Uncertainty Management in Medical Applications
[16] G. Palmer and B. Le Blanc, MENTA/MD: Moteur decisionnel dans le cadre de la logique
possibiliste. Actes 3emes Journees Nationales sur les Applications des Ensembles Flous, pp.
61-68, Nimes, 1993.
[17] R. Lopez de Mantaras, J. Agusti, E. Plaza, and C. Sierra, Milord: A fuzzy expert system
shell. In Fuzzy Expert Systems, A. Kandel, ed., pp. 213-223, Boca Raton, FL: CRC Press,
1992.
[18] J. J. Buckley, W. Siler, and D. Tucker, A fuzzy expert system. Fuzzy Sets Syst 20:1-16,1986.
[19] K. P. Adlassnig and G. Kolarz, CADIAG-2: Computer-assisted medical diagnosis using
fuzzy subsets. In Approximate Reasoning in Decision Analysis, M. M. Gupta and E. Sanchez,
eds., pp. 219-247. Amsterdam: North Holland.
[20] S. Barro, A. Bugarin, P. Felix, R. Ruiz, R. Marin, and F. Palacios, Fuzzy logic applications
in cardiology: Study of some cases. Proceedings International Conference IPMU, pp. 885-
891, Paris, 1994.
[21] E. Binaghi, M. L. Cirla, and A. Rampini, A fuzzy logic based system for the quantification
of visual inspection in clinical assessment. Proceedings International Conference IPMU, pp.
892-897, Paris, 1994.
[22] D. L. Hudson and M. E. Cohen, The role of approximate reasoning in a medical expert
system. In Fuzzy Expert Systems, A. Kandel, ed. Boca Raton, FL: CRC Press, 1992.
[23] S. Bothorel, B. Bouchon-Meunier, and S. Muller, A fuzzy logic-based approach for serni-
ological analysis of microcalcifications in mammographic images. Int. J. Intell. Syst., 1997.
[24] P. C. Smits, M. Mari, A. Teschioni, S. Dellepine, and F. Fontana, Application of fuzzy
methods to segmentation of medical images. Proceedings International Conference IPMU,
pp. 910-915, Paris, 1994.
[25] I. Bloch and H. Maitre, Fuzzy mathematical morphology. Ann. Math. Artif. Intell. 9:III-IV,
1993.
[26] M. C. Jaulent and A. Yang, Application of fuzzy pattern matching to theflexibleinterroga-
tion of a digital angiographies database. Proceedings International Conference IPMU, pp.
904-909, Paris, 1994.
[27] M. Cohen and D. Hudson, eds., Comparative Approaches to Medical Reasoning. Singapore:
World Scientific, 1995.
[28] R. R. Yager, S. Ovchinnikov, R. M. Tong, and H. T. Nguyen, eds., Fuzzy Sets and
Applications, Selected Papers by L. A. Zadeh. New York: Wiley, 1987.
[29] H. J. Zimmermann, Fuzzy Set Theory and Its Applications. Dordrecht: Kluwer, 1985.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.
Amir B. Geva
1. INTRODUCTION
State recognition (diagnosis) and event prediction (prognosis) are important tasks in
biomedical signal processing. Examples can be found in tachycardia detection from
electrocardiogram (ECG) signals, epileptic seizure prediction from an electroencepha-
logram (EEG) signal, and prediction of vehicle drivers falling asleep from both signals.
The problem generally treats a set of ordered measurements of the system behavior and
asks for recognition of temporal patterns that may forecast an event or a transition
between two different states of the biological system.
Applying clustering methods to continuously sampled measurements in quasi-sta-
tionary conditions is useful for grouping discontinuous related temporal patterns. Since
the input patterns are time series, a similar series of events that lead to a similar result
would be clustered together. The switches from one stationary state to another, which
are usually vague and not focused on any particular time point, are naturally treated by
means of fuzzy clustering. In such cases, an adaptive selection of the number of clusters
(the number of underlying processes, or states, in the time series) can overcome the
general nonstationary nature of real-life time series.
The method includes the following steps: (0) rearrangement of the time series into
temporal patterns for the clustering procedure, (1) dynamic state recognition and event
detection by unsupervised fuzzy clustering, (2) system modeling using the noncontin-
uous temporal patterns of each cluster, and (3) time series prediction by means of
similar past temporal patterns from the same cluster of the last temporal pattern.
The prediction task can be simplified by decomposing the time series into separate
scales of wavelets and predicting each scale separately. The wavelet transform provides
an interpretation of the series structures and information about the history of the series,
using fewer coefficients than other methods.
The algorithm suggested for the clustering is a recursive algorithm for hierarchical-
fuzzy partitioning. The algorithm benefits from the advantages of hierarchical cluster-
ing while maintaining the rules of fuzzy sets. Each pattern can have a nonzero member-
ship in more than one data subset in the hierarchy. Feature extraction and reduction is
optionally reapplied for each data subset. A "natural" and feasible solution to the
cluster validity problem is suggested by combining hierarchical and fuzzy concepts.
27
28 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing
The algorithm is shown to be effective for a variety of data sets with a wide dynamic
range of both covariance matrices and number of members in each class. The new
method is demonstrated for well-known time series benchmarks and is applied to
state recognition of the recovery from exercise by the heart rate signal and to the
forecasting of biomedical events such as generalized epileptic seizures from the EEG
signal.
as the multilayer perceptron, are likely to fail to represent the underlying input-output
relations [17].
means of fuzzy clustering. In such cases, an adaptive selection of the number of clusters
(the number of underlying semistationary processes in the signal) can overcome the
general nonstationary nature of biomedical signals.
The time series prediction task can be treated by combining fuzzy clustering tech-
niques with common methods. The deterministic versus stochastic (DVS) algorithm for
time series prediction, which was successfully demonstrated by Casdagli and Weigend
in the Santa Fe competition [3], is an important and relevant approach. In the DVS
algorithm k sets of samples from a nonuniform time scale are used with the present
window, according to an affinity criterion, as a precursory set for a future element. The
prediction phase of the algorithm can be regarded as a ^-nearest-neighbor clustering of
the time series elements to groups that reflect certain "states" of the series. The idea is
that we expect future results of similar states to be similar as well. In the same way that
the ^-nearest neighbor was used, unsupervised fuzzy clustering methods can be imple-
mented so as to provide an alternative method for time series prediction [15]. This
approach is expected to provide superior results in quasi-stationary conditions, where
a relatively small number of stationary distributions control the behavior of the series
and unexpected switches between them are observed. Again, an unsupervised selection
of the number of clusters and of the number of patterns in each cluster (the parameter
k, which is fixed for all the clusters in the DVS algorithm) can overcome the nonsta-
tionarity of the signals and improve the prediction results.
2. METHODS
The general scheme of the hybrid algorithm for state recognition and time series pre-
diction using unsupervised fuzzy clustering is presented in Figure 1. The method
includes the following steps:
0. Rearrangement of the time series into temporal patterns for the clustering
procedure
1. Dynamic state recognition and event detection by unsupervised fuzzy clustering
2. Modeling and system identification of the noncontinuous temporal patterns of
each cluster
3. Time series prediction by means of similar past temporal patterns from the same
cluster of the last temporal pattern
The clustering procedure can be applied directly on continuous overlapping win-
dows of the sampled raw data or on some of its derivatives (the phase or state space of
the data). In the clustering phase of the algorithm, we "collect" all the temporal pat-
terns from the past that are similar to the current temporal event by the clustering
procedure. This set of patterns is used in the next stage to predict the following samples
of the time series. Using only similar temporal patterns to predict the time series
simplifies the predictor learning task and, thus, enables better prediction ("From causes
which appear similar, we expect similar effects," Hume [27]). The prediction stage can
be done by one of the common time series prediction methods (e.g., linear prediction
with ARMA models or nonlinear prediction with NNs). The prediction method pre-
sented here combines unsupervised learning in the clustering phase and supervised
learning in the modeling phase. The learning procedure can be dynamic by utilizing
Section 2 Methods 31
Time series
j „ 6 9t, w = 1, ...,L
Temporal patterns
e
*t={Si ί,+ΛΓ-ΐ} ^
i=l,...,M,M=L-N-d +2
Fuzzy clustering
Modeling
for/=l,...,K
Ay={x,|i6//},
i=\,...,M-\}
Predicting
the clustering procedure for each new sample and predicting the next samples by the
adapted clustering results.
1. Cluster the temporal patterns into an optimal (subject to some cluster validity
criterion) number of fuzzy sets, K, that is, find the degree of membership,
0 <«_/,/< 1, of each temporal pattern x„ i=\,...,M, in each cluster,
j = 1,..., K, such that
Λ:
] ζ «,·,,· =1, i=\,...,M
j=i
2. Fit a prediction model to each fuzzy cluster, j' = 1,..., K, using the set of its
"maximal members," that is, the set of temporal patterns that have the maximal
degree of membership in the y'th cluster,
• A, = {x,|i € J ; }, and a column vector of the corresponding predictions,
• by = {si+N_i+(j\i e Jj}, as a learning set, where
• Jj = {i\Ujri = ke™xJQ(«*,,·), i = 1,. · ·, M — 1} is the set of indices of the tem-
poral patterns that have the maximal degree of membership in the y'th cluster.
Note that each temporal pattern is a member in one and only one set of max-
imal members.
Assuming a linear prediction model of order N,
• by = cy · Aj, for each cluster, j = 1,..., K, one can estimate the TV-dimensional
coefficients (row) vectors, Cy, by
• cy = by · pinv(Ay), where "pinv" stands for the Moore-Penrose pseudoinverse
[28] of a matrix.
Note that other prediction models, such as NN techniques, can be learned for each
cluster and that the degree of membership of each pattern in the cluster can be used
as the weight of this pattern in the learning process.
3. Predict the Jth sample ahead, sL+d by a fuzzy mixture of all the prediction models
that were found in the previous step (2), using the degree of membership of the last
pattern, uMj, in all the clusters, j = 1,..., K, for weighting the models.
For the preceding linear model
K
U
SL+d = Σ J<M ' CJ ' %M
7=1
The procedure can be partially applied and terminated after each of the stages accord-
ing to the user's requirements.
2.2. Features Extraction and Reduction
The clustering can be applied directly on windows of the sampled raw data or on
some of its derivatives (the phase space of the data). It is common practice to use a
transformation of the input instead of the data elements themselves in order to char-
acterize the signal's inherent properties [29]. A considerable amount of experience has
been gained in using several known transformations (such as discrete derivatives, spec-
trum estimation, and wavelet analysis) for feature extraction to reduce the data's
Section 2 Methods 33
dimension and to modify the data to fit the context of a specific problem. An important
feature of the proposed prediction methods is their ability to perform the clustering
under any such transformation of the input and thus to exploit the related benefits.
The power spectrum (including high-order spectrum) and spectrum of each tem-
poral pattern can be estimated by one of the common spectrum estimation methods
(e.g., short-time fast Fourier transform, AR, ARMA, eigenvector analysis), to con-
struct the features matrix for the clustering algorithm. The power spectrum is a direct
and robust way to describe the nature of the signal. One of the noteworthy advantages
of the power spectrum is its phase invariance. The main disadvantage of power spec-
trum estimation is the requirement for a stationary signal.
forms are chosen in order to best match the signal structures. Matching pursuits are
general procedures to compute adaptive signal representations and feature extraction.
A matching pursuit can isolate the signal structures that are coherent with respect to a
given dictionary and provides an interpretation of the signal structures. At each itera-
tion of the algorithm, a waveform that is best adapted to approximate part of the signal
is chosen. If a signal structure does not correlate well with any particular dictionary
element, namely a noise component, it is subdecomposed into several elements and its
information is diluted. Although matching pursuit is a nonlinear procedure, it does
maintain an energy conservation that guarantees its convergence [35].
This algorithm has been generalized into spatiotemporal matching pursuit
(SToMP) and adapted to multiple source estimation of EEG complexes such as evoked
potentials (EPs), which are known to be summations of simultaneous electrical activ-
ities of deeper generators [36]. By using a physiologically motivated time-frequency
dictionary of waveforms, the number and the temporal activity pattern of the signal
generators may be estimated. A slightly different version of the SToMP algorithm can
be directly applied to the temporal patterns of a multidimensional signal as a feature
extraction procedure for the clustering algorithm [34].
HUFC '
Optional feature selection
and reduction
1
t l
KM
1 Yes
No
different weight in the partitioning. The number of clusters in each stage is determined
by adapted cluster validity criteria, based on the hypervolume measurement [26,38].
In the first call to the procedure all data points have an equal weight (of one) in
the partitioning. In the next level of the recursive process the same two-step procedure
is applied to each of the fuzzy clusters that were found in the previous partitioning.
Each fuzzy cluster is composed of all the data points with nonzero membership values
in it. These memberships are used as the weights of the data points for the next
recursive call to the WUOFC. The final membership values of each level are the
membership values that are found by the WUOFC algorithm multiplied by the
given weights of the data points. This procedure ensures that the final membership
values of each level of the recursive calls are decreasing. The recursive process is
terminated when the optimal number of clusters in the subset is one (which constrains
the chosen cluster validity criterion to be applicable and also sensitive for one cluster)
or when the number of data points in a cluster is smaller than some constant (usually
around 10) multiplied by the number of features [20,24]. Note that, in contrast to
"hard" hierarchical clustering, the final decision about the data point affiliation is
made only when the algorithm terminates, since each data point can have a nonzero
membership in more than one cluster.
The main part of the new algorithm is a recursive procedure HUFC(X,w) whose
inputs are an N x M data matrix, X, composed of M columns of data patterns,
Xj e TlN, j = 1,..., M, and a column vector, w e HM, of M weights of each data
pattern in the partitioning. The final result of the HUFC algorithm is a Ks x M global
matrix, Ug, of the memberships of all M data patterns in all final Ks fuzzy clusters. The
HUFC algorithm is initiated by setting the global matrix Ug to an empty matrix and the
global number of clusters Ks to zero and executed by calling HUFC(Xo, w0), where X0
36 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing
is the matrix of the M0 original data patterns and w0 is a column vector of M0 ones. The
pseudocode of the HUFC procedure includes the following steps:
HUFC(X,w)
1. Extract the optimal features from the column of the matrix X and reduce the
number of features from N to F (basically by KLT).
2. If the sum of the patterns' weights jHji, wj > Constant x F <> commonly
Constant« 10.
3. then (U, K) = WUOFC(X.w)
Q apply the weighted unsupervised optimal fuzzy clustering algorithm (see Section
2.4),
O where K is the chosen number of clusters in the given data and U is a K x M
matrix of
O the memberships of the M given patterns in these K clusters.
4. else AT = 1
5. If K > 1
6. then for k «- 1 to K
7. do HUFC(X, w x uk) O recursive call to the main procedure
<> where uk is the vector of the memberships of all M patterns in the k's cluster,
O and w x u^ denotes a vector whose jth component is Wj x Ukj,j = 1,..., M.
8. else append the column vector w to the global memberships matrix U8
9. K* <- Kg + 1 O increase the global number of clusters by one.
10. return
When the algorithm has terminated Ug contains the final memberships of all the data
patterns in all the AT8finalclusters.
tions. It performs well in a situation of large variability of cluster shapes, densities, and
number of data points in each cluster. The pseudocode of the weighted version of the
UOFC algorithm is iterated for an increasing number of clusters in the data set, calcu-
lating a new partition of the data set, and computing performance measures in each
run, until the optimal number of clusters is obtained:
(U, K) = WUOFC(X,w)
1. Choose a single, K = 1, initial centroid, Pi, at the weighted (by w) mean of all data
patterns.
2. While K < the maximal feasible number of clusters in the data
3. do Calculate a new partition of the data set by two phases (see Section 2.5):
3.1 Cluster with the weighted fuzzy Z-means with the Euclidean distance function:
(U, PJC) = WFKM(X, w, K, P*_,)
3.2 Use the final centroids of the stage 3.1, P^, as the initial centroids for the weighted
fuzzy ΛΤ-means with the exponential distance function;
a fuzzy modification of the maximum likelihood estimation (FMLE):
(U,Pjfc) = WFKM(X,w,Ä:,Pjr)
4. Calculate the cluster validity criteria (see Section 2.6).
5. Add another centroid equally distant (with a large number of standard deviations)
from all data points (see step 2 in the following modified fuzzy ΛΓ-means
algorithm).
6. Use the cluster validity criteria to choose and return the optimal number of cluster,
K, and the corresponding partition, U.
2.5. The Weighted Fuzzy K-Mean (WFKM)
Algorithm
The weighted version of the fuzzy ΑΓ-mean algorithm, which is used in stages 3.1
and 3.2 of the WUOFC algorithm, is derived from the minimization with respect to P
of a set of cluster centers and U, a membership matrix, of a weighted fuzzy version of
the least-squares function [18]:
M K
where x, in the z'th pattern, the rth column in the X data matrix, pfc is the center of the
Mi cluster, uk ,· is the degree of membership of the data pattern x, in the Äth cluster, w, is
the weight of the rth pattern (as if w, patterns that are equal to x, were included in the
data matrix X), d2(pfc, x,·) is the square of the distance between x, and pk, M is the
number of data patterns, and K is the number of clusters in the partition. The para-
meter q (commonly set to 2) is the weighting exponent for uki and q controls the
"fuzziness" of the resulting clusters [18]. The pseudocode of the weighted fuzzy K-
mean clustering algorithm with the modified centroids initialization [15,26,38] includes
the following steps:
(U, Pk) = WFKM(X, w, K, P j d )
1. Use the final centroids (prototypes) of the previous partition, P^-i, as the initial
centroids for the current partition: in stage 3.1 of the WUOFC algorithm use the
38 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing
K - 1 (*) final centroids of its previous stage and for phase 3.2 use all the K final
centroids, P^, of stage 3.1.
repeat Calculate the degree of membership uk ,· of all data pattern in all clusters:
for k <- 1 to KC)
do for i +- 1 to M
do
(*) Only for k = K and in the first iteration of stage 3.1 of the UOFC algorithm use
the following distance: d2(x„ pk) = 10 · Sum( Diagonal( CoVariance(X))),
i= 1,..., M. Otherwise use the Euclidean distance in stage 3.1 (Eq. 4) or the
exponential distance (Eq. 5) in stage 3.2 of the WUOFC algorithm.
Calculate the new set of cluster centers:
for k <- 1 to K
do
M i M
PA = ΣS</-Lw■<--wi x<7
U
Xi/ΈΣ"ΐuh ■ w> (3)
ί=1 ' ί=1
4. until
max
kj r( u
kj — (previous ukj < €
In the first phase 3.1 of the WUOFC algorithm, the fuzzy weighted £-niean algorithm
is performed with the Euclidean distance function:
The final cluster centers of the first phase 3.1 are used as the initial centroids for the
second phase. In the second phase 3.2, a fuzzy modification of the maximum likelihood
estimation is utilized by using the following exponential distance function in the
weighted fuzzy Ä-mean algorithm:
where ak = Υ*ίχ ukJ/ Σ*ίχ w, is the sum of memberships within the kth cluster, which
consist of the a priori probability of selecting the kth cluster and
M I M
*k = J2 U*J ■Wt ■ (p* ~ *') ■(p* ~ x ' ) T / Σ u
kj ■ wi (®
i=l ' 1=1
Section 2 Methods 39
K
\m(K) = Σ hk, (7)
k=\
K I K
VPD(/0 = £ > * / £ > , (8)
k=l I k=\
where Ck = Σιεΐ* ukj ■ Wj, and lk is a set of indices of the "central members" in the
M i cluster.
where gkj is they'th column of G^ = FjjT1, the inverse of the M i cluster covariance
matrix.
Note that a pattern x, is a "central member" in the fcth cluster only if all the
projections of the Mahanalobis distance between the pattern x, and the M i cen-
troid ρ λ are smaller then one (and not as in [26,34], where the Mahanalobis dis-
tance itself should be smaller than one).
3. The average partition density using "central members" (APDC):
K
1
VAD(*) = T : 5 > * / A * ] (9>
K
k=\
M K
The UOFC algorithm is terminated when the performance measures for cluster
validity reach their best value. The choice of the criterion or combination of criteria to
be the performance measure is driven by the specific distribution of the data. One of the
main constraints on a validity criterion for the HUFC algorithm is its efficient applic-
ability for one cluster (compared to more than one cluster), remembering that the
recursive procedure is halted when the "partition" to one cluster is the best of all
partitions. This constraint precludes the use of any validity criterion that involves the
distance between clusters (such as the classical Fisher criterion [24], which is based on
the "between and within clusters scatter matrix," or the well established Xie-Beni
criterion [39] and many others).
2.7. The Dynamic WUOFC Algorithm
In the realization of temporal patterns clustering, the data set X can be dynamic.
For each new sample s i + i, we get another temporal pattern (column) in the matrix X,
and repeat the HUFC with the new data set. The dynamic procedure is started with an
initial X matrix with M0 columns and is rerun for each new sample. It is possible to save
computation time by initializing the prototypes of each partition by the centroids of the
last partition. We can gradually decrease the weights of old samples in the clustering
procedure by the next definition of w,:
where M is the number of columns in X and β is a constant (usually set to M0, the initial
number of patterns in X). In the time series clustering, w, has the meaning of a memory
coefficient. By this dynamic procedure we get a dynamic number of clusters with
dynamic moving centroids. The dynamic parameters that are used to identify the
state of the system are the variable number of clusters, the location of the centroids
of the clusters, and the fuzzy-covariance matrices of the clusters.
3. RESULTS
As mentioned before, the first part of the hybrid procedure can be applied for state
recognition and events detection, while the whole procedure should be used only if time
series prediction is needed. In the following two sections we will demonstrate these
options.
Section 3 Results 41
<
1 2 3 1 2 3
epsll Partition density epsl 1 Average partition density
Figure 4 The first partition of rat number 11's EEG data into two clusters by the
HUFC algorithm. The upper panel shows the partition in the clustering
space of three (out of eight) energies of the scales of the discrete wavelet
transform of the EEG stretch terminating with a seizure. Each data point is
marked by the number of the cluster in which it has maximal degree of
membership. The number of clusters was determined by the average density
criterion for cluster validity (Figure 3). The lower panel shows the "hard"
affiliation of each successive point in the time series (1 second) to each of
the clusters. The seizure beginning (as located by a human expert) is marked
by a solid vertical line (after 700 seconds). (See insert for color illustrations.)
Section 3 Results 43
Figure 5 The final partition of the EEG data with the HUFC algorithm. Clusters 4
and 5 can be used to predict the seizure, which can be identified by clusters
8 and 10. (See insert for color illustrations.)
as during the exercise. We can learn from the small variance of these first clusters that in
this first stage the heart rate variability is very small. Cluster number 5 (which did not
exist in the first stage) indicates the very beginning stage of the recovery. The patterns of
clusters 9 and 10 mark the breathing during recovery and resting stages, respectively,
and so on. This example emphasizes the ability of the proposed algorithm to extract
and to quantify the dynamic states of the subject.
Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing
1
1 2 3 4 5
hrv Partition density hrv Average partition density
-
'■1 1 Γ" "F'■ ' 1r "T"
1 f 1 1 1 1 1
1
MM!
1 I 1 1
.3
Nili
1
1 2
I
3 4 5
■ 1 1
I
1 \u
f * * \
1 2 3 4 5
i
Figure 6 The validity criteria of the first stage of the HUFC algorithm for the heart
rate signal. Again, only the average partition density criterion (bottom
right) suggests the choice of four clusters in this first stage. The other
criteria were found to be useless for these continuous clusters.
where j = 1,..., T enumerates the points in the withheld test set, "observation" is the
true value of the time series, S, "prediction" is the output of the prediction algorithm,
S, and σ2 denotes the sample variance of the observed time series in the test set. A value
of NMSE = 1 corresponds to simply predicting the average.
Learning the optimal dimension of the temporal patterns N is still an open pro-
blem. The number of clusters in each partition was automatically chosen by the average
partition density VAPD criterion using the "maximal members" (Eq. 10). Note that the
clusters that are created by continuously sampled temporal patterns are different from
Section 3 Results 45
Figure 7 Thefirstpartition of the recovery heart rate signals by the HUFC algorithm
into four clusters as suggested by the average partition density (Figure 6).
The upper panel shows the partition of the 3D temporal patterns {s,, J, + I ,
si+2), i= 1 L — 2, of the heart rate signal into the four clusters, and in
the lower panel we can see the affiliation of each temporal pattern with its
corresponding cluster marked on the original heart rate signal (the contin-
uous line). (See insert for color illustrations.)
other clusters; they are not well separated and do not have any regular shape (Figures
3-11), and it seems that the APD criterion is well adapted to these kinds of clusters
(where other validity criteria give poor results or, most of the time, show monotonic
behavior).
46 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing
Figure 8 The final partition of the heart rate variability signal into 10 clusters. The
upper panel shows the partition of the 3D temporal patterns of the heart
rate signal into the final 10 clusters, and in the lower panel we can see the
affiliation of each temporal pattern with its corresponding cluster marked
on the original heart rate signal (the continuous line). (See insert for color
illustrations.)
1
f I
/
1 2 3 4 5 6 1 2 3 4 5 6
x_l .ufc Partition density x_l .ufc Average partition density
1 2 3 4 5 6
I 1 2 3 4 5 6
Number of clusters Number of clusters
Figure 9 The first partition of the s„ = 4 · s„_[ ■ (1 — s„_j) time series by the HUFC
algorithm. The lower panel shows the partition of the 2D temporal patterns,
{i„ si+i}, i = 1,..., 899, intofiveclusters as suggested by the average parti-
tion density criterion in the upper panel. (See insert for color illustrations.)
the two-dimensional state (or phase) space, divided intofiveclusters as suggested by the
APD criterion in lower panel. In the final stage of the HUFC algorithm, the temporal
patterns were divided into 101 clusters, and more clusters were chosen in nonlinear
areas of the phase space. Figure 10 shows the final prediction results with
NMSE = 6.116e-7.
In the second example we chose heart rate variability signal in the rest state. Again
the 900 first points were used as a training set and the next 100 points as the test set
48 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing
Figure 10 The one sample ahead (d = 1) prediction results of 200 samples of the s„ =
4 · i„_i · (1 - s„_i) time series. The circle (O) marks the original samples of
the time series and the "x" marks the predictions.
(Γ = 200). One sample ahead was predicted (d = 1), using 3D temporal patterns,
[st, si+lsi+2}, i= 1,..., 898 (N = 3). The top panel of Figure 11 shows final clustering
into 19 clusters as suggested using the APD criterion. The lower panel of Figure 11
shows the classification results on the original training set. For these real data the
prediction error was 0.3076 as shown in Figure 12.
We have described a method for biomedical state recognition and dynamic system
identification by an unsupervised hierarchical-fuzzy clustering. The clustering is useful
in grouping noncontinuous temporal patterns from nonstationary signals and to form
warning clusters. Moreover, the vague flips from one state to another are naturally
treated by means of fuzzy clustering. In summary, two main problems are tackled by
the unsupervised fuzzy clustering procedure. First, it finds similar events in the "his-
tory" of the time series that are relevant to the prediction and avoids the use of non-
relevant information that can bias the prediction results. Note that noncontinuous time
series can be utilized by the clustering algorithm, so "old" observations of the time
series can be employed for the prediction. Second, using only this minimal required
number of similar temporal patterns improves the robustness and reduces the compu-
tation time of any prediction algorithm that is used. Yet the specific parameters for
temporal pattern classification by the unsupervised fuzzy clustering (the value of the
"fuzziness" of the partition, q, the partition validity criteria, etc.) should be further
investigated. Moreover, the results of the clustering (the membership function, the
centroids and their variances, etc.) could be more efficiently utilized in the prediction
process; for example, it seems promising to use RBF NN for the prediction of each
Section 4 Conclusion and Discussion 49
Figure 11 Thefinalpartition of the resting heart rate signals by the HUFC algorithm
into 19 clusters as suggested by the average partition density using the
"maximal members." The upper panel shows the partition of the 3D
temporal patterns {s,, ί,·+ι, ί, +2 ), i = 1,..., L — 2, of the heart rate signal
into 19 clusters, and in the lower panel we can see the affiliation of each
temporal pattern with its corresponding cluster marked on the original
heart rate signal (the continuous line). (See insert for color illustrations.)
50 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing
Figure 12 The one sample ahead (d = 1) prediction results of 100 samples of the
resting heart rate signal. The circle (O) marks the original samples of the
time series and the "x" marks the predictions.
cluster by using the cluster centroid and variance matrix as initial values of the neuron
transfer function.
The hierarchical-fuzzy algorithm tries to exploit the advantages of hierarchical
clustering while overcoming its disadvantages by mean of fuzzy clustering. The hier-
archical partition helps to fathom the inner structure of the data and thus enables
multiscale compression and reconstruction of the data. Thus, it can be used to analyze
complicated fractal structures and natural signals and pictures. The algorithm can work
for data with a wide dynamic range in both the covariance matrices and the number of
members in each cluster. The number of subclusters in each bifurcation is neither two
nor any other constant (as in some hierarchical clustering algorithm) but is adaptively
determined by the nature of the data. Choosing optimal different features for the
partition of each data subset can help get a feasible result when dealing with a mixture
of different types of data. The HUFC algorithm can be naturally realized by means of
parallel computation where each subclustering can be made by a different processor.
Hence it can be faster in practice than other fuzzy clustering algorithms.
The method was demonstrated by forecasting epileptic seizures from the EEG
signal and by state recognition of the recovery from exercise of the heart rate signal.
Adding more channels to the feature-extracting process and more parameters (derived
from other biomedical signals) to the clustering process should increase the forecasting
power of this method. Other possible applications for using the method as a warning
device could be the prediction of impending psychotic states, detrimental effects of
hypoxia in pilots, and loss of vigilance in drivers as well as extracerebral pathologies
such as a heart attack, based on heart rate variability. The method can also be utilized
for the prediction of nonbiological time series.
The time series prediction results of the method are encouraging. One of the
significant problems of methods for nonlinear time series prediction is the strong depen-
References 51
dence between the quality of the results and the specific characteristics of the time series
[3]. The main advantage of the new method is the unsupervised and adaptive learning of
the number of clusters, or states, in the time series space (the number of the underlying
process in the signal) and of the variable number of patterns in each cluster, which can
overcome the general nonstationary nature of the time series. However, the establish-
ment of reliable validity criteria for temporal pattern classification is an important open
issue. Another open problem of the methods is the adaptive learning of the best value
for the dimension of the temporal patterns, JV, that is, the dimension of the state space
of the specific time series analyzed. Trying to choose the "optimal" N by means of the
best results of the training set, by the conventional technique, did not give the best
results in all cases.
ACKNOWLEDGMENTS
The author would like to thank Dr. Dan Kerem from the Israeli Naval Medical
Institute for the EEG data and for his physiological advice, Eran Lumbroso from Tel-
Aviv university for the heart rate signals, and Shai Poliker from the ECE department of
BGU for his helpful comments. This research was supported by The Israel Science
Foundation founded by The Israel Academy of Sciences and Humanities.
REFERENCES
[1] J. D. Hamilton, Time Series Analysis. Princeton: Princeton University Press, 1994.
[2] A. S. Weigend, Predicting the future: A connectionist approach. Int. J. Neural Syst. 1(3):
193-209, 1990.
[3] A. S. Weigend and N. A. Gershenfeld, Time Series Prediction: Forecasting the Future and
Understanding the Past. Reading, MA: Addison-Wesley, 1992.
[4] K. Hornik, M. Stinchcombe, and H, White, Multi-layer feedforward network are universal
approximators, Neural Comput. 2: 359, 1988.
[5] E. Hartman, K. Keeler, and J. K. Kowalski, Layered neural networks with Gaussian hidden
units as universal approximators. Neural Comput. 2: 210, 1990.
[6] S. Haykin, Neural Networks, a Comprehensive Foundation. New York: Macmillan, 1994.
[7] S. G. Mallat and S. Zong, Characterization of signal from multiscale edges. IEEE Trans
PAMI 10:710-732, 1992
[8] I. Daubechies, Othogonal bases of compactly supported wavelets. Commun. Pure Appl.
Math. 41:909-996, 1988.
[9] S. G. Mallat, A theory for multiresolution signal decomposition: The wavelet representa-
tion. IEEE Trans. Pattern Analy. Machine Intell. 11: 674-693, 1989.
[10] B. R. Bakshi and G. Stephanopoulos, Wave-net: A multi resolution, hierarchical neural
network with localize learning. AIChE J. 39(1): 57-81, 1993.
[11] B. R. Bakshi and G. Stephanopoulos, Reasoning in time: Modeling analysis and pattern
recognition of temporal process trends. Adv. Chem. Eng. 22: 485-547, 1995.
[12] Q. Zhang and A. Benveniste, Wavelet networks. IEEE Trans. Neural Networks 3: 889-898,
1992.
[13] Q. Zhang, Using wavelet networks in nonparametric estimation. IEEE Trans. Neural
Networks 8(2): 227-236, 1997.
52 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing
[14] B. Delyon, A. Juditsky, and A. Benveniste, Accuracy analysis for wavelet approximations.
IEEE Trans. Neural Networks 6(2): 332-348, 1995.
[15] A. B. Geva, Dynamic unsupervised fuzzy clustering in forecasting events from biomedical
signals. Ministry of Science, International Conference on Fuzzy Logic and Applications, Israel,
May 1997.
[16] T. Poggio and F. Girosi, A theory of network for approximation and learning. Proc. IEEE
78(9): 1481-1497, 1990.
[17] J. Kohlmorgen, K.-R. Müller, and K. Pawelzik, Segmentation and identification of drifting
dynamical systems. Proceedings of the 1997 IEEE Signal Processing Society Workshop on
Neural Networks for Signal Processing, Florida, September, 1997.
[18] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York:
Plenum, 1981.
[19] J. C. Bezdek and S. K. Pal, Fuzzy Models for Pattern Recognition. New York: IEEE Press,
1992.
[20] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice
Hall, 1988.
[21] I. Gath, C. Feuerstein, and A. B. Geva, Unsupervised classification and adaptive definition
of sleep patterns, Pattern Recogn. Lett. 15: 977-984, 1994.
[22] A. B. Geva and H. Pratt, Unsupervised clustering of evoked potentials by waveform. Med.
Biol. Eng. Comput. 543-550, 1994.
[23] N. R. Pal and J. C. Bezdek, On cluster validity for the fuzzy c-means model. IEEE Trans.
Fuzzy Syst. 3(3): 370-379, 1995.
[24] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, New York: Wiley-
Interscience, 1973.
[25] L. A. Zadeh, Fuzzy sets. Inform. Control 8: 338-353, 1965.
[26] I. Gath and A. B. Geva, Unsupervised optimal fuzzy clustering. IEEE Trans. Pattern Anal.
Machine Intel!. 7: 773-781, 1989.
[27] D. Hume, Enquiry Concerning Human Understanding, 1784.
[28] MatLab Reference Guide. The Math Works, 1992.
[29] A. Cohen, Biomedical Signal Processing, Boca Raton, FL: CRC Press, 1986.
[30] S. Mallat, A Wavelet Tour of Signal Processing, San Diego: Academic Press, 1998.
[31] A. Aldroubi and M. Unser, Wavelets in Medicine and Biology, Boca Raton, FL: CRC Press,
1996.
[32] M. Akay, Time-Frequency and Wavelets in Biomedical Signal Processing, New York: IEEE
Press, 1998.
[33] A. B. Geva, ScaleNet—MultiScale neural network architecture for time series prediction.
IEEE Trans. Neural Networks 9(6) 1471-1482, 1998.
[34] A. B. Geva and D. H. Kerem, Brain state identification and forecasting of acute pathology
using unsupervised fuzzy clustering of EEG temporal patterns. In Applications of Neuro-
Fuzzy Systems in Medicine and Bio-medical Engineering (BME), H.-N. Teodorescu, L. C.
Jain and A. Kandel, eds. Boca Raton, FL: CRC Press.
[35] S. G. Mallat and S. Zong, Characterization of signal from multiscale edges. IEEE Trans.
PAMI10: 710-732, 1992.
[36] A. B. Geva, H. Pratt, and Y. Y. Zeevi, Multichannel wavelet-type decomposition of evoked
potentials: Model-based recognition of generator activity. Med. Biol. Eng. Comput. 95(1):
40-46, 1997.
[37] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression, Boston: Kluwer
Academic Publisher, 1992.
[38] I. Gath and A. B. Geva, Fuzzy clustering for the estimation of the parameters of the
components of mixtures of normal distributions. Pattern Recogn. Lett. 9: 77-86, 1989.
[39] X. L. Xie and G. Beni, A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal.
Machine Intel!. 13(8): 841-847, 1991.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.
Simon Haykin
2. SUPERVISED LEARNING
This form of learning assumes the availability of a labeled (i.e., ground-truthed) set of
training data made up of N input-output examples:
T = {(xi,di))li (1)
53
54 Chapter 3 Neural Networks: A Guided Tour
end. The hidden neurons play an important role: the extraction of important features
contained in the input data.
The training of an MLP is usually accomplished by using a BP algorithm that
involves two phases [2,3].
• Forward phase. During this phase the free parameters of the network are fixed,
and the input signal is propagated through the network of Figure 1 layer by
layer. The forward phase finishes with the computation of an error signal
£< = dt - V; (3)
where dt is the desired response and >>, is the actual output produced by the
network in response to the x,·.
• Backward phase. During this second phase, the error signal e, is propagated
through the network of Figure 1 in the backward direction, hence the name of
the algorithm. It is during this phase that adjustments are applied to the free
parameters of the network so as to minimize the error e, in a statistical sense.
Back-propagation learning may be implemented in one of two basic ways, as
summarized here:
1. Sequential mode (also referred to as the pattern mode, on-line mode, or stochastic
mode): In this mode of BP learning, adjustments are made to the free parameters
of the network on an example-by-example basis. The sequential mode is best suited
for pattern classification.
2. Batch mode: In this second mode of BP learning, adjustments are made to the free
parameters of the network on an epoch-by-epoch basis, where each epoch consists
of the entire set of training examples. The batch mode is best suited for nonlinear
regression.
The back-propagation learning algorithm is simple to implement and computationally
efficient in that its complexity is linear in the synaptic weights of the network. However,
a major limitation of the algorithm is that it can be excruciatingly slow, particularly
when we have to deal with a difficult learning task that requires the use of a large
network.
We may try to make back-propagation learning perform better by invoking the
following list of heuristics:
• Use neurons with antisymmetric activation functions (e.g., hyberbolic tangent
function) in preference to nonsymmetric activation functions (e.g., logistic func-
tion). Figure 2 shows examples of these two forms of activation functions.
• Shuffle the training examples after the presentation of each epoch; an epoch
involves the presentation of the entire set of training examples to the network.
• Follow an easy-to-learn example with a difficult one.
• Preprocess the input data so as to remove the mean and decorrelate the data.
• Arrange for the neurons in the different layers to learn at essentially the same
rate. This may be attained by assigning a learning-rate parameter to neurons in
the last layers that is smaller than those at the front end.
56 Chapter 3 Neural Networks: A Guided Tour
*>(")
a =-1.719
1.0 ■•-yT
-1.0 0
1
1
i
/
1.0
1 /
1 V'
- a = -1.719
(a)
N
-°® (4)
where O denotes "the order o f and e denotes the fraction of classification errors
permitted on test data. For example, with an error of 10% the number of training
examples needed should be about 10 times the number of synaptic weights in the net-
work.
Supposing that we have chosen a multilayer perceptron to be trained with the
back-propagation algorithm, how do we determine when it is "best" to stop the train-
Section 2 Supervised Learning 57
ing session? How do we select the size of individual hidden layers of the MLP? The
answers to these important questions may be gotten though the use of a statistical
technique known as cross-validation, which proceeds as follows:
• The set of training examples is split into two parts:
• Estimation subset used for training of the model
• Validation subset used for evaluating the model performance
• The network is finally tuned by using the entire set of training examples and
then tested on test data not seen before.
φ-Ι
• RBF networks have a single hidden layer, whereas multilayer perceptrons can
have any number of hidden layers.
• The output layer of an RBF network is always linear, whereas in a multilayer
perceptron it can be linear or nonHnear.
• The activation function of the hidden layer in an RBF network computes the
Euclidean distance between the input signal vector and a parameter vector of
the network, whereas the activation function of a multilayer perceptron com-
putes the inner product between the input signal vector and the pertinent synap-
tic weight vector.
The use of a linear output layer in an RBF network may be justified in light of
Cover's theorem on the separability of patterns. According to this theorem, provided
that the transformation from the input space to the feature (hidden) space is nonlinear
and the dimensionality of the feature space is high compared to that of the input (data)
space, then there is a high likelihood that a nonseparable pattern classification task in
the input space is transformed into a linearly separable one in the feature space.
Design methods for RBF networks include the following:
1. Random selection of fixed centers [4]
2. Self-organized selection of centers [6]
3. Supervised selection of centers [5]
4. Regularized interpolation exploiting the connection between an RBF network and
the Watson-Nadaraya regression kernel [7]
^(χ,-,χ,^^χ,·)^,·) (5)
where x, and x,· are input vectors for examples i andy, and <p(xi) is the vector of hidden-
unit outputs for inputs x,·. The hidden (feature) space is chosen to be of high dimen-
sionality so as to transform a nonlinear separable pattern classification problem into a
linearly separable one. Most important, however, in a pattern-classification task, for
Section 3 Unsupervised Learning 59
example, the support vectors are selected by the SVM learning algorithm so as to
maximize the margin of separation between classes.
The curse-of-dimensionality problem, which can plague the design of multilayer
perceptrons and RBF networks, is avoided in support vector machines through the use
of quadratic programming. This technique, based directly on the input data, is used to
solve for the linear weights of the output layer [8].
3. UNSUPERVISED LEARNING
where the term —r\y]wji is added to stabilize the learning process. As the number of
iterations approaches infinity, we find the following:
1. The synaptic weight vector of neuron j approaches the eigenvector associated with
the largest eigenvalue Xmax of the correlation matrix of the input vector (assumed
to be of zero mean).
2. The variance of the output of neuron j approaches the largest eigenvalue Xmax.
The generalized Hebbian algorithm (GHA), due to Sänger [10] is a straightforward
generalization of Oja's neuron for the extraction of any desired number of principal
components.
3.2. Self-Organizing Maps
In a self-organizing map (SOM), due to Kohonen (1997), the neurons are placed at
the nodes of a lattice, and they become selectively tuned to various input patterns
(vectors) in the course of a competitive learning process. The process is characterized
by the formation of a topographic map in which the spatial locations (i.e., coordinates)
60 Chapter 3 Neural Networks: A Guided Tour
of the neurons in the lattice correspond to intrinsic features of the input patterns.
Figure 4 illustrates the basic idea of a self-organizing map, assuming the use of a
two-dimensional lattice of neurons as the network structure.
In reality, the SOM belongs to the class of vector coding algorithms [11]. That is, a
fixed number of code words are placed into a higher dimensional input space, thereby
facilitating data compression.
An integral feature of the SOM algorithm is the neighborhood function centered
around a neuron that wins the competitive process. The neighborhood function starts
by enclosing the entire lattice initially and is then allowed to shrink gradually until it
encompasses the winning neuron.
The algorithm exhibits two distinct phases in its operation:
1. Ordering phase, during which the topological ordering of the weight vectors takes
place
2. Convergence phase, during which the computational map is fine tuned
The SOM algorithm exhibits the following properties:
1. Approximation of the continuous input space by the weight vectors of the discrete
lattice.
2. Topological ordering exemplified by the fact that the spatial location of a neuron
in the lattice corresponds to a particular feature of the input pattern.
3. The feature map computed by the algorithm reflects variations in the statistics of
the input distribution.
4. SOM may be viewed as a nonlinear form of principal components analysis.
4. NEURODYNAMIC PROGRAMMING
Supervised learning is a cognitive learning problem performed under the tutelage of a
teacher. It requires the availability of input-output examples representative of the
environment.
Reinforcement learning, on the other hand, is a behavioral learning problem [17].
It is performed through the interaction of a learning system with its environment. The
need for a teacher is eliminated by virtue of this interactive process.
Basically, neurodynamic programming is the modern approach to reinforcement
learning, building on Bellman's classic work on dynamic programming [18]. For a
formal definition of neurodynamic programming, we offer the following:
• Neurodynamic programming enables a system to learn how to make good
decisions by observing its own behavior and to improve its actions by using a
built-in mechanism through reinforcement.
Neurodynamic programming incorporates two primary ingredients:
1. The theoretical foundation provided by dynamic programming
2. The learning capabilities provided by neural networks as function approximators
An important feature of neurodynamic programming is that it solves the credit
assignment problem by assigning credit or blame to each one of a set of interacting
decisions in a principled manner. The credit assignment problem is also referred to as
the loading problem, the problem of loading a given set of training data into the free
parameters of the network.
62 Chapter 3 Neural Networks: A Guided Tour
Time is an essential dimension of learning. We may incorporate time into the design of
a neural network implicitly or explicitly. A straightforward method of implicit repre-
sentation of time is to add a short-term memory structure at the input end of a static
neural network (e.g., multilayer perceptron), as illustrated in Figure 5. This configura-
tion is called a focused time-lagged feed-forward network (TLFN). Focused TLFNs are
limited to stationary dynamical processes.
To deal with nonstationary dynamical processes, we may use distributed TLFNs
where the effect of time is distributed at the synaptic level throughout the network. One
way in which this may be accomplished is to use finite-duration impulse response (FIR)
filters to implement the synaptic connections of an MLP; Figure 6 shows an FIR model
of a synapse. The training of a distributed TLFN is naturally a more difficult proposi-
tion than the training of a focused TLFN. Whereas we may use the ordinary back-
propagation algorithm to train a focused TLFN, we have to extend the back-propaga-
tion algorithm to cope with the replacement of a synaptic weight in the ordinary MLP
by a synaptic weight vector. This extension is referred to as the temporal back-propa-
gation algorithm, due to Wan [19].
Output
Figure 5 Focused time-lagged feed-forward network (TLFN); the bias levels have
been omitted for convenience of presentation.
Section 6 Dynamically Driven Recurrent Networks 63
Another practical way to account for time in a neural network is to employ feedback at
the local or global level. Neural networks so configured are referred to as recurrent
networks.
We may identify two classes of recurrent networks:
1. Autonomous recurrent networks exemplified by the Hopfield network [20] and
brain-state-in-a-box (BSB) model. These networks are well suited for building
associative memories, each with its own domain of applications. Figure 7 shows
an example of a Hopfield network involving the use of four neurons.
2. Dynamically driven recurrent networks, which are well suited for input-output
mapping functions that are temporal in character.
Dynamically driven recurrent network architectures include the following:
1. Input-output recurrent model, commonly referred to as a nonlinear autoregressive
with exogenous inputs (NARX) model. Figure 8 shows an example of this net-
work.
2. State-space model, illustrated in Figure 9.
3. Recurrent multilayer perceptron, illustrated in Figure 10.
4. Second-order network, illustrated in Figure 11.
The first three configurations build on the state-space approach of modern control
theory. Second-order networks use second-order neurons, where the induced local
field (activation potential) of each neuron is defined by
V
k = JU2W*VXiUJ (8)
4
* 1
| * 1
z'1
z"1 z~ Unit-delay
r-i
operators
II-
<
To design a dynamically driven recurrent network, we may use any one of the
following approaches:
• Back-propagation through time (BPTT), which involves unfolding the temporal
operation of the recurrent network into a layered feedforward network [22].
This unfolding facilitates the application of the ordinary back-propagation
algorithm.
• Real-time recurrent learning, in which adjustments are made (using a gradient
descent method) to the synaptic weights of a fully connected recurrent network
in real time [23].
• Extended Kaimanfilter(EKF), which builds on the classic Kaiman filter theory
to compute the synaptic weights of the recurrent network. Two versions of the
algorithm are available [24]:
• Decoupled EKF
• Global EKF
The decoupled EKF algorithm is computationally less demanding but somewhat less
accurate than the global EKF algorithm.
A serious problem that can arise in the design of a dynamically driven recurrent
network is the vanishing gradients problem. This problem pertains to the training of a
recurrent network to produce a desired response at the current time that depends on
input data in the distant past [25]. It makes the learning of long-term dependences in
Section 6 Dynamically Driven Recurrent Networks 65
Input
~V~
«(»-I) i »-
u(n-2) f »-
u(n -q + 2) i »-
Output
Multilayer *~~*>0ι+1)
u(n-q+l) i »► perceptron
.T
^ ( n - g + 1)
y(n -q + 2) ' ►■
y(n-\) i ».
Ξ
y(n) ■> *-
Context
E
units
Bank of
unit ^^B
delays I
Output
Input
Hidden ^ ^ A ^ ^ Output ' vector
vector
■■■M layer ^^^^r layer
Multilayer perceptron with
Figure 9 Simple recurrent network (SRN). single hidden layer
66 Chapter 3 Neural Networks: A Guided Tour
Bank of
unit delays
Xo(»+l)
^ ^ ^ First ■ ^ ^ Second ■ ψψψ _ Output
^ ^ hidden J g W ^ hidden J i > ^ ° vector
^^Ψ^ layer ^ ^ ^ ^ ^ ^ ^ layer ^ ^ ^ ^ ^ ^ ^
7. CONCLUDING REMARKS
REFERENCES
[1] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed. Englewood Cliffs, NJ:
Prentice Hall, 1999.
[2] P. J. Werbos, Beyond regression: New tools for prediction and analysis in the behavioral
sciences. Ph.D. thesis, Harvard University, 1974.
[3] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by
error propagation. In Parallel Distributed Processing in the Microstructure of Cognition, D.
E. Rumelhart and J. L. McCleland, eds., Vol. 1, Chap. 8. Cambridge, MA: MIT Press, 1986.
[4] D. S. Broomhead and D. Lowe, Multivariable functional interpolation and adaptive net-
works. Complex Syst. 2: 321-355, 1988.
[5] T. Poggio and F. Girosi, Networks for approximation and learning. Proc. IEEE, 78: 1481-
1497, 1990.
[6] J. E. Moody and C. J. Darken, Fast learning in networks of locally-tuned processing units.
Neural Comput. 1: 281-294, 1989.
[7] P. V. Yee, Regularized radial basis function networks: Theory and applications to prob-
ability estimation, classification, and time series prediction. Ph.D. thesis, McMaster
University, 1998.
68 Chapter 3 Neural Networks: A Guided Tour
Just because the brain looks like a mess of porridge doesn't mean it's a cereal computer.
Michael Arbib
To understand the importance, current state, and future trends for ANNs, it is helpful
to review briefly the basic structure of ANNs and their development history.
70 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals
a = £ WtX, (1)
i=0
where Xt and Wt reflect the magnitude of the inputs and their associated weights (for
i = 1,2, and n — 1), respectively. The PE output Y is computed as a function of a and
an offset or threshold value Θ as
Υ=Α<χ-θ) (2)
i0 .
w x
wO
xl I / ^ o v l
x2Q_w2 ^— ^/ / pE \x Output
^
y
wn-l^
Figure 1 An example of a simple artificial
neuron consisting of a single processing ele-
ΧΆ
'^ \J ment with synaptic weights and connections.
Section 1 Overview and History of Artificial Neural Networks 71
=/iE^,-ö)j (3)
Hidden layer
Input layer
one of the most essential hurdles that grounded the development of the ANNs field. In
1986-1987, many neural network research programs were initiated and the field gained
tremendous momentum.
between layers. By repeatedly presenting the network with training patterns and chan-
ging the weights to minimize the error term, the network is trained to give the desired
output for a given class of input. Weight changes can be carried out after each example
or after each presentation of the complete training set, sometimes called cumulative
back-propagation. By plotting the error term against the number of presentations, a
"learning curve" can be produced, which ideally converges to zero as the number of
presentations increases (for more details on back-propagation please see the Appendix).
An alternative form of network trained in a supervised fashion is ADALINE.
ADALINEs are much less complex than back-propagation trained MLPs but employ
similar strategies. The output of the ADALINE is a linear combination of the weighted
sums of the inputs. No sigmoid transfer function is used. An error term is then calcu-
lated and used to modify the network weights. The weighted sum is fed through a
thresholding unit to give a binary output.
Associative memories are another example of ANN learning in the supervised
mode. These networks consist of two sets of neurons, representing input and output
layers. All neurons in the input layer are connected to all neurons in the output layer.
By adjusting the connection weights between the neurons, it is possible to store
examples of input patterns and their corresponding output classes. Subsequent test
input patterns will produce outputs associated with the closest exemplar class.
Autoassociation memories are trained to produce an output identical to the input, so
when presented with noisy and incomplete test patterns they make an informed guess of
the missing data points and hence effectively achieve filtering (noise reduction).
We may think of the following analogy to compare the supervised with the unsu-
pervised mode of learning. Learning with supervision corresponds to classroom learn-
ing with the teacher's questions answered by the students and the answers corrected, if
needed, by the teacher. Unsupervised learning corresponds to learning the subject from
a videotape lecture covering the material but not including any other teacher's involve-
ment. The teacher lectures on directions and methods but is not available to provide
explanations and answer questions (Zurada, 1992).
In unsupervised learning the desired response or the target vector is unknown;
thus, explicit error information cannot be used to improve network behavior. Since
no information is available about the correctness or incorrectness of responses, learning
must somehow be accomplished based on observations of responses to input patterns
that we do not have much knowledge about. Unsupervised learning algorithms use
patterns that are typically redundant raw data having no labels regarding their class
membership or associations. In this mode of learning, the network must discover for
itself the existence of any possible patterns, regularities, distinguishing properties, and
so on. While discovering these possibilities, the network changes its parameters or
undergoes self-organization (Kohonen, 1990). This is called the self-organizing feature
maps (SOMs) algorithm. In this algorithm, the training process involves the presenta-
tion of pattern vectors from the training set one at a time. A winning node is selected in
a systematic fashion after all input features are presented. A weight adjustment process
takes place by using the concept of a neighborhood that shrinks over time and a
learning coefficient that also decreases with time. After several input patterns are pre-
sented, weights will form cluster centers that sample the input space such that the point
76 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals
density function of the cluster centers approaches the probability density function of the
input features. The weights will also be organized such that topologically close output
nodes are sensitive to inputs that are physically simpler. Thus, the output nodes will be
ordered in a natural way.
The unsupervised network simply consists of input and hidden neurons. The net-
work learns by associating different input pattern types with different clusters of hidden
nodes. When trained, different groups of hidden neurons respond to different classes of
input patterns. Some information about the number of clusters or similarity versus
dissimilarity of patterns can be helpful for this mode of learning.
The data used for training the neural net must be representative of the population
(over the entire feature space) for the network to model adequately the probability
density function of each class. It is also important that the training patterns be pre-
sented randomly. The network must be able to generalize to the entire training set as a
whole, not to the individual classes one at a time. Presenting classes of vectors sequen-
tially can result in poor convergence and unreliable class discrimination. Training on
patterns randomly generates a type of noise that can jog the network out of a local
minimum in the feature space. Noise is sometimes added to the training set to assist
convergence.
In a pattern recognition problem, the input to the ANN is the feature vector of the
unknown pattern. The feature vector is presented to each node in the input layer of the
network. Often the feature vector is augmented by an additional element that is chosen
to be unity. This provides an additional weight in the summation that acts as an offset
in the activation function. The unknown pattern is assigned to the class somehow
specified by the output vector. The network, then, accepts a feature vector as input
and generates an output vector indicating a membership value corresponding to the
class to which the unknown pattern belongs.
In addition to the selection of the processing element (nonlinearity and the activa-
tion function) and network topology, the behavior of the network is determined by the
connection weights. Values for these weights are adjusted during the training of the
network and are fixed when the network is operating in a production mode.
2.1. Processing and Analysis of Biomedical Signals
A main pillar of modern medicine is measurement of biomedical signals such as
cardiac electrical activity, heart sounds, respiratory flows and sounds, and brain and
other neural electrical activities. Almost all biomedical signals are contaminated with
noise artifacts due to sensing methods or environmental noise. In addition, extracting
useful information regarding any pathological conditions from measured physiological
signals demands careful analysis of the signal by an expert. Modern analog and digital
signal processing techniques have contributed significantly in making the analysis of
physiological signals easier and more accurate. For instance, these techniques help
remove different types of noise, identify inflection points, and combine multiple signals.
They also detect and classify changes in the signal due to pathological events and
conditions and transform the signal to extract hidden information not available in
the original signal.
2.2. Detection and Classification of Biomedical
Signals Using ANNs
The primary goal of processing biomedical signals is to identify the pathological
condition of the patient and monitor the changes in the condition over a course of
treatment or with a procedure. ANNs have the potential to provide such information as
demonstrated by their application in processing and analysis of biopotentials, medical
images, speech and auditory processing, and so forth. They are particularly useful in
78 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals
The occurrence of longer and higher magnitude MUAPs reflects an increase in the
number or density of fibers in motor units or increases in the temporal dispersion of
the activity picked up by the recording electrode.
MYOs are a group of diseases that primarily affect skeletal musclefibers.They are
divided into two groups: inherited and acquired. Most muscular dystrophies are her-
editary, causing severe degenerative changes in the muscle fibers. Polymyositis is an
example of an acquired myopathy, which is characterized by an acute or subacute onset
with muscle weakness progressing slowly over a matter of weeks. MYOs typically
shorten the MUAP duration and reduce its amplitude compared to normal. These
findings are attributed to fiber loss within the motor unit, with the degree of reduction
of duration and amplitude reflecting the amount of fiber loss.
In routine clinical electromyography, MUAP morphology is subjectively evaluated
by the technician or neurologist. However, a purely descriptive approach is not suffi-
cient and an exact quantitative measurement of different MUAP parameters are neces-
sary. In addition, manual analysis is time consuming and the subjective measurement of
MUAP parameters introduces variable sources of error.
During the past two decades, rapid advances in signal processing methods and
computing technology have made automated quantitative EMG analysis feasible.
Computer-aided EMG processing and diagnosis saves time, standardizes the measure-
ments, and enables the extraction of additional features that cannot be easily calculated
by manual methods. To further the development of quantitative EMG techniques, the
need for adding automated decision has emerged.
Integrated computer-aided diagnosis of neuromuscular disorders has become pos-
sible by combining quantitative MUAP analysis techniques with the pattern recogni-
tion capabilities of ANNs. Pattichis et al. [11] used a parametric pattern recognition
algorithm that facilitated automatic MUAP feature extraction and combined this algo-
rithm with the classification abilities of ANN models to provide an integrated environ-
ment for the diagnosis of MND and MYOs. They investigated 10 network architectures
with two learning paradigms: supervised and unsupervised. For supervised learning,
they used back-propagation, and for unsupervised learning, they used Kohonen's self-
organizing feature maps algorithm.
In their comprehensive study, Pattichis et al. used a total of 880 MUAPs collected
from the biceps brachii muscle of 44 subjects: 14 normals, 16 with MND, and 14 with
MYOs. The mean and standard deviation of seven features were extracted from
MUAPs, giving a total of 14 features. Duration, spike duration, amplitude, area,
spike area, number of phases, and number of turns were used for classification into
normal, MND, and MYO. Ten different ANN architectures were designed with 14
input nodes, 3 output nodes, and 2 hidden layers with different numbers of nodes
(i.e., 10-5, 40-10, and 100-10).
The training set was formed by randomly selecting MUAP data from 24 (8 subjects
from each group) of the 44 subjects. The data from the remaining 20 subjects were used
for evaluating the performance of the ANN models. These sets were used for cluster
analysis and for training the back-propagation and Kohonen self-organizing feature
maps paradigms.
The diagnostic performance of the neural models investigated was of the order of
80-90% for models trained with the back-propagation algorithm and 80% for models
trained with Kohonen's self-organizing feature map algorithm. .K-means cluster analy-
sis was also used on the same data set and was observed to offer poorer performance.
Section 2 Application of ANNs in Processing Information 83
spikes and seizures of great clinical significance. In addition, the use of ambulatory
monitoring, which produces 24-hour or longer continuous EEG recordings, is becom-
ing more common, creating a higher sense of urgency for the development of auto-
mated methods for detection and classification of EEG spikes. Over the past several
years, substantial progress has been achieved in the analysis of EEG signals and devel-
opment of automatic recognition of epileptiform transients [12]. Such methods usually
decompose the EEG signal into waves or half-waves by determining extrema amplitude.
Each wave is then examined for its fit to a set of predetermined criteria, such as
duration, amplitude, slope, and sharpness.
Accurate detection of spike/spike-wave (SSW) is now possible from artifact-free
signals. However, difficulties arise with EMG and EEG activities resembling spikes.
Such difficulties plague automatic spike detection methods. Neural network classifica-
tion methods can provide opportunities to improve on the performance of more tradi-
tional methods.
In most recent years, ANNs have been used for detecting EEG spikes. ANN-
based spike detection systems basically use two different approaches for input repre-
sentation: (1) extracted EEG features and (2) raw EEG signal. In the first approach,
features such as slope and sharpness are extracted and presented to the ANN for
training and testing [13]. Success of such methods depends on proper selection of
features, which may not be known completely a priori. In the second approach, the
raw EEG signal is presented to the ANN after proper scaling and windowing [14].
The success of such methods relies heavily on selection of the appropriate window
size. A compromise is usually made between window size, effective EMG filtering,
and detection accuracy.
Kalayci and Ozdamar [15] implemented a family of three-layer MLP feed-forward
neural networks employing the back-propagation learning algorithm to detect EEG
spikes. They used wavelet transform (WT) to extract features from SSW and non-SSW
(any other activity including background EEG and EMG artifact) to train and test their
ANN classifiers. The EEG data were collected from five patients (two males and three
females with an average age of 13.8 years and range of 8-15 years) diagnosed with
epilepsy. In this study, a total of 3614 (761 SSW and 2853 non-SSW) files were gener-
ated. They used 1200 (400 SSW and 800 non-SSW) wavelet-transformed files randomly
selected to form the training set. The data from 2414 (361 SSW and 2053 non-SSW)
wavelet-transformed files were used as the testing set. Two sets of wavelet features,
Daubechies 4 and Daubechies 20, were extracted from SSW and non-SSW data files
containing record lengths of 512 EEG data points. To investigate the effects of the
different WT scales on detection, feature sets with 8 coefficients from resolution scale 1
and 24 coefficients from resolution scale 3 were used.
A family of MLPs was designed for this study. Each MLP had a different number
of input nodes (20, 16, 8), a variable number of hidden layer neurons (from 3 to 8), and
one output neuron. SSW and non-SSW events were represented as 0.8 and —0.8,
respectively, and a hyperbolic tangent function was used as the activation function.
Each network was trained and tested at least twice, and the best classification accuracy
was chosen.
Classification performance of the ANN models was measured using conventional
criteria. SSWs and non-SSWs were considered as positive and negative events, respec-
tively. A true positive (TP) outcome was registered when both the ANN and neurolo-
gists classified an EEG portion as SSW. False positive (FP), true negative (TN), and
Section 2 Application of ANNs in Processing Information 85
false negative (FN) outcomes were similarly described. Sensitivity and specificity were
calculated as %TP/(TP + FN) and %TN/(FP + TN), respectively. The average of
these two percentages was used to calculate the overall classification accuracy.
Proper selection of the WT resolution scale with the variety of ANNs tested in this
study showed more than 90% accuracy in detection performance, as defined by the
average of sensitivity and specificity.
rhythmia has been shown to be associated with motor disorders of the stomach,
whereas the relative amplitude of the EGG is associated with gastric contractility.
Therefore, accurate detection of gastric dysrhythmia as well as amplitude variations
of the EGG is very important in clinical applications. EGG dysrhythmias reflect
motor disorders of the stomach. Gastric dysrhythmia includes tachygastria (slow
wave frequency of 4-9 cpm), bradygastria (slow wave frequency of 0.5-2cpm), and
arrhythmia (no rhythmic activity). It is known that gastric dysrhythmia is often short.
ANNs have been used for automated diagnosis of delayed gastric emptying from
EGGs with encouraging results [16]. In this work, Lin et al. used five spectral para-
meters of the EGG data as inputs to a back-propagation neural network with three
hidden nodes to acquire a correct diagnosis of 80%. Although satisfied with their
encouraging results, these investigators further improved the performance of the
neural network by using genetic algorithms in conjunction with cascade correlation
learning architecture [17]. This algorithm offered the advantage of automatically
growing the architecture of the neural network to give a suitable network size, and
it also reduced the training time and complexity associated with the back-propagation
network. The algorithm enabled the group to conclude that a neural network with
three hidden units seems to be a good choice for this application. They could achieve
the correct diagnosis in 83% of cases with a sensitivity of 84% and a specificity of
82%. The authors stated that although these results were comparable to those
obtained with their back-propagation network, this approach eliminated the guess-
work associated with the size and connectivity pattern of the network in advance and
improved the detection speed.
In this study, the researchers acquired EGG data from 152 patients. Based on
gastric emptying tests, the patients were predefined as two classes: 76 patients with
delayed gastric emptying and 76 patients with normal gastric emptying. The training
set contained 38 patients with delayed gastric emptying and 38 with normal gastric
emptying randomly selected from 152 patients. The remaining 76 patients were used as
testing set, which also contained 38 patients with delayed gastric emptying and 38 with
normal gastric emptying.
Bright et al. [18] investigated the use of ANN-based detectors to determine whether
a patient's goiter is causing upper airway obstruction (UAO). They explored the pos-
sibility of processing the flow-volume loops from standard forced expiratory vital
Section 2 Application of ANNs in Processing Information 87
capacity (FVC) maneuvers [19] to establish whether goiter has caused UOA. Flow-
volume curves from 155 patients attending a specialized thyroid clinic were obtained
using the recommended three maneuvers [20]. From these traces, the trace with highest
peakflowwas selected for processing. Using these loops and other available physiologic
data, it was determined that 46 of the patients had UAO. Flow-volume loops from
these patients and 51 subjects who were thought not to have UAO were selected as set 1
and set 2. Further,flowvolumes from 50 other patients, not in the pool of the preceding
155 patients, with chronic obstructive pulmonary disease (COPD) were included as set
3. The performance of human experts in detecting the presence of UAO using only
flow-volume loops was established in the following fashion.
The three sets of loops described were mixed randomly. Two expert clinicians
independently examined the traces. They assessed each trace for the presence of
UAO by assigning an ordinal scale from 1 to 4. In this scoring, the assigned numbers
had the following significance: 1, not at all certain; 2, moderately certain; 3, quite
certain; and 4 very certain. Eight weeks later, each observer was asked to repeat the
scoring of the records independently. The inter- and intraobserver agreements between
the first and second scorings were measured by computing a kappa factor [21]. The
interobserver kappa factors for the first and second scorings were 0.58 and 0.68, respec-
tively. The intraobserver kappa was 0.5 for one observer and 0.46 for the other. This
process demonstrated the subjectivity and relative nonrepeatability of human scoring of
the data. Bright et al. used several standard measures obtained from the FVC maneuver
as inputs. These included peak expiratory flow (PEF), the exhaled volume during the
first second of the FVC maneuver (FEV1), and FEV1/FVC. Further, they developed
several novel indices to quantify the shape of theflow-volumeloop. Specifically, these
features aimed at reflecting the observed relative flatness of the upper part of the loop
(high-flow region) for patients with UAO as compared with the loops for patients
without UAO. For instance, the exhaled volume range for each patient was divided
into 20 points and the corresponding measured flow at each point was expressed as a
percentage of measured peak flow. Initially, the standardflow-volumeloop measures as
well as the devised novel measures, 50 features total, were used as input to an ANN.
The number of input nodes for this network was equal to the total number of features
(i.e., 50). The network also had two hidden layers with the number of nodes in the first
layer equal to approximately twice the number of input nodes and that in the second
layer equal to half the input nodes. The network had two outputs. Approximately two-
thirds to one-half of theflow-volumecurves for each patient category from the patient
test record was used to train the network. The remaining records were combined to
form a test set. After completion of training, the total weight from each network input
node to each output node was computed. These sums were compared with each other,
and the five highest weight sums were selected as a reduced input set. The number of
inputs was then reduced to five.
To evaluate the performance of the second ANN with reduced number of inputs, it
was presented with 67 records that included 17 from data set 1 (i.e., patients with
UAO), 28 from set 2 (i.e., patients with goiter only), and 22 from set 3 (i.e., COPD
patients). The network exhibited 88% sensitivity, 94% specificity, and 92% total accu-
racy. To compare the performance of the devised ANN with that of other classification
methods, an analysis of the patient records using logistic regression was performed. In
this study, dependent variables for the logistic regression classifier were the same as the
inputs to the ANN. When tested with the patient records described earlier, the logistic
88 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals
regression model had a sensitivity of 70.6%, specificity of 92%, and total accuracy of
86.6% (all at 5% significance level). Bright et al. concluded that the performance of the
ANN was superior to those of human and logistic regression classification.
It has been shown by Burk et al. [21] that pharyngeal wall vibration can be used as
an indicator of imminent collapse of the airway during sleep for patients with obstruc-
tive sleep apnea (OSA). They initially devised a classical method to detect pharyngeal
wall vibration (PWV) [22]. They applied this methodology to adjust automatically the
pressure applied by a continuous positive airway pressure (CPAP) machine, commonly
used to treat patients with OSA. They demonstrated that this approach is as effective as
treating (OSA) patients with conventional CPAP [23]. In their investigation of the
proposed system Behbehani et al. [24] determined that the performance of the phar-
yngeal wall vibration detector was satisfactory. However, they established that the
detection of PWV was not 100%. Lopez et al. [25] set out to develop an ANN-based
detector of the pharyngeal wall for patients with OSA. They aimed at improving the
detection rate over the previous classical method.
Lopez et al. designed an ADALINE network with 15 input nodes (plus 1 bias), two
hidden layers each having two nodes, and one output node. A schematic diagram of the
network is shown in Figure 4.
This topology was selected through a competitive comparison of the performance
of six distinct topologies. Specifically, 2853 data vectors containing pharyngeal vibra-
tion episodes combined with 4352 data vectors without any pharyngeal vibration epi-
sodes (a total of 7205) were used to train the competing topologies and select the most
effective one. The efficacy of the networks was judged by percentage of false positive
and percentage of negative detection as well as the final error. The training data was
collected from five volunteer patients (three male and two female). Input signals to the
networks were derived from sensed pressure at the nasal mask. For this purpose, 32-
point ensembles from the nasal mask pressure were formed and their fast Fourier
transforms (FFTs) were obtained. The spectra were normalized for each patient by
computing the average value of the spectra for the last 15 values of the spectra, exclud-
ing the low-frequency value.
Performance of the network was evaluated by comparing its detection of PVWs
with the classical method. The evaluation data came from a group of five volunteer
male patients other than those whose data were used for selecting and training the
network. The comparison was made at three pressure levels, 4, 8, and 13 cm H 2 0
(i.e., low, medium, and high). Statistical comparison of the results showed that the
ANN-based network had a lower percentage of false negative at 4 and 13 cm H 2 0,
while the classical method had a lower percentage of false negative at 8 cm H 2 0 (with a
significance level of 0.05). Similarly, for 4 and 13 cm H 2 0, the percentage of false
positives for the ANN-based detector was lower than for the classical method.
However, at 8 cm H 2 0 there was no significant difference between the two methods.
Further, the ANN-based detector had the same rate for detecting percent false positive
and percent false negatives at all pressures. A similar comparison for the classical
method showed that at 4 cm H 2 0 the percentage of false positives was less than the
percentage of false negatives; there was no difference at 8 and 13 cm H 2 0. Lack of
dependence on operation pressure for the ANN-based detector is an attractive feature,
as it allows one to apply the same detector across all pressures [24,26].
artifact, and baseline wander. Visual inspection should also be included in the assess-
ment of the quality of reconstruction. It is important to detect the time of occurrence
and the type of distortion.
With the increase in ECG signal recording and monitoring in clinical diagnosis of
cardiac disorders, an enormous amount of ECG data is collected. If a patient's ECG is
sampled at 500 samples per second and if each sample takes 12 bits, then a typical 30-
second record of the 12-lead ECG requires 270 kbytes and a 12-hour Holter record
requires 129.6 Mbytes of memory. Therefore, there is a clear need for ECG compression
algorithms to store, archive, and transmit this information. Many different compres-
sion algorithms have been developed to achieve this important goal [28].
One such method is based on principal component analysis (PCA) [29]. This is also
known as the Karhunen-Loeve (or the eigenvector) transform (KLT). Basically, it
consists of finding a linear combination of the original signal such that the obtained
signals are orthogonal and their variance is maximized. KLT is an optimal transform in
the sense that the least number of orthonormal functions is needed to represent the
original signal for a given root mean square (rms) error. Moreover, the PCA results in
uncorrelated transform coefficients (diagonal covariance matrix) and minimizes the
total entropy compared with any other transform. The main drawback of this technique
is that it requires the computation of the eigenvalues and eigenvectors of the correlation
matrix of the data set, which, in the case of ECG signals, is a very large matrix.
Neural network implementation of PCA provides a means for unsupervised fea-
ture discovery and dimensionality reduction. Al-Hujazi and Al-Nashash [30] described
a method for the compression of ECG data using the generalized Hebbian algorithm
[31]. This algorithm is based on an interesting observation that a single linear neuron
with a Hebbian-type adaptation rule for its weights can evolve into a filter for the first
principal component of the input distribution. Al-Hujazi et al. extended the algorithm
to produce a multiple Hebbian neural network to reduce the computation time neces-
sary to calculate the principal components of arbitrary size on the input vector. They
found it necessary to implement a multiple neural network due to large variations
among ECG arrhythmias.
The method was tested on normal data as well as data representing three different
types of pathologies obtained from the MIT/BIH ECG database, which resulted in a
compression ratio (CR) up to 30 with a percent root mean square error (PRD) of 5%.
However, it is emphasized that for this method to be useful, the training set must
include all expected arrhythmias.
Hamilton et al. developed a compression algorithm for ECG signals based on an
autoassociative neural network [32]. To achieve compression, they first detected the
QRS complexes and then compressed the ECG signal using an autoassociative ANN,
one in which the input and output patterns are the same. A multilayer perceptron with
the back-propagation learning algorithm was employed. This consisted of a hidden
layer with a reduced number of nodes to produce compression. Since the majority of
beats within a given recording segment have the same gross morphology, by storing an
average waveform and compressing the difference only, optimum gain was made of the
ANN's compression capability. They used network sizes with 360 inputs, 360 outputs,
and a variable number of hidden nodes: 6, 8, 9, 10, 12, 15, 18, 24, 36, and 72. They
showed that the compression ratio of the network was controlled by the ratio of hidden
layer networks to input-output layer neurons with the poorest PRD occurring at higher
compression ratios. They achieved a CR of 10 with a PRD of 7.1% for a network size
Additional Reading and Related Material 91
of 360-36-360. It was also shown that removal and separate storage of the DC offset for
each beat before compression improved the overall system performance. Removal of
the DC component resulted in a CR of 10 with a PRD of 4.6% for a network of size
360-30-360.
Another popular method of data compression is vector quantization (VQ).
Basically, it involves the creation of a codebook of vectors. Creating the optimum
codebook will achieve the best possible data compression. The basic vector design
uses an encoder to replace an input vector Xn with a vector from the codebook.
Compression of the data is achieved by using the address of the codebook vector in
place of the original vector. If the codebook has C elements, then the rate of the
quantizer, R, in bits/vector is R = log2 C. The compression ratio, CR, in bits/sample
is CR = R/N, where N is the number of samples in each vector. One of the advantages
of VQ is that fractional compression ratios are achievable.
Many different techniques are available to create a codebook that best spans the
data of interest [33]. Neural network implementation of VQ provides a means to create
a codebook of vectors that attempt to span the low-frequency components of the ECG
signal. Since these vectors are potentially less informative, less important information
will be distorted. McAuliffe [34] used a Kohonen neural network that adapted the
codebook vectors based on distance measurements and controlled the scope of the
changes based on time. Compression of the signal was achieved by inserting the address
of the codebook vector that best represented the original vector in place of the vector.
The results showed that minimal distortion was introduced into the ECG, with com-
pression ratios ranging from 3 : 1 to 19 : 1, depending on noise content and heart rate.
• H. Demuth and M. Beale, Neural Network Toolbox for Use with MATLAB,
User's Guide, Chap. 5, fifth printing—version 3, Math Works Inc., 1998.
• MATLAB Neural Network Toolbox, nnet M-files.
• http://www.mathworks.com/ftp/nnetsv5.shtml
Due to its widespread use, we briefly present the steps in training an artificial
neural network using standard back-propagation technique. In this presentation, we
use a network with two hidden layers to illustrate the training process. However, the
method can easily be applied to networks with more than two hidden layers. Consider
the network shown in Figure Al.
It has n input nodes {Xt for i = 1,2 «), two hidden layers with p nodes in the
first layer (Γ, for i = 1,2,... ,p) and q nodes in the second layer (Z, for i = 1,2,..., q),
and m output nodes (Γ, for i = 1,2,..., m). The activation functions for the hidden and
output nodes are usually selected to have the following properties. They are continuous,
differentiable, and monotonically nondecreasing. One of the most commonly used
activation functions is a bipolar sigmoid function with a range of—1 to 1. This function
can be expressed as
^ndpFir1
The shape of this function is shown in Figure 2. An additional desirable feature of
this function is that its derivative can be expressed in terms of the function itself. That
is,
/'fe) = iti+/fe)][i-/(?)]
where/'Cg) is the derivative of/ with respect to g. This feature of the function speeds up
the computation of the activation function derivative that is an essential part of the
back-propagation training method.
The process of training the network starts with randomly selecting values for
connecting weights: un, ul2,..., unp, vn,vn,..., vpq, and wu, wn,..., wqm. The com-
putations that follow the initial selection of these weights can be divided into three
distinct stages, with each stage having few steps. The first stage is called feed-forward
and it involves three steps. In the first step, each input node receives the signal for
training the network, xt for i = 1,2,..., n. The input nodes then pass these inputs
through the connecting branches to the first hidden layer as
n
'Ρϊι = Σ XiU h
' (A1)
where j / represents the input signal to the hidden node h = 1,2,..., p in the first
hidden layer. In the second step of the feed-forward stage, the outputs of the nodes
in the first hidden layer are computed and applied to the second layer. Specifically,
and
where z, is the input to the hidden node j in the second hidden layer and
is the output from node j . The third and final step in the feed-forward stage is to
compute the network output using the signals generated by the nodes in the second
layer of the network. That is,
where Tk is the input to the output node k and the output from the node is given by
Tk =f(T\) (A6)
The second stage in this method of training the network is called back-propagation of
the error. In this stage, a three-step process is also followed. First, the error for each
output node is computed by
where Rk is the desired or target output. The correction factor Sk for the weights on the
branches ending at the output node k is computed as
94 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals
h = ekf'(Tk) (A8)
In addition, the increment for weight correction is obtained from
AWjk = aSkZj (A9)
where a is the learning factor, normally chosen to be less than unity. The next step is to
back-propagate the Sk's through the last layer (in this case the second layer) of hidden
nodes to obtain
m
h = Σ 8*w* (A10)
The correction factors for the nodes in the second layer are computed in a manner
similar to the computation of 8k for the output nodes. That is,
Sj^jfitj) (All)
and the incremental corrections to the weights ending at node./ of the second layer are
computed as
Avhj = aSjyh (A12)
Similarly, the correction for the branch weights connecting to the nodes in the
other hidden layer(s) (in this case the first layer) can be computed as
and
h = tfW (AH)
The correction increment for these weights is obtained from
Auih=aShxi (A15)
Equations A12 and A15 are often called the delta rule.
The third and final stage in the back-propagation algorithm is updating the
weights. In this stage, the weights on the branches connecting to the nodes are updated
as follows:
v^ = V^ + AVhJ (A17)
for each hidden unit in the second layer (h = \,2,... ,p;j =\,2, ...,q), and
References 95
..old
= uih +Auih (A18)
for weights connecting to each unit in the first hidden layer (i = 1,2,...,«;
«=1,2,...,/>)·
In practice, the network is trained by presenting it with a number of examples.
Consider the training data to consist of input Χ(λ) and output R(k) pairs where
*ι(λ)'
χ2(λ)
Χ(λ) = (A19)
χ„(λ)
and
*ι(λ)'
R2(X)
Κ(λ): (Α20)
Μ».
In (A 19) and (A20) the argument λ depicts the example number in the training set. For
instance, a training set may have 5000 examples, in which case λ = 1,2,..., 5000. In a
pattern mode of training the network is presented with one example from the training
set at a time and all three stages of the back-propagation training including the updat-
ing of the weights (Eqs. A16 through A18) in the third stage are carried out. The
training process is often repeated by randomizing the order in which the examples
are presented to the network. This can be accomplished either by randomizing all of
the examples in the training set for each pass or by establishing epochs that contain a
subset of the training examples. For instance, if there are 5000 total examples, they may
be grouped into 100 epochs each containing 50 examples. The order in which the epochs
are presented to the network is then randomized.
There are alternative modes of training such as batch mode, in which the weights
are not updated for each example in an epoch; rather they are updated after all exam-
ples within an epoch are presented to the network. However, description of these
methods is beyond the scope of this chapter and the reader is referred to [1] and [35].
REFERENCES
[1] S. Haykin, Neural Networks: A Comprehensive Foundation. Macmillan and IEEE Computer
Society, 1994.
[2] Y. S. Tsai, B. N. Huang, and S. F. Tung, An experiment on ECG classification using back-
propagation neural network. Proceedings, Annual International Conference IEEEjEMBS,
pp. 1463-1464, 1990.
[3] J. Bortolan, R. Degani, and J. L. Willems, ECG classification with neural networks and
cluster analysis. Proc. Comput. Cardiol. 177-180, 1991.
96 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals
[4] Y. H. Hu, W. J. Tompkins, and Q. Xue. Artificial neural network for ECG arrhythmia
monitoring. In Neural Network for Signal Processing II, S. Y. Kang, F. Fallside, J. A.
Sorenson, and C. A. Kamm, eds., pp. 350-359, Piscataway, NJ: IEEE Press, 1992.
[5] J. Nandal and M. de C. Bossan, Classification of cardiac arrhythmias based on principal
component analysis and feedforward neural networks. Comput. Cardiol. 341-344, 1993.
[6] J. P. Marques de Sa, A. P. Goncalves, F. O. Ferreira, and C. Abreu-Lima, Comparison of
artificial neural network based ECG classifiers using different feature types. Comput.
Cardiol. 545-547, 1994.
[7] Y. H. Hu, W. J. Tompkins, J. L. Urrusti, and V. X. Alfonso, Applications of artificial neural
networks for ECG signal detection and classification. /. Electrocardiol. 26(Suppl): 66-73,
1993.
[8] R. Watrous and G. Towell, A patient-adaptive neural network ECG patient monitoring
algorithm. Comput. Cardiol. 229-232, 1995.
[9] S. Barso, M. Fernandez-Delago, J. A. Vial-Sobrino, C. V. Reguerio, and E. Sanchez,
Classifying multichannel ECG patterns with an adaptive neural network. IEEE/EMB
Mag. 17(1): 45-55, 1998.
[10] W. Trojaborg, Motor unit disorders and myopathies. In Textbook of Clinical
Neurophysiology, M. A. Halliday, R. S. Butler, and R. Paul, eds., pp. 417-438. New
York: Wiley, 1987.
[11] C. S. Pattichis, C. N. Schizas, and L. T. Middleton, Neural network models in EMG
diagnosis. IEEE Trans. Biomed. Eng. 42(5): 486-^96, 1995.
[12] J. D. Frost, Automatic recognition and characterization of epileptiform discharges in human
EEG. J. Clin. Neurophysiol. 2(3): 231-249, 1985.
[13] C. Eberhart and R. W. Dobbins, Neural Networks PC Tools—A Practical Guide, Chap. 10,
Case study I: Detection of electroencephalogram spikes. San Diego: Academic Press, 1990.
[14] O. Ozdamar, G. Zhu, I. Yaylali, and P. Jayakar, Real-time detection of EEG spikes using
neural networks. IEEE EMBS 14th International Conference Proceedings, pp. 1022-1023,
1992.
[15] T. Kalayci, and O. Ozdamar, Wavelet preprocessing for automated neural network detec-
tion of EEG spikes, IEEE EMB Mag. 14(2): 160-166, 1995.
[16] Z. Y. Lin, J. D. Z. Chen, and R. W. McCallum, Noninvasive diagnosis of delayed gastric
emptying from cutaneous electrogastrograms using multi-layer feedforward neural net-
works. Gastroenterology, 112(4):A777, 1997.
[17] Z. Y. Lin, H. Liang, J. D. Z. Chen, and R. W. McCallum, Application of combined genetic
algorithms in conjunction with cascade correlation to diagnosis of delayed gastric emptying
from electrogastrograms. IEEE EMBS 19th International Conference Proceedings, pp. 1355-
1358, 1997.
[18] P. Bright, M. R. Miller, J. A. Franklyn, and M. C. Sheppard, The use of a neural network to
detect upper airway obstruction caused by goiter. Am. J. Respir. Crit. Care Med., 157: 1885-
1891, 1998.
[19] Guidelines for the measurement of respiratory function. Recommendations of the BTS and
ARTP. Respir. Med. 88: 165-194, 1994.
[20] J. Cohen, A coefficient of agreement for nominal scales. Educ. Psycho!. Meas. 20: 37-46,
1960.
[21] J. R. Burk, E. A. Lucas, J. R. Axe, K. Behbehani, and F. Yen, Auto-CPAP in the treatment
of obstructive sleep apnea: A new approach. 1992 Annual Meeting Abstracts, Association of
Professional Sleep Societies 6th Annual Meeting. Sleep Res. 22: 61, 177, 1992.
[22] K. Behbehani and T. Kang, A microprocessor-based sleep apnea ventilator. Proceedings of
11th Annual International Conference of IEEE Engineering in Medicine and Biology, Seattle,
November 1989.
References 97
1. INTRODUCTION
Training neural networks to recognize events that occur with low probability is sig-
nificant in many applications. This is a difficult and challenging problem. When we
investigated the use of neural networks to identify regulatory regions in human genomic
sequences, we realized that there are rare events in the sequences. That is, repetitive
DNA sequences, namely Alu regions, represent small portions of genomic sequences,
which consist mostly of non-Alu regions. This results in a shortage of examples. We
propose two schemes to solve these problems by neural networks and sample stratifica-
tion.
Sample stratification is a technique for making each class in a sample have equal
influence during learning of neural networks [1,2]. It is preferable to use a stratified
sample that includes an equal number of examples from each class in the training
sample for classification with neural networks. However, it is usually not possible to
make a sample stratified because we cannot have enough examples in rare event cases.
The first method presented in this chapter stratifies a sample by adding up the weighted
sum of the derivatives during the backward pass of training. The second method uses
bootstrap aggregating. After training neural networks with multiple sets of boot-
strapped examples of rare event classes and subsampled examples of common event
classes, we do multiple voting for classification. These two schemes make rare event
classes have a better chance of being included in the sample for training and improve
the classification accuracy of neural networks. We demonstrate the performance of the
two schemes with real human genomic sequences for locating regulatory regions
obtained from the National Center for Biotechnology Information (NCBI)
Repository. We also compare the results of the proposed methods with those of
Bayesian classifiers with two-dimensional Gaussian-distributed data.
2. SAMPLE STRATIFICATION
98
Section 3 Stratifying Coefficients 99
3. STRATIFYING COEFFICIENTS
EP = \Y\Dpj-Z^xp,W)fcixp) (1)
j
where Dpj is the 7th component of the desired output vector due to the presentation of
input vector p. The output of node j of the output layer, which is the Mb. layer, is
denoted as Zpj(xp, w). The SC, c(xp) evaluated at the present input vector, is decided by
the ratio of the probability of the common event to the probability of the rare event.
The dependence of Zpj on the present input vector xp and the weights, denoted by w,
will be suppressed in the following notation.
The output of node j in the mth layer due to the presentation of the input vector p
is defined as
Z£=/(0 (2)
where wm denotes the weight matrix between the mth and the (m - l)st layer of the
networks.
The back-propagation algorithm applies a correction Aw^ to synaptic weight wfl,
which is proportional to the instantaneous gradient dEp/dwp"/. According to the chain
rule, we may express this gradient as follows:
Section 3 Stratifying Coefficients 101
The negative of the gradient vector components of the error Ep with respect to Yß are
given by
dE
«m _ P
"PJ =
Om (6)
dY
pl
Applying the chain rule allows this partial derivative to be written as
(8)
dYiPJ
which is simply the first derivative of the activation function evaluated at the present
input to that particular node.
In order to compute the first term, consider two cases. In the first case, the error
signal is developed at the output layer N. This can be computed from Eq. 1 as
dE„
-[Dpj - Zp)]c(xp) (9)
az:PJ
Substituting Eqs. 8 and 9 in to Eq. 7 yields
§ = [θρ]-Ζ^ο(ΧρΥΧΥ») (10)
For the second case, when computing the error terms for some layer other than the
output layer, the δρ/s can be computed recursively from those associated with the
output layer as
^=/'(θΣδ7Η+1 <12)
These results can be summarized in three equations. First, an input vector, xp, is
propagated through the network until an output is computed for each of the output
nodes of the output layer. These values are denoted as Y$. Next, the error terms
associated with the output layer are computed from Eq. 10. The error terms associated
with each of the other m — 1 layers of the network are computed from Eq. 12. Finally,
the weights are updated as
where η represents the learning rate of the networks. Usually η is chosen to be some
nominal value such as 0.01.
From [7], it is seen that the only change to the back-propagation algorithm is the
inclusion of the stratifying coefficient in Eq. 10. All other steps of the algorithm remain
the same.
(X)-Dif\f(X,Cj)dx (14)
J
j=l [ /=1
This equation represents a sum of squared errors, with two errors appearing for each
input-class pair. For a particular pair of input X and class Cj, each error, Zt{X) — Dt is
simply the difference of the actual network output Zt{X) and the corresponding desired
output Dj. The two errors are squared, summed, and weighted by the joint probability
f(X, Cj) of the particular input-class pair.
Substituting/(Jr, Cj) =f(Cj\X)f(X) in (12) yields
where f*(X) = c(X)f(X). In this way, the neural network can be forced to form its
mean-square error estimates off(Cj\X) according to the distributionf*{X) rather than
the distribution/(Z). Expanding the bracketed expression in Eq. (18) yields
Exploiting the fact that ZJ(X) is a function only of JST and E;=i/(C/l^0 = 1 allows Eq.
19 to be expressed
For a two-class problem, £>, equals one if input X belongs to Class C, and zero other-
wise. Therefore, Σ]=ι £>if(Cj\X) =f(C,\X), and
E Z
*=| E t ^ " 2Ζ,(Χ)/·(0,·|Ζ) +/(Q|JT)] l r W <** (21)
£
a = | E t z ? W ~ 2z,(ry(c,|Jso +/2(c,|Z) +/(c,iz) -/2(Q|Z)] /*Wdx (22)
2 2
fi ■f*(*)dx (23)
J
I 1=1 ;=1
2
fi
- ί Σ κ <J50-/(C,|Z)]
2
f(X)dx
J
i=l
2
+ £[/(c,i*) -/ (c,i^)] hjf)<fa (24)
104 Chapter 5 Rare Event Detection in Genomic Sequences by Neural Networks
Because the second term in Eq. 24 is independent of the network outputs, minimization
of ΕΆ is achieved by choosing network parameters to minimize the first term.
Finally, we get
where Ex*[.] is the expected value with respect to the modified distribution f*(X).
Equation 25 shows the neural network outputs' approximate a posteriori probabilities
based on the modified distribution.
4. BOOTSTRAP STRATIFICATION
Training Procedure
1. Divide common event data into n subdivisions.
2. Bootstrap rare event data into n data.
3. Train n NNs independently.
4. Combine the output of the individual neural networks by consensus.
The system block diagram is shown in Figure 1.
For unique sequences, in UniGene in the NCBI Repository, all annotated human
unique sequences were extracted (41,120 entries) in protein coding regions. From this
set, entries were discarded if they did not contain the string "complete cds." Finally
6120 entries were obtained. Table 4 shows some of the entries, and an example of the
sequence data is shown in Table 5.
/WQ) = β Χ ρ
ώ ("ά "* " ^1'2) {οτί=ί 2
'
108 Chapter 5 Rare Event Detection in Genomic Sequences by Neural Networks
HsalO0581 Length: 338 September 11, 1997 12:40 Type: N Check: 1800 ..
GCGGCGGCTGGCGGCCGGCTTCTCGCTCGGGCAGCGGCGGCGGCGGCGGCGGCGGCTTCC
GGAGTCCCGCTGCGAAGATGCTCAAAGTCACGGTGCCCTCCTGCTCCGCCTCGTCCTGCT
CTTCGGTCACCGCCAGTGCGGCCCCGGGGACCGCGAGCCTCGTCCCGGATTACTGGATCG
ACGGCTCCAACAGGGATGCGCTGAGCGATTTCTTCGAGGTGGAGTCGGAGCTGGGACGGG
GTGCTACATCCATTGTGTACAGATGCAAACAGAAGGGGACCCAGAAGCCTTATGCTCTCA
AAGTGTTAAAGAAAACAGTGGACAAAAAAATCGTAAGAACTGAGATAGGAGTTCTTCTTC
GCCTCTCACATCCAAACATTATAAAACTTAAAGAGATATTTGAAACCCCTACAGAAATCA
GTCTGGTCCTAGAACTCGTCACAGGAGGAGAACTGTTTGATAGGATTGTGGAAAAGGGAT
ATTACAGTGAGCGAGATGCTGCAGATGCCGTTAAACAAATCCTGGAGGCAGTTGCTTATC
TACATGAAAATGGGATTGTCCATCGTGATCTCAAACCAGAGAATCTTCTTTATGCAACTC
CAGCCCCAGATGCACCACTCAAAATCGCTGATTTTGGACTCTCTAAAATTGTGGAACATC
AAGTGCTCATGAAGACAGTATGTGGAACCCCAGGGTACTGCGCACCTGAAATTCTTAGAG
GTTGTGCCTATGGACCTGAGGTGGACATGTGGTCTGTAGGAATAATCACCTACATCTTAC
TTTGTGGATTTGAACCATTCTATGATGAAAGAGGCGATCAGTTCATGTTCAGGAGAATTC
TGAATTGTGAATATTACTTTATCTCCCCCTGGTGGGATGAAGTATCTCTAAATGCCAAGG
ACTTGGTCAGAAAATTAATTGTTTTGGATCCAAAGAAACGGCTGACTACATTTCAAGCTC
TCCAGCATCCGTGGGTCACAGGTAAAGCAGCCAATTTTGTACACATGGATACCGCTCAAA
AGAAGCTCCAAGAATTCAATGCCCGGCGTAAGCTTAAGGCAGCGGTGAAGGCTGTGGTGG
CCTCTTCCCGCCTGGGAAGTGCCAGCAGCAGCCATGGCAGCATCCAGGAGAGCCACAAGG
CTAGCCGAGACCCTTCTCCAATCCAAGATGGCAACGAGGACATGAAAGCTATTCCAGAAG
GAGAGAAAATTCAAGGCGATGGGGCCCAAGCCGCAGTTAAGGGGGCACAGGCTGAGCTGA
TGAAGGTGCAAGCCTTAGAGAAAGTTAAAGGTGCAGATATAAATGCTGAAGAGGCCCCCA
AAATGGTGCCCAAGGCAGTGGAGGATGGGATAAAGGTGGCTGACCTGGAACTAGAGGAGG
GCCTAGCAGAGGAGAAGCTGAAGACTGTGGAGGAGGCAGCAGCTCCCAGAGAAGGGCAAG
GAAGCTCTGCTGTGGGTTTTGAAGTTCCACAGCAAGATGTGATCCTGCCAGAGTACTAAA
CAGCTTCCTTCAGATCTGGAAGCCAAACACCGGCATTTTATGTACTTTGTCCTTCAGCAA
GAAAGGTGTGGAAGCATGATATGTACTATAGTGATTCTGTTTTTGAGGTGCAAAAAACAT
ACATATATACCAGTTGGTAATTCTAACTTCAATGCATGTGACTGCTTTATGAAAATAATA
GTGTCTTCTATGGCATGTAATGGATACCTAATACCGATGAGTTAAATCTTGCAAGTTAAC
/WQ)= exp
i (~ά "*" ^1'2) for''=l*2'3'4
where of = variance = [2 0; 0 2] for /= 1,\ 2,\ 3,\ 4
μι = mean vector = [0 0]T
μ2 = mean vector = [0 5]T
μ3 = mean vector = [5 5]T
μ4 = mean vector = [5 0]T
We use these data to show the robustness of the stratifying coefficient scheme for
multiclass data.
6. EXPERIMENTAL RESULTS
fast training, namely a hybrid neural network that combines competitive learning and
delta rule learning and EBUDS (error back-propagation using direct solution), which
uses back-propagation and the direct least-squares method [23].
In the beginning, we investigated many different neural network architectures in
terms of the number of hidden layers and the number of weights in each layer. The
experimental results presented are obtained using networks with 60 nodes in the input
layers, 32 nodes in the hidden layers, and 2 nodes in the output layers. The networks
were trained by various training methods using 1000, 2000, and 3000 samples. The
learning rate we used in the experiments with BP was 0.01.
In the following tables PD means the detection rate and is defined as the percentage
of rare events that are correctly classified as rare events. Similarly, FAR means the false
alarm rate and is defined as the percentage of the total number of events falsely declared
as rare events.
Table 6 shows the results for the regular neural network. We can see that the
network does not perform well as the training data size increases. For the BP with
LRWF using the importance-sampling concept, the performance of neural networks
depends on the number of stages. In Table 7, we see similar results with the two-stage
BP with LRWF.
Table 8 shows that the first proposed scheme for rare event detection works well.
Actually, the BP with LRWF and the BP with SC have the same weighting scheme, but
the results are different. The reason is as follows. For the BP with LRWF, the assump-
tion made is that the data sample should be representative, which means every example
in the population has an equal chance of being in it when we make our sample. But as
far as classification is concerned, it is preferable that the data sample should be strat-
Section 6 Experimental Results 115
TABLE 6 Performance of a Two-Stage Back-Propagation Neural Network
Accuracy Accuracy
No. of Tr data No. of Ts data (resub.) (hold-out) Po FAR
Accuracy Accuracy
No. of Tr data No. of Ts data (resub.) (hold-out) Po FAR
Accuracy Accuracy
No. of Tr data No. of Ts data (resub.) (hold-out) PD FAR
Accuracy Accuracy
No. of Tr data No. of Ts data (resub.) (hold-out) Po FAR
1000 1000 0.875 0.954 0.64 0.0295
2000 1000 0.8765 0.973 0.72 0.0137
ified, which means examples from small classes have a better chance of being included
than those from large classes. We can confirm this fact through Table 7 and Table 8.
Table 9 shows similar results with the hybrid network. We used competitive learn-
ing in the first stage of the hybrid scheme.
Tables 10 and 11 show the performance of the bootstrap stratification. Table 10
shows the results of the BP with bootstrap stratification. This network achieved better
performance than any other rare event neural networks. Table 11 shows the outcome of
the hybrid scheme, with bootstrap stratification having better performance than the
simple hybrid scheme. For fast training, we adopted Verma's scheme [23] to reduce
training time, with the performance slightly changed, as shown in Table 12.
116 Chapter 5 Rare Event Detection in Genomic Sequences by Neural Networks
TABLE 10 Performance of a Bootstrap Stratification Neural Network (Two-Stage BP)
Accuracy Accuracy
No. of Tr data No. of Ts data (resub.) (hold-out) PD FAR
1000 1000 0.999 0.94 0.81 0.0537
2000 1000 0.998 0.9256 0.80 0.0674
3000 1000 0.998 0.9305 0.80 0.0625
Accuracy Accuracy
Ρ[Η,] ΡηΗ,] (resub.) (hold-out) Pa FAR
7. CONCLUSIONS
In this chapter, we presented two methods for rare event detection in association with
human genomic sequences as well as synthetically generated data, using neural net-
works and sample stratification. In the first scheme, we used a modification of the
importance-sampling concept, which modifies the probability distribution of the under-
lying random process in order to make rare events occur more frequently. This method
uses a stratifying coefficient multiplying the sum of the derivatives during the backward
pass of training. In the second scheme, we utilized a bootstrap technique. These two
schemes make rare events have a better chance of being included in the sample for
training in order to improve the classification accuracy of neural networks. The results
indicate that the proposed schemes have the potential to improve significantly the
classification performance to recognize rare events.
More progress is required on acceptable minimum data size for rare event detec-
tion. We cannot make up for two small amounts of data. At a certain level of scarcity,
we cannot obtain the desired results even with the proposed techniques. Another
research direction would be toward investigating the relationship between the perfor-
mance and the number of bootstrap replicates. We cannot arbitrarily increase the
number of replicates for better performance without considering complexity problems
as well as saturation of improvement of classification accuracy. We want to find a
reasonable bound for the number of replicates considering the performance and the
complexity of the neural networks.
REFERENCES
[1] W. Choe, O. K. Ersoy, and M. Bina, Detection of rare events by neural networks.
Proceedings of the Artificial Neural Networks in Engineering Conference (ANNIE '98), pp.
5-10, 1998.
[2] M. Smith, Neural Networks for Statistical Modelling. New York: Van Nostrand Reinhold,
1993.
References 121
[3] P. M. Hahn and M. C. Jeruchim, Developments in the theory and application of importance
sampling. IEEE Trans. Commun. COM-34(7): 715-719, 1986.
[4] M. C. Jeruchim, P. Balaban, and K. S. Shanmugan, Simulation of Communications Systems.
New York: Plenum, 1992.
[5] R. L. Mitchelle, Importance sampling applied to simulation of false alarm statistics. IEEE
Trans. Aerosp. Electron. Syst. AES-17(1): 15-24, 1981.
[6] D. J. Monro, O. K. Ersoy, M. R. Bell, and J. S. Sadowsky, Neural network learning of low-
probability events. IEEE Trans. Aerosp. Electron. Syst. 3(3): 898-910, July 1996.
[7] D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, Parallel Distributed
Processing, Vol. 1: Foundations. Cambridge, MA: MIT Press, 1986.
[8] M. D. Richard and R. P. Lippman, Neural network classifiers estimate bayesian a posteriori
probabilities. Neural Comput. 3: 461-483, 1991.
[9] D. W. Ruck, S. K. Rogers, M. Kabrisky, M. E. Oxley, and B. W. Suter, The multilayer
perceptron as an approximation to a Bayes optimal discriminant function. IEEE Trans.
Neural Networks 1(4): 296-298, 1990.
[10] E. A. Wan, Neural networks classification: A bayesian interpretation. IEEE Trans. Neural
Networks 1(4): 303-305, 1990.
[11] H. White, Learning in artificial neural networks: A statistical perspective. Neural Comput. 1:
425-464, 1989.
[12] L. Breiman, Bagging predictor. Machine Learning 24: 123-140, 1996.
[13] J. R. Quinlan, Bagging, boosting, and C4.5. Proceedings Fourteenth National Conference on
Artificial Intelligence, pp. 725-730, 1996.
[14] A. M. Zoubir and B. Boashash, The bootstrap and its application in signal processing. IEEE
Signal Process. Mag. January: 56-76, 1998.
[15] N. Nilsson, Learning Machines. New York: McGraw-Hill, 1965.
[16] L. K. Hansen and P. Salamon, Neural network ensembles. IEEE Trans. Pattern Anal.
Machine Intell. 12(10): 993-1001, 1990.
[17] H. Valafar and O. K. Ersoy, Parallel, self-organizing, consensual neural network. Report.
TR-EE 90-56, School of Electrical Engineering, Purdue University, 1990.
[18] D. W. Opitz and J. W. Shavlik, Generating accurate and diverse members of a neural
networks ensemble. In: Advances in Neural Information Processing System, Vol. 8.
Cambridge, MA: MIT Press, 1996.
[19] K. Turner and J. Ghosh, Theoretical foundations of linear and order statistics combiners for
neural pattern classifiers. TR-95-02-98, Computer and Vision Research Center, University
of Texas at Austin, 1995.
[20] S. Haykin, Neural Networks. Englewood Cliffs, NJ: Macmillan, 1994.
[21] J. Kangas, T. Kohonen, and J. Laaksonen, Variants of self-organizing maps. IEEE Trans.
Neural Networks 1(1): 93-99, 1990.
[22] R. P. Lippmann, Pattern classification using neural networks. IEEE Commun. Mag. 27: 47-
64, 1989.
[23] R. Verma, Fast training of multilayer perceptrons. IEEE Trans. Neural Networks 8(6): 1314-
1320, 1997.
[24] J. Jurka, Repbase update. Genetic Information Research Institute, http://www.girinst.org/
~server/repbase.html, 1997.
[25] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed. San Diego: Academic
Press, 1990.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.
Nicolaos B. Karayiannis
1. INTRODUCTION
The structure of a radial basis function (RBF) neural network is shown in Figure 1. An
RBF neural network is usually trained to map a vector xk € W into a vector yk e D5"°,
where the pairs (xk, yk), 1 < k < M, form the training set. If this mapping is viewed as a
function in the input space 03", learning can be seen as a function approximation
problem. According to this point of view, learning is equivalent to finding a surface
in a multidimensional space that provides the best fit to the training da,ta.
Generalization is therefore synonymous with interpolation between the data points
along the constrained surface generated by the fitting procedure as the optimum
approximation to this mapping.
Broomhead and Lowe [1] were the first to explore the use of radial basis functions
in the design of neural networks and to show how RBF networks model nonlinear
relationships and implement interpolation. Micchelli [2] showed that RBF neural net-
works can produce an interpolating surface that exactly passes through all the pairs of
the training set. However, the exact fit is neither useful nor desirable in applications
because it may produce anomalous interpolation surfaces. Poggio and Girosi [3] viewed
the learning process in an RBF network as an ill-posed problem, in the sense that the
information in the training data is not sufficient to reconstruct uniquely the mapping in
regions where data are not available. From this point of view, learning is closely related
to classical approximation techniques, such as generalized splines and regularization
theory. Park and Sandberg [4,5] proved that RBF networks with one layer of radial
basis functions are capable of universal approximation. Under certain mild conditions
on the radial basis functions, RBF networks are capable of approximating arbitrarily
well any function. Similar proofs also exist in the literature for conventional feed-
forward neural models with sigmoidal nonlinearities [6].
The performance of an RBF network depends on the number and positions of the
radial basis functions, their shapes, and the method used for learning the input-output
mapping. The existing learning strategies for RBF neural networks can be classified as
follows: (1) strategies selecting the radial basis function centers randomly from the
training data [1], (2) strategies employing unsupervised procedures for selecting the
radial basis function centers [7-10], and (3) strategies employing supervised procedures
for selecting the radial basis function centers [3,9,11-15].
122
Section 1 Introduction 123
Broomhead and Lowe [1] suggested that, in the absence of a priori knowledge, the
centers of the radial basis functions can either be distributed uniformly within the
region of the input space for which there is data or chosen to be a subset of the training
points by analogy with strict interpolation. This approach is sensible only if the training
data are distributed in a representative manner for the problem under consideration, an
assumption that is very rarely satisfied in practical applications. Moody and Darken
[10] proposed a hybrid learning process for training RBF networks with Gaussian
radial basis functions, which is widely used in practice. This learning procedure employs
different schemes for updating the output weights, that is, the weights that connect the
radial basis functions with the output units, and the centers of the radial basis func-
tions, that is, the vectors in the input space that represent the prototypes of the input
vectors included in the training set. Moody and Darken used the c-means (or fe-means)
clustering algorithm [16] and "P nearest neighbor" heuristic to determine the positions
and widths of the Gaussian radial basis functions, respectively. The output weights are
updated according to this scheme using a supervised least-mean-squares learning rule.
Poggio and Girosi [3] proposed a fully supervised approach for training RBF neural
networks with Gaussian radial basis functions, which updates the radial basis function
centers together with the output weights. Poggio and Girosi used Green's formulas to
obtain an optimal solution with respect to the objective function and employed gradient
descent to approximate the regularized solution. They also proposed that Kohonen's
self-organizing feature map [17,18] can be used for initializing the radial basis function
centers before gradient descent is used to adjust all the free parameters of the network.
Chen et al. [7,8] proposed a learning procedure for RBF neural networks based on
the orthogonal least squares (OLS) method. The OLS method is used as a forward
regression procedure to select a suitable set of radial basis function centers. In fact,
124 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
this approach selects radial basis function centers one by one until an adequate RBF
network has been constructed. Cha and Kassam [11] proposed a stochastic gradient
training algorithm for RBF networks with Gaussian radial basis functions. This algo-
rithm uses gradient descent to update all free parameters of an RBF network, which
include the radial basis function centers, the widths of the Gaussian radial basis func-
tions, and the output weights. Whitehead and Choate [19] proposed an evolutionary
training algorithm for RBF neural networks. In this approach, the centers of the radial
basis functions are governed by space-filling curves whose parameters evolve geneti-
cally. This encoding causes each group of codetermined basis functions to evolve in
order to fit a region of the input space. Roy et al. [20] proposed a set of learning
principles that led to a training algorithm for a network that contains "truncated"
radial basis functions and other types of hidden units. This algorithm uses random
clustering and linear programming to design and train this network with polynomial
time complexity.
Despite the existence of a variety of learning schemes, RBF neural networks are
frequently trained in practice using variations of the learning scheme proposed by
Moody and Darken [10]. According to these hybrid learning schemes, the prototypes
that represent the radial basis function centers are determined separately according to
some unsupervised clustering or vector quantization algorithm and the output weights
are determined by a supervised procedure to implement the desired input-output map-
ping. These approaches were developed as a natural reaction to the long training times
typically associated with the training of conventional feed-forward neural networks
using gradient descent [21]. In fact, these hybrid learning schemes achieve fast training
of RBF neural networks because of the strategy they employ for learning the desired
input-output mapping. However, the same strategy prevents the training set from
participating in the formation of the radial basis function centers, with a negative
impact on the performance of trained RBF neural networks [9]. This created a
wrong impression about the actual capabilities of an otherwise powerful neural
model. The training of RBF networks using gradient descent offers a solution to the
trade-off between performance and training speed. Moreover, such training can make
RBF neural networks serious competitors to classical feed-forward neural networks.
Learning schemes attempting to train RBF networks byfixingthe locations of the
radial basis function centers are very slightly affected by the specific form of the radial
basis functions used. On the other hand, the convergence of gradient descent learning
and the performance of the trained RBF networks are both affected rather strongly by
the choice of radial basis functions. The search for admissible radial basis functions
other than the Gaussian function motivated the development of an axiomatic approach
for constructing reformulated RBF neural networks suitable for gradient descent learn-
ing [12-15].
This chapter begins with a review of function approximation models used for
interpolation and points out their relationship with RBF neural networks. An axio-
matic approach provides the basis for reformulating RBF neural networks, which is
accomplished by searching for radial basis functions other than Gaussian. According to
this approach, the construction of admissible RBF models reduces to the selection of
generator functions that satisfy certain properties. The search for potential generator
functions is facilitated by considering the admissibility in the wide and strict sense of
linear and exponential functions. The selection of specific generator functions is based
on criteria related to their behavior when the training of reformulated RBF networks is
Section 2 Function Approximation Models and RBF Neural Networks 125
There are many similarities between RBF neural networks and function approximation
models used to perform interpolation. Such a function approximation model attempts
to determine a surface in a Euchdean space R"' that provides the best fit to the data
(xk, yk), \<k<M, where xk e X c RV) and yk € R for all k = 1,2,..., M. Micchelli
[2] considered the solution of the interpolation s(xk) = yk, 1 < k < M, by functions
s : R"' -» R of the form
M
s(x) = ^wkg(\\x-xk\\2) (1)
y = w0 + J2wj8(Ux-vj\\2) (2)
If the function g() satisfies certain conditions, the model (2) can be used to implement a
desired mapping R"' -» R specified by the training set (xk, yk), 1 < k < M. This is
usually accomplished by devising a learning procedure for determining its adjustable
parameters. In addition to the weights wj, 0 < j < c, the adjustable parameters of the
model (2) also include the vectors v; € V C R", 1< j < c. These vectors are determined
during learning as the prototypes of the input vectors xk, 1 < k < M. The adjustable
parameters of the model (2) are frequently updated by minimizing some measure of the
discrepancy between the expected output ν λ of the model and the corresponding input
xk and its actual response
126 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
for all pairs (xk, yk), 1 < k < M, included in the training set.
The function approximation model (2) can be extended to implement any mapping
R"< -+ Rn», n0 > 1, as
and φ{·) is a radial basis function. In such a case, the response of the network to the
input vector xk is
=/[?, wvWiAk
hk=f\Y,
hk h
jA' . 1 < » < »e (6)
where h^ = 1, Vfc, and A,·,* represents the response of the radial basis function located
at the 7'th prototype to the input vector xk, given as
hj,k = <K\\x-k-Vj\\)
(7)
= S(Hx*-V;l!2), l<j<c
The response (6) of the RBF neural network to the input Xjt is actually the output of the
upper associative network. When the RBF network is presented with x^, the input of
the upper associative network is formed by the responses (7) of the radial basis func-
tions located at the prototypes v,, 1 <j < c, as shown in Figure 1.
The models used in practice to implement RBF neural networks usually contain
linear output units. An RBF model with linear output units can be seen as the special
case of (4) that corresponds to/(x) = x. The choice of a linear function/(·) was mainly
motivated by the hybrid learning schemes originally developed by training RBF neural
networks. Nevertheless, the learning process is only slightly affected by the form of/(·)
if RBF neural networks are trained using learning algorithms based on gradient des-
cent. Moreover, the form of an admissible function /(·) does not affect the function
approximation capability of the model (4) or the conditions that must be satisfied by
radial basis functions. Finally, the use of a nonlinear sigmoidal function/(·) could make
RBF models stronger competitors to conventional feed-forward neural networks in
some applications, such as applications involving pattern classification.
Section 3 Reformulating Radial Basis Neural Networks 127
Axiom 2: hjk > hJt for all xk, xt e X and y,· € V such that ||xfc - y,-||2 < ||x€ - y,||2.
Axiom 3: If VxAife = dhjk/dxk denotes the gradient of hjyk with respect to the
corresponding input x*., then
IIVxAfcll2 ^ Ι1ΔχΑ·.ιΙΙ2
Ι|χ*-ν,·|| 2 > ||χ,-ν,·|| 2
for all X*, xte X and vy e V such that ||xfc - y,||2 < |]x€ - v,||2.
These basic axiomatic requirements impose some rather mild mathematical restric-
tions on the search for admissible radial basis functions. Nevertheless, this search can
be further restricted by imposing additional requirements that lead to stronger math-
ematical conditions. For example, it is reasonable to require that the responses of all
radial basis functions to all inputs are bounded, that is, hJk < oo, V,,k. On the other
hand, the third axiomatic requirement can be made stronger by requiring that
if |Xfc - v,.||2 < ||x, - v,.||2. Since ||xfc - vy||2 < ||x< - v,||2,
and the third axiomatic requirement is satisfied. This implies that the condition (8) is
stronger than that imposed by the third axiomatic requirement.
The preceding discussion suggests two complementary axiomatic requirements for
radial basis functions [13]:
Axiom 5: If VXA fc = 3A; ^/9χ^ denotes the gradient of hjk with respect to the
corresponding input x^, then
tor functions that lead to admissible radial basis functions can be facilitated by the
following theorem [13]:
Theorem 2: Consider the model (4) and let g(x) be defined in terms of the generator
function g0(x) that is continuous on (0, oo) as
If m > 1, then this model represents an RBF neural network in accordance with all
five axiomatic requirements if:
1. g0(x) > 0, Vx € (0, oo).
2. g0(x) is a monotonically increasing function of x e (0, oo), that is, go(x) > 0,
Vx € (0, oo).
3. r0(x) = Qn/Qn - 1))feo(*))2- SoWSo (*) > 0, Vx e (0, oo).
4. lim^ 0 + go(x) = Lx > 0.
5. do(x) = go(x)go(x) - 2xr0(x) < 0, Vx e (0, oo).
If m < 1, then this model represents an RBF neural network in accordance with all
five axiomatic requirements if:
1. g0(x) > 0, Vx e (0, oo).
2. g0(x) is a monotonically decreasing function of x e (0, oo), that is, go(x) < 0,
Vx e (0, oo).
3. r0(x) = (m/(m - 1)) (g^x))2 -g0(x)go(x) < 0, Vx e (0, oo).
4. lim^ 0 + go(*) = Li < oo.
5. 4)(x) = go(x)go(x) - 2xr0(x) > 0, Vx e (0, oo).
Any generator function that satisfies the first three conditions of Theorem 2 leads
to admissible radial basis functions in the wide sense [12,14,15]. Admissible radial basis
functions in the strict sense can be obtained from generator functions that satisfy all five
conditions of Theorem 2 [13].
This section investigates the admissibility in the wide and strict sense of linear and
exponential generator functions.
r
oW = ^ — [ ^ (12)
130 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
If m > 1, then r0(x) > 0, Vx e (0, oo). Thus, g0(x) = ax + b is an admissible generator
function in the wide sense (i.e., in the sense that it satisfies the three basic axiomatic
requirements) for all a > 0 and b > 0. Certainly, all combinations of a > 0 and b > 0
also lead to admissible generator functions in the wide sense.
For g0(x) = ax + b, the fourth axiomatic requirement is satisfied if
For g0(x) = ax + b,
If m > 1, the fifth axiomatic requirement is satisfied if do(x) < 0, Vx e (0, oo). For
a > 0, the condition d^ipc) < 0 is satisfied by g0(x) = ax + b if
w-li)
*> 7- (15)
m + \a
Since m > 1, the fifth axiomatic requirement is satisfied only if b = 0 or, equivalently, if
g0(x) = ax. However, the value b = 0 violates the fourth axiomatic requirement. Thus,
there exists no combinations of a > 0 and b > 0 leading to admissible generator func-
tion in the strict sense that has the form g0(x) = ax + b.
If a = 1 and b = γ2, then the linear generator function becomes g0(x) = x + γ1.
For this generator function, g(x) = (x + j^y/O-»») if w = 3, g(x) = (χ + γ2)~ι/2 corre-
sponds to the inverse multiquadratic radial basis function
For this generator function, ΐ™χ_>0+ g0(x) = γ2 and limx^.0+g(x) = y2/(1_m). Since
w > 1, g(·) is a bounded function if γ takes nonzero values. However, the bound of
g{·) increases and approaches infinity as γ decreases and approaches 0. If m > 1,
the condition d0(x) < 0 is satisfied by go(x) = x + γ2 if
m-1 j ., _.
m+ 1
Clearly, the fifth axiomatic requirement is satisfied only for γ — 0, which leads to an
unbounded function g() [12,14,15].
Another interesting generator function from a practical point of view can be
obtained from g0(x) = ax + b by selecting b = \ and a = δ > 0. For g0(x) = 1 + δχ,
limx^.0+g(x) = \ivnx_,.0+g0(x) = 1. For this choice of parameters, the corresponding
radial basis function φ(χ) = g(x2) is bounded by 1, which is also the bound of the
Gaussian radial basis function. If m > 1, the condition do(x) < 0 is satisfied by go(x) =
l+Sxi(
Section 4 Admissible Generator Functions 131
m-ll
x > —Γ7Ι ( 18 )
m+lo
For afixedm > 1, the fifth axiomatic requirement is satisfied in the limit S -*■ oo. Thus,
a reasonable choice for S in practical situations is <5 » 1.
The radial basis function that corresponds to the linear generator function g0{x) =
ax + b and some value of m > 1 can also be obtained from the decreasing function
g0{x) = 1 /{ax + b) combined with an appropriate value of m < 1. As an example, for
m = 3, g0(x) = ax + b leads to g{x) = {ax + b)~l/2, For a = 1 and b = y2, this genera-
tor function corresponds to the multiquadratic radial basis function φ(χ) =
g{x2) = {χ2 + γ2)~ι/2. The multiquadratic radial basis function can also be obtained
using the decreasing generator function go(x) = l/{x + y2) with m = — 1. In general,
the function g{x) = (go(*))1/(1-m) corresponding to the increasing generator function
g0{x) = ax + b and m = m, > 1 is identical to the function g{x) = (go(*))1/(1-'")
corresponding to the decreasing function go{x) = 1 /{ax + b) and m — md if
1 — w, md — 1
or, equivalently, if
m,· + md = 2 (20)
w - 1 {ax + by
For m < 1, r0(x) < 0, Vx e (0, oo), and g0(x) = I/{ax + b) is an admissible generator
function in the wide sense.
For g0{x) = l/{ax + b),
which implies that g0{x) = 1 /{ax + b) satisfies the fourth axiomatic requirement unless
b approaches 0, which implies that ΐΓτηΛ_,0+ g0{x) = l/b = oo. For g0{x) = I/{ax + b),
If m < 1, the fifth axiomatic requirement is satisfied if do(x) > 0, Vx e (0, oo). Since
a > 0, the condition CIQ(X) > 0 is satisfied by g0(x) = l/(ax + b) if
m- lb ,ΛΛ.
x> 5- (24)
Once again, the fifth axiomatic requirement is satisfied for b = 0, a value that violates
the fourth axiomatic requirement.
4.2. Exponential Generator Functions
Consider the function g(x) = (go(x))1/(1-m), with g0(x) = exp(ßx), ß > 0, and
m > 1. For any £, g0(*) = expGSx) > 0, Vx e (0, oo). For all ß > 0, g0(x) = exp(jöx) is
a monotonically increasing function of x e (0, oo). For go(x) = exp(ßx), go(x) = ßexp
{ßx) and go'(x) = ß2 expißx)· In this case,
which implies that g0(x) = exp(ßx) satisfies the fourth axiomatic requirement. For
g0(x) = expGöx), ß > 0,
Χ>
-Ίβ- = Ύ>0 (28)
where σ2 — {m- l)/ß. Regardless of the value β > 0, g0(x) = exp(/*x) is not an admis-
sible generator function in the strict sense.
Consider also the function g(x) = (go(*))1/(1-m)' w i t n 8o(x) = exp(-/?x), 0 > 0, and
m < 1. For any β g000 = exp(-ßx) > 0, Vx € (0, oo). For all β > 0, #000 =
-ßexp(-ßx) < 0, Vx € (0, oo), and g0(x) — exp(-/3x) is a monotonically decreasing
function. Since go(x) = ß2exp(—ßx),
If w < 1, then r0(x) < 0, Vx € (0, oo), and g0(x) = &χρ(-βχ) is an admissible generator
function in the wide sense for all ß > 0.
Section 5 Selecting Generator Functions 133
which implies that g0(x) = exp(-/frc) satisfies the fourth axiomatic requirement. For
g0(x) = exp(-ßx), ß>0,
For m < 1, the fifth axiomatic requirement is satisfied if do(x) > 0, Vx e (0, oo). The
condition do(x)> 0 is satisfied by g0(x) = exp(—βχ) if
1
~m ^ ιπ\
X> (32)
~W = -2
where σ2 = (1 - m)/ß. Once again, g0(x) = exp(—ßx) is not an admissible generator
function in the strict sense regardless of the value of ß > 0.
It must be emphasized that both increasing and decreasing generator functions
essentially lead to the same radial basis function. If m > 1, the increasing exponential
generator function g0(x) = exp(ßx), ß>0, corresponds to the Gaussian radial basis
function φ(χ) = g(x?) = exp(—Λ^/σ2), with σ2 = (m — \)/β. If m < 1, the decreasing
exponential generator function g0(x) = exp(—βχ), β > 0, also corresponds to the
Gaussian radial basis function φ(χ) = g(x?) = exp(—χ^/σ2), with σ2 = (1 — m)/ß. In
fact, the function g(x) = 0?ο(*))1/(1-"'* corresponding to the increasing generator func-
tion g0(x) = exp(/Jjc), β > 0, with 'm = m, > 1 is identical to the function g(x) =
(#o(*))1/(1_m) corresponding to the decreasing function g0(x) = exp(-jöx), ß > 0, with m
= ΑΜ^ < 1 if
zu,r — 1 = 1 - md (33)
or, equivalently, if
m,+/n</ = 2 (34)
All possible generator functions considered in the previous section satisfy the three
basic axiomatic requirements but none of them satisfies allfiveaxiomatic requirements.
In particular, the fifth axiomatic requirement is satisfied only by generator functions of
the form g0(x) = ax, which violate the fourth axiomatic requirement. Therefore, it is
clear that at least one of thefiveaxiomatic requirements must be compromised in order
to select a generator function. Since the response of the radial basis functions must be
bounded in some function approximation applications, generator functions can be
selected by compromising the fifth axiomatic requirement. Although this requirement
134 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
is by itself very restrictive, its implications can be used to guide the search for generator
functions appropriate for gradient descent learning [13].
5.1. The Blind Spot
Since hJ<k = g(\\xk - Vj||2),
Theorem 1 requires that g(x) be a decreasing function of x € (0, oo), which implies that
g'(x) < 0, Vx € (0, oo). Thus, (37) indicates that the fifth axiomatic requirement is
satisfied if b'{x) < 0, Vx e (0, oo). If this condition is not satisfied, then HV^Ay^H2 is
not a monotonically decreasing function of ||x^ — Vy||2 in the interval (0, oo), as required
by the fifth axiomatic requirement. Given a function g(-) satisfying the three basic
axiomatic requirements, the fifth axiomatic requirement can be relaxed by requiring
that IIV^AyfcH2 is a monotonically decreasing function of ||x* —V/||2 in the interval
(B, oo) for some B > 0. Accordingly to (36), this is guaranteed if the function b(x) =
x(g'(x))2 has a maximum at x = B or, equivalently, if there exists a B > 0 such that b'
(B) = 0 and b"(B) < 0. If B e (0, oo) is a solution of b'(x) = 0 and b"(B) < 0, then
b'(x) > 0, Vx € (0, B), and b'(x) <0,Vxe (B, oo). Thus, || V^A^H2 is an increasing
function of |χ& — τ/||2 for ||χ^ — y,||2 € (0,5) and a decreasing function of ||χ^ - yy||2
for |Xfc - ν;·||2 € (B, oo). For all input vectors xk that satisfy \\xk - y,||2 < B, the norm of
the gradient V^A,·^ corresponding to the y'th radial basis function decreases as χ^
approaches its center that is located at the prototype v,. This is exactly the opposite
of the behavior that would intuitively be expected, given the interpretation of radial
basis functions as receptivefields.As far as gradient descent learning is concerned, the
hypersphere KB = {x € X c R"' : ||x - v||2 e (0, B)} is a "blind spot" for the radial
basis function located at the prototype v. The blind spot provides a measure of the
sensitivity of radial basis functions to input vectors close to their centers.
The blind spot ΊΖ^ corresponding to the linear generator function g0(x) = ax + b
is determined by
*-Ξπ§
Section 5 Selecting Generator Functions 135
The effect of the parameter m on the size of the blind spot is revealed by the behavior of
the ratio (m - l)/(m + 1) viewed as a function of m. Since (m — \)/{m + 1) increases as
the value of m increases, increasing the value of m expands the blind spot. For a fixed
value of m > 1, Blin = 0 only if b = 0. For b ^ 0, 5 Un decreases and approaches 0 as a
increases and approaches infinity. If a = 1 and b = yL, Bün approaches 0 as γ
approaches 0. If a = S and b=\, B^n decreases and approaches 0 as S increases and
approaches infinity.
The blind spot 7£Bex corresponding to the exponential generator function go(x) =
exp(ßx) is determined by
w-1
Be*P=^T2ß (39)
For a fixed value of ß, the blind spot depends exclusively on the parameter m. Once
again, the blind spot corresponding to the exponential generator function expands as
the value of m increases. For a fixed value of m > 1, Bexp decreases and approaches 0 as
ß increases and approaches infinity. For g0(x) = exp(/Jx), g(x) = (go(x))l^l~m) =
exp(—χ/σ2) with σ2 = (m — l)/ß. As a result, the blind spot corresponding to the
exponential generator function approaches 0 only if the width of the Gaussian radial
basis function φ(χ) = g(x2) = exp(—χ2/σ2) approaches 0. Such a range of values of σ
would make it difficult for Gaussian radial basis functions to behave as receptive fields
that can cover the entire input space.
It is clear from (38) and (39) that the blind spot corresponding to the exponential
generator function is much more sensitive to changes of m than that corresponding to
the linear generator function. This can be quantified by computing for both generator
functions the relative sensitivity of B = B{m) in terms of m, defined as
For the linear generator function g0(x) = ax + b, dBuJBm = (2/(m + \)2)(b/a) and
^•=S^T <41>
For the exponential generator function g0(x) = exp(ßx), dBexp/dm = \/(2ß) and
„, tn
SB = r (42)
s% = ^ S 2 L (43)
Since m > 1, Sj}rap > SjjUn. As an example, for m = 3 the sensitivity with respect to m of
the blind spot corresponding to the exponential generator function is twice that corre-
sponding to the linear generator function.
136 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
1. The value A*,* = g(B) of the response hj^ = g(\\Xk — yj\\2) of the y'th radial basis
function at ||χ^ — \j\\L = B and the rate at which the response hJtk = g(||x;t - V/||2)
decreases as |χ& — v,||2 increases above B and approaches infinity, and
2. The maximum value attained by the norm of the gradient V^Ay^ at |Xjt — Vy||2 = B
and the rate at which ||VX/tA,jfc||2 decreases as ||χ* —V/||2 increases above B and
approaches infinity.
The criteria that may be used for selecting radial basis functions can be established
by considering the following extreme situation. Suppose the response hjtk = g(||xjfc—
Vyll2) diminishes very quickly and the receptive field located at the prototype y,· does
not extend far beyond the blind spot. This can have a negative impact on the function
approximation ability of the corresponding RBF model because the region outside the
blind spot contains the input vectors that affect the implementation of the input-output
mapping as indicated by the sensitivity measure HV^Ay^H2. Thus, a generator function
must be selected in such a way that:
1. The response Ay^ and the sensitivity measure HV^A^H2 take substantial values
outside the blind spot before they approach 0, and
2. The response A;jt is sizable outside the blind spot even after the values of || V^Ay^H2
become negligible.
The rate at which the response AyA = g(||xfc-Vy||2) decreases is related to the
"tails" of the functions g(·) that correspond to different generator functions. The use
of a short-tailed function g() shrinks the receptive field of the RBF model, whereas the
use of a long-tailed function g(·) increases the overlapping between the receptive fields
located at different prototypes. If g(x) = (go(*))1/(1~m) and AM > 1, the tail of g(x) is
determined by how fast the corresponding generator function g0(x) changes as a func-
tion x. As x increases, the exponential generator function g0(x) = exp(ßx) increases
faster than the linear generator function g0(x) = ax + b.Asa result, the response g(x) =
(g0(jc))1/(1_m) diminishes very quickly if g0(·) is exponential and slowly if g0(·) is linear.
The behavior of the sensitivity measure || VXtAyift||2 also depends on the properties
of the function g(·)· For Ay>fc = g(||Xfc - v,||2), VXtAyft can be obtained from (35) as
Section 5 Selecting Generator Functions 137
where
a M = 2g'(||x f t -y y || 2 ) (45)
From (44),
The selection of a specific function g(·) influences the sensitivity measure WVXkhj k\\2
through aJJc = -2g\\\xk - v,- f). If g(x) = feo(*))1/(1"m), then
L
g'(x) = r -(g0(x))m/{1-m)go(x)
l m
~ (47)
= γ4^(*(*)Γ*ό(*)
Since A,-* = s(||x* - v,|| 2 ), a,·,* is given by
/ \ l/(i»-l)
h (49)
J*=\-u »rr.ü)
h h
m-\
^l(m-\) (50)
2a \ )
m a\\xk-Vj\\2 + b)
where σ2 = (w - l)/ß. For this generator function, go(x) = β exp(^x) = ßgo(x). In this
case, ^(||x fc - v / ) = ß{hhk)l-m and (48) gives
2 / ||x,-v;|| 2 \ (53)
(54)
Figures 3a and b show the response hjk = g(\\Xk — v,|| ) of they'th radial basis function
2N
2
to the input vector x* and the sensitivity measure HV^A^H plotted as functions of
2 m m)
llxjfc - v/U for g(x) = (.g0(x)) ~ , with g0(x) = exp(^x), m = 3, for ß = 5 and ß = 10,
respectively. Once again, IIV^A,-^2 increases monotonically as ||χ* —ν,·||2 increases
from 0 to B = l/ß and decreases monotonically as ||χ^ — v,||2 increases above B and
approaches infinity. Nevertheless, there are some significant differences between the
response hjk and the sensitivity measure HV^A^H2 corresponding to linear and expo-
nential generator functions as indicated by comparing Figures 2 and 3. If
g0(jc) = exp(ßx), then the response A;fcis substantial for the input vectors inside the
Section 5 Selecting Generator Functions 139
B
1.6 -i 1—r-| 1 1—r-| 1 1—r-| r
(y2)-'A,7.*
1.4
<y2)^\\Vxkhj_k\\2 -■
1.2 -
■i 1.0
aυ
«
1ω 0.8
w
e
a 0.6
0.4 -
0.2
■ « ■ ··»-- ■ i_
10~ 10" 10"1 10° 101 102
l|Xt-Vy||
(a)
B
1.6 1 1 | 1 1 1 | 1
. r»
1.4 —
i \ (y2)^l|VXtA,,t|f - · - ■
1.2 —
\
sensitivity
1
_/ \ -
o
/ \
\
1 °· 8
CO
\
\
-
1 0.6 \ —
-1 \
0.4 \
\
V
0.2
U
n 7l—1—-r-rf^-r-^r- - _ ^ l ■ . .1 .
10- 10" 10" 10° 101 102
*k-yß
(b)
Figure 2 The normalized response {yi)llt-m l^hjk of they'th radial basis function and
the normalized norm of the gradient (y2)2/(m~1)l|VX/l/!/i)t||2 plotted as func-
tions of ||x* - v, ||2 for g(x) = fe0W)1/(1"""), with g0(x) = x + y2, m = 3 and
(a)y 2 = 0.1, (b)y 2 = 0.01.
140 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
z 1 1 l | I I I ) 1 1 1| 1 1 1-| 1 — —1 1
— nh
1.8 /' j,k "■
1.6 / \ ΙΙ ν ^ Λ ;,*ΙΙ 2 -·
£. 1.4 _ / \ _
/
id sensitiv)
\
_ / 1
/ \ -
a * \
1 0.8 \
1 \
ei 0.6
/
0.4 —
/ -
0.2
/
n
10" 10" 10" 10° 101 102
(a)
4 1 1 1| 1 I I ■ 1 II 1 1 1 | 1 1 1"
3.5 i \
1
\ II^ A ;,*H 2 -·
3 1
2.5 1
\ -
1
2
I
1
-
1
g. 1.5 - 1 i -
/
1 /
/
\ -
0.5
\
n x^.. 1 ll 1
10" 10" 10" 10° 10' 102
ll« 4 -v/
(b)
Figure 3 The response hjk of the y'th radial basis function and the norm of
the gradient WVXthJk\\2 plotted as functions of ||χ*-γ,|| 2 for
Six) = (goW)1/('"m). with ?„(*) = exp(/3x), m = 3, and (a) ß = 5, (b)
/J=10.
Section 6 Learning Algorithms Based on Gradient Descent 141
blind spot but diminishes very quickly for values of \\xk — ν,·||2 above B. In fact, the
values of hjk become negligible even before ||VXi/i;/t||2 approaches asymptotically zero
values. This is in direct contrast to the behavior of the same quantities corresponding to
linear generator functions, which are shown in Figure 2.
hk =f(S>i,k)
=/(wfh,))
(55)
=f\i2wijhjk)
with h0Jc = 1, and hjM = g{\\xk - v,||2), 1 <j < c, hk = [h0>khitk... hcJc]7', and
w, = [Wj0Wjti... wic]T. Training is typically based on the minimization of the error
between the actual outputs of the network yk, 1 < k < M, and the desired responses
yfc, 1 < k < M.
6.1. Batch Learning Algorithms
A reformulated RBF neural network can be trained by minimizing the error
E
= ^TJTpi,k-hkf (56)
Minimization of (56) using gradient descent implies that all training examples are
presented to the RBF network simultaneously. Such training strategy leads to batch
learning algorithms. The update equation for the weight vectors of the upper associa-
tive network can be obtained using gradient descent as [15]
^ (57)
= l2^,€P.khk
k=\
where η is the learning rate and t°pk is the output error, given as
4,k=f'(yP,k)(yP,k-yp,k) (58)
142 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
Similarly, the update equation for the prototypes can be obtained using gradient des-
cent as [15]
Δν? = -ην^Ε
m
(59)
k=\
with aqk = —2g'(\\xk -\q\\2). The selection of a specific function g() influences the
update of the prototypes through aqk = — 2g'(||xt — v?||2), which is involved in the
calculation of the corresponding hidden error €qk, Since hqk = giW^k ~ v?ll2) aQ d
i(x) = (go(x))l/^~m\ aq,k is given by (48) and the hidden error (60) becomes
Given the learning rate η and the responses h^ of the radial basis functions, these weight
vectors are updated according to the output errors e°p<k, 1 < p < n0. Following the
update of these weight vectors, the current estimate of each prototype \q, 1 < q < c,
is replaced by
M
\q + A\q = \q + η^€**(** - ν,) (63)
For a given value of the learning rate η, the update of \q depends on the hidden errors
ej£, 1 < k < M. The hidden error ehqk is influenced by the output errors €°uk, 1 < i < n0,
and the weights vt>/(?, 1 < /' < n0, through the term J3"=1 e^w^. Thus, the RBF network
is trained according to this scheme by propagating back to the output error.
This algorithm can be summarized as follows:
1. Select η and e; initialize {wy} with zero values; initialize the prototypes y,, 1 <j < c;
set h0ik = l,Vk.
2. Compute the initial response:
Section 6 Learning Algorithms Based on Gradient Descent 143
• hj,k = (g0(\\xk-yj\\2)i/il-m\Vj,k.
• h = [ho,khi,k---hc,k]T,Vk.
• hk =/(wfhfc)> V«, k.
3. Compute Ε = \Σ,£ι Σ £ ι ( * . * ~ hk)2·
4. Set£Oid = £ .
5. Update the adjustable parameters:
• «?jk =f'(yi,k)(yi,k -&,*)» v/, k.
Ek = \Y(yi,k-hk)2 (64)
for k = 1,2,..., M. The update equation for the weight vectors of the upper associative
network can be obtained using gradient descent as [15]
A w M = w M - w M _!
= -nVWpEk (65)
= W°p,khk
where wp,k-\ a n d wp,fc a r e the estimates of the weight vector wp before and after the
presentation of the training example (x^, yk), η is the learning rate, and €°pk is the output
error defined at (58). Similarly, the update equation for the prototypes can be obtained
using gradient descent as [15]
Δν ? Λ = v?>A - \q<k_x
= -nVyqEk (66)
= Wq,k(Xk ~ Vq)
where \qk_i and vqk are the estimates of the prototype yq before and after the pre-
sentation of the training example (x^, yk), η is the learning rate, and eqtk is the hidden
error defined in (61).
144 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
When an adaptation cycle begins, the current estimates of the weight vectors wp
and the prototypes \q are stored in v/p 0 and v, 0 , respectively. After an example (x^, yfc),
1 < k < M, is presented to the network, each weight vector vrp, I <p < n0, is updated
as
Following the update of all the weight vectors w^, 1 < p < n0, each prototype \q,
1 < q < c, is updated according to
An adaptation cycle is completed in this case after the sequential presentation to the
network of all the examples included in the training set. Once again, the RBF network
is trained according to this scheme by propagating back the output error.
This algorithm can be summarized as follows:
1. Select η and e; initialize {wy} with zero values; initialize the prototypes v,·, 1 <j<c;
set hok = 1, Vk.
2. Compute the initial response:
• A/,* = (go(l|xt-y/ll 2 )) I/(1 ~ M) ,Vy,fc.
• hk = [ho,kKk ■ ■ ■ hc,kf, Vfc.
• hk =/(wfh*). Vi, k.
3. Compute £ = ^ ! Σ £ ι ( * . * " hkf-
4. Set£ o l d = £.
5. Update the adjustable parameters for all k = 1, 2 , . . . , M:
• *lk =f'(yi,k)(yi,k-hk)> v/.
• w, <- w,· + n€°ikhk, Vi.
• & = (2/(w - 1)) iidlXfc - YjfXhjjF ΣΖι <k"V> V-/'·
ν
• y
j « - 7/ + »»«/.ti** - ; ) ' v->·
6. Compute the current response:
• hJik = (g0(\\xk-yj\\2))m-m\Vj,k.
• h = [ho,kKk---hc,kf,Vk
• hk =/(w?h/fc). Vi, k.
7. Compute £ = \ £ f = 1 E£i(V/.* " hk)2-
8. If: (£ old - E)/EM > e; then: go to 4.
The effect of the generator function on gradient descent learning algorithms developed
for reformulated RBF neural networks is essentially related to the criteria established in
Section 5 for selecting generator functions. These criteria were established on the basis
of the response hjk of they'th radial basis function to an input vector x^ and the norm of
the gradient VXtAyfc that can be used to measure the sensitivity of the radial basis
function response hjk to an input vector xk. Since VXkhj,k = -'Vy.hjk, (46) gives
Section 7 Generator Functions and Gradient Descent Learning 145
According to (69), the quantity \\xk — ν,ΙΙ 2 ^* can also be used to measure the sensitivity
of the response of the y'th radial basis function to changes in the prototype v,· that
represents its location in the input space.
The gradient descent learning algorithms presented in Section 6 attempt to train an
RBF neural network to implement a desired input-output mapping by producing
incremental changes of its adjustable parameters, that is, the output weights and the
prototypes. If the responses of the radial basis functions are not substantially affected
by incremental changes of the prototypes, then the learning process reduces to incre-
mental changes of the output weights and eventually the algorithm trains a single-
layered neural network. Given the limitations of single-layered neural networks [21],
such updates alone are unlikely to implement nontrivial input-output mappings. Thus,
the ability of the network to implement a desired input-output mapping depends to a
large extent on the sensitivity of the responses of the radial basis functions to incre-
mental changes of their corresponding prototypes. This discussion indicates that the
sensitivity measure ||VT hjtk\\2 is relevant to gradient descent learning algorithms devel-
oped for reformulated RBF neural networks. Moreover, the form of this sensitivity
measure in (69) underlines the significant role of the generator function, whose selection
affects || Vy.Ay.kH2 as indicated by the definition of a,·,* in (48). The effect of the generator
function on gradient descent learning is revealed by comparing the response h}^ and the
sensitivity measure ||VT hj^\\2 = |VXjfcAj!>k\\2 corresponding to the linear and exponential
generator functions, which are plotted as functions of ||χ^ — v,||2 in Figures 2 and 3,
respectively.
According to Figure 2, the response A^ of they'th radial basis function to the input
Xjt diminishes very slowly outside the blind spot, that is, as ||χ^ — ν,·||2 increases above
B. This implies that the training vector \k has a nonnegligible effect on the response A,·,*
of the radial basis function located at this prototype. The behavior of the sensitivity
measure ||VT.A,-tfc||2 outside the blind spot indicates that the update of the prototype y,
produces significant variations in the input of the upper associative network, which is
trained to implement the desired input-output mapping by updating the output
weights. Figure 2 also reveals the trade-off involved in the selection of the free para-
meter γ in practice. As the value of γ decreases, ||VT A; ^||2 attains significantly higher
values. This implies that they'th radial basis function is more sensitive to updates of the
prototype v, due to input vectors outside its blind spot. The blind spot shrinks as the
value of γ decreases but || V¥.A,-^||2 approaches 0 quickly outside the blind spot, that is,
as the value of ||χ* —Vy||2 increases above B. This implies that the receptive fields
located at the prototypes shrink, which can have a negative impact on gradient descent
learning. Decreasing the value of γ can also affect the number of radial basis functions
required for the implementation of the desired input-output mapping. This is due to
that fact that more radial basis functions are required to cover the input space. The
receptivefieldslocated at the prototypes can be expanded by increasing the value of γ.
However, ||VT Ay_fc|2 becomes flat as the value of γ increases. This implies that very large
values of γ can decrease the sensitivity of the radial basis functions to the input vectors
included in their receptive fields.
According to Figure 3, the response of they'th radial basis function to the input \k
diminishes very quickly outside the blind spot, that is, as \\xk — ν,·||2 increases above B.
146 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
This behavior indicates that if an RBF network is constructed using exponential gen-
erator functions, the inputs xk corresponding to high values of ||VT.A,->fc||2 have no
significant effect on the response of the radial basis function located at the prototype
Vj. As a result, the update of this prototype due to x,t does not produce significant
variations in the input of the upper associative network that implements the desired
input-output mapping. Figure 3 also indicates that the bund spot shrinks as the value
of ß increases while ||VV AJ>fc||2 reaches higher values. Decreasing the value of ß expands
the blind spot but ||VT.«/,jt||2 reaches lower values. In other words, the selection of the
value of ß in practice involves a trade-off similar to that associated with the selection of
the free parameter γ when the radial basis functions are formed by linear generator
functions.
8. EXPERIMENTAL RESULTS
The performance of reformulated RBF neural networks generated by linear and expo-
nential generator functions was evaluated and compared with that of conventional
RBF networks using a set of 2D vowel data formed by computing the first two formats
Fl and F2 from samples of 10 vowels spoken by 67 speakers [22,23]. This data set has
been extensively used to compare different pattern classification approaches because
there is significant overlapping between the points corresponding to different vowels in
the F1-F2 plane [9,22,23]. The available 671 feature vectors were divided into a training
set, containing 338 vectors, and a testing set, containing 333 vectors. The training and
testing sets formed from the 2D vowel data were classified by various RBF neural
networks. The RBF networks tested in these experiments consisted of 2 inputs and
10 linear output units, each representing a vowel. The number of radial basis functions
varied in these experiments from 15 to 50. The RBF networks were trained using a
normalized version of the input data produced by replacing each feature sample x by
x = (x — μχ)/σχ, where μχ and σχ denote the sample mean and standard deviation of
this feature over the entire data set. The networks were trained to respond with yiik = 1
and yjk = 0, V/ φ i, when presented with an input vector xk e X representing the rth
vowel. The assignment of input vectors to classes was based on a winner-takes-all
strategy. More specifically, each input vector was assigned to the class represented by
the output unit of the trained RBF network with the maximum response.
The training set formed from the 2D vowel data was used to train conventional
RBF neural networks with Gaussian radial basis functions. The prototypes represent-
ing the locations of the radial basis functions in the feature space were determined by
thefc-meansalgorithm, and the weights connecting the radial basis functions and the
output units were updated to minimize the error at the output using gradient descent
with learning rate η = 10 -2 . The widths of the Gaussian radial basis functions were
determined according to the "closest neighbor" heuristic. More specifically, the width σ;
of the radial basis function located at the prototype v, was determined as σ7 = m i n ^
{|lvy — v*ll)· Table 1 summarizes the percentage of feature vectors from the training and
testing sets classified incorrectly by RBF neural networks trained in the three different
trials. These experimental results verify the erratic behavior that is often associated with
RBF neural networks. There was a significant variation in the percentage of classifica-
tion errors recorded on the training and testing sets as the number of radial basis
functions varied from 15 to 50. When the number of radial basis functions was rela-
Section 8 Experimental Results 147
TABLE 1 Percentage of Classification Errors Produced in Three Trials on the Training Set (Etrain)
and the Testing Set (f,est) Formed from the 2D Vowel Data by Conventional RBF Networks
Containing c Gaussian Radial Basis Functions
tively small (from 15 to 25), the performance of the trained RBF networks on both
training and testing sets was rather poor. Their performance improved as the number of
radial basis functions increased above 30. Even in this case, however, the classification
errors did not consistently decrease as the number of radial basis functions increased, a
behavior that is reasonably expected at least for the training set. The performance of
conventional RBF networks is significantly affected by the initialization of the learning
process, as indicated by the classification errors produced by RBF networks with the
same number of radial basis functions in three trials. Since the output weights were all
set to 0 in the beginning of the learning process, the initialization used in these trials
affected only the partition of the feature vectors produced by the c-means algorithm.
Thus, the performance of RBF neural networks trained using this learning scheme is
mainly affected by the prototypes produced by the unsupervised clustering algorithm,
which determine the locations of the Gaussian radial basis functions in the feature
space.
A variety of reformulated RBF neural networks were trained on the training set
formed from the 2D vowel data using the gradient descent algorithm presented in
Section 6. The radial basis functions were of the form φ{χ) = gix1) and g(·) was defined
in terms of an increasing generator function g0() asg(x) = (go(x))l/^~m\ with m > 1. In
each case, the learning rate was selected sufficiently small so as to avoid a temporary
increase of the total error especially in the beginning of the learning process.
Table 2 shows the percentage of classification errors on the training and testing sets
produced by reformulated RBF networks generated by linear generator functions
g0(x) = χ + γ2 with m = 3 and γ = 0, γ = 0.1, γ = 1. It is clear from Table 2 that the
performance of reformulated RBF networks improved as the value of γ increased from
0 to 1. The improvement was significant when the reformulated RBF networks con-
tained a relatively small number of radial basis functions (15 to 25). This is particularly
important, because the ability of RBF networks to generalize degrades as the number of
radial basis functions increases. The price to be paid for such an improvement in
performance is slower convergence of the learning algorithm. It was experimentally
verified that the gradient descent learning algorithm converges slower as the value of
γ increases from 0 to 1, despite the fact that the maximum allowable learning rate for
γ = 0 was η = 10~5 whereas the networks with γ = 0.1 and γ = 1 were trained with
η = 10 -4 . This is consistent with the sensitivity analysis of the gradient descent learning
148 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
TABLE 2 Percentage of Classification Errors Produced on the Training Set (Etrain) and the Testing
Set (Etest) Formed from the 2D Vowel Data by Reformulated RBF Networks Containing c Radial Basis
Functions of the Form φ(χ) = g(x2), with g(x) = (0o(*)) 1/<1_m) . 3oW = x + y2, m = 3, and Various
Values of y
algorithm presented in Section 7 and the behavior of the sensitivity measure ||VY A7 A||2,
which is plotted in Figure 2. It is clear from Figure 2 that the response of each radial
basis function y, becomes increasingly sensitive to changes of ||χ* — y,||2 as γ decreases
and approaches 0.
Table 3 shows the percentage of classification errors on the training and testing sets
produced in three trials by reformulated RBF networks generated by the linear gen-
erator function g0(x) = x + y2 with γ = 1 and m = 3. The learning process was initi-
alized in each trial using a different set of prototypes distributed randomly over the
feature space. In all three trials, the output weights were all set to 0. According to Table
3, the differences in the percentage of classification errors produced in different trials by
reformulated RBF networks of the same size were not significant. Unlike conventional
RBF neural networks, reformulated RBF neural networks are only slightly affected by
the initialization of the learning process. This is not surprising, because the prototypes
of reformulated RBF networks are updated during learning as indicated by the training
set. In contrast, the prototypes of conventional RBF networks remain fixed during
learning. As a result, an unsuccessful set of prototypes can severely affect the imple-
TABLE 3 Percentage of Classification Errors Produced in Three Trials on the Training Set (Etrain)
and the Testing Set (E,est) Formed from the 2D Vowel Data by Reformulated RBF Networks
Containing c Radial Basis Functions of the Form φ(χ) = gix2), with g(x) = (go(x)) 1/(1 ~ m) ,
fif0(x) = x + y2, m = 3, and y = 1
mentation of the input-output mapping that relies only on the update of the output
weights.
Table 4 summarizes the percentage of classification errors produced on the training
and testing sets formed from the 2D vowel data by reformulated RBF networks gen-
erated by exponential generator functions g0(x) = exp(/Sx) with m = 3 and ß = 0.5,
ß= 1, /? = 5. For a fixed value of m, the width σ of all Gaussian radial basis functions
is determined as σ2 = (m - l)/ß. For m = 3, the values ^β = 0.5, β = 1, and β = 5
correspond to σ2 = 4, σ2 = 2, and σ2 = 0.4, respectively. The networks were also
trained with β = 0.1 (σ2 = 20) and β = 0.2 (σ2 = 10) but the learning algorithm did
not converge and the networks classified incorrectly almost half of the feature vectors
from the training and testing sets. Table 4 indicates that even the networks trained with
β = 0.5 did not achieve satisfactory performance. The performance of the trained net-
works on both training and testing sets improved as the value of β increased above 0.5.
The best performance on the training set was achieved for β = 5. However, the best
performance on the testing set was achieved for β = 1. It is remarkable that for β = 1
the percentage of classification errors on the testing set was almost constant as the
number of radial basis functions increased above 20. The gradient descent algorithm
also converged for values of β above 5 but the performance of the trained RBF net-
works on the testing set degraded. This set of experiments is consistent with the sensi-
tivity analysis of gradient descent learning and reveals the trade-off associated with the
selection of the value of β or, equivalently, the width of the resulting Gaussian radial
basis functions. Small values of β reduce the sensitivity of the radial basis functions to
changes of their respective prototypes and the learning algorithm does not converge.
Large values of β lead to radial basis functions that are more sensitive to changes of
their respective prototypes. However, such values of β create sharp radial basis func-
tions, and that has a negative effect on the ability of the trained RBF networks to
generalize. These experiments also indicate that there exists a certain range of values of
β that guarantee convergence of the learning algorithm and satisfactory performance.
Table 5 summarizes the percentage of classification errors produced in three trials
by reformulated RBF networks generated by the exponential generator function g0(x)
= exp(/Jje) with m = 3 and β = 1. In this case the width of all Gaussian radial basis
functions was σ = «/2. The learning process was initialized in each trial by different sets
TABLE 4 Percentage of Classification Errors Produced on the Training Set (£train) and the Testing
Set (£test) Formed from the 2D Vowel Data by Reformulated RBF Networks Containing c Radial Basis
Functions of the Form φ(χ) = g(x2), with g(x) = (SO(*))1/<1_m)< 0oW =exp(ßx), m = 3, and Various
Values of ß
4
0 = 0.5(1,:= 10"4) 0 = 1 . 0 0 , = = 10"4) /5 = 5.0(i> == 10" )
c ■Etrain(%) £,«.(%) *.rai„(%) £tet(%) •EtrainC/o) £«*.(%)
15 29.0 25.2 22.2 21.9 25.1 27.9
20 29.3 27.6 23.7 20.1 22.5 21.0
25 29.3 27.9 25.4 20.7 21.3 21.6
30 27.2 27.0 23.1 20.7 22.2 21.9
35 28.1 25.5 24.8 20.7 21.6 21.9
40 27.5 25.5 23.4 20.7 21.0 21.0
45 27.8 24.6 24.3 19.5 19.8 21.3
50 26.6 24.0 24.3 19.8 19.8 21.0
Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
TABLE 5 Percentage of Classification Errors Produced in Three Trials on the Training Set (£train)
and the Testing Set (£test> Formed From the 2D Vowel Data by Reformulated RBF Networks
Containing c Radial Basis Functions of the Form φ(χ) = g(x2), with g(x) = (sro(x))1/<1~m)<
sr0(x) =exp08x), m = 3, and β -= 1
4
= 10"4)
Trial #1 (η -- Trial #2 (,, == io- ) Trial #3 (η == 10"4)
c •£train(°/°) £,«<(%) £
train(%) iw(%) -EtrainC/o) £,e S .(%)
of randomly selected prototypes. In fact, the sets of initial prototypes used in these
experiments were identical to those used in the experiments employing conventional
RBF networks and reformulated RBF networks generated by linear generator func-
tions, the results of which are summarized in Tables 1 and 3, respectively. The output
weights were all set to 0. The classification errors observed in each trial on the training
set and the testing set were not significantly affected by the number of radial basis
functions. The classification errors on the training set were very close in different trials.
The same is true for the classification errors recorded in the three trials on the testing
set. Thus, the performance of the trained RBF networks was only slightly affected by
the initialization of the training process. Compared with the reformulated RBF net-
works generated by linear generator functions, the reformulated RBF networks tested
in these experiments produced a higher percentage of classification errors on the train-
ing set and a slightly lower percentage of classification errors on the testing set. In fact,
the RBF networks generated by linear generator functions achieved more balanced
performance on the training and testing sets.
Figures 4, 5, and 6 show the partition of the feature space produced by reformu-
lated RBF networks trained using gradient descent with 30, 40, and 50 radial basis
functions, respectively. The function g() was of the form g(x) = (g0(x))1/(m~l), with
g0(x) — x and m = 3. The networks with 30 and 40 radial basis functions produced
different partitions of the feature space, which indicates that there exist significant
differences in the representations of the feature space by 30 and 40 prototypes. It is
clear from Figures 4a and 5a that both networks attempt to find the best possible
compromise in regions of the input space with extensive overlapping between the
classes. As a result, some of the regions formed differ in terms of both shape and
size. Nevertheless, the partitions produced by both networks for the training set offer
a reasonable compromise for the testing set as indicated by Figures 4b and 5b. Figure
6a shows that the RBF network trained with 50 radial basis functions follows all the
peculiarities of the training set. This is clear from the anomalous surfaces produced by
the network in its attempt to classify correctly feature vectors from the training set that
can be considered outliers with respect to their classes. Moreover, Figure 6b indicates
that the partition of the feature space produced by the network for the training set is
not appropriate for the testing set. This is a clear indication that increasing the number
Section 8 Experimental Results 151
4000
3500
3000
2500
2000
1500
1000
of radial basis functions above a certain threshold compromises the ability of trained
RBF neural networks to generalize.
9. CONCLUSIONS
The success of a neural network model depends rather strongly on its association with
an attractive learning algorithm. For example, the popularity of conventional feed-
forward neural networks was to a large extent due to the error back-propagation
algorithm. On the other hand, the effectiveness of RBF neural models for function
approximation was hampered by the lack of an effective and reliable learning algorithm
for such models. Despite its disadvantages, gradient descent seems to be the only
method leading to appealing learning algorithms.
According to the axiomatic approach presented in this chapter for reformulating
RBF neural networks, the development of admissible RBF models reduces to the
selection of admissible generator functions that determine the form and properties of
the radial basis functions. The reformulated RBF networks generated by linear and
exponential generator functions can be trained by gradient descent and perform con-
siderably better than conventional RBF networks. Moreover, their training by gradient
descent is not necessarily slow. Consider, for example, reformulated RBF networks
generated by linear generator functions of the form g0(x) = χ + γ*. Values of γ close
to 0 can speed up the convergence of gradient descent learning and still produce trained
RBF networks performing better than conventional RBF networks. For values of γ
approaching 1, the convergence of gradient descent learning slows down but the per-
formance of the trained RBF networks improves. The convergence of gradient descent
learning is also affected by the value of the parameter m relative to 1. Values of m close
to 1 tend to create sharp radial basis functions. As the value of m increases, the radial
basis functions become wider and more responsive to neighboring input vectors.
However, increasing the value of m reduces the sensitivity of the radial basis functions
to changes of their respective prototypes, and this slows down the convergence of
gradient descent learning. Nevertheless, the selection of the parameter m is not crucial
in practice. Reformulated RBF networks can be trained in practical situations by fixing
the parameter m to a value between 2 and 4.
The experimental evaluation of reformulated RBF networks presented in this
chapter showed that the association of RBF networks with erratic behavior and poor
performance is unfair to this powerful neural architecture. The experimental results also
indicated that the disadvantages often associated with RBF neural networks can only
be attributed to the learning schemes used for their training and not to the models
themselves. If the learning scheme used to train RBF neural networks decouples the
determination of the prototypes and the updates of the output weights, then the pro-
totypes are simply determined to satisfy the optimization criterion behind the unsuper-
vised algorithm employed. Nevertheless, the satisfaction of this criterion does not
necessarily guarantee that the partition of the input space by the prototypes facilitates
the implementation of the desired input-output mapping. The simple reason for this is
that the training set does not participate in the formation of the prototypes. In contrast,
the update of the prototypes during the learning process produces a partition of the
input space that is specifically designed to facilitate the input-output mapping. In effect,
this partition leads to trained reformulated RBF neural networks that are strong com-
References 155
petitors to other popular neural models, including feed-forward neural networks with
sigmoidal hidden units.
There is experimental evidence that the performance of reformulated RBF neural
networks improves when their supervised training by gradient descent is initialized by
using an effective unsupervised procedure to determine the initial set of prototypes from
the input vectors included in the training set. An alternative to employing the c-means
algorithm for determining the initial set of prototypes would be the use of unsupervised
algorithms that are not significantly affected by their initialization. The search for such
codebook design techniques led to soft clustering [16,24-27] and soft learning vector
quantization algorithms [9,14,26-29]. Unlike crisp clustering and vector quantization
techniques, these algorithms form the prototypes on the basis of soft instead of crisp
decisions. As a result, this strategy reduces significantly the effect of the initial set of
prototypes on the partition of the input vectors produced by such algorithms. The use
of soft clustering and LVQ algorithms for initializing the training of reformulated RBF
neural networks is a particularly promising approach currently under investigation.
Such an initialization approach is strongly supported by recent developments in unsu-
pervised competitive learning, which indicated that the same generator functions used
for constructing reformulated RBF neural networks can also be used to generate soft
LVQ and clustering algorithms [14,24,27].
The generator function can be seen as the concept that establishes a direct relation-
ship between reformulated RBF models and soft LVQ algorithms [14]. This relation-
ship makes reformulated RBF models potential targets of the search for architectures
inherently capable ofmerging neural modeling with fuzzy-theoretic concepts, a problem
that attracted considerable attention recently [30]. In this context, a problem worth
investigating is the ability of reformulated RBF neural networks to detect the presence
of uncertainty in the training set and quantify the existing uncertainty by approximat-
ing any membership profile arbitrarily well from sample data.
REFERENCES
[1] D. S. Broomhead and D. Lowe, Multivariable functional interpolation and adaptive net-
works. Complex Syst. 2: 321-355, 1988.
[2] C. A. Micchelli, Interpolation of scattered data: Distance matrices and conditionally positive
definite functions. Construct. Approx. 2: 11-22, 1986.
[3] T. Poggio and F. Girosi, Regularization algorithms for learning that are equivalent to
multilayer networks. Science 247: 978-982, 1990.
[4] J. Park and I. W. Sandberg, Universal approximation using radial-basis-function networks.
Neural Comput. 3: 246-257, 1991.
[5] J. Park and I. W. Sandberg, Approximation and radial-basis-function networks. Neural
Comput. 5: 305-316, 1993.
[6] G. Cybenko, Approximation by superpositions of a sigmoidal function. Math. Control
Signals Syst. 2: 303-314, 1989.
[7] S. Chen, C. F. N. Cowan, and P. M. Grant, Orthogonal least squares learning algorithm for
radial basis function networks. IEEE Trans. Neural Networks 2(2): 302-309, 1991.
[8] S. Chen, G. J. Gibson, C. F. N. Cowan, and P. M. Grant, Reconstruction of binary signals
using an adaptive radial-basis-function equalizer. Signal Process. 22: 77-93, 1991.
156 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
[9] N. B. Karayiannis and W. Mi, Growing radial basis neural networks: Merging supervised
and unsupervised learning with network growth techniques. IEEE Trans. Neural Networks
8(6): 1492-1506, 1997.
[10] J. E. Moody and C. J. Darken, Fast learning in networks of locally-tuned processing units.
Neural Comput. 1: 281-294, 1989.
[11] I. Cha and S. A. Kassam, Interference cancellation using radial basis function networks.
Signal Process. 47: 247-268, 1995.
[12] N. B. Karayiannis, Gradient descent learning of radial basis neural networks. Proceedings of
1997 IEEE International Conference on Neural Networks, pp. 1815-1820, Houston, June 9-
12, 1997.
[13] N. B. Karayiannis, Learning algorithms for reformulated radial basis neural networks.
Proceedings of 1998 IEEE International Joint Conference on Neural Networks, pp. 2230-
2235, Anchorage, AK, May 4-9, 1998.
[14] N. B. Karayiannis, Reformulating learning vector quantization and radial basis neural net-
works. Fundam. Inform, 37: 137-175, 1999.
[15] N. B. Karayiannis, Reformulated radial basis neural networks trained by gradient descent.
IEEE Trans. Neural Networks, 10: 657-671, 1999.
[16] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York:
Plenum, 1981.
[17] T. Kohonen, Self-Organization and Associative Memory, 3rd ed. Berlin: Springer-Verlag,
1989.
[18] T. Kohonen, The self-organizing map. Proc. IEEE 78(9): 1464-1480, 1990.
[19] B. A. Whitehead and T. D. Choate, Evolving space-filling curves to distribute radial basis
functions over an input space. IEEE Trans. Neural Networks 5(1): 15-23, 1994.
[20] A. Roy, S. Govil, and R. Miranda, A neural-network learning theory and a polynomial time
RBF algorithm. IEEE Trans. Neural Networks 8(6): 1301-1313, 1997.
[21] N. B. Karayiannis and A. N. Venetsanopoulos, Artificial Neural Networks: Learning
Algorithms, Performance Evaluation and Applications. Boston: Kluwer Academic, 1993.
[22] R. P. Lippmann, Pattern classification using neural networks. IEEE Commun. Mag. 27: 47-
54, 1989.
[23] K. Ng and R. P. Lippmann, Practical characteristics of neural network and conventional
pattern classifiers. In Advances in Neural Information Processing Systems 3, R. P. Lippmann
et al., eds., pp. 970-976. San Mateo, CA: Morgan Kaufmann, 1991.
[24] N. B. Karayiannis, Fuzzy and possibilistic clustering algorithms based on generalized refor-
mulation. Proceedings of the Fifth IEEE International Conference on Fuzzy Systems, pp.
1393-1399, New Orleans, September 8-11, 1996.
[25] N. B. Karayiannis, Fuzzy partition entropies and entropy constrained clustering algorithms.
/. Intell. Fuzzy Syst. 5(2): 103-111, 1997.
[26] N. B. Karayiannis, Ordered weighted learning vector quantization and clustering algo-
rithms. Proceedings of 1998 International Conference on Fuzzy Systems, pp. 1388-1393,
Anchorage, AK, May 4-9, 1998.
[27] N. B. Karayiannis, Soft learning vector quantization and clustering algorithms based in
reformulation. Proceedings of 1998 International Conference on Fuzzy Systems, pp. 1441-
1446, Anchorage, AK, May 4-9, 1998.
[28] N. B. Karayiannis and J. C. Bezdek, An integrated approach to fuzzy learning
vector quantization and fuzzy c-means clustering. IEEE Trans. Fuzzy Syst. 5(4): 622-628,
1997.
[29] E. C.-K. Tsao, J. C. Bezdek, and N. R. Pal, Fuzzy Kohonen clustering networks. Pattern
Recogn. 27(5): 757-764, 1994.
[30] G. Purushothaman and N. B. Karayiannis, Quantum neural networks (QNNs): Inherently
fuzzy feedforward neural networks. IEEE Trans. Neural Networks 8(3): 679-693, 1997.
References 157
[31] N. B. Karayiannis, Entropy constrained learning vector quantization algorithms and their
application in image compression, SPIE Proceedings, Vol. 3030: Applications of Artificial
Neural Networks in Image Processing II, pp. 2-13, San Jose, CA, 1997.
[32] N. B. Karayiannis, An axiomatic approach to soft learning vector quantization and cluster-
ing. IEEE Trans. Neural Networks, 10: 1153-1165, 1999.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.
Nicolaos B. Karayiannis
1. INTRODUCTION
Consider the set X c R" that is formed by M feature vectors from an «-dimensional
Euclidean space, that is, X = {xlt x 2 ,..., x«}, x, e R", 1 < / < M. Clustering is the
process of partitioning the M feature vectors to c < M clusters, which are represented
by the prototypes v, € V, 1 < j < c. Vector quantization can be seen as a mapping from
an «-dimensional Euclidean space X into the finite set V = {vi, v 2 ,..., vc} c W, also
referred to as the codebook.
Codebook design can be performed by clustering algorithms, which are typically
developed by solving a constrained minimization problem using alternating optimiza-
tion. These clustering techniques include the crisp c-means [1], fuzzy c-means [1],. and
generalized fuzzy c-means [2,3]. Alternative approaches to clustering resulted in the
development of entropy-constrained clustering and vector quantization algorithms
that are directly or indirectly related to Gibbs distribution and annealing. Rose et al.
[4] proposed a deterministic annealing algorithm for clustering using concepts from
probability theory and statistical mechanics. The clustering algorithms produced by
entropy-constrained minimization are not inherently invariant under uniform scaling
of the feature vectors. Karayiannis [5,6] proposed a new approach to fuzzy clustering,
which provided the basis for the development of entropy-constrained fuzzy clustering
(ECFC) algorithms. The formulation of the clustering problem considered in this
approach allows the development of fuzzy clustering algorithms invariant under uni-
form scaling of the feature vectors.
Recent developments in neural network architectures resulted in learning vector
quantization (LVQ) algorithms [7-21]. Learning vector quantization is the name used
for unsupervised learning algorithms associated with the competitive neural network
shown in Figure 1. The network consists of an input layer and an output layer. Each
node in the input layer is connected directly to the cells in the output layer. A prototype
vector is associated with each cell in the output layer as shown in Figure 1. Batch fuzzy
learning vector quantization (FLVQ) algorithms were introduced by Tsao et al. [20], and
their connection to probabilistic vector quantization models proposed in [22] was stu-
died in [23]. The update equations for FLVQ involve the membership functions of the
fuzzy c-means (FCM) algorithm, which are used to determine the strength of attraction
158
Section 2 Clustering Algorithms 159
X X X
'\k 2k 3k nk
T
X
k ~ t X l)t X 2* "' X n*]
between each prototype and the input vectors. Tsao et al. [20] justified their update
equation by pointing out its close relationship to the fuzzy and crisp c-means algorithms.
Karayiannis and Bezdek [15] developed a broad family of batch LVQ algorithms by
minimizing the average of the generalized means of the Euclidean distances between
each of the feature vectors and the prototypes. The minimization problem considered in
this derivation is actually a reformulation of the problem of determining fuzzy c-parti-
tions that was solved by the FCM algorithm [24]. Under certain conditions, the resulting
batch LVQ scheme can be implemented as the FCM or FLVQ algorithms [15].
This chapter begins with the reformulation of FCM and ECFC clustering algo-
rithms. These results provide the basis for the development of a unified theory for soft
LVQ and clustering. This theory provides an alternative to developing LVQ and clus-
tering algorithms, a task that is almost exclusively based on alternating optimization. In
fact, this theory allows the development of LVQ and clustering algorithms by minimiz-
ing a reformulation function using gradient descent. Further investigation indicates that
the development of LVQ and clustering algorithms reduces to the selection of an
admissible generator function. Existing fuzzy LVQ and clustering algorithms are inter-
preted as the products of hnear and exponential generator functions. This chapter also
presents a family of soft LVQ and clustering algorithms produced by nonlinear gen-
erator functions. These algorithms are used to perform segmentation of magnetic reso-
nance images of the brain.
2. CLUSTERING ALGORITHMS
Clustering algorithms can be classified as crisp or fuzzy according to the scheme they
employ for partitioning the feature vectors into clusters. This section introduces crisp
Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms
and fuzzy partitions and also describes the crisp c-means, fuzzy c-means, and entropy-
constrained fuzzy clustering algorithms.
2.1. Crisp and Fuzzy Partitions
A partition π{Χ) of the set X is a collection of subsets of X that are pairwise
disjoint, all nonempty, and their union yields the original set. Let X be the finite set
X = {xj, X2,..., Xji/}· A family of c 6 [2, M) subsets Aj, 1 <j < c, of X is a crisp or
hard c-partition of X if
υ;=,Λ· = x (i)
AjnAj = Q>, l<i^j<c (2)
<2>cAjCX, l<j<c (3)
Each subset Aj, 1 < j < c, is assigned a characteristic function or indicator function
Uj(xi) = u,y, defined as
J 1 if x, € Aj
w
"'·> | 0 if x, £ Λ,
According to this definition, each Uy can take the value of 0 or 1, that is,
For any x, e X, there exists aj =f such that «,·,·» = 1 and Uy = 0, V/ 56/". Thus,
c
u
Σ v< = 1. 1 <1<M (6)
Since each subset >^, 1 < j < c, of Af is nonempty, ΣΪίχ uy > 0, 1 < j < c. Since all Aj,
1 <j<c, are proper subsets of X, there exists no single subset of X containing all
elements of X. This implies that J ^ M,·, < M, 1 <j<c. Thus,
u
0<Σ >j>< M, \<j<c (7)
1=1
c M l
Uc = U e mMcUy € {0, 1}, V/J; ^ ue. = 1, Vi; 0 < ^ ui}■< M, Vy (8)
<=i
Crisp c-partitions assign each element of A* to a single subset, thus ignoring the
possibility that this element may also belong to other subsets. This is a clear disadvan-
tage of crisp c-partitions, which affects their ability to partition the elements of # in a
way consistent with human intuition and their physical properties. In fact, crisp c-
partitions fail to represent the uncertainty that is often encountered in practical appli-
Section 2 Clustering Algorithms 161
cations. The use of crisp c-partitions in formulating real-world problems restricts the
space of possible solutions searched by the resulting algorithmic tools.
The shortcomings of crisp c-partitions were illustrated by Bezdek [1], who consid-
ered all valid 2-partitions of the set
Table 1 shows all valid 2-partitions of the set X. According to the first two partitions,
the nectarine (a hybrid of peach and plum) is assigned to the same subset with either
peach or plum. However, both partitions fail to reveal the close relationship between
the nectarine and both the peach and the plum. The third partition creates a subset
containing only the nectarine, thus separating the only hybrid fruit in the set. Although
this separation seems to be reasonable, the partition fails to indicate the unquestionable
relationship between the peach, the plum, and their hybrid.
Fuzzy c-partitions were introduced in an attempt to overcome the shortcomings of
crisp c-partitions. Fuzzy c-partitions are constructed by partitioning the set X into
fuzzy subsets. In this case, the characteristic or indicator function (4) becomes a mem-
bership function Uy = «/(x,), which represents the degree to which x, can be considered
as a member of the subset A,·. According to the classical definition of fuzzy c-partitions,
the M x c matrix U = [uy] is a fuzzy c-partition if and only if its entries satisfy the
following conditions:
M
0
< Σ "y < M> 1 <7 < c (12)
M
Uf = JU € mMcUy € [0,1], ViJ; Σ uy = 1, V/; 0 < £ uy < M, V; (13)
Since the conditions (11) and (12) are also satisfied by valid crisp c-partitions, the only
difference between crisp and fuzzy c-partitions is that in the case of fuzzy c-partitions
{«,·;} are allowed to take values from the interval [0, 1]. Since {0,1} c [0,1], the condi-
tion (11) is also satisfied by crisp c-partitions which require that «,·, e {0, 1}, Vi,j. Thus,
Subset A A2 ΛΛ A2 A A2
peach 1 0 1 0 0 1
nectarine 1 0 0 1 1 0
plum 0 1 0 1 0 1
162 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms
the space of crisp c-partitions is a subspace of the space of fuzzy c-partitions. In fact, the
classical definition of fuzzy c-partitions was strongly influenced by the properties of
crisp c-partitions. This influence becomes even more obvious on noting the relative
freedom allowed by fuzzy set theory for selecting membership functions. The condition
(11) was relaxed in some recent approaches [2,3,8-13,16,25-27].
2.2. Crisp omeans Algorithm
Suppose the M x c matrix U = [uy] defines a valid crisp c-partition of the set
X = [xl,x2,...,xM}, that is, ui} e {0,1}, Vi, j , and E/=i My = ^ Vi - L e t V =
[\y, v 2 ,...,?(.} be the set of prototypes representing the feature vectors in X. Then
the discrepancy associated with the representation of each x, by one of the prototypes
Y/> 1 <j<c, can be measured by
The total discrepancy associated with the representation of all feature vectors x,·,
1 < i < M, by the prototypes v;, 1 < j < c, is given by
M c <15>
= ΣΣΜ*/-*/·ΙΙ 2
(=1 ;'=1
The c-means algorithm was developed using alternating optimization to solve the
minimization problem [1]
2} = { l , 2 , . . . , c } - X , (18)
The cardinality J , of the set J , represents the number of prototypes whose distance
from x, is equal to the minimum distance between x, and all prototypes. The coupled
necessary conditions for solutions (U, V) € Uc x RMC of (16) are [1]:
• If |J,| > 1, then Uy = 0, V/ € ϊ,, and J2JeX. Uy = 1, Vi
Section 2 Clustering Algorithms 163
If II, | = 1, then
and
^ =0 ^ . IZJS' (2°)
The c-means algorithm begins with the selection of an initial set of prototypes,
which implies the partition of the feature vectors into c clusters. Each cluster is repre-
sented by a prototype, which is computed as the centroid of the feature vectors belong-
ing to that cluster. Each of the feature vectors is assigned to the cluster whose prototype
is its closest neighbor. The new prototypes are computed from the results of the new
partition, and this process is repeated until the changes of the prototypes from one
iteration to the next become negligible.
The c-means algorithm can be summarized as follows:
1. Select c and e; fix N; set v = 0.
2. Generate an initial set of prototypes: Vo = {vii0, V2JO, · · ·. vCio}·
3. Set v = v + 1 (increment iteration counter).
4. Calculate:
algorithms consider each cluster as a fuzzy set. As a result, each feature vector may be
assigned to multiple clusters with some degree of certainty measured by the membership
function.
2.3. Fuzzy c-means Algorithm
The fuzzy c-means (FCM) algorithm is one of the most powerful tools used to
perform clustering in practical applications. The FCM algorithm was developed using
alternating optimization to solve the minimization problems [1]
where 1 < m < oo, V = [V]V2 ... vc] e 05ne, Uy = W/(x,·) is the membership function that
assigns x, to the y'th cluster, and the M x c matrix U = [uy] is a fuzzy c-partition in the
set Uf defined in (13). Define the sets
The cardinality I , of the set I , represents the number of prototypes that coincide with
x(. The coupled first-order necessary conditions for solutions (U, V)eW/X R"c of (21)
are [1]:
• If I,· Φ <Z>, then Uy = 0, V/ e I„ and Σ/€ζ,> uy = 1 > ^ ·
• If I , = 0, then
and
The "fuzziness" of the clustering produced by the FCM is controlled by the para-
meter m, which is greater than 1 [1]. As this parameter approaches 1, the partition of the
feature vector space is a nearly crisp decision-making process. Increasing this parameter
tends to degrade membership toward a maximally fuzzy partition [1].
The fuzzy c-means algorithm can be summarized as follows:
1. Select c, m, and e; fix N; set v = 0.
2. Generate an initial set of prototypes Vo = {vio, V2,o, · · ·, vc>o}.
Section 2 Clustering Algorithms 165
3. Set v = v+l.
4. Calculate:
J
\trviix«-v^-iii7 j
i M c
=
^ -ΜΣΣ«»Η (26)
i=l j=\
G(U) = - # ( U )
1 Α , Α , (28)
=
ΜΣΣ««Η
(=1 .7=1
and Z>(U, V) is the average distortion between the feature vectors x, e X and the pro-
totypes \j e V, defined as
i M. _c
W^MEE1*-^!211
(29)
;=i y=i
166 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms
7μ(υ, V) is defined in terms of the scaling parameter a and the fuzzification parameter
μ € (0, 1). The scaling parameter σ can be used to develop clustering algorithms invar-
iant under uniform scaling of the feature vectors x, € X. The fuzzification parameter μ
determines the relative effect of the entropy and the distortion terms on the objective
function /^(U, V). If μ -*■ 1, the entropy term is dominant and minimization of Ιμ
(U, V) implies maximization of the partition entropy H(U) = — G(U). As the value of
μ decreases, the effect of the entropy term decreases and the minimization of the
distortion between the feature vectors and the prototypes plays an increasingly domi-
nant role. If μ -*· 0, the effect of the entropy term in (27) is eliminated and the cluster-
ing process is almost exclusively based on the minimization of the distortion D(V, V)
between the feature vectors and the prototypes.
The coupled necessary conditions for solutions (U, V) eUf x Wc of(27) are [5,6]
Ύ=Σ£*2&., i<j<c ÖD
Σί=1 "y
The fuzziness of the c-partition produced by this formulation depends on the value of
μ e (0,1) [6]. If μ « 1, then δμ = (1 - μ)/μ approaches 0 and Uy « \, Vi,j. Such a value
of μ produces a maximally fuzzy partition. As the value of μ decreases, the minimiza-
tion of the distortion between the feature vectors and the prototypes plays a more
significant role. In this case, (30) assigns membership values to the feature vectors
according to their relative distance from a certain prototype. As μ approaches 0, the
membership function (30) approaches the indicator function associated with the crisp c-
means algorithms, that is, «,·, -*■ 1 if ||x,· - y,-||2 < ||x,· - v^||2, W φ}, and Uy -» 0 other-
wise. The transition from a maximally fuzzy to a nearly crisp partition can also be seen
as an annealing process. The value of δμ = (1 — μ)/μ increases as the value of μ
decreases, while δμ = (1 — μ)/μ -*■ oo as μ -*■ 0. Clearly, δμι = μ/(1 — μ) can be inter-
preted as the system temperature, which decreases from a very high value to 0 during the
annealing process. In this context, the fuzziness of the partition relates to the number of
accessible states for the system.
The distortion component of the objective function (27) is a function of the feature
vectors and the range of its values is unknown. In contrast, the entropy component of
(27) is not a function of the feature vectors and the range of its values is known. The
scaling parameter σ in (27) can be used to balance the relative effect of the entropy and
the distortion components of the objective function. This is necessary for the develop-
ment of clustering algorithms that are invariant under uniform scaling of the feature
vectors. The scaling parameter σ can be computed at each iteration according to the
condition
where σ0 is a bias constant that determines the relative weight assigned to the partition
entropy and scaled average distortion components of the objective function (27). If
σ0 > 1, the evaluation of the scaling parameter σ according to (32) favors the average
distortion over the partition entropy. Conversely, the evaluation of σ according to (32)
increases the role of the partition entropy in the clustering process if σ0 < 1. According
to the scheme presented above, σ can be computed at each iteration in terms of the
current estimates of the prototypes and the membership values as
'Σ£ΙΣΜΜ*-Τ/ΙΙ2
In the beginning of the clustering process, where the membership values are unknown,
it can be assumed that the partition is maximally fuzzy, that is, U = [1/c]. In this case,
the scaling parameter σ can be evaluated by requiring that
which is satisfied if
σ = σ0 - j (35)
2
i^EjUll^-v,·!!
According to (35), σ = a0./ymax/Ä where // max = lnc is the maximum value of the
partition entropy and D is the average of the squared Euclidean distances, defined as
^ = (l/M C )E£ 1 E;=illx 1 -Vyll 2 ·
The ECFC algorithm can be summarized as follows:
1. Select c, μ, σ0, and e;fixN; set v = 0.
2. Generate an initial set of prototypes V0 = {vio, V2,o> · · ·. vc0}·
3. Compute:
• δμ = (1 - ß)/ß-
• σ = a0(Mc)/(Z?=i EU H* ~ MI 2 )·
4. Set v = v + 1 .
5. Calculate:
&\ρ(-σδμ\\\,- - νΛν_!||2) . .
• Ui j v = —z - r-, 1 < ι < M; 1 < ; < c.
J
' Σ^εχρί-ο^ΙΙχ,-ν^,ΙΙ2)'· " ~
• vj.v = (ΣΖι "ν>Χ/)/(Σ£ι «y.v), 1 <j < c
• o = - σ 0 ( Σ " ι Σ > ι %»In %»)/(Σ£ι Σ£=Ι %»ΙΙ* " Τ/JI2)·
2
• £ν = Σ;=Ι ι ν - * / > - ! ιι ·
6. If ν < Ν and £„ > e, then go to step 4.
168 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms
The crisp and fuzzy clustering algorithms presented in the previous section were all
developed using alternating optimization. According to this optimization strategy, the
formula for the optimal set of membership functions is obtained by assuming afixedset
of prototypes. Conversely, the optimal prototypes are obtained by assuming a fixed set
of membership functions. Reformulation is a methodology that allows the development
of clustering algorithms using gradient descent to minimize a functional that is closely
related to that minimized by alternating optimization. The algorithms obtained by
reformulating fuzzy clustering can be implemented on the competitive neural network
shown in Figure 1. Thus, reformulation essentially establishes a link between clustering
and learning vector quantization.
M c
/ = χ ν 2 1
'» ΣΣΜ
i=l y=l
.·- ;ΙΙ ("«/·Γ"
1-m
2 \ 1/(1-«)'
1-m
2 1/(1 ,) 1
=ΣΣ^(έ(ι^-^ιι ) -" )
ί=1 j=l V=l /
Since the membership functions {«,·,} satisfy the condition £ / = 1 ui}, = 1, 1 < i < M, the
functional Jm minimized by the FCM algorithm can be written as [1,24]
The functional (37) has a structured form and depends exclusively on the feature
vectors and the set of prototypes V = {vj, v 2 ,..., ve}.
A recent approach to learning vector quantization revealed that (37) is a mean-
ingful measure of the discrepancy associated with the representation of the feature
vectors in X = {\x, x 2 ,..., \M] by the prototypes in V [15]. According to this approach,
a broad family of batch LVQ algorithms can be derived by minimizing [15]
, M
L (38)
P = MT(DP^V)
Section 3 Reformulating Fuzzy Clustering 169
where Dp(\h V) is the generalized mean (or unweighted /»-norm) of ||x,· - y,||2, 1 < j < c,
defined as [29]
QpVPi-vft)
Dp(x,,V)=KV(||x,-v,ry) (39)
with p e R - {0}. The update equation for each prototype \j was obtained using gra-
dient descent as [15]
M
AV
J = 7/ Σ α,^χ'' ~ ^ ( 40)
i=l
where ?j, is the learning rate and {ay} are the competition functions, defined as
■•-(steiiT
The minimization problem that resulted in this family of batch LVQ algorithms is
actually a reformulation of the problem of determining fuzzy c-partitions that was
solved by FCM algorithms. For p = 1/(1 - m), (38) takes the form
^g(;|>-/r-j
l-m
(42)
m-\
=1 J
M ""
which indicates that (38) is an alternative functional expression for the reformulation
function (37) that was discussed by Hathaway and Bezdek [24]. Moreover, for
p = 1/(1 — m), the update equation (40) can be written as
M
Δ^^^Πχ,.-ν,.) (43)
1=1
where {uy} are the membership functions (24) of the FCM algorithm.
The reformulation function (37) can also be written as Jm = Mcl~mRm, where
i M
Κ
>* = ΤΪΣ^ < 44 >
i=l
1 m
with/(x) = x - and
—"(έ exp(-CT<5Mi|x,-vJ2)
(47)
V=l
- In (cSd
where
If {uy} are given in (30), then combining (46) and (47) gives
M c
Since the term -μ. In c is independent of the prototypes, the reformulation function
(50) that corresponds to ECFC algorithms can be simplified as Ιμ — σ(1 - μ)Κμ, where
M
,=i
/=i
where
The search for admissible reformulation functions of this form requires the determina-
tion of the conditions that must be satisfied by the functions /(·) and g(·) involved in
their definition, which are assumed to be differentiable everywhere.
4.1. Update Equations
Minimization of admissible reformulation functions of the preceding form using
gradient descent can produce a variety of batch LVQ algorithms whose behavior and
properties depend on the functions/(·) and g(-). The gradient VT R = dR/dvj of R with
respect to the prototype v, can be determined as
j M
ί=1
ν
M dS ' ν,'3ί
U i
M c
1 (\ \
2
=^Σ/'<^^Σ^-^ )) <55>
. M
The update equation for the prototypes can be obtained according to the gradient
descent method as
Δν,- = -ί?;νν.Α
" ( 56 )
= Vj 2_ ay(x< - vj)
(=1
where η} = (2/Mc)??/ is the learning rate for the prototype v; and the competition
functions {ay} are computed in terms of/(·) and g(·) as
<*ij=f'(Si)g'(\\xi-yj\\2) (57)
The LVQ algorithms derived earlier can be implemented iteratively. Let v7>_i,
1 <j < c, be the set of prototypes obtained after the (v — l)th iteration. According to
the update equation (56), a new set of prototypes vy v, 1 <j<c, can be obtained
according to
M
v,· „ = y,-iV_i + nj<v ] Γ α0>(χ,· - ν,-,,,-ι), 1 <j <c (58)
i=l
where ην is the learning rate at iterate v and aijv =/'(S,>_i)g'(||x; — v,>_il|2), with
Si<v_i = (l/c)Y^=lg(\\Xi■ — V/,v_i ||2). Under certain conditions, the LVQ algorithms
described by the update equation (58) reduce to iterative clustering algorithms. The
update equation (58) can also be written as
( 1-
M \ M
^=(£«^1 ' 1
^J^C (6°)
The closed-form formula that can be used to compute the prototypes at each iteration
can be obtained by substituting the learning rates at (60) in the update equation (59) as
2_)=1 aij,vXi
V.»- v-M a - 1<7'<C (61)
2^=1 y,v
Section 4 Generalized Reformulation Function 173
Axiom 2: ay >0,l<i<M;l<j<c.
Axiom 3: If ||x,· - Vp||2 > ||x,· - v^H2 > 0, then aip < aiq, 1 <p, q < c, and p Φ q.
Axiom 1 indicates that there is actually no competition in the trivial case where all
feature vectors x, e X are represented by a single prototype. Thus, the single prototype
is equally attracted by all feature vectors x, e X. Axiom 2 implies that all feature
vectors x, e X compete to attract all prototypes y,·, 1 <j<c. Axiom 3 implies that a
prototype v^, that is closer in the Euclidean distance sense to the feature vector x, than
another prototype yp is attracted more strongly by this feature vector.
Axioms 1-3 lead to the admissibility conditions for reformulation functions sum-
marized by the following theorem [11,13]:
Theorem 1: Consider the finite set of feature vectors X = {xj, x 2 ,..., χ Μ ), which
are represented by the set of c < M prototypes V = {vi, v 2 ,..., vc}. Then the func-
tion R defined by (53) and (54) is an admissible reformulation function of the first
(second) kind iff(x) and g(x) are differentiable everywhere functions of x € (0, oo)
satisfying/(g(x)) = x,f(x) and g(x) are both monotonically decreasing (increasing)
functions of x e (0, oo), and g'(x) is a monotonically increasing (decreasing) func-
tion of x e (0, oo).
In this case, f(x) and g(x) are both monotonically decreasing functions of x e (0, oo) if
δμ = (1 - μ)/μ > 0 or, equivalently, if μ e (0,1). For μ e (0,1), g'(x) = -σδμ
εχρ(-σ<5μΧ) is a monotonically increasing function of x e (0, oo) and R is a reformula-
tion function of the first kind. If μ e (-oo, 0) U (1, oo), then δμ = (1 — μ)/μ < 0 and
both/Xx) and g(x) are monotonically increasing functions of x e (0, oo). However, the
functions f(x) and g(x) fail to form admissible reformulation functions of the second
kind for μ e (-oo, 0) U (1, oo) because in this case g'(x) is an increasing function of
x e (0, oo) and, thus, violates one of the admissibility conditions of Theorem 1.
CONSTRUCTING REFORMULATION
FUNCTIONS: GENERATOR FUNCTIONS
The function g(-) that results in the FCM algorithm has the form g(x) = (goW)1/(1""l),
where g0(x) = xandm e (—oo, 0) U (1, oo). If m e (1, oo), the partition produced by the
algorithm approaches asymptotically a crisp c-partition as m -*■ 1 + . The partition
becomes increasingly fuzzy as m increases and approaches a maximally fuzzy partition
as m ->■ oo. If m e (—oo, 0), the resulting partition becomes increasingly fuzzy as m
increases from —oo and approaches 0 from the left. The function g() that results in
the ECFC algorithm can also be written as
g(x) = expi-aS^x)
= (exp(ax)fß-1)/ß
For μ = (m - \)/m, (62) takes the form g(x) = (g0(x)Y/(l~m), with g0(x) = exp(ax). If
m ->· 1 + (μ -> 0), then <5μ -> oo and the partition produced by the resulting algorithm
approaches asymptotically a crisp c-partition. If m -*■ oo (μ ->· 1), then δμ ->· 0 and the
resulting algorithm produces a maximally fuzzy partition.
The FCM and ECFC algorithms are generated by a function g(·) of the form
with g0(x) = x and g0(x) = exp(ax), respectively. This is an indication that a broad
variety of reformulation functions can be constructed using function g(-) of the form
(63), where g0(x) is called the generator function. For a given generator function g0(-),
the corresponding function/(·) can be determined from g(x) = (goW) 1/il-m) m s u c n a
way that/(g(x)) = x, while the competition functions {ay} can be obtained using (57).
The following theorem summarizes the conditions that must be satisfied by admissible
generator functions [11,13].
Theorem 2: Consider the function R defined by (53) and (54). If g(-) is defined in
terms of the generator function g 0 () t n a t is continuous on (0, oo) as
g(x) = (So(*))1/(1~m\ m^l, then/(x) =/ 0 (* I ~ m ), with/0fe0(*)) = *· The generator
function g0(x) leads to an admissible reformulation function R if:
l.gQ(x)>0,Wxe(0,oo),
Section 6 Constructing Admissible Generator Functions 175
2. g0(x) is a monotonically increasing function of x e (0, oo), that is, go(x) > 0,
Vx e (0, oo), and
3. r0(x) = (m/(m - l))(^(x))2 - g0(x)gö(x) > 0, Vx e (0, oo),
or
l.go(*)>0,Vxe(0,oo),
2. g0(x) is a monotonically decreasing function of x e (0, oo), that is, go(x) < 0,
Vx 6 (0, oo), and
3. r0(x) = (m/(m - l))feo(x))2 - *o(*)*o (*) < 0, Vx e (0, oo).
If go(x) is an increasing generator function and m > 1 (m < 1), then /? is a reformula-
tion function of the first (second) kind. If go(x) is a decreasing generator function and
m > 1 (m < 1), then i? is a reformulation function of the second (first) kind.
The generator function go(x) = x > 0, Vx e (0, oo), that results in FCM algorithms
is a monotonically increasing function (go(x) = 1). Since go(x) = 0, the third condition
of Theorem 2 requires that m/(m — 1) > 0, which is valid for m > 1 or m < 0. If
m > 1 (m < 0), then Ä is a reformulation function of the first (second) kind.
Consider also the generator function g0(x) = exp(ax) > 0, Vx 6 (0, oo), that results
in ECFC algorithms. If σ > 0, then go(x) = σβχρ(σχ) > 0, Vx € (0, oo), and g0(x) =
exp(ax) is an increasing generator function. Since gS(x) = σ2 exp(ax), the third condi-
tion of Theorem 2 holds if m/(m — 1) > 1 or, equivalently, if m > 1. This implies that R
is an admissible function of the first kind. If σ < 0, then go(x) = βχρ(σχ) is a decreasing
generator function. Since it is required by the third condition of Theorem 2 that m < 1,
the decreasing generator function g0(x) = βχρ(σχ), σ < 0, also produces reformulation
functions of the first kind.
Theorem 2 indicates that the construction of admissible LVQ models reduces to the
search for admissible generator functions. A variety of admissible generator functions
can be determined by a constructive approach that begins from the admissibility con-
ditions of Theorem 2 and determines the form of admissible generator functions by
solving a differential equation.
The construction of admissible generator functions can be attempted by letting
go(x) be a function of g0(x), that is,
go(x)=p(go(x)) (64)
where \/p{x) is an integrable function. Theorem 2 requires that g(,(x) > 0, Vx e (0, oo),
for m > 1 and go(x) < 0, Vx e (0, oo), for m < 1. Since it is also required by Theorem 2
that g0(x) > 0, Vx e (0, oo), the function /?(■) must be selected so that p(x) > 0,
Vx e (0, oo), if m > 1 and p(x) < 0, Vx e (0, oo), if m < 1. For such functions, the
admissibility conditions of Theorem 2 are satisfied by all solutions go(x) > 0,
Vx e (0, oo), of the differential equation (64) that satisfy the conditions r0(x) > 0,
Vx e (0, oo), for m > 1 and r0(x) < 0, Vx € (0, oo), for m < 1.
176 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms
The function g0(·) can be obtained in this case by solving the differential equation
If m > 1, it is required that r0(x) > 0, Vx e (0, oo), which holds for all m/(m - 1) > n.
For m > 1, m/(m — 1) > 1 and the inequality m/(m — 1) > n holds for all « < 1. For
n = 1, m/(m — 1) - n = l/(m — 1) > 0. Thus, the condition r0(x) > 0, Vx € (0, oo), is
satisfied for all n < 1.
Assume that m > 1 and consider the differential equation (66) for n < 1. For n = 1,
(65) corresponds to p(x) = kx. In this case, the solutions of (66) are
where c > 0 and σ = k/c > 0. The remainder of this chapter investigates generator
functions of the form g0(x) = exp(ax), σ > 0, which result from (68) by setting c = 1.
For n < 1, the admissible solutions of (66) are of the form
where q = 1/(1 - «) > 0, a = k(\ - «) > 0 and b > 0. For n = 0, p(x) = k and (69) leads
to linear generator functions
For a = 1 and b = 0, (70) gives the generator function g0(x) = x, which leads to FCM
and FLVQ algorithms.
6.2. Decreasing Generator Functions
Assume that m < 1 and let
This function />(·) satisfies the condition p{x) < 0, Vx e (0, oo). For this function, g 0 () is
a solution of the differential equation
For m < 1, Theorem 2 requires that r0(x) < 0, Wx e (0, oo), which holds for all
m/(m — 1) < n. If m < 1, then m/(w — 1) < 1 and the inequality m/(m — 1) < n holds
for all n > 1. For n = 1, w/(w — 1) — « = l/(m — 1) < 0. Thus, the condition r0(x) < 0,
Vx € (0, oo), is satisfied for all n > 1.
Assume that m < 1 and consider the solutions of the differential equation (72) for
/I > 1. For n = 1, p(x) = —kx and the solutions of (72) are
where c > 0 and σ =fc/c> 0. For c = 1, (74) leads to decreasing exponential functions
of the form g0(x) = exp(-ax), σ > 0. For n > 1, the admissible solutions of (72) are of
the form
where q = 1/(1 - n) < 0, a = k(n - 1) > 0 and b > 0. For n = 2, p(x) = -kx2 and (75)
leads to the generator functions
^=Afro<*)) (77)
The differential equation that has l/go(x) as a solution can be obtained by substituting
in (77) g0(x) by l/g0(x) as
i(i^))=jP/(i^)) (78)
or, equivalently,
° W - - f e oWWM
* dx » ,Ä
! - ^>) (79)
go(x)=Pi(go(x)) (80)
178 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms
then the corresponding decreasing generator function l/go(x) can be obtained as the
solution of
*o(*)=A*feo(*)) (81)
where pd(·) is given in terms of/>,·(·) as
Pd(x) =
= -xPi{-)
~χ2ρ(χ)
(82)
This section presents the derivation and examines the properties of LVQ and clustering
algorithms produced by admissible generator functions.
7.1. Competition and Membership Functions
Given an admissible generator function g0(·), the corresponding LVQ and cluster-
ing algorithms can be obtained by gradient descent minimization of the reformulation
function defined by (53) and (54) with g(x) = (g0(x)Y/(l~m), τηφ\. If g(x) =
fo,(x))1/(1-m), η,φί, then
«i^W"'0"-" (85)
where
eij = g^i-yJ\\lY^s\-m)
(δθ)
:
S0(I|X,--V;|| 2 )
-it
(8?)
2 l/(l-m)\ I"»»
^o(iix,-vj :n
Since (ay/0y)1/m = (yy)1/(m_1), it can easily be verified that {ay} and {%} satisfy the
condition
W-
l/m
1<i<M (88)
This condition can be used to determine the constraints imposed by the generator
function on the resulting c-partition by relating the competition functions {ay} with
the corresponding membership functions {uy}. Fuzzy LVQ and clustering algorithms
can be obtained as special cases of the proposed formulation if the corresponding
generator functions produce fuzzy c-partitions. A generator function produces fuzzy
c-partitions if the membership functions {uy} determined in terms of the corresponding
competition functions {ay} and {#,·,} satisfy the condition
If the condition (89) is not satisfied by the membership functions {uy} formed in terms
of {ay} and {#,·,}, then the proposed formulation produces soft LVQ and clustering
algorithms.
The update equation (58) involving the competition and membership functions
obtained in this section can produce clustering or batch learning vector quantization
algorithms. The update equation (58) produces a clustering algorithm if m is fixed
during the learning process and the learning rates are computed at each iteration as
1j,v = (Hi=i %v) -1 · The update equation (58) produces a learning vector quantization
algorithm if η},tV = (Σ/=ι aij,v)~l a n d the value of m is not fixed during learning. If m
decreases during learning, the update equation (58) produces descending learning vec-
tor quantization algorithms [20]. In such a case, the algorithms produce increasingly
crisp c-partitions as the learning process progresses. In practice, m is often calculated at
iterate v as
where m, and nif are the initial and final values of m, respectively, and N is the total
number of iterations [20].
180 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms
According to (86), 0,·,· = 1, Wi, j , and at] = (Yy)mKm~l)- For g0(x) = x, the competition
functions {ay} can be obtained using (91) as
•-(;§esCT
This is the form of the competition functions of the batch LVQ algorithms developed
by minimizing (38) using gradient descent [15].
Since 0,-,- = 1, Vi, j , the condition (88) becomes.
The condition (93) implies that the linear generator function g0(x) — x produces fuzzy
c-partitions, with the membership functions {uy} obtained from the competition func-
tions [ay] as uy = {aij)l,mlc. Using (92),
• ^ = (E"i«ff.v) _1 . 1<7<C.
V;- v = ν,·,ν_ι + ^> Σ,=1 «y,v(Xi - V/,v-l), 1<j < C
ν 2
• ^ = Σ;=ιΐΐν- ;>-ιΐι ·
5. If v < N and Ev > e, then go to step 3.
The proposed formulation can also produce fuzzy c-partitions if go(x) = exp(ax),
which corresponds to/ 0 (x) = In χ/σ and/0'(*) = 1/(σχ). For this generator function,
(87) gives
In this case, θϋ = (^)" 1 and αϋ = %(Ky)m/(m-1) = (7y)1/(m_1). For g0(x) = exp(ax), the
competition functions {or,-,} can be obtained using (95) as
- 1
/ , \ l/(m-l)\
2
/exp(a||x;.-v,|| )\ \
εχρίσΙΙχ,-ν,ΙΙ2)/ J
- ] Γ « 0 = 1, \<i<M (97)
According to (97), the exponential generator function go{x) = exp(crx) also produces
fuzzy c-partitions, with the membership functions {«,·,} obtained from the competition
functions {ay} as uy — cty/c. For μ = (ηι— \)/m and δμ = (1 — μ)/μ, (96) gives
βχρ(-σί μ ||χ,-ν ; || 2 )
Σ*=ι βχρί-σδ^ΙΙχ,-ν^ΙΙ2)
/^χρΜχ,.-ν,^Ι^-'Υ1 .
• «»> = 2_\ -Γ-» ΠΪΓ
,1 <
< '. < ^ 1 <7 < c.
< M;
^\εχρ(σ||χ,-νΑι,_ιΓ)/ y
• ^ =(E£i«j/' i<y<c.
• \JiV = v,- „_, + ??,· „ Σ ^ ι a,y,«(x,· - v/,v-i), 1 < 7 < c.
• σ = - ο · ο ( Σ £ ι Σ ; = Ι "/,> In «*,ν)/(Σ£ι Σ)=ι «fcvll* - v,>|| 2 ).
• £v = I£=illy/.v-7/.v-ill 2 .
6. If v < ./V and Ev > e then go to step 4.
il-T^)>° (100>
For q > 0, the inequality (100) holds if 1 > q/(\ — m) > 0. If m > 1 (1 — m < 0), then
the inequality 1 > q/(\ -m) holds if q > 1 — m, which is true in this case since q > 0
and 1 — m < 0. If m < 1 (1 — m > 0), then the inequality 1 > q/(l —m)>0 holds if
1 — m > q > 0.
If m > 1, then goOO = xq generates admissible reformulation functions of the first
kind for all q > 0. If m < 1, then g0(x) = xq generates admissible reformulation func-
tions of the second kind only if 0 < q < 1 — m. If q = 1, then g0(x) = xq generates
admissible reformulation functions of the first kind if m > 1. If q = 1 and m < 1,
then g0(x) = xq generates admissible reformulation functions of the second kind if
Section 8 Soft LVQ and Clustering Algorithms Based on Nonlinear Generator Functions 183
ν,-H2)) = 1. Since fö(xy) = qfd(x)fd(y), {θν} can be obtained from (86) as θ9 = ifi/Oy),
or, equivalently, as
0y = iYiji,(!-*)/« (101)
where
\-m
Yii (102)
~v^Mi-^\2)
The competition function {ay} can be obtained from (85) as ay — ey(Yy)mnm ', or,
equivalently, as
2x ql(m-l)\
α
ν ^JlH^L^L
= ΑζΣ θ
l=\ Pf-^l
(103)
^
Figure 2 Reformulation functions of the first
kind (indicated by vertical shading) and the
second kind (indicated by horizontal shading)
generated by generator functions of the form \
g0(x) = xq for different values of q and m. The
reformulation functions of the first and sec-
X
ond kind associated with the FCM and
FLVQ algorithms are represented by the line
horizontal to the m axis located at q = 1. \
184 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms
•-(;έ(κί)"Τ
This is the form of the competition functions of the FLVQ algorithm [8,20]. In this case,
ay = {cuy)m, where
where {uy} are the membership functions that implement the nearest prototype parti-
tion of the feature vectors associated with crisp c-means algorithm, defined as
The behavior of the competition functions for a fixed value of q > 0 as m spans the
interval (1, oo) can be given as
l m
2χ q/(l-m)\ ~
Ij = {il < i < M : ||x,· - ν,·||2 < ||x, - v,||2, VI φ]} (113)
Since U/=1I,· = {1,2,..., M], the total quantization error associated with the represen-
tation of the entire set of feature vectors by the prototypes v7, 1 <j<v, can be calcu-
lated as E = Σ]=ί Ej, where Ej is the quantization error associated with the
representation of the feature vectors {x,}iei. by the prototype vy, that is
The largest possible reduction of the total quantization error E can be produced by
splitting the prototype that corresponds to the largest among the quantization errors
denned in (114). More specifically, the prototype yk is split if Ek > Ej, V/' ψ k.
Suppose vk is the prototype selected for splitting according to the splitting criterion
just presented. Let xe be the feature vector that has the maximum distance from yk
among all {x,·},·^, that is, ||x* - \kf > ||x,· - ν*||2, Vz φ I. According to (114), x* has
the largest contribution to the quantization error Ek associated with the representation
of the feature vectors {χ,}ί€χ4 by the prototype vk. Thus, the error Ek can be reduced by
splitting the prototype vk along the direction yk — xt. Splitting of the prototype yk
produces two new prototypes yk and vv+1. The new prototype vk is obtained by updat-
ing the original prototype v* as
where 8 e (0,1). According to (115), the new prototype v^ is attracted by the feature
vector xt by an amount determined by 8 € (0,1). As 8 increases from 0 to 1, the new
prototype vk moves closer to the feature vector xt. In order to balance the effect of
moving yk toward xt, the new prototype vv+1 is obtained by moving the original pro-
totype yk by the same amount in the opposite direction. More specifically, the new
prototype is obtained by updating the original prototype yk as
where 8 e (0,1). According to (116), the prototype vv+1 is repelled by the feature vector
Xi by an amount determined by 8 e (0,1). As 8 increases from 0 to 1, the new prototype
v„+1 moves away from the feature vector xt. The codebook resulting after sphtting one
prototype can be improved by calculating each prototype as the centroid of all feature
vectors belonging to its corresponding cluster. More specifically, each prototype v,· is
computed as the centroid of the feature vectors {χ,·},€χ., with 2} defined in (113).
• Initialization Scheme 1 (II): According to this scheme, the initial prototypes are
randomly selected from the feature space, that is, the subspace of W containing
all feature vectors that form the inputs to the clustering or LVQ algorithm.
• Initialization Scheme 2 (12): This initialization scheme is based on the prototype
splitting procedure described in Section 9.1.
• Initialization Scheme 3 (13): This initialization scheme is based on the same
prototype splitting approach. The only difference is that this scheme employs
the clustering or LVQ algorithm used to produce the final set of prototypes
every time a new prototype is created by splitting one of the existing
prototypes.
188 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms
The clinical utility of magnetic resonance (MR) imaging rests on the contrasting image
intensities obtained for different tissue types, both normal and abnormal. For a given
MR image pulse sequence, image intensities depend on local values of the following
relaxation parameters: the spin-lattice relaxation time (Tl), the spin-spin relaxation
time (T2), and the Flair. In the context of MR imaging, segmentation usually implies
the creation of a single image with fewer intensity levels that correspond to different
segments. The development of MR image segmentation techniques has been attempted
using classical pattern recognition techniques [30], rule-based systems [31], image ana-
lysis methods and mathematical morphology [32], supervised neural networks [33], and
unsupervised clustering procedures [33].
Segmentation of MR images is formulated to exploit the differences among local
value of the Tl, T2, and Flair relaxation parameters. The values of these parameters
represent the intensity levels (pixels) of a set of three images, namely the Tl-weighted,
T2-weighted, and Flair-weighted images. Let xT1 xT2, and xFlair be the pixel values of
the Tl-weighted, T2-weighted, and Flair-weighted images, respectively, at a certain
location. The relaxation parameter values χτχ, χτ2, and χρ^ can be combined to
form the vector
(c)
grade ranging from 0 to 10, with 0 representing the lowest diagnostic value and 10
representing the highest diagnostic value. The average of the two grades assigned to
each segmented image was used in the evaluation to represent its diagnostic value. The
average is indeed a reliable measure of diagnostic value because the two evaluators
assigned very similar grades to the majority of the segmented images and disagreed only
in their assessment of segmented MR images of low diagnostic value.
190 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms
In the first set of experiments, the MR image was segmented by the soft clustering
algorithms generated by g0(x) = xq, which were tested with m = 2 and q = 0.7, q=\,
q=\.2,q = \.5, and q = 2. The algorithms were used to produce eight segments, which
implies that each algorithm generated c = 8 prototypes. Table 2 summarizes the results
of the evaluation of the segmented MR images produced by soft clustering algorithms
initialized by all three initialization schemes. Regardless of the initialization scheme
used, the segmented images produced by the algorithms tested with q = 1, q = 1.2, and
q = 1.5 were hardly distinguishable from each other. On the other hand, the initializa-
tion scheme had a rather significant effect on the diagnostic value of the segmented
images produced by the algorithms tested with q = 0.7 and q = 2.
Initialization 11 12 13
9 = 0.7 4.25 6.75 6.25
9=1.0 6.75 6.75 6.75
q=l.2 6.75 6.75 6.75
9=1.5 6.75 6.75 6.75
? = 2.0 6.75 5.50 6.75
" The algorithms were tested with m = 2 and JV = 50 and the clustering process
was initialized using the schemes II, 12, and 13.
Figure 4 shows the segmented images produced by the soft clustering algorithms
initialized by the 12 initialization scheme and tested with q = 0.7, q = 1, q = 1.2, and
q = 2. In this set of segmented images, red segments represent tumor, blue segments
represent edema surrounding the tumor, white segments represents CSF or CSF filling
in the surgical cavity, yellow segments represent white matter, and green segments
represent gray matter. There are also some segments of pink color within the area
segmented as tumor (red color). Since labeling these segments was difficult, the exis-
tence of pink segments and their size relative to that of the red segments was con-
sidered to be a liability and had a negative impact on the evaluation of the segmented
images. In addition to the segments already mentioned, there are segments occupying
areas that do not correspond to brain matter, such as air, fat, skin, and background.
The algorithms were capable of identifying the surgical cavity, the tumor, and the
edema. Nevertheless, there was concern that the volume of CSF might be under-
estimated in the segmented images. This is probably the cause of a slight overestima-
tion of gray matter. Finally, there are some green pixels (representing gray matter)
appearing within the blue segment that represents edema. According to Figure 4, the
segmented images produced for q = 0.7, q = 1, and # = 1.2 cannot be distinguished
from each other. However, the segmented image produced for q = 2 differs from all
the rest. Gray matter (represented by green color) is almost completely absent from
this segmented image. Another significant difference between the segmented MR
image produced for q = 2 and the rest is the appearance of two yellow patches
(representing white matter) within the red area (representing remaining tumor). The
same patches were segmented as gray matter (green color) by the algorithms tested
with q = 0.7, q=\, and q = 1.2.
Section 10 Magnetic Resonance Image Segmentation 191
(c) (d)
Initialization 11 12 13
" The algorithms were tested with values of m decreasing linearly from m, = 5 to
ntf = 2 in N = 50 iterations and the learning process was initialized using the
schemes II, 12, and 13.
images produced for q = 0.7 and q — 1 (FLVQ) were of low diagnostic value. This is
clearly indicated by Figure 5, which shows the segmented MR images obtained for
q = 0.7, q = 1, q = 1.2, and q = 2. In Figures 5a and b, red color (representing tumor)
surrounds the edema and the region of the MR image occupied by skin. Red color,
representing tumor, also appeared in the cortex and the edema. The diagnostic value
of the segmented images improved for values of q above 1.2, as indicated by Figures 5c
and d.
Table 3 also indicates that the performance of the soft LVQ algorithms tested in
these experiments was rather strongly affected by the scheme used to produce the initial
set of prototypes. Regardless of the value of q, there were no visible differences among
the segmented images produced by the soft LVQ algorithms initialized by the initializa-
tion schemes II and 12. For q = 0.7, q—\, and q — 1.2, the soft LVQ algorithms
initialized by the initialization scheme 13 produced segmented MR images inferior to
those produced by the same algorithm employing the schemes II and 12 for initializa-
tion. Compared with the soft clustering algorithms tested with a fixed value of m, soft
LVQ algorithms were more strongly affected by the value of q. This can be attributed to
the fact that soft LVQ algorithms were implemented with values of m that decreased
linearly from mt = 5 to nif = 2 during the learning process. In the initial stages of the
learning process, where the values of m are considerably higher than 1, the soft LVQ
algorithms tested with low values of q tend to produce increasingly soft partitions. This
leads to tissue mixing, which is also observed in the last stages of the learning process,
where the values of m are close to its final value of ntf = 2. Tissue mixing can be
remedied by increasing the value of q above 1. Values of q in this range lead to parti-
tions that are closer to a crisp partition and, thus, balance the effect of the high values
of m used in the initial stages of the learning process.
The third set of experiments evaluated the effect of the parameter q on soft cluster-
ing and LVQ algorithms by comparing the segmented images produced by such algo-
rithms tested with values of q from the interval [0.7, 4]. Soft clustering algorithms were
tested with afixedvalue of m = 2, whereas soft LVQ algorithms were tested with values
of m decreasing linearly from m,f = 5 to nif — 2 in N = 50 iterations. All algorithms
tested in these experiments were initialized using the initialization scheme 12. Table 4
summarizes the results of the evaluation of the segmented images produced in this set of
experiments. Soft clustering algorithms produced segmented MR images of high diag-
nostic value when q was taking values between 0.7 and 1.5. For this range of values of q,
there were no visible differences among the segmented images produced by soft cluster-
Section 10 Magnetic Resonance Image Segmentation 193
(c) (d)
achieved their best performance for values of q in a neighborhood of q = 1 (that is, the
value of q that corresponds to the FCM algorithm). On the other hand, soft LVQ
algorithms failed for values of q in a neighborhood of q = 1 (that is, the value of q
that corresponds to the FLVQ algorithm). In fact, soft LVQ algorithms produced
segmented MR images comparable or even superior to those resulting from soft clus-
tering algorithms for values of q between 1.5 and 4.
11. CONCLUSIONS
The reformulation of FCM and ECFC algorithms indicated that clustering algo-
rithms can alternatively be derived by minimizing their corresponding reformulation
function using gradient descent. It was also shown in this chapter that a broad
variety of soft LVQ and clustering algorithms can be developed by minimizing an
admissible reformulation function. According to this formulation, the development
of new algorithms reduces to the search for a generator function that satisfies certain
conditions. FCM and ECFC algorithms can be interpreted as special cases of the
proposed formulation corresponding to linear and exponential generator functions,
respectively.
The axiomatic approach presented in this chapter for the development of soft
LVQ and clustering algorithms can eventually replace alternating optimization as a
tool for vector quantization and clustering. The simplicity of the proposed approach
allows the development of algorithms that would not be the product of alternating
optimization techniques [10, 11]. In fact, the fuzzy LVQ and clustering algorithms
produced by the proposed formulation using the linear and exponential generator
functions constitute only a tiny subset of all possible algorithms that can be gener-
ated by this approach. Any admissible generator function leads to soft but not
necessarily fuzzy LVQ and clustering algorithms, in the sense that only the member-
ship functions corresponding to the linear and exponential generator functions
satisfy the condition (89) required for fuzzy opartitions. The search for potential
generator functions would essentially involve admissible functions increasing faster
than the linear generator function and slower than the exponential generator func-
tion. Soft clustering and LVQ algorithms were developed by selecting nonlinear
generator functions.
Acknowledgments 195
The soft LVQ and clustering algorithms produced by nonlinear generator func-
tions were evaluated and compared with existing algorithms by formulating MR
image segmentation as an unsupervised clustering or vector quantization process.
This is a nontrivial problem and provides a reliable basis for comparing the perfor-
mance of LVQ and clustering algorithms because the diagnostic value of the seg-
mented MR images depends exclusively on the unsupervised algorithm used to
perform segmentation. The soft clustering algorithms generated by g0(x) = xq and
tested with m = 2 achieved their best performance for values of q in a neighborhood
of 1, that is, the value of q that leads to the FCM algorithm. On the other hand, the
soft LVQ algorithms generated by go(x) = xq and tested with values of m decreasing
linearly from m, = 5 and my = 2 in N = 50 iterations achieved their best perfor-
mance for values of q in the interval [1.5,4]. Note that this interval does not include
the value q=\, which leads to the FLVQ algorithm. This experimental outcome
reveals that the performance of clustering algorithms was not significantly affected
by replacing the linear generator function g0(x) = x by the nonlinear generator
function go(x) = xq, with q Φ I. On the other hand, the use of the nonlinear gen-
erator function go(x) = xq with q > 1 instead of the linear generator function go(x) —
x improved significantly the performance of LVQ algorithms. The application of
soft clustering and LVQ algorithms in image segmentation indicated that the for-
mation of the initial set of prototypes has a rather significant effect on their per-
formance. The most robust and consistent behavior was exhibited in this
experimental study by the soft clustering and LVQ algorithms initialized by the
scheme 12, which employs the proposed prototype splitting procedure.
Initialization of the algorithms by the scheme II (based on random generation of
the initial set of prototypes) worked well in some cases, but the performance of the
algorithms initialized according to this scheme was not consistent. Similar inferences
can be made about the initialization scheme 13, which involved repeated prototype
splitting followed by the application of the algorithm used to perform MR image
segmentation.
There is experimental evidence that soft vector quantization algorithms are strong
competitors to vector quantizers based on crisp algorithms in applications involving a
large number of feature vectors of high dimensionality [2,7,16]. Thus, soft LVQ and
clustering algorithms can also be used to perform codebook design for lossy image and
video compression, a task that is frequently performed using a variation of the crisp c-
means algorithm known in the engineering literature as the Linde-Buzo-Gray (LBG)
algorithm [28,34]. There is also evidence that soft LVQ and clustering algorithms
produced by the formulation presented in this chapter are inherently capable of identi-
fying the feature vectors that are equidistant from the prototypes. This property can be
exploited to detect outliers in the feature set. This problem is currently under
investigation.
ACKNOWLEDGMENTS
The author would like to thank Lawrence C.-P. Leung for processing the MR
image data. Furthermore, the author thanks Professors W. K. Alfred Yung, M.D.,
and Edward F. Jackson, Ph.D., who provided the MR image data and evaluated the
segmented MR images.
Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms
REFERENCES
[1] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York:
Plenum, 1981.
[2] N. B. Karayiannis, Generalized fuzzy λ-means algorithms and their application in image
compression. SPIE Proceedings vol. 2493: Applications of Fuzzy Logic Technology II, pp.
206-217, Orlando, FL, April 17-21, 1995.
[3] N. B. Karayiannis, Generalized fuzzy c-means algorithms. Proceedings of Fifth International
Conference on Fuzzy Systems, pp. 1036-1042, New Orleans, LA, September 8-11, 1996.
[4] K. Rose, E. Gurewitz, and G. C. Fox, Vector quantization by deterministic annealing. IEEE
Trans. Inform. Theory 38: 1249-1257, 1992.
[5] N. B. Karayiannis, Maximum entropy clustering algorithms and their application in image
compression. Proceedings of 1994 IEEE International Conference on Systems, Man, and
Cybernetics, pp. 337-342, San Antonio, October 2-5, 1994.
[6] N. B. Karayiannis, Fuzzy partition entropies and entropy constrained fuzzy clustering
algorithms. J. Intell. Fuzzy Syst. 5(2): 103-111, 1997.
[7] N. B. Karayiannis, Entropy constrained learning vector quantization algorithms and their
application in image compression. SPIE Proceedings, Vol. 3030: Applications of Artificial
Neural Networks in Image Processing II, pp. 2-13, San Jose, CA, February 12-13, 1997.
[8] N. B. Karayiannis, A methodology for constructing fuzzy algorithms for learning vector
quantization. IEEE Trans. Neural Networks 8(3): 505-518, 1997.
[9] N. B. Karayiannis, Learning vector quantization: A review. Int. J. Smart Eng. Syst. Design
1: 33-58, 1997.
[10] N. B. Karayiannis, Ordered weighted learning vector quantization and clustering algo-
rithms. Proceedings of 1998 International Conference on Fuzzy Systems, pp. 1388-1393,
Anchorage, AK, May 4-9, 1998.
[11] N. B. Karayiannis, Soft learning vector quantization and clustering algorithms based in
reformulation. Proceedings of 1998 International Conference on Fuzzy Systems, pp. 1441—
1446, Anchorage, AK, May 4-9, 1998.
[12] N. B. Karayiannis, Reformulating learning vector quantization and radial basis neural net-
works. Fundam. Inform. 37: 137-175, 1999.
[13] N. B. Karayianms, An axiomatic approach to soft learning vector quantization and cluster-
ing. IEEE Trans. Neural Networks, in press.
[14] N. B. Karayiannis, J. C. Bezdek, N. R. Pal, R. J. Hathaway, and P.-I. Pai, Repairs to
GLVQ: A new family of competitive learning schemes. IEEE Trans. Neural Networks
7(5): 1062-1071, 1996.
[15] N. B. Karayiannis and J. C. Bezdek, An integrated approach to fuzzy learning vector
quantization and fuzzy c-means clustering. IEEE Trans. Fuzzy Syst. 5(4): 622-628, 1997.
[16] N. B. Karayiannis and P.-I. Pai, Fuzzy algorithms for learning vector quantization. IEEE
Trans. Neural Networks 7(5): 1196-1211, 1996.
[17] E. Kosmatopoulos and M. Christodoulou, Convergence properties of a class of learning
vector quantization algorithms. IEEE Trans. Image Process. 5(2): 361-368, 1996.
[18] I. Pitas, C. Kotropoulos, N. Nikolaidis, R. Yang, and M. Gabbouj, Order statistics learning
vector quantizer. IEEE Trans. Image Process. 5(6): 1048-1053, 1996.
[19] I. Pitas, C. Kotropoulos, N. Nikolaidis, and A. G. Bors, Robust and adaptive techniques in
self-organizing neural networks. Nonlinear Anal. Theory Methods Appl. 307: 4517-4528,
1997.
[20] E. C.-K. Tsao, J. C. Bezdek, and N. R. Pal, Fuzzy Kohonen clustering networks, Pattern
Recogn. 27(5): 757-764, 1994.
[21] S.-J. Yu and C.-H. Choi, LVQ with a weighted objective function. Proceedings of IEEE
International Conference on Neural Networks, pp. 2763-2768, Perth, Australia, November
27-December 1, 1995.
References 197
[22] E. Yair, K. Zeger, and A. Gersho, Competitive learning and soft competition for vector
quantizer design. IEEE Trans. Signal Process. 40(2): 294-309, 1992.
[23] J. C. Bezdek and N. R. Pal, Two soft relatives of learning vector quantization. Neural
Networks 8(5): 729-743, 1995.
[24] R. J. Hathaway and J. C. Bezdek, Optimization of clustering criteria by reformulation.
IEEE Trans. Fuzzy Syst. 3: 241-246, 1995.
[25] N. B. Karayiannis, Generalized fuzzy c-means algorithms. /. Intell. Fuzzy Syst. 8(1): 68-71,
2000.
[26] N. B. Karayiannis and P.-I. Pai, Fuzzy vector quantization algorithms and their application
in image compression. IEEE Trans. Image Process. 4(9): 1193-1201, 1995.
[27] R. Krishnapuram and J. M. Keller, A possibilistic approach to clustering. IEEE Trans.
Fuzzy Syst. 1:98-110, 1993.
[28] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression. Boston: Kluwer
Academic, 1992.
[29] D. Dyckhoff and W. Pedrycz, Generalized means as a model of compensative connectives.
Fuzzy Sets Syst. 14(2): 143-154, 1984.
[30] T. J. Hyman, R. J. Kurland, G. C. Levy, and J. D. Shoop, Characterization of normal brain
tissue using seven calculated MRI parameters and a statistical analysis system. Magn. Reson.
Med. 11:22-34, 1989.
[31] S. P. Raya, Low-level segmentation of 3-D magnetic resonance brain images—A rule-based
system. IEEE Trans. Med. Imaging 9(3): 327-337, 1990.
[32] M. Bomans, K. H. Hohne, U. Tiede, and M. Riemer, 3-D segmentation of MR images of
the head for 3-D display. IEEE Trans. Med. Imaging 9(2): 177-183, 1990.
[33] L. O. Hall, A. M. Bensaid, L. P. Clarke, R. P. Velthuizen, M. S. Silbiger, and J. C. Bezdek, A
comparison of neural network and fuzzy clustering techniques in segmenting magnetic
resonance images of the brain. IEEE Trans. Neural Networks. 3: 672-682, 1992.
[34] N. M. Nasrabadi and R. A. King, Image coding using vector quantization: A review. IEEE
Trans. Commun. 36: 957-971, 1988.
[35] G. J. Klir and B. Yuan, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Upper Saddle
River, NJ: Prentice Hall, 1995.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.
Sleep state is one of the substantial aspects of consciousness. Concerning the functions
of sleep in memory and learning, physiological and psychological evidence has been
accumulated (e.g., [1,2]). Many ideas have been proposed to elucidate the mechanisms
underlying sleep functions (e.g., [3]), but none of these ideas has been established. It is
essential to construct a model that can give insight into mechanisms of sleep and enable
its computational interpretation. Efforts have been made to do so from physiological
and model-based points of view.
In the central nervous system of the cat, the following phenomena concerning the
dynamics of single neuronal activities during the sleep cycle have been found: (1)
During rapid eye movement sleep (REM or dream sleep), neuronal activities showed
slow fluctuations, and their power spectral densities (PSDs) were approximately inver-
sely proportional to frequency in the frequency range 0.01-1.0 Hz (simply abbreviated
as l/f). (2) During steady-state slow wave sleep (SWS), neurons showed almost flat
spectral profiles in the same frequency range. These phenomena have been found in
various regions of the cat's brain such as the mesencephalic reticular formation [4],
the hippocampus, the thalamus, and the cortex [5-7].
Based on neurophysiological knowledge, the dynamics transition was successfully
simulated by using an interconnected neural network model including a globally
applied inhibitory input and random noise [8-11]. That is, the neuronal dynamics
during SWS and REM sleep were reproduced by the network model under strong
and weak inhibitory inputs, respectively. A monotonous structure of the network
attractor is suggested to exist where the "0" state is highly attractive under strong
inhibition and the metastability of the attractor is dominant under weak inhibition.
Thus, the structural change in the network attractor associated with an increase in the
global inhibitory input could underlie the neuronal dynamics transition. It was also
shown that statistical properties of the noise could differentiate the network state
behavior in the metastable attractor.
In this chapter, the phenomenology of the dynamics transition of single neuronal
activities during sleep is reviewed. Then simulation results for the neuronal dynamics
transition are summarized. Finally, the generation mechanism underlying the
dynamics transition is studied based on the structural analysis of network attractors
198
Section 1 Dynamics Transition of Neuronal Activities During Sleep 199
under various conditions. On the basis of these studies, we discuss what is happening
in actual neural networks and its possible contribution to brain functions.
The phenomenon of the dynamics transition was first demonstrated in the long-term
spontaneous single neuronal activity of the mesencephalic reticular formation (MRF)
of the cat during sleep (Figure 1) [4]. During slow wave sleep, a random spike density
pattern was often observed and an almost flat PSD was obtained for the low-frequency
range from approximately 0.01 to 1 Hz. On the other hand, for almost the same fre-
quency range during REM sleep, a slowly fluctuating spike density pattern was
observed and the PSD was inversely proportional to frequency. This kind of PSD is
simply called the \/f PSD. Thus, an important characteristic of the cat's MRF neuro-
nal activity is the "dynamics transition" between the flat PSD during SWS and the 1//
PSD during REM sleep. The robustness of the characteristic dynamics transition
between the two sleep states was verified for the 18 MRF neurons [4]. Furthermore,
SWS REMS
the consistent characteristic of the transition was verified on a time axis during 24 hours
for a representative MRF neuron. That is, the PSD was calculated for a sample series of
the neuronal activity extracted from all sustained episodes of both SWS and REM. In
all the episodes of the respective states, the MRF neurons displayed the flat and \/f
PSDs during SWS and REM, respectively.
The \/f PSD is more generally evaluated by a form off~b profile, where exponent
b is the slope of the PSD in a double logarithmic plot. When b = 1 the PSD corresponds
to the exact 1// profile, whereas b = 0 corresponds to the flat PSD. As for the varia-
bility of the value of b during REM sleep among neurons, the mean values calculated
from 19 MRF neuronal activities were distributed from 0.56 to 1.37 with a pooled mean
of 0.96. Within a neuron, variabilities in b for the respective states were small.
Therefore, each neuron was suggested to have its own value, which was hypothesized
to indicate the structural specificity of the reticular network including the neuron under
observation [13].
Figure 2 shows a summary of the dynamics transition of single neuronal activities
between the two sleep states in the five neuronal groups: the mesencephalic reticular
formation, the primary somatosensory cortex, the ventrobasal complex of the thala-
mus, and the hippocampus (the theta and the pyramidal neurons) [14,15]. In addition,
the neurons recorded from the neocortex, areas 6 and 18, from the dorsal lateral
nucleus, and from the thalamic reticular nucleus have shown similar characteristics
[7,16,17]. It should be emphasized that all of these neuronal groups show a similar
Figure 2 Summary of the dynamics transition between SWS and REMS in five
neuronal groups: the mesencephalic reticular formation (Ml 13), the pri-
mary somatosensory cortex (S007), the ventrobasal complex of the thala-
mus (VI10), and the hippocampal theta (HO 13) and pyramidal (H071)
neurons. On the ordinate is the logarithm of PSD in (spikes/250 ms)2/Hz.
(Adapted with permission from Ref. 14.)
Section 3 Neural Network Model 201
dynamics transition, despite the structural and functional differences of the neuronal
groups. Some common modulation mechanisms must operate under the dynamics
transition.
were applied to activate each neuron, mimicking various noise sources rapidly
fluctuating.
Here, the model structure is reviewed briefly. The neural network model consists of
fully interconnected neuron-like elements (referred to as "neuron"). For the ith neuron,
the state evolution rule is defined as follows:
";(' + 1) = Σ V W ~h + M + !>
W X
(1)
where N denotes the number of neurons contained in the network; t is a discrete time;
€t(t) denotes random noises, which are assumed to be mutually independent, zero-mean
white Gaussian noises with an identical variance σ2(1 — a 2 ), in which a variance of
autoregressive process is kept constant regardless of a; h (> 0) is the inhibitory input,
which is fixed here independent of neurons for simplicity; and a is an autoregressive
parameter. In this case, the autocorrelation function ^(fc) is given by
is satisfied.
State evolution of the network model is performed in an asynchronous (cyclic)
manner [32]. The memorized patterns and the initial states are given as equiprobable
binary random sequences. Unless otherwise stated, simulations are done for 11,000
Monte Carlo steps (MCS) and the initial 1000 MCS are not analyzed to exclude the
state sequence dependent on the initial pattern. Because, in this study, the PSD of a
state sequence is almost invariant to the temporal translation of the sequence, the
starting time of the analysis scarcely affects the resulting PSD.
The data length, 10,000 MCS, is selected to estimate the PSD in a frequency
bandwidth of three decades with sufficient statistical reliability. PSDs of actual neuro-
nal activities referred to here were given in a similar frequency bandwidth [4].
Furthermore, the data length of the neuronal spike train analyzed was at most several
hundred seconds. Comparing this actual data length with 10,000 MCS, 1 MCS could
correspond to several tens of milliseconds. This could be regarded as a time unit during
which a neuron's state is active (1) or inactive (0). The neuronal state may be deter-
mined to be responsible for the number of spikes during this time unit. This time
resolution is presumably sufficient, considering that the firing rates of actual neurons
under study were at most 30-40 spikes/sec and the concerned frequency range is lower
than 1 Hz [4].
Typical PSD profiles of single neuronal activities in the network model are shown in
Figure 3 for various inhibitions and a values. Unless otherwise stated, the number of
neurons N = 100 and the number of memorized patterns M = 30. In this figure, activ-
ity of a single neuron is picked up from 100 neurons included in the network. The raster
plot of Xi(t) is shown together with the corresponding PSD, where a dot indicates
x,(i) = 1. As one can see for the case of a = 0 (i.e., white noise) the PSD profile changes
from 1// to flat as the inhibitory input increases. Here, the parameter values, h and σ,
are selected regardless of the connection type so that most of neurons in the network
show 1// PSD profiles under the weak inhibition and flat PSD profiles under strong
inhibition. The time series xt(t) responsible for the 1// PSD shows larger and slower
variations than that for the flat PSD. Naturally, the activity is reduced as the inhibitory
input increases. As described previously [8], the strong and weak inhibitory inputs are
responsible for SWS and REM, respectively, in the framework used here. Qualitatively,
the PSD profiles and the temporal characteristics of activities are well reproduced in the
simulations. Regardless of the inhibition level, finely fragmented activities tend to be
suppressed as a increases, which is more obvious in the strong inhibition case. Slopes of
PSDs commonly become steeper with an increase in a. Because an increase in a makes
autocorrelation longer lasting, these results can be attributed to a change in the correla-
tion structure of the noise. The neuronal activities seem to more closely follow the
dynamics of the noise as a increases.
204 Chapter 8 Metastable Associative Network Models of Neuronal Dynamics Transition
Figure 3 Simulation results for the dynamics transition of neuronal activities in the
symmetry neural network for various a values under weak and strong
inhibition, A, where M = 30 and σ = 0.26. Raster plots of single activities
and the corresponding PSDs for the picked-up neuron are shown together.
Results for (a) h = 0.40 and (b) h = 0.50. In PSD, the frequency axis is
normalized by/0> which denotes an arbitrary standard frequency. (Adapted
with permission from Ref. 12.)
It has been suggested that the structural change of the network attractor could underlie
the dynamics transition of neuronal activities during the sleep cycle. That is, the meta-
stable properties of the attractor could be key to understanding the physiological
mechanism that controls the dynamics of neuronal activities during the sleep cycle.
Here we show how the correlation of the random noise modifies the network dynamics
in the state space.
Section 5 Dynamics of Neural Network in State Space 205
For the symmetry network in Figure 3, activities of all neurons (network activity)
are briefly presented in Figure 4a under weak inhibition and in Figure 4b under strong
inhibition. In each panel, the autoregressive parameter a differentiates the pattern of
network activities.
As shown in Figure 4a, under weak inhibition, the network activity explicitly
indicates that the regular and irregular patterns appear alternately with varied dura-
tions. In the regular states, several different stripe patterns can be clearly seen. In
contrast, only the irregular state becomes dominant under strong inhibition. It can
be shown that these stripe patterns correspond to the vicinities of equilibrium states
under this condition, whereas the irregular pattern corresponds to the vicinity of the
"0" state, where all neurons are silent. The closest reference equilibrium state to the
current network state is determined every MCS in terms of a direction cosine (DC),
which represents the "agreement" between a current network state x(t) and a certain
reference state x*. For the networks in Figure 3, reference equilibrium states including
the "0" state are reached from 4000 statistically independent initial states with no noise,
that is, σ = 0. Here, the "0" state is denoted by x0 = [0,0,..., 0]. For the networks in
Figure 3a and b, 63 and 2 equilibrium states are found, respectively. For all references,
the closest reference to the current state is determined step by step by comparing the
magnitude of the corresponding DCs.
Under weak inhibition, the network state wanders in the vicinities of the equili-
brium states. Here, the equilibrium states are not absolutely stable because intermittent
transitions among them driven by noise are observed. In this sense, they are denoted
here as "metastable equilibrium states" or simply metastable states, following the ter-
minology of statistical physics [33]. In spite of a constant drive by the noise, the network
state is trapped in metastable states for a certain period. In the irregular states, the
Figure 4 Dynamics of network state evolution for the network in Figure 3. This
shows the brief sequences of all neuronal activities (network activity).
The numbers on the left end are neuron numbers, and the number on the
top is the number of steps from the beginning of evolution. (Adapted with
permission from Ref. 12.)
206 Chapter 8 Metastable Associative Network Models of Neuronal Dynamics Transition
network may wander around the vicinity of the "0" state. While in the vicinity of the
"0" state, each neuronal state is expected to be determined by an instantaneous value of
the noise rather than inputs from the other neurons. This is presumably the reason why
the spatiotemporal activity pattern looks random. Here, the "metastability" represents
the structural properties of the network attractor in which such metastable equilibrium
states dominantly exist. Therefore, the following description is possible concerning the
dynamics transition of single neuronal activities. The globally applied inhibitory input
modifies the structure of the network attractor. In the weakly inhibited case, the meta-
stability of the network attractor becomes dominant so as to realize the 1// fluctuations
of single neuronal activities, and in the strongly inhibited case, the "0" state becomes
the global attractor, which underlies low and random activities. In other words, it is
suggested that these behaviors reflect the geometrical structure of the network attractor:
the attractor has a "bumpy" structure under weak inhibition and a monotonic structure
under strong inhibition.
For a = 0.5, one may not be able to recognize the difference in the behavior of the
network state and the preceding results in the case of white Gaussian noise, that is,
a = 0. However,finelyfragmented patterns such as snow noises become suppressed for
a = 0.9 in both the strongly and weakly inhibited cases. For the weakly inhibited case,
more kinds of regular patterns could be recognized than in the case with the smaller a.
Similarly, in the strongly inhibited case, the distinct regular patterns are clearly raised as
a increases.
•a
20 30 40 50 60 10 20 30 40 50 60
State number State number
Transition probability from 0 state Transition probability from state 29
a = 0.0
a = 0.5
a = 0.9
0.6
0.4
0.2
0 10 20 30 40 50 60 20 30 40 50 60
State number State number
!
i
0.6 ••'-'i"
·"""""
&
0.4
...".--'<·'
— " ■ ' ' '
0.2
Figure 6
Section 6 Metastability of the Network Attractor 209
by repeating the same procedure, changing the flipping order 100 times. The minimum
in the set of the maxima is selected as a potential wall height between s,· and sj. In
addition, the transition probability from s,· to sy· is estimated by 10 runs of 10,000 MCS
network state evolutions, where the transition probability for i =j indicates the staying
probability in state i.
The wall height and the transition probability are presented in Figure 6. For the
weakly inhibited case, the wall heights and the transition probabilities from the "0"
state and state 29 to all other metastable states are presented. Note that the wall
heights in the objective states are "0." Characteristically, there are high potential
walls between the "0" state and any other metastable states. In contrast, there are
several low walls around state 29 (e.g., to state 4, state 26, and state 55). This
structural property of the network potential around state 29 is shared by the other
metastable states except for the "0" state. In this sense, the potential landscape
around the "0" state is special. Concerning the transition probability, transition
from the "0" state to the other metastable states is rare in comparison with staying,
which can be understood from the above special potential landscape. On the other
hand, the transition probability from state 29 is high to itself and the other metastable
states with low potential walls. This agreement between the height of the potential
wall and the transition probability demonstrates the validity of the procedure for
deriving the wall height. A larger a is shown to make transition to the other states
less frequent and to increase the staying probability. Although, in the strongly inhib-
ited condition, only a few metastable states could be analyzed, the results in Figure 6c
and d show features similar to those in the weakly inhibited case in Figure 6a and b.
Under this condition, the potential wall from the "0" state to state 1 is much higher
than from state 1 to the "0" state, which is thought to make the staying probability
close to 1. For state 1, the staying probability increases and the transition decreases as
a gets closer to 1.
The escape time distribution is expected to depend on the local landscape of the
network potential around an equilibrium state. A symmetry neural network is known to
be a multidimensional discrete gradient system (e.g., [32]). However, there is no general
theory describing the metastability of such a multidimensional system. On the other
hand, for a one-dimensional continuous gradient system with a two-well potential, the
escape time under a small Gaussian noise obeys an exponential distribution whose
parameter depends on the height of the potential wall between two wells (e.g., [34]).
That is, a staying probability in a shallow potential well has a faster decaying profile
than one in a deep well. Although for the correlated noise, the theoretical results are
derived only under limited conditions even for the one-dimensional potential case, from
some numerical experiments the escape time distribution is expected to depend on the
Figure 6 Estimated height of network potential walls between metastable states and the transition probabil-
ity. (a) Height of network potential walls around the "0" state and the transition probability from the "0"
state, (b) Height of network potential walls around state 29 and the transition probability from state 29. These
cases are for the network under weak inhibition in Figure 3a. (c) Height of network potential walls around the
"0" state and the transition probability from the "0" state, (d) Height of network potential walls around state
1 and the transition probability from state 1. These cases are for the network under strong inhibition in Figure
3b. (Adapted with permission from Ref. 12.)
210 Chapter 8 Metastable Associative Network Models of Neuronal Dynamics Transition
local geometry of the attractor as well as the correlation structure of the noise [35]. The
result obtained for the neural network qualitatively coincides with those for the one-
dimensional system.
In short, the behavior of network activity in the state space consists of stochastic
transitions among metastable equilibrium states. The stochastic features of transitions
are determined by the height of the potential walls around metastable states and the
correlation structure of the noise. As far as the inhibition-induced dynamics transition
is concerned, global inhibition reduces the height of the potential walls and the number
of metastable states so that a PSD of neuronal activity changes its profile from \/f to
white.
In the preceding sections, global inhibition has been shown to change the structure of
the network attractor so as to induce the neuronal dynamics transition. From another
point of view, the global inhibition and the random noise can be regarded as external
inputs in contrast to the inputs from the other neurons interconnected (network input).
Here, "external" means that neurons in the network are not involved in its dynamics. In
this context, the previously proposed mechanism for the neuronal dynamics transition
could be reinterpreted as follows. For the weakly inhibited condition, interaction
between the external inputs and the network inputs prevails, whereas for the strongly
inhibited condition, the external inputs dominate the network inputs. That is, the
balance between the external and network inputs is supposed to play an essential
role in inducing the neuronal dynamics transition. In order to realize the same situation
in a different way from the global inhibition, it was investigated how randomly diluting
connections between neurons affected the structure of the network attractor and the
metastable behavior of the network [12].
The noise was white Gaussian, and the values of σ and h were set so that most of
the neurons exhibited 1// fluctuations. The dilution was done at random and in a
symmetrical manner. With no dilution, the metastable behavior was clearly similar to
that in Figure 4a. As the dilution ratio increased, irregular patterns such as snow
noise became distinct in the network activities, where most of the neurons exhibited
flat PSDs in the frequency range less than / ~ 10 -2 . In order to understand the
structural change of the network attractor associated with the dilution, the number
of equilibrium states was derived by the procedure described earlier. The number of
metastable equilibrium states monotonically declined with increasing dilution ratio.
Furthermore, similar results were obtained for the other trials of random dilution.
Naturally, full dilution isolates each neuron, which results in purely random neuronal
activities elicited only by noise. Within the numerical range of the inhibition and the
dilution ratio used, this result suggests that similar structural changes are caused by
the dilution and the inhibitory input. Although there are many possible ways to
reduce effective network inputs, the balance between the external inputs and the
network inputs is suggested to be essential for inducing the neuronal dynamics
transition.
Section 8 Discussion 211
8. DISCUSSION
Single neuronal activities in various regions of the brain were found to exhibit the
distinct dynamics transition from the flat to the 1// power spectral profile during the
sleep cycle in cats. The pharmacological studies suggested that the neuromodulatory
systems such as aminergic and cholinergic neuron groups were involved in the dynamics
transition. Based on these results, the dynamics transition was successfully simulated by
using the interconnected neural network model including globally applied inhibitory
input and random noise. That is, the neuronal dynamics during SWS and REM was
reproduced by the network model under strong and weak inhibitory inputs, respec-
tively. In addition, the correlation properties of the noise were shown to affect the
behavior of the network activity, which could cooperate with the network attractor
to induce the dynamics transition. Although not shown here, these findings were qua-
litatively shared by an asymmetry neural network model whose connections were made
following Dale's rule [12] and a continuous-state and time neural network model
(unpublished results).
It has been suggested that the structural change of the network attractor associated
with the global inhibitory input could be the underlying mechanism for the neuronal
dynamics transition that was observed physiologically [10,11]. In particular, the meta-
stable properties of the network attractor were found to play a main role in generating
the \/f fluctuations of neuronal activities. It was also found that the correlation proper-
ties of the noise could change the dynamics of the network activities, and the staying
time in the metastable state was prolonged with an increase in the correlation time of
the noise [11]. Those results were confirmed in terms of the escape time distribution in
each metastable state, and the metastable structure of the network attractor was
roughly visualized by estimating the wall height of the network potential energy
between metastable states. In addition, diluting connections in the network was
shown to modify the structure of the network attractor so that the dynamics transition
of neuronal activities occurred, and it was similar to that induced by the global inhibi-
tion. This result generalizes the conditions for generating the \/f fluctuations and the
dynamics transition of neuronal activities. Accordingly, one of the essential factors for
inducing the dynamics transition is suggested to be the balance between the external
inputs, such as the global inhibition and the noise, and the inputs from the other
interconnected neurons (i.e., network inputs). In other words, when the network inputs
and the external biasing inputs are comparable, the metastability of the network attrac-
tor distinctly appears: the \/f fluctuation of neuronal activities is shown. In contrast,
when the external biasing input exceeds the network input, a specific state such as the
"0" state is highly attractive: the neuronal activities exhibit a flat PSD. Nevertheless, a
large external noise would control the network dynamics with its correlation structure.
According to the simulation results, mechanisms underlying the actual neuronal
dynamics transition might be anticipated as follows. During SWS, neurons receive
stronger inhibitory inputs and/or less input magnitude from interconnected neurons
than in REM sleep. In contrast, during REM sleep, neurons are released from inhibi-
tory inputs and/or receive comparable input magnitude from interconnected neurons
with the inhibition and the noise.
The serotoninergic system is suggested as a possible candidate for the globally
working inhibitory system. This is based on neuropharmacological results [5,6].
212 Chapter 8 Metastable Associative Network Models of Neuronal Dynamics Transition
ACKNOWLEDGMENTS
REFERENCES
[1] C. Smith, Sleep states and memory processes. Behav. Brain Res. 69: 137-145, 1995.
[2] J. Antrobus, Dream theory 1997: Toward a computational neurocognitive model. Sleep Res.
Soc. Bull. 3: 5-10, 1997.
[3] F. Crick and G. Mitchison, The function of dream sleep. Nature, 304: 111-114, 1983.
[4] M. Yamamoto, H. Nakahama, K. Shima, T. Kodama, and H. Mushiake, Markov-depen-
dency and spectral analyses on spike counts in mesencephalic reticular neurons during sleep
and attentive states. Brain Res. 366: 279-289, 1986.
[5] T. Kodama, H. Mushiake, K. Shima, H. Nakahama, and M. Yamamoto, Slow fluctuations
of single unit activities of hippocampal and thalamic neurons in cats. I. Relation to natural
sleep and alert states. Brain Res. 487: 26-34, 1989.
[6] H. Mushiake, T. Kodama, K. Shima, M. Yamamoto, and H. Nakahama, Fluctuations in
spontaneous discharge of hippocampal theta cells during sleep-waking states and PCPA-
induced insomnia, /. Neurophysiol. 60: 925-939, 1988.
[7] M. Yamamoto, M. Nakao, T. Kodama, A. Hanzawa, K. Nakamura, and N. Katayama,
Dynamics transition of cortical neuronal activity between REM sleep and slow wave sleep.
Abstract 4th IBRO Congress Neuroscience, D11.2, p. 404, 1995.
[8] M. Nakao, T. Takahashi, Y. Mizutani, and M. Yamamoto, Simulation on dynamics transi-
tion in neuronal activity during sleep cycle by using asynchronous and symmetry neural
network model. Biol. Cybern. 63: 243-250, 1990.
[9] M. Nakao, K. Watanabe, T. Takahashi, Y. Mizutani, and M. Yamamoto, Structural prop-
erties of network attractor associated with neuronal dynamics transition. Proceedings
IJCNN, pp. 529-534, Baltimore, 1992.
[10] M. Nakao, K. Watanabe, Y. Mizutani, and M. Yamamoto, Metastability of network
attractor and dream sleep. Proceedings ICANN, pp. 27-30, Amsterdam, 1993.
[11] M. Nakao, I. Honda, M. Musila, and M. Yamamoto, Metastable behavior of neural net-
work under correlated random perturbations. Proceedings ICONIP, pp. 1692-1697, Seoul,
1994.
[12] M. Nakao, I. Honda, M. Musila, and M. Yamamoto, Metastable associative network
models of dream sleep. Neural Networks, 10: 1289-1302, 1997.
[13] M. Yamamoto, 1// fluctuations observed in single central neurons during REM sleep. In
Physics of the Living State, T. Musha and Y. Sawada, eds. pp. 211-222. Tokyo: Ohmsha,
1994.
[14] M. Yamamoto, H. Nakahama, K. Shima, K. Aya, T. Kodama, H. Mushiake, and M. Inase,
Neuronal activities during paradoxical sleep. Adv. Neurol. Sei. 30: 1010-1022, 1986.
[15] M. Yamamoto, M. Nakao, Y. Mizutani, T. Takahashi, and K. Watanabe, Pharmacological
and model-based interpretation of neuronal dynamics transitions during sleep-waking cycle.
Method Inform. Med. 33: 125-128, 1994.
214 Chapter 8 Metastable Associative Network Models of Neuronal Dynamics Transition
[16] M. Yamamoto, M. Nakao, Y. Mizutani, and T. Kodama, Dynamic properties in time series
of single neuronal activity during sleep. Adv. Neurol Sei. 39: 29-40, 1995.
[17] K. Takahashi, Measurement of single neuronal activities in the cat's brain and their
dynamics analysis. Master's thesis, Graduate School of Information Science, Tohoku
University, 1997.
[18] M. Yamamoto, M. Nakao, and T. Kodama, A possible mechanism of dynamics-transition
of central single neuronal activity during sleep. In Sleep and Sleep Disorders: From Molecule
to Behavior. O. Hayaishi and S. Inoue, eds. pp. 81-95, Tokyo: Academic Press, 1997.
[19] D. McGinty and R. M. Harper, Dorsal raphe neurons: Depression of firing during sleep in
cats. Brain Res. 101: 569-575, 1976.
[20] K. Shima, H. Nakahama, and M. Yamamoto, Firing properties of two types of nucleus
raphe dorsalis during the sleep-waking cycle and their responses to sensory stimuli. Brain
Res. 399: 317-326, 1986.
[21] C. M. Portas and R. W. McCarley, Behavioral state-related changes of extracellular ser-
otonin concentration in the dorsal raphe nucleus: A microdialysis study in the freely moving
cat. Brain Res. 648: 306-312, 1994.
[22] H. Iwakiri, K. Matsuyama, and S. Mori, Extracellular levels of serotonin in the medial
pontine reticular formation in relation to sleep-wake cycle in cats: a microdialysis study.
Neurosci. Res. 18: 157-170, 1993.
[23] T. Kodama, Y. Takahashi, and Y. Honda, Enhancement of acetylcholine release during
paradoxical sleep in the dorsal tegmentalfieldof the cat brain stem. Neurosci. Lett. 114:227-
282, 1990.
[24] T. Kodama, H. Mushiake, K. Shima, T. Hayashi, and M. Yamamoto, Slow fluctuations of
single unit activities of hippocampal and thalamic neurons in cat. II. Role of serotonin on
the stability of neuronal activities. Brain. Res. 487: 35-44, 1989.
[25] M. Yamamoto, H. Arai, T. Takahashi, N. Sasaki, M. Nakao, Y. Mizutani, and T. Kodama,
Pharmacological basis of 1// fluctuations of neuronal activities during REM sleep. Sleep
Res. 22: 458, 1993.
[26] Y. Koyama and Y. Kayama, Properties of neurons participating in regulation of sleep and
wakefulness. Adv. Neurol. Sei. 39: 29-40, 1995.
[27] D. A. McCormick and D. A. Price, Actions of acetylcholine in the guinea-pig and cat medial
and lateral geniculate nuclei, in vitro. J. Physiol. 392: 147-165, 1987.
[28] M. Stewart and S. E. Fox, Do septal neurons pace the hippocampal theta rhythm? Trends
Neurosci. 13: 163-168, 1990.
[29] T. M. McKenna, J. H. Ashe, G. K. Hui, and N. W. Weinberger, Muscarinic agonists
modulate spontaneous and evoked unit discharge in auditory cortex. Synapse 2: 54-68,1988.
[30] H. Sato, Y. Hata, H. Masui, and T. Tsumoto, A functional role of cholinergic innervation to
neurons in the cat visual cortex. /. Neurophysiol. 58: 765-780, 1987.
[31] H. Sei, K. Sakai, M. Yamamoto, and M. Jouvet, Spectral analyses of PGO-on neurons
during paradoxical sleep in freely moving cats. Brain Res. 612: 351-353, 1993.
[32] J. J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities. Proc. Natl. Acad. Sei. USA 79: 2254-2258, 1982.
[33] D. J. Amit, Modeling Brain Function. Cambridge: Cambridge University Press, 1989.
[34] A. R. Bulsara and F. E. Ross, Cooperative stochastic processes in reduced neuron models.
Proceedings International Conference on Noise in Physical Systems and 1 If Fluctuations, pp.
621-627, 1991.
[35] F. Moss and P. V. E. McClintock, eds., Noise in Nonlinear Dynamical Systems, Vols. 1, 2,
and 3. Cambridge: Cambridge University Press, 1989.
[36] B. L. Jacobs, Overview of the activity of brain monoaminergic neurons across the sleep-
wake cycle. In Sleep: Neurotransmitters and Neuromodulators, A. Wauquier, J. M. Monti, J.
M. Gaillard, and M. Radulovacki, eds., pp. 1-14. New York: Raven Press, 1985.
References 215
1. INTRODUCTION
For most analytical instrumentation for chemometric applications, spectral signals can
be generated by electromagnetic energy, including X-ray, ultraviolet, visible, infrared,
microwave, electron spin resonance, and nuclear magnetic resonance [1]. Spectral sig-
nals thus represent the perturbation response of the probed system. This provides a
means of identifying the system characteristics and modeling the possible responses. In
theory, it canfingerprintthe composition molecules in both qualitative and quantitative
ways. In order to apply these principles successfully to biomedical and clinical research,
one often needs to employ some spectral signal processing techniques for smoothing,
fitting, and/or extracting the signal for quantitative interpretation [2-4]. This has been
an intensive research field on the past few decades. The name chemometrics usually
refers to using linear algebra calculation methods to make either quantitative or qua-
litative measurements of chemical data, primarily spectra. Classical methods for spec-
tral signal measurement include least-squares regression (LSR), principal component
regression (PCR), and partial least squares (PLS), which can be found in daily usage in
spectral analysis. However, the possible nonlinear effect of multiple substances and
interference still poses technical challenges for performance. Both PLS and PCR are
decomposition techniques for multivariate spectral measurement. PLS uses the concen-
tration information (expected pattern) for decomposition processes. It takes advantage
of the correlation between the spectral data and the constituent concentrations. This
causes spectra containing higher constituent concentrations to be weighted more heav-
ily than those with low concentrations. The resulting spectral vectors are directly related
to the constituents of interest.
Artificial neural networks have been successfully applied to many engineering
fields [5-7]. There are more than 50 different types of network architecture and a
number of different input-output transfer functions in the literature. Neural networks
are often used as system modeling or identification methods to characterize system
properties [8]. The capabilities of nonlinear handling and adaptation for better perfor-
mance have been the advantages of this method. With a given set of input-output data
(often limited in biomedical applications), the model structure needs to approximate the
system characteristics with acceptable accuracy. Among these different structures, the
back-propagation (BP) model is probably the most popular one and claims many
216
Section 2 Methods 217
successful applications [9-12]. In general, the procedures can be divided into a training
phase and an evaluating phase. During the training phase, the sum of the square error
was minimized by using a generalized delta learning rule, which theoretically can
achieve any level of accuracy. The converged weight matrix can thus contain informa-
tion on maximum variations within the input patterns. It can then be used as a feed-
forward network for the evaluating phase. Presumably for the same category, one can
quantify unknown or untrained patterns by using the convergent weight matrix. The
advantage of this method is easy implementation. However, it is also known for its slow
convergence and limited capacity for training patterns. The radial basis function (RBF)
has its origins in techniques used for interpolation in multidimensional space [13]. The
implementation with a special two-layered architecture enables the nonlinear transfor-
mation of input space to a compacted output space. With a linear combiner on this new
space, one can modify the connection weight matrix from hidden to output layers by
using a traditional linear least-squares regression. The optimally chosen center and
width of the hidden units and the symmetrical transfer function can provide a smooth
interpolation of scattered data in arbitrary dimension to the desired accuracy [14]. This
model has been used in various areas such as speech recognition, data classification, and
time series prediction [15-19]. It has also been proposed to play an important role in the
pattern recognition capability of a neuronal signal processing scheme [20].
In this chapter, multivariate analysis methods (BP, RBF and PLS) are adopted to
measure glucose concentrations from near-infrared spectra and this performance is
compared. To facilitate the instrumentation development, all three methods were devel-
oped by using MATLAB and then implemented by using LabVIEW 4.0.1 (National
Instrument Inc.). The comparison of performance will be given in the discussion section
according to the simulation results from glucose spectra. Part of this chapter has been
presented in a conference [21].
2. METHODS
Both X and Y are mean centered to enhance the differences between concentration
and sample responses.
During the calibration phase, one needs to decompose and calculate the T, P, and
Q matrices from the known spectra value (X) and concentration (Γ) matrices by the
least-squares method. Each component of the T and P matrices is calculated by sub-
218 Chapter 9 Artificial Neural Networks for Spectroscopic Signal Measurement
sequently removing the contribution of the previous spectral vector. The optimal num-
ber of factor (a) is determined by calculating the prediction residual error sum of
squares (PRESS) with a calibration and validation set and cross-validation method.
In general, the smaller the PRESS value, the better the model can predict the concen-
trations of calibrated constituents. One can then use the convergent matrices P and Q to
calculate the calibration model for the prediction phase. With P orthogonal, the cali-
bration equation can be shown as follows:
Ϋ = XPTQ (3)
The transfer function (F) can be a linear, sigmoid, or hyperbolic tangent function.
The back-propagation error is calculated according to the generalized delta rule
(GDR). The connection weight matrices are set to random values near zero before
the beginning of the training process. During the training epochs, these weights are
adjusted with respect to the difference between the desired and actual output patterns of
gaussian
units
^""F/T/
Linear units
the network. With a predefined threshold, learning rate, hidden node numbers, and
maximum epoch, the network will eventually reach a stable condition, which we assume
convergence. The convergent state corresponds to a local minimum instead of a global
one in terms of the error energy. The convergent weight matrices can reflect the max-
imum variations within the input patterns. It has the form of eigenvector decomposition
in the eigenvalue order for linear neurons [22,23]. An adaptive learning rate and
momentum term can be applied to increase the convergent speed during the training
phase. After convergence, the network can be used as a feed-forward network for the
prediction of untrained patterns. The output value of the output layer is then calculated
by the binary weighting for quantitative results.
m
Y = YjK{\X-ci\X)*Wi (5)
i
where Y is the output vector, X is the input matrix, c, is the RBF center of the ith node,
m is the total number of centers, W is the connection weight between output node and
hidden nodes, K(o) is the common radial symmetric kernel function with nonlinearity,
and || o || denotes the Euclidean distance between X and c,. It is known that the choice
of kernel function is not critical to the performance of RBF networks. Functions often
used include the thin-plate spline function, Gaussian function, multiquadric function,
and the inverse multiquadric function. Gaussian functions (e-^*2) were used through-
out this work. The spreading width (σ) is preset by default value, and the optimal value
obtained by exhaustive search in the LabVIEW program. The number of centers is
calculated according to the orthogonal least square (OLS) method [24]. In brief, the
RBF can be viewed as a special case of the linear regression model and expressed as
follows:
m
r « = I>(O0,-MO (6)
i
The pat) are known as the regressors, which can correspond to a radial basis
function withfixedcenter c, as in Eq. 5, and the 0, are the parameters. The OLS method
involves the transformation of the set;?, into a set of orthogonal basis vectors, P = ΏΑ,
where A is an M x M upper triangular matrix with l's on the diagonal and Ω is an
N x M matrix with orthogonal column ω,. The often ill-conditioned information
matrix (P) can be decomposed by the modified Gram-Schmidt (MGS) method. By
monitoring the error reduction ratio due to ω, and a chosen tolerance p, one can find
a subset of significant regressors in a forward-regression manner [25,26]
220 Chapter 9 Artificial Neural Networks for Spectroscopic Signal Measurement
[^=gjwfWi/(dTd) (7)
i - Σ>], < p (8>
Figure 2 The graphical user interface (GUI) of the simulation system for PLS, BP,
and RBF in the LabVIEW environment. The upper panel shows the near-
infrared (NIR) spectra of glucose solutions with different concentrations.
Section 3 Results 221
(YSI-1500, Yellow Springs) before the measurement and subsequently used in the
neural network training and evaluating procedures as the standard. The absorption
spectra were calculated by reference to air, deionized water, or an absorption film. The
presented data are taken from the air reference. The acquired spectral data were saved
in ASCII format for file retrieval in off-line analysis. Due to the strong absorption of
water signals within the spectral range, different preprocessing methods were used to
change the scale of spectra for input to both PLS and neural network models. These
include the optimal density on logarithmic scale, linear scale, and Euclidean norm.
Simulation results indicated the significant performance improvement of the
Euclidean norm. The total number of spectra was divided into odd and even number
groups for the training and evaluating procedures. In cross-validation, the left-one-out
spectrum was used to evaluate the performance of trained networks.
3. RESULTS
The glucose spectra have typical water absorption peaks in the range of 970, 1250, and
1450 nm. The strong absorption band due to the presence of water is about three orders
of magnitude greater than the glucose absorption with the current measurement con-
figuration. To account for the dramatic difference between glucose and water absorp-
tion and increase the performance, we manipulate the original spectral data for
different scale and normalization procedures. All of the methods mentioned have
been implemented in the LabVIEW 4.0.1 window environment. The graphical user
interface (GUI) of the working system is shown in Figure 2. The system has been tested
with a simulated spectrum by linear and nonlinear combination of two normal distri-
bution curves, with different center and width. The resultant root mean square of the
residual error is about 0.11, which indicates that the accuracy of the prediction is
applicable to real data.
3.1. PLS
After activating the PLS, the suitable wavelength range can be interactively
selected as input spectra. The optimal factor number is determined by the PRESS
value when it first comes to baseline as shown at the bottom in Figure 3. For the
glucose data set, the average factor number is around 6. The convergent speed is faster
than with BP. As the graph shown for the cross-validation, the performance can be
evaluated by the total summation square error of predication versus the true value from
the glucose analyzer. This standard method gives very stable results for the presented
data scheme.
3.2. BP
Generally, the system can have better performance with a Euclidean norm applied
to the input spectra. In the training phase, we can modify the learning rate to view the
resultant error graphically. The sum square error is monitored and compared to the
threshold value for convergence. The spectra and concentration vectors are presented to
the networks in batch mode. The resultant convergent weight matrices are used to
evaluate the untrained spectra. The prediction value is then plotted against the calibra-
tion value to visualize the difference as shown in Figure 4.
222 Chapter 9 Artificial Neural Networks for Spectroscopic Signal Measurement
Figure 3 The layout of PLS simulation. The region of interest can be selected with
cursors to forward to the PLS processing routine. The bottom panel shows
the PRESS value versus the number of factors and the predictions versus
expected value. The RMSE value of the corresponding chosen factors is
shown on the graph.
3.3. RBF
In the RBF section, the error reduction ratio is calculated and compared to find the
significant regressors in an orderly manner. The convergent status is monitored by the
residual of one minus the total summation of the error reduction ratio with threshold
setting by p value. The system performance is quite sensitive to the value of spreading
factor, which we can control through the front panel. The resultant prediction error is
plotted against the calibration value and visualized in Figure 5.
The simulation results for PLS, BP, and RBF with the glucose near-infrared spec-
tra are listed in Table 1 for comparison. Due to the large number of samples in cross-
validation, the listed value for BP is the result obtained with a different maximal epoch
number as indicated in a footnote.
4. DISCUSSION
The chemometric measurement of optical spectra has extensive usage in both research
and clinical applications. The effectiveness of this analytical method is still an active
research area. There has been a recent surge in application for biomedical spectral
Section 4 Discussion 223
Figure 4 The layout of BP simulation. The region of interest can be selected with
cursors to forward the BP processing routine. The bottom panel shows the
PRESS value versus the number of hidden neurons and the prediction
versus expected value. The RMSE value of the corresponding chosen fac-
tors is shown on the graph.
PLS
Origin(OD) 32.53 33.53 25.82
l-exp(-OD) 19.15 21.2 42.31
Euclidean norm 23.01 25.49 65.36
t
Origin(OD) 28.28 25.67 35.54
l-exp(-OD) 57.31 179.38 66.02
Euclidean norm 24.41 25.49 65.36
Figure 5 The layout of RBF simulation. The region of interest can be selected with cursors
to forward to the RBF processing routine. The bottom panel shows the PRESS
value versus the number of hidden nodes and the prediction versus expected value.
The RMSE value of the corresponding chosen factors is shown on the graph.
Section 4 Discussion 225
case. These two methods use Cartesian mapping for the input and both have the
disadvantage of slow convergence. BP uses a sigmoid function for the nonlinearity
and GDR for the weight modification. It is bounded for the local minimum. PLS is
ideal for the linear case and results in the global minimum. RBF provides better results
in our current simulation in both convergent speed and resultant prediction accuracy.
Most of all, its performance does not degrade with increasing sample size. This can be
important for practical applications where the calibration data set will accumulate to
account for the wider range of interest and group average effect.
The MATLAB files for the PLS, BP, and RBF algorithms are as follows:
Bp.m
% INITFF - Initializes a feed-forward network.
% TRAINBP - Trains a feed-forward network with back propagation.
% SIMUFF - Simulates a feed-forward network.
% FUNCTION APPROXIMATION WITH TANSIG/PURELIN NETWORK:
% Using the above functions two-layer network is trained
% to respond to specific inputs with target outputs.
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
input data
spectrum:spectra data
concentration:concentration data
xpre and ypre are the prediction data set.
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
clear
close('all')
load c:\spectrum.txt;
load c:\concentration.txt;
xcal=spectrum;
ycal=concentration;
load c:\spectruml.text;
load c:\concentrationl.txt;
xpre=spectruml;
ypre=concentrationl;
tp = [df me eg lr];
[wl,bl,w2,b2,ep,tr] =trainbp(wl,bl,'tansig',w2,b2, 'purelin', xcal,ycal,tp);
ploterr(tr,eg);
% ___________
% Prediction
% _________
a = simuff(xpre,wl,bl,'tansig',w2,b2,'purelin');
plot (ypre, a ) ;
% The result is fairly close. Training to a lower error
% goal would result in a closer approximation.
echo off
Rb.m
% SOLVERB - Designs a radial basis network.
% SIMURB - Simulates a radial basis network.
% SUPERFAST FUNCTION APPROXIMATION WITH RADIAL BASIS NETWORKS:
% Using the above functions a radial basis network is trained
% to respond to specific inputs with target outputs.
input data
spectrum:spectra data
concentration:concentration data
xpre and ypre are the prediction data set.
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
clear
close('all')
load c:\spectrum.txt;
load c:\concentration.txt;
xcal=spectrum;
ycal=concentration;
load c:\spectruml.txt;
load c:\concentrationl.txt;
xpre=spectruml;
ypre=concentrationl;
228 Chapter 9 Artificial Neural Networks for Spectroscopic Signal Measurement
% ______________
% TRAINING THE NETWORK
% _____________
[wl,bl,w2,b2,nr,tr] = solverb(xcal,ycal,tp);
% TRAINRB has returned weight and bias values, the number
% of neurons required NR, and a record of training errors TR.
% ________________
% PLOTTING THE ERROR CURVE
% ________=_______=
ploterr(tr,eg);
% _________
% Prediction
% _______
a = simurb(xpre,wl,bl,w2,b2)
plot (ypre, a ) ;
echo off
PLS.m
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
input data
spectrum:spectra data
concentration:concentration data
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
clear
close('all')
load c:\spectrum.txt;
load c:\concentration.txt;
xcal=spectrum;
ycal=concentration;
ycplall=[];
Section 4 Discussion
yplall=[]
Ecplall=[];
ycpall=[];
ypall=[];
Ecpall=[]
tamax,fn]=size(xcal) ;
% ___ _ _ _ _ _ _ _ _
data arrangement for cross-validation
xc:spectra of calibration
yx:concentration of calibration
xp:spectra of prediction
yp:concentration of prediction
% _ _ _ _ _ _ _ _ _ _ _ _ _
for turn-1:amax
xp=[];
xc=[];
yp=[];
yc=[];
W= [ ] ;
T= [ ] ;
if turn =1
xp=xcal(turn, : ) ;
xc=xcal(turn+1):amax,:);
yp=ycal(turn,:);
yc=ycal(turn+1:amax,:) ;
elseif turn=amax
xp=xcal(turn,:);
xc=xcal(1:turn-1,:);
yp=ycal(turn,:);
yc=ycal(1:turn-1, : ) ;
else
xp=xcal(turn,:);
xc=[xcal(1:turn-1,:);xcal(turn+1:amax, :) ] ;
yp=ycal(turn,:);
yc=[ycal(1:turn-1,:);ycal(turn+1:amax,:)];
end
[η,ηη]=size(xc);
nl=min(n,nn);
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
center spectra and concentration
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
xmean-ones(n,1)*mean(xc);
ymean=ones(n,1)*mean(yc);
[np2,nnp2]=size(xp);
230 Chapter 9 Artificial Neural Networks for Spectroscopic Signal Measurement
xmean2=ones(np2,1)*mean(xc);
ymean2=ones(np2,1)*mean(yc);
x=xc-xmean;
yi=yc-ymean;
y=yi;
for a=l:nl
c=l/sqrt (y'*x*x'*y) ;
w=c*x'*y;
W=[W,w];
t=x*w;
T=[T,t];
Q=inv(T'*T)*T'*yi;
E=x-t*w';
F=yi-T*Q;
x=E;
y=F;
b=W*Q;
b0=ymean2-xmean2*b;
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
prediction of y for new spectra
% _ __ __ __
ycp=b0+xp*b;
Ecpl=yp-ycp;
yep;
ycplall=[ycplall,ycp];
yplall=[yplall,yp];
Ecplall=[Ecplall,Ecpl];
end
ycpall=[ycpall;ycplall];
ycplall=[];
ypall=[ypall;yplall];
yplall=[];
Ecpall=[Ecpall;Ecplall];
Ecplall=[];
end
for i=l:nl
msecp(i)=sqrt (EcpalK : , i ) '*Ecpall( : ,i)/amax) ;
end
[aa,a]=min(msecp)
ypresent=[ypall(:,a),ycpall(:,a)];
[ypsl,yps2]=size(ypresent);
References 231
% ____ ____
plot predicted values and RMSECV
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
figured)
plot(ypresent(:,1)»ypresent(:,2),Ό')
title('Glucose prediction ')
xlabeK'Actual concentration(mg/dL)');
ylabeK'Predicted concentration (mg/dL)');
figure(2)
plot(msecp(1:20),Ό') ;
t i t l e ( ' Linear RMSECV ')
xlabeK'DIMENSION ') ;
ylabeK'RMSECV (mg/dL)');
ACKNOWLEDGMENTS
REFERENCES
[1] D. A. Skoog, Principles of Instrumental Analysis, Holt, Rinehart and Winston: CBS College
Publishing, 1995.
[2] E. V. Thomas and D. M. Haaland, Comparison of multivariate calibration methods for
quantitative spectral analysis. Anal. Chem. 62: 1091-1099, 1990.
[3] W. E. Blass and G. W. Halsey, Deconvolution of Absorption Spectra, New York: Academic
Press, 1981.
[4] Y.-Z. Liang, Y.-L. Xie, and R.-Q. Yu, Accuracy criteria and optimal wavelength selection
for multicomponent spectrophotometric determinations. Anal. Chim. Ada 222: 347-357,
1989.
[5] P. J. Gemperline, J. R. Long, and V. G. Gregoriou, Nonlinear multivariate calibration using
principal components regression and artificial neural networks. Anal. Chem. 63: 2313-2323,
1991.
[6] T. B. Blank and S. D. Brown, Data processing using neural networks, Anal. Chim. Acta 277:
273-287, 1993.
[7] J. J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities. Proc. Natl. Acad. Sei. USA 79: 2554-2558, 1982.
[8] S. Grossberg, Nonlinear neural networks: Principle, mechanisms, and architectures. Neural
Networks 1: 17-61, 1988.
[9] P. A. Jansson, Neural network: An overview. Anal. Chem. 63: 367A-362A, 1991.
[10] B. J. Wythoff, S. P. Levine, and S. A. Tomellini, Spectral peak verification and recognition
using a multilayered neural network. Anal. Chem. 62: 2702-2709, 1990.
[11] C.-W. Lin, J. C. LaManna, and Y. Takefuji, Quantitative measurement of two-component
pH-sensitive colorimetric spectra using multilayer neural networks. Biol. Cyber. 67: 303-308,
1992.
232 Chapter 9 Artificial Neural Networks for Spectroscopic Signal Measurement
[12] J.-J. Weim, C.-W. Lin, T.-S. Kuo, T. Kao and C.-Y. Wang, A quantitative neural networks
system for glucose concentration measurement. Chin. J. Med. Bio. Eng. 15: 59-72, 1995.
[13] M. J. D. Powell, Radial basis functions for multivariable interpolation: A review. In
Algorithms for the Approximation of Function and Data. J. C. Mason and M. G. Cox,
eds., pp. 143-167. New York: Chapman & Hall, 1990.
[14] J. Park and I. W. Sandberg, Approximation and radial-basis-function networks. Neural
Comput. 5: 305-316, 1993.
[15] M. Casdagli, Nonlinear prediction of chaotic time series. Physica D 35: 335-356, 1989.
[16] J. C. Carr, W. R. Fright, and R. K. Beatson, Surface interpolation with radial basis func-
tions for medical imaging. IEEE Trans. Med. Imaging 16: 96-107, 1997.
[17] N. Donaldson, H. De, K. Gollee, J. Hunt, J. Jarvis, and M. K. Kwende, A radial function
model of muscle stimulated with irregular inter-pulse intervals. Med. Eng. Phys. 17:431-441,
1995.
[18] J. Holzfuss and J. Kadtke, Global nonlinear noise reduction using radial basis function. Int.
J. Bifurcation Chaos 3: 589-596, 1993.
[19] S. Lowes and J. M. Shippen, A diagnostic system for industrial fans. Measurement Control
30: 9-13, 1997.
[20] J. J. Hopfield, Pattern recognition computation using action potential timing for stimulus
representation. Nature 276: 33-36, 1995.
[21] C.-W. Lin, T.-C. Hsiao, M.-T. Zeng, and H.-H. Chiang, Quantitative multivariate analysis
with artificial neural networks. Second International Conference on Bioelectromagnetism, pp.
59-60, Melbourne, Australia, 1998.
[22] E. Oja, A simplified neuron model as a principal component analyzer. /. Math. Biol. 15:
267-273, 1982.
[23] T. D. Sänger, Optimal unsupervised learning in a single-layer linear feedforward neural
network. Neural Network 2: 495-473, 1989.
[24] S. Chen, S. A. Billings, and W. Luo, Orthogonal least square methods and their application
to non-linear system identification. Int. J. Control 50:1873-1896, 1989.
[25] S. Chen, S. A. Billing, C. F. N. Cowan, and P. M. Grant, Non-linear systems identification
using redial basis function. Int. J. Syst. Sei. 21: 2513-2539, 1990.
[26] S. S. A. Chen, C. F. N. Cowan, and P. M. Grant, Orthogonal least squares learning algo-
rithm for radial basis function networks. IEEE Trans. Neural Networks 2: 302-309, 1991.
[27] R. R. Anderson and J. A. Parrish, The optics of human skin. /. Invest. Dermatol. 77: 13-19,
1981.
[28] C.-C. Chiu, D. F. Cook, J. J. Pignatiello, and A. D. Whittaker, Design of a radial basis
function neural network with a radius-modification algorithm using response surface meth-
odology. /. Intell. Manuf. 8: 117-124, 1997.
[29] B. A. Whitehead and T. D. Choate, Cooperative-competitive genetic evolution of radial
basis function centers and widths for time series prediction. IEEE Trans. Neural Networks
7:869-880, 1996.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.
1. INTRODUCTION
233
234 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram
stomach. Two epigastric electrodes were connected to yield a bipolar EGG signal. The
other electrode was used as a reference. The EGG signal was amplified with a frequency
range of 0.03 to 0.25 Hz and simultaneously digitized and stored on the EGG
Digitrapper. The analog-to-digital converter has 8-bit resolution and the sampling
frequency was 1 Hz. All recordings were made in a quiet room. The subjects were in
a supine position and asked not to talk and to remain as still as possible during
recording to avoid motion artifacts. The EGG recording for each subject was made
for 30 minutes in the fasting state and for 1 to 2 hours after a test meal.
p i
Sj = ~J2 a*SJ-k + Σ C n
k J-k + nJ 0)
k=l k=\
where % (1 <k <p) and ck (1 < k < q) are called the ARMA parameters, and My is a
white noise process.
To model an EGG signal, an adaptive ARMA filter was proposed [27] as shown in
Figure 1, where Xj is the input signal at time j . The sets ay and c^ are, respectively, the
feed-forward and feedback weights of the adaptive filter, and j , is the estimate of the
input signal, expressed as
P 9
a
yj = Σ kjXJ-k + ] C Ckjej-k (2)
fc=l *=1
ej = xj-yj (3)
To make the filter output jy an ARMA estimate of the input signal Xj, the filter
weights, which are initially set to zero, are iteratively adjusted in such a way that the
error signal ej becomes a white noise process. After some mathematical manipulations
and simplifications [27], this leads to an adaptation expressed as follows:
236 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram
*&*->
Vz-1
where μ.α and μ0 are step sizes controlling the convergence and stability of the algo-
rithm.
The ARMA modeling parameters consist of ak (k = 1,2,.. .,p) and ck
(k= 1,2, ...,q). The p and q were set to 20 and 2, respectively, based on previous
studies [27]. Updated ARMA parameters for each EGG segment were used as the input
of the neural network.
Previous studies [2-4] have shown that the EGG accurately reflects the frequency
of the gastric slow wave. Therefore, spectral data instead of raw EGG data were used as
the input to the ANN in most applications. Running spectral analysis is widely used for
both qualitative and quantitative analyses of the EGG [28]. Two running spectral
analysis methods, adaptive spectral analysis [27] and the exponential distribution
(ED) [29], used in this chapter are briefly introduced as follows.
Adaptive Spectral Analysis. The adaptive spectral analysis method is based on
ARMA parametric modeling of the EGG signal. Once the adaptive filter converges,
the power spectrum of the EGG signal can be computed from the ARMA modeling
parameters, ak (k - 1, 2 , . . . ,p) and ck(k = 1,2,... ,q), according to the following [27]:
a2 =
j-m {L*
+ k=m
(7)
where j is the current time index and m is the time index at which the algorithm
converges.
At any point in an EGG recording, a power spectrum can be calculated instanta-
neously from the updated parameters of the model. Similarly, the power spectrum of
the signal for any particular time interval can be calculated by averaging the filter
parameters over that time interval. A typical EGG signal and its running spectra,
computed by the adaptive spectral analysis method, are presented in Figure 2. The
top trace shows a 30-minute EGG recording made on one patient, and the lower
panel presents the running spectra of the recording. The power spectra (from the
bottom to the top) were computed every 2 minutes starting at the beginning of the
signal with each curve representing the power spectrum of 2-minute EGG data. These
2-minute analyses were ordered serially without overlap. Comparing the spectra with
the EGG trace, one can observe that the temporal ordering of frequency events in the
EGG signal is accurately reflected in the spectral analysis.
The main advantage of this method is increased spectral and temporal solution.
Numerous experiments have shown that the adaptive spectral analysis method provides
narrow frequency peaks permitting more precise frequency identification and enhanced
ability in the determination of frequency changes at any time point [27,28]. This method
is especially powerful in detecting dysrhythmic events of brief duration and rhythmic
variations of the gastric slow wave.
where x(n) is the digitized EGG signal, n is the time index, and k is the frequency index.
WN(r) is a symmetrical window with a length of TV, and WM(JJL) is a rectangular window
with a length of M. After obtaining the summation in the preceding equation, an Ap-
point fast Fourier transform (FFT) can be used to evaluate RWEf(n, k) at each time
instant n.
The performance of the ED method for the analysis of EGG has been thoroughly
investigated by Lin and Chen [31]. The optimal parameters derived in that reference
were used in this chapter.
In practice, the EGG data were divided into segments before being processed for
the network, and thus the cutting place (time) of each segment leads to different wave-
forms in the time domain. This time shift effect is quite sensitive to the ANNs and
would make detection or identification difficult. In order to remove the time shift effect,
each segment of EGG data was transformed to the frequency domain, and only the
amplitude or power spectrum was used.
The difference between the amplitude spectra of different EGG data (e.g., data
with and without motion artifacts) is mainly at high frequencies, and sometimes it is not
obvious in the linear amplitude spectrum. A logarithmic scale (dB levels) was used to
enlarge the differences. This is denoted by Y and defined as
r=ioiog|jr(*)| (9)
where | * | denotes the absolute value and X{k) is the discrete Fourier transform of
EGG data [32].
The amplitude spectrum (or unsmoothed spectrum) provides better performance in
terms of signal detectability. In terms of characterizing the entire spectrum (power), the
smoothed spectral estimate (periodogram) is better [32]. The periodogram method is an
exact implementation based on the definition of the power spectral density. In this
method, EGG data samples are divided into consequent segments with certain overlap.
A Fourier transform is performed on each data segment, and the resultant functions of
Section 3 Applications in the EGG 239
all segments are averaged. Power can also be presented in linear and decibel (dB) units.
The decibel is the most commonly used unit and is defined as follows:
^ = 10 1og105 (10)
where A is power in dB and B is power in its linear unit, that is, taking the square
magnitude of the Fourier transform of the data.
Windows are often applied in both the amplitude spectrum and the periodogram
method to control the effect of side lobes in spectral estimations [33]. The Hamming
window, which has a low side lobe effect while still maintaining a good main lobe
bandwidth [34], was applied to each segment of the EGG data before computing the
Fourier transform in this chapter.
The EGG data used in this study were collected from 20 volunteers, each data
collection lasting for total of 1 hour. Special exercises were designed to mimic possible
motions in an EGG study, including reading loudly, raising the legs up, tapping the
electrodes, coughing, sitting up, turning the body, and walking. Four hundred data
segments were selected from all the EGG data by visual examination of the tracings
based on the study protocol. Of the segments, 160 contained motion artifacts and the
remaining 240 were pure data without motion artifacts. These 400 data segments were
divided into two groups, with data for one group of 10 volunteers used as the training
set and data for the other group of 10 used as the testing set. For each group there were
200 segments, 80 of them containing motion artifacts and 120 of them no motion
artifacts. Each data segment consisted of 16 samples.
Because the segmented raw EGG data had a time shift effect that is sensitive to
NNs, it is difficult to detect motion artifacts from raw EGG data using NNs. To
improve the performance of detection, the following features were derived from the
raw data based on the characteristics of the motion artifacts:
240 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram
1. Amplitude spectrum (AS): Because the EGG data are real, the amplitude spectra
are symmetric to the frequency axis. Therefore, half of each spectrum contains all
required information. As the length of each data segment is 16 samples in this
application, only the first 9 of 16 spectral data were chosen as the input to the NNs.
2. Maximum derivative (MD): This feature is very effective for detection of pulse-
type motion artifacts, for they vary sharply in a short time period and their max-
imum derivatives are much larger. The maximum derivative for each data segment
can be represented as follows:
where x, is the /th sample of the segment, and N is the number of samples in one
segment.
3. Standard deviation (SD): Data with motion artifacts show large variations in
amplitude in the time domain. This can be characterized by the standard deviation
(square root of the variance), which can be expressed as
a={E[x(i)-E(x(i))]2}1/2 (12)
!=0
^ = max(x, - min(x,))
The momentum and adaptive learning rate mechanism were used in the training
process of the three-layer feed-forward networks. After some trials, the parameters were
set as the following: learning rate, 0.01; learning increase ratio, 1.05; learning decrease
ratio, 0.7; momentum constant, 0.9; and error ratio, 1.04. The error goal was set to 0.01,
which was found to be accurate enough in this application.
Feature selection plays an important role in the detection of motion artifacts. The
accuracy would not be high enough if selection of the features was improper or too few
features were selected. On the other hand, too many features would lead to redundancy
and make the detection time consuming. Therefore, different combinations of the four
different features were used and compared to find an optimal set. Table 1 shows the
Section 3 Applications in the EGG 241
Accuracy (%) 94.9 96.2 97.4 96.2 98.7 98.7 98.7 100 100
"The four features compared are the maximum derivative (MD), amplitude spectrum (AS), relative amplitude
(RA), and standard deviation (SD).
testing results of detection by networks with three hidden neurons using one, two, three,
or four features. Two-feature detection was better than or equal to one-feature detec-
tion, and three-feature detection was better than or equal to two-feature detection.
Three-feature detection using amplitude spectrum, maximum derivative, and standard
deviation was as accurate as four-feature detection and better than any two-feature
detection. Therefore, amplitude spectrum, maximum derivative, and standard deviation
were considered to be the optimal choice for the training and testing data sets.
A software system running on MATLAB has been developed for the detection and
elimination of motion artifacts in EGG recordings. Figure 3 shows a flowchart of this
system. Figure 4 shows an example of this system's ability to identify motion artifacts in
a real EGG recording. Figure 4a shows an original EGG recording that contains
relatively severe motion artifacts. The data segments with zero values in Figure 4b
show the motion artifacts recognized by the network. The EGG waveform after the
deletion of the data segments with motion artifacts is shown in Figure 4c. The effect of
motion artifacts on the EGG spectrum is shown in Figure 5. The frequency peak at 3
cycles/min (cpm) indicates the electrical activity of the stomach. It can be seen from this
figure that motion artifacts result in not only waveform distortion but also spectral
distortion.
The EGG data were obtained from 10 healthy subjects in the fasting state, each for
2 hours. Gastric contractions were simultaneously monitored with EGG recordings
using an intraluminal antral manometric probe. The manometric signals were used as
242 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram
Amplitude
spectrum
U2
Elimination Output
Data Maximum Neural
collection derivative network ofMA
Ul U3 U5 U6
Standard
deviation
U4
Figure 3 Flowchart of the software system for detection and elimination of motion
artifacts.
Figure 4 Example of EGG data before and after elimination of motion artifacts.
(a) Original EGG data with motion artifacts; (b) the motion artifacts (repre-
sented by zero values) detected by the neural network; (c) EGG data with
motion artifacts eliminated.
Section 3 Applications in the EGG 243
Power spectra: (a) original EGG (star);
(b) after deleting motion artifacts (solid)
the "gold standard" for the existence of gastric contractions. The EGG recording was
divided into segments, each with 512 samples. Each segment of the EGG data was
labeled as 0 or 1. A segment was labeled 0 if no contractions were seen in the stomach in
the simultaneous manometric recordings. A segment was labeled 1 if one or more
contractions were present in the manometric recording. The power spectrum of each
segment was computed by the exponential distribution method to use as the input to the
network. There was an overlap of 75% between two adjacent EGG segments. Only 64
spectral data were used for each input, which covered 0 to 15 cpm. Since the electrical
activity of the stomach contains no information above 15 cpm, spectral data above
15 cpm were discarded. This substantially simplified the structure of the network. The
EGG in five subjects was used as the training set. The testing set was composed of the
EGG in the other five subjects.
0.08
0.07
0.06
| · 0.05
0.04-
0.03
0.02
0.01
Learning rate Average squared error Accuracy (train) (%) Accuracy (test) (%)
nodes was the same as that of the network with 10 hidden nodes for a fixed
number of iterations of 1000. With a structure 64:10:2 and the optimized values
of the learning rate and momentum factor presented previously, the network recog-
nized the EGG during motor quiescence with an accuracy of 90% and the EGG
during gastric contractions with an accuracy of 94%.
0.08 :
0.07:
0.06 :
0.05
0.04 d
0.03
0.02:
0.01 -
Momentum factor Average squared error Accuracy (train) (%) Accuracy (test) (%)
motor disorders and several clinical symptoms, such as nausea, vomiting, motion sick-
ness, and early pregnancy [37-40]. The assessment of the normality or abnormality of
the EGG is, therefore, of great clinical significance. Currently, researchers assess the
normality of the EGG by visual examination of the EGG tracing or its running power
spectra or both. Figure 8 shows typical normal and abnormal EGG signals. The left
panel shows the time signals and the right panel their power spectra. The four rows
correspond to bradygastria, normal, tachygastna, and arrhythmia, respectively. The
normal signal clearly contains a frequency of 3 cpm, and abnormal signals differ from
this pattern in that they contain lower, higher, or irregular frequencies. Compared with
other surface recordings, such as ECG, the quality of the EGG is usually poor. The
gastric signal in the EGG is disturbed by noise, which is composed of respiratory and
motion artifacts, the ECG, and electrical interference of the small intestine. In addition,
an EGG recording usually lasts more than 1 hour. As a result, visual examination of the
EGG not only is time consuming but also requires extensive experience in spectral
246 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram
" Learning rate = 0.05, momentum factor = 0.9, number of iterations = 1000.
analysis. Therefore, a neural network approach has been proposed for the automated
classification of normal and abnormal EGGs.
To achieve optimal results and reduce the complexity of the network, three types of
input data preprocessing were investigated and compared with each other. These
included raw EGG data, power spectral data, and ARMA modeling parameters of
the EGG. The raw EGG data were divided into segments, each with 60 time samples
(1 minute). The power spectral data and ARMA modeling parameters were obtained
from the raw EGG data using the adaptive running spectral analysis method [27], which
has been shown to provide high frequency resolution and accurate temporal informa-
tion [28]. The ARMA parameters of each EGG segment were composed of 20 feed-
forward and 2 feedback parameters.
The EGG segment was defined as normal if the visual examination of the EGG
traces and its spectrum indicated a dominant frequency in the range 2.4 to 3.8 cpm.
Otherwise, it was defined as abnormal. The training data set was composed of 100
segments of the normal EGG and 100 segments of the abnormal EGG, which were
randomly selected from the EGG recordings of 10 subjects. The test set was composed
of 100 segments of normal EGG and 100 segments of abnormal EGG randomly
selected from another 10 subjects' EGG recordings.
The feed-forward network with one hidden layer was used as a classifier in this
study. The optimal number of neurons in the hidden layer was determined experimen-
tally. The numbers of neurons in the input layer were determined by the dimension or
size of the input vector. For the input types of raw EGG data and spectral data, 60
input neurons were used. The input layer contained 22 neurons when the ARMA
modeling parameters were used as the input. The output layer consisted of one neuron
for classifying the EGG into two classes: normal and abnormal.
Three indexes were used to assess the performance of the neural networks: percent
correct (Pc), sum-squared error (SSE), and complexity per iteration.
The complexity per iteration was defined as the computational time required for
each iteration of the algorithm. The Pc was defined as the percentage of all of the
answers obtained that were judged to be correct according to the gold standard. The
Section 3 Applications in the E G G 247
0.8 0.75
0.7
! IVj 1 1 0.7
u 0.6
I |l/i 1 k — 0.65
\ 1 |V 1 i J
CO
-a 3 o.i.
3 is
=
E
0.5
f\r i yv ? 0.55
O
0.3
Mill 20 30 40
0.45
0.4
2 3 4 5 6 7
0.2 Time (seconds) Frequency (cpm)
0.7
0.65
0.65
0.6 0.6
1" 0.45
ffl
3-
Ö
0.55
0.5
0.4
I
0.35 O, 0.45
0.4
0.3
0.25 0.35
20 30 40 1 2 3 4 5 6 7
Time (seconds) Frequency (cpm)
20 30 40 2 3 4 5 6 7
Time (seconds) Frequency (cpm)
0.5
{
> t
0.45
i
0.4
! |
\ 1 >-\ !/->
Vs
a- 0.3
0.25 V /
0.2
20 30 40 50 2 3 4 5 6
Time (seconds) Frequency (cpm)
Figure 8 Examples of the EGG signals. The left panels show the time domain signals
and the right panels correspond to their power spectra. The four rows from
top to bottom represent bradygastria, normal, tachygastria, and arrhyth-
mia, respectively.
248 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram
SSE was obtained by computing the difference between the output value that an output
neuron was supposed to have, called Tip, and the value the neuron actually had as a
result of the feed-forward calculations, called zip. This difference was squared, and then
the sum of the squares was taken over all output neurons. Finally, the calculation was
repeated for each example in the training or testing set, as applicable. Let P be the
number of examples in the training set and testing set and N2 the number of neurons in
the output layer; then the SSE can be expressed as
p N2
Si£ = ^ £ ( z , , r g 2 (15)
p=\ (=1
BP SCG QN
Convergence rate — + +
Complexity per iteration + + -
Robustness — + +
" The SCG algorithm is the best compromise.
Section 3 Applications in the EGG 249
20 40 60 80 100
Number of iterations
Figure 9 Cost function (or SSE) displayed as a function of the iteration for time
domain data (left) and spectral data (right) for three algorithms: BP (—),
SCG ( ), and QN (—). The network structure is 60:4:1.
6 8 10 12 14 16 6 8 10 12 14 16
Number of hidden neurons Number of hidden neurons
Figure 10 For BP and SCG, the complexity per iteration is a linear function of the
network size; in contrast, the QN shows a quadratic relationship.
with three different types of input. It can be seen that both the spectral data and the
ARMA modeling parameters yielded an accurate classification of 95%, whereas the
percent correct was only 65% when the raw EGG data were used as the input.
Although ARMA modeling parameters generated the same performance as the spectral
data, the ANN with the ARMA modeling parameters as the input contained substan-
tially less input neurons (22 vs. 60) and was much simpler in computations.
hypothesis that patients with normal and delayed emptying of the stomach can be
differentiated from certain EGG parameters using the ANN, a feature-based neural
network method has been developed to provide a noninvasive alternative for detection
of delayed gastric emptying.
The EGG data were obtained from 152 patients with suspected gastric motility
disorders who underwent clinical tests for gastric emptying. A 30-minute baseline EGG
recording was made before ingestion of a standard test meal for each patient. Then the
patient consumed a standard test meal within 10 minutes. After the meal, simultaneous
recordings of the EGG and scintigraphic gastric emptying were made continuously for
2 hours. The techniques for recording the EGG and gastric emptying were previously
described [41]. The gastric emptying results were interpreted by the nuclear medicine
physicians.
Previous studies have shown that spectral parameters of the EGG provide useful
information regarding gastrointestinal motility and symptoms [33]. Therefore, all EGG
data were subjected to computerized spectral analysis using the programs previously
developed in our laboratory [42]. The following EGG parameters were extracted from
the spectral domain of the EGG data for each patient and were used as the input to the
neural network: dominant frequencies and their corresponding powers of the prepran-
dial and postprandial EGGs, the EGG peak power ratio between the preprandial and
postprandial EGGs, percentages of 2-4 cpm (normal frequency range) activity, and
tachygastria in the fasting and fed state. The EGG power ratio between the preprandial
and postprandial EGGs is related to the regularity and amplitude of the gastric slow
Section 3 Applications in the EGG 251
wave and has been reported to be associated with gastric contractility. The percentage
of the normal 2-4 cpm activity is a quantitative assessment of the regularity of the
gastric slow wave measured from the EGG. It was defined as the percentage of time
during which normal 2-4 cpm slow waves were observed in the EGG. It was calculated
using the running power spectral analysis method [42]. Tachygastria has been shown to
be associated with gastric hypomotility [33]. Therefore, the percentage of tachygastria
was calculated and used as a feature to be input into the neural network. It was defined
as the percentage of time during which 4-9 cpm slow waves were dominant in the EGG
recording and was computed in the same way as the percentage of the normal gastric
slow wave.
In order to preclude some features dominating the classification process, the value
of each parameter was normalized to the range of zero to one. Experiments were
performed using all or part of the preceding parameters as the input to the artificial
neural network to derive optimal performance.
The EGG data obtained from the 152 patients were divided into two groups based
on the results of the scintigraphic gastric emptying test: 76 patients with delayed gastric
emptying and 76 patients with normal gastric emptying. The training set was composed
of EGG data of 50% of the patients from each of the groups, and the remaining data
were used as the testing set. The statistical analysis of the EGG parameters between the
two groups of patients revealed that the patients with delayed gastric emptying had a
lower percentage of regular 2-4 cpm slow waves in both fasting (77.1 ±2.6% vs
88.7 ± 1.3%, p < 0.001) and fed (77.8 ± 2.2% vs. 90.0 ± 1.0%, p < 0.001) states. A
significantly higher level of tachygastria was also observed in the fed state in patients
with delayed gastric emptying (13.9 ± 1.8% vs. 4.1 ± 0.6%, p < 0.001). Both groups of
patients showed a postprandial increase in EGG dominant power. This increase was,
however, significantly lower in patients with delayed gastric emptying than in patients
with normal gastric emptying (1.2 ± 0.6 dB vs. 4.6 ± 0.5 dB, p < 0.001). No significant
differences were observed in the dominant frequency of the EGG between the two
groups in the fasting (3.08 ±0.10 cpm vs. 2.94 ±0.03 cpm, p > 0.05) and fed
(3.20 ± 0.10 cpm vs. 3.03 ± 0.03 cpm, p > 0.05) states.
A number of experiments were conducted to optimize the performance of the
network using different numbers of EGG parameters ranging from two to all of the
parameters. Table 7 presents the test results of the network with different neurons in the
input layer and hidden layer. ThefiveEGG parameters were the dominant frequency in
the fasting state, the dominant frequency in the fed state, the postprandial increase of
the EGG dominant power, the percentage of normal 2-4 cpm slow waves in the fed
state, and the percentage of tachygastria in the fed state. These five parameters were
determined based on a series of experiments using different combination of all EGG
parameters. It can be seen that the best performance was achieved when these five
parameters were used as the input and the hidden layer had five neurons. In this
case, the accuracy in determining the correct diagnosis was 85% with a sensitivity of
82% and a specificity of 89%. It can also be seen that three neurons are the optimal
number for the hidden layer and that exclusion of any one or two of the five input
parameters would cause the performance of the classification to deteriorate.
252 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram
5-5-2 80 74 87
5-4-2 80 74 87
5-3-2 85 82 89
5-2-2 80 74 87
4-3-2 80 74 87
3-3-2 72 74 71
" CC(%), the percentage of correct classification; SE(%), sensitivity; SP(%), specificity.
The reasons for choosing the neural network approaches in clinical applications of
the EGG were as follows: (1) The best candidate problems for ANN analysis are those
that are characterized by fuzzy, imprecise, and imperfect knowledge (data) and/or by
lack of a clearly stated mathematical algorithm for the analysis of data [6]. The pro-
blems of the EGG in clinical applications are perfect candidates for ANNs. The EGG
signal is imprecise. However, the measurement of the EGG is noninvasive and well
accepted by patients and physicians. Therefore, ample data can be made available
without any difficulty for the training and testing of the neural network.
(2) Successful applications of the ANN for the classification of other medical data
have been reported in numerous previous studies [7-13,43,44].
The structure of the neural network and the parameters were determined from the
literature and experiment. In comparison with other ANNs, the feed-forward neural
network has the advantages of availability of effective training algorithms, relatively
better system behavior, and a successful track record in solving PC-based systems
[6,11]. One hidden layer was used on the basis of several previous studies [14,15] that
showed that one hidden layer resulted in the same performance as two or more hidden
layers. Conflicting results were reported in the literature on the number of hidden
neurons [11]. The selection of the number of hidden neurons in this chapter was
based purely on experiments that showed that different hidden nodes were needed in
different applications. The BP learning algorithm was successfully applied in most
cases, and experimental results also showed that the adaptive learning rate and momen-
tum mechanism greatly improved the network performance. Compared with the BP
and QN algorithms, the SCG algorithm is more appropriate for classification of normal
and abnormal signals. It has moderate computational complexity and shows a super-
References 253
REFERENCES
[1] W. C. Alvarez, The electrogastrogram and what it shows. J. Am. Med. Assoc. 78: 1116-1118,
1922.
[2] A. J. P. M. Smout, E. J. van der Schee, and J. L. Grashuis, What is measured in electro-
gastrography? Dig. Dis. Sei. 25: 179-187, 1980.
[3] B. O. Familoni, K. L. Bowes, Y. J. Kingma, and K. R. Cote, Can transcutaneous recordings
detect gastric electrical abnormalities? Gut 32: 141-146, 1991.
[4] J. Chen, B. D. Schirmer, and R. W. McCallum, Serosal and cutaneous recordings of gastric
myoelectrical activity in patients with gastroparesis. Am. J. Physiol. 266: G90-G98, 1994.
[5] R. P. Lippmann, An introduction to computing with neural nets. IEEE ASSP Mag. April:
4-22, 1987.
[6] R. C. Eberhart and R. W. Dobbins, Neural Network PC Tools: A Practical Guide. San
Diego: Academic Press, 1990.
[7] D. R. Hush and B. G. Home, Progress in supervised neural networks. IEEE Signal Process.
January: 8-39, 1993.
[8] P. A. Karkhanis, J. Y. Cheung, and S. M. Teague, Using a PC based neural network to
estimate the ejection fraction of a human heart. Int. J. Microcomput. Appl. 9: 99, 1990.
[9] T. Pike and R. A. Mustart, Automated recognition of corrupted arterial waveforms using
neural network techniques. Comput. Biol. Med. 22: 173-179, 1992.
254 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram
[32] A. V. Oppenheim and R. W. Schäfer, Digital Signal Processing, Englewood Cliffs, NJ:
Prentice-Hall, 1975.
[33] J. D. Z. Chen and R. W. McCallum, EGG parameters and their clinical significance. In
Electrogastrography: Principles and Applications, J. D. Z. Chen and R. W. McCallum, eds.,
pp. 45-73, New York: Raven Press, 1994.
[34] A. H. Nuttal, Some windows with very good sidelobe behavior, IEEE Trans. Acoust. Speech
Signal Process, 29: 84-91, 1981.
[35] J. Chen, R. W. McCallum, and R. Richards, Frequency components of the electrogastro-
gram and their correlations with gastrointestinal contractions in humans. Med. Biol. Eng.
Comput. 31: 60-67, 1993.
[36] J. D. Z. Chen, R. Richards, and R. W. McCallum, Identification of gastric contractions
from the cutaneous electrogastrogram. Am. J. Gastroenterol. 89: 79-85, 1994.
[37] J. D. Z. Chen and R. W. McCallum, Clinical applications of electrogastrography. Am. J.
Gastroenterol. 88: 1324-1336, 1993.
[38] C. H. You, K. Y. Lee, W. Y. Chey, and R. Menguy, Electrogastrographic study of patients
with unexplained nausea, bloating and vomiting. Gastroenterology 79: 311-314, 1980.
[39] J. Chen and R. W. McCallum, Gastric slow wave abnormalities in patients with gastropar-
esis. Am. J. Gastroenterol. 87: 477-482, 1992.
[40] R. M. Stern, K. L. Koch, H. W. Leibowitz, I. Lindblad, C. Shupert, and W. R. Stewart,
Tachygastria and motion sickness. Aviat. Space Environ. Med. 56: 1074-1077, 1985.
[41] J. D. Z. Chen, Z. Y. Lin, and R. W. McCallum, Abnormal gastric myoelectrical activity and
delayed gastric emptying in patients with symptoms suggestive of gastroparesis. Dig. Dis.
Sei. 41: 1538-1545, 1996.
[42] J. Chen, A computerized data analysis system for electrogastrogram. Comput. Biol. Med. 22:
45-58, 1992.
[43] M. F. Kelly et al., The application of neural networks to myoelectric signal analysis: A
preliminary study. IEEE Trans. Biomed. Eng. 37: 221-230, 1990.
[44] S. Srinivasan, R. E. Gander, and H. C. Wood, A movement pattern generated model using
artificial neural networks. IEEE Trans. Biomed. Eng. 39: 716-722, 1992.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.
INDEX
A F
A posteriori probabilities, 102 Feature extraction, 32, 249
ADALINE, 72 Function approximation, 125
Admissible generator function, 129, 175 Fuzzy arithmetic, 10
Approximate reasoning, 17 Fuzzy clustering, 29, 180
ARMA, 235 Fuzzy omeans algorithm, 163-164
Artificial neural networks, 69 Fuzzy interference, 22
Associative network, 198 Fuzzy realtions, 12
Fuzzy set theory, 5
B G
Back propagation, 218 Generalized reformulation function, 171
Back propagation learning, 54 Generator function, 144, 174
Blind spot, 134 Genomic sequences, 98
Bootstrap stratification, 104 Gradient descent learning, 144
I
C Information-theoretic models, 60
Clustering algorithm, 159
Crisp fuzzy partitions, 160
K
Knowledge, 2
D
Dynamic stability, 204
Dynamic transition, 210 L
Dynamically driven recurrent networks, Learning algorithm, 248
Dynamics, 204 Linear generator function, 137
LVQ and clustering algorithm, 178
E M
EEG, 83 Metastability, 206
EGG, 85, 233 Modified back-propagation, 100
Entropy-constrained fuzzy clustering, 165 MRI segmentation, 188
Exponentially generator function, 131, 13! Multilayer perceptrons, 53, 71
N Sequential learning algorithm, 143
Necessity, 16 Sigmoid function, 71
Neural network models, 201 Signal compression, 89
Neurodynamic programming, 61 Sleep, 88,198,199
Neuromodulation, 201 Soft clustering, 182
Nonstationary signal processing, 29 Spectroscopic signals, 216
Supervised learning, 53, 74
Support vector machines, 58
P SWS, 199
Partial least squares, 212 System identification, 28
Pharyngeal wall, 88
Possibility distribution, 14
Possibility theory, 12
Principal component analysis, 59 T
Processing elements, 70 Thresholding, 71
Time-frequency analysis, 33
Time series prediction, 28, 44
R
Radial basis function, 57, 123, 125, 219
Reformulating fuzzy clustering, 168 U
REM, 199
Uncertainty, 1
Respiratory signal, 86
Unsupervised fuzzy clustering, 34, 36
Running power spectrum, 236
Unsupervised learning, 59, 75
Upper airway obstruction, 87
S
Sample stratification, 98
Selecting generator function, 133 W
Self-organizing map, 59 Weighted fuzzy K-mean algorithm, 37
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.
259
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.
epst 1 HUFC 2 subclasses
300
Feature 2
Feature 1
> (
ja? "IT Γ "f" ++4f ++ + +
1 1
1
1 1
I 1
1 1
1
Figure 2.4 The first partition of rat number 11 's EEG data into two clusters by the HUFC
0
algorithm. The upper panel shows the partition in the clustering space of three
(out[ of eight) energies of the scales of the discrete wavelet transform of the EEG
stretch terminating with a seizure. Each data point is marked by the number
of the cluster in which it has maximal degree of membership. The number
of clusters was determined by the average density criterion for cluster validity
(Figure 3). The lower panel shows the "hard" affiliation of each successive
point in the time series (1 second) to each of the clusters. The seizure beginning
(as located by a human expert) is marked by a solid vertical line (after 700
seconds).
1
epsl 1 HUFC final 10 classes
300
Feature 2
Feature 1
+ + ' -H-+ ! + + +■ +
-H-+ +
i
+ ■#+■ + -l· +■
5
+
!
0 4
HHHttt IltBI t
3
2
IIIIMIIIIIIlUllBllllllMI^IIII II III
1
Figure 2.5 Thefinalpartition of the EEG data with the HUFC algorithm. Clusters 4 and 5
can be used to predict the seizure, which can be identified by clusters 8 and 10.
2
hrv HUFC 4 subclasses
a
3 3
80s
70 | 3 a3s 3
35 3 3 J
c 60
| 50^
30
80
60
70 80
40 60
50
Feature 2
Feature 1
ure 2.7 The first partition of the recovery heart rate signals by the HUFC algorithm
into four clusters as suggested by the average partition density (Figure 6). The
upperpanel shows the partition of the 3Dtemporal patterns [SJ, s,-+i, s;+2). i =
1 , . . . , L — 2, of the heart rate signal into the four clusters, and in the lower
panel we can see the affiliation of each temporal pattern with its corresponding
cluster marked on the original heart rate signal (the continuous line).
hrv HUFCIinal 10 classes
Feature 2
Feature 1
<n m
rD // c
/ 3
P* / £■
m 03
n -Q
< <
1 2 3 4 5 6 1; 2 3 4 5 6
<_1 .ufc Partition density x_1
x_1 .ufc Average partition density
—r \
m
r3
/
/
(0
c
D \ I
I
?- / & >
m <0
n J3
< <
1 2 3 4 5 ( 1 2 3 4 5 6
NlumlDerc»fellister s Number of clusters
Feature 1
Figure 2.9 The first partition of the s„ = 4 ■ i„_i ■ (1 - s n _i) time series by the HUFC
algorithm. The lower panel shows the partition of the 2D temporal patterns,
{i;, i,-+l}, i = 1 , . . . , 899, into five clusters as suggested by the average parti-
tion density criterion in the upper panel.
Predicting 1 sample ahead of s (origin — blue o, predict — red x)
Figure 2.10 The one sample ahead (d=l) prediction results of 200 samples of the sn =
4 ■ s„_i ■ (1 - sn-\) time series. The circle (O) marks the original samples of
the time series and the " X " marks the predictions.
hrv.hfc HUFC 19 subclasses
Feature 2 -2 -2
Feature 1
Figure 2.11 Thefinalpartition of the resting heart rate signals by the HUFC algorithm into
19 clusters as suggested by the average partition density using the "maximal
members." The upper panel shows the partition of the 3D temporal patterns
{ί,·,ί,·+1, ί;+2). i = 1 , . . . , £-—2, of the heart rate signal into 19 clusters, and
in the lower panel we can see the affiliation of each temporal pattern with its
corresponding cluster marked on the original heart rate signal (the continuous
line).
Predicting 1 sample ahead of hrv (origin — blue o, predict — red x)
2.5 r
Figure 2.12 The one sample ahead (d = 1) prediction results of 100 samples of the resting
heart rate signal. The circle (O) marks the original samples of the time series
and the " X " marks the predictions.