Nonlinear Biomedical Signal Processing - Fuzzy Logic, Neural Networks, and New Algorithms, Volume 1 (PDFDrive)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 282

NONLINEAR BIOMEDICAL

SIGNAL PROCESSING
Volume I
IEEE Press Series on Biomedical Engineering

The focus of our series is to introduce current and emerging technologies to biomedical and electrical
engineering practitioners, researchers, and students. This series seeks to foster interdisciplinary
biomedical engineering education to satisfy the needs of the industrial and academic areas. This
requires an innovative approach that overcomes the difficulties associated with the traditional textbook
and edited collections.

Metin Akay, Series Editor


Dartmouth College

Advisory Board

Thomas Budinger Simon Haykin Richard Robb


Ingrid Daubechies Murat Kunt Richard Satava
Andrew Daubenspeck Paul Lauterbur Malvin Teich
Murray Eden Larry Mclntire Herbert Voigt
James Greenleaf Robert Plonsey Lotfi Zadeh

Editorial Board
Eric W. Abel Gabor Herman Kris Ropella
Dan Adam Helene Hoffman Joseph Rosen
Peter Adlassing Donna Hudson Christian Roux
Berj Bardakjian Yasemin Kahya Janet Rutledge
Erol Basar Michael Khoo Wim L. C. Rutten
Katarzyna Blinowska Yongmin Kim Alan Sahakian
Bernadette Bouchon-Meunier Andrew Laine Paul S. Schenker
Tom Brotherton Rosa Lancini G. W. Schmid-Schönbein
Eugene Bruce Swamy Laxminarayan Ernest Stokely
Jean-Louis Coatrieux Richard Leahy Ahmed Tewfik
Sergio Cerutti Zhi-Pei Liang Nitish Thakor
Maurice Cohen Jennifer Linderman Michael Unser
John Collier Richard Magin Eugene Veklerov
Steve Cowin Jaakko Malmivuo AI Wald
Jerry Daniels Jorge Monzon Bruce Wheeler
Jaques Duchene Michael Neuman Mark Wiederhold
Walter Greenleaf Banu Onaral William Williams
Daniel Hammer Keith Paulsen Andy Yagle
Dennis Healy Peter Richardson Yuan-Ting Zhang

Books in the IEEE Press Series on Biomedical Engineering


Akay, M., Time Frequency and Wavelets in Biomedical Signal Processing
Hudson, D. L. and M. E. Cohen, Neural Networks and Artificial Intelligence for Biomedical Engineering
Khoo, M. C. K., Physiological Control Systems: Analysis, Simulation, and Estimation
Liang, Z-P. and P. C. Lauterbur, Principles of Magnetic Resonance Imaging: A Signal Processing
Perspective
Akay, M. Nonlinear Biomedical Signal Processing: Volume I, Fuzzy Logic, Neural Networks, and New
Algorithms
Akay, M. Nonlinear Biomedical Signal Processing: Volume II, Dynamic Analysis and Modeling
Ying, H. Fuzzy Control and Modeling: Analytical Foundations and Applications
NONLINEAR BIOMEDICAL
SIGNAL PROCESSING
Fuzzy Logic, Neural Networks,
and New Algorithms

Volume I

Edited by
Metin Akay
Darmouth College
Hanover, NH

g Λ n IEEE Engineering in Medicine


and Biology Society, Sponsor
-<·-

IEEE Press Series on Biomedical Engineering


Metin Akay, Series Editor
*

IEEE
PRESS

The Institute of Electrical and Electronics Engineers, Inc., New York


This book and other books may be purchased at a discount
from the publisher when ordered in bulk quantities. Contact:

IEEE Press Marketing


Attn: Special Sales
445 Hoes Lane
P.O. Box 1331
Piscataway, NJ 08855-1331
Fax: + 1 732 981 9334

For more information about IEEE Press products,


visit the IEEE Online Catalog & Store: http://www.ieee.org/ieeestore.

© 2000 by the Institute of Electrical and Electronics Engineers, Inc.


3 Park Avenue, 17th Floor, New York, NY 10016-5997.

All rights reserved. No part of this book may be reproduced in any form,
nor may it be stored in a retrieval system or transmitted in any form,
without written permission from the publisher.

Printed in the United States of America.

10 9 8 7 6 5 4 3 2 1

ISBN 0-7803-6011-7
IEEE Order No. PC5861

Library of Congress Cataloging-in-Publication Data


Nonlinear biomedical signal processing/edited by Metin Akay.
v. < > cm. — (IEEE Press series on biomedical engineering)
Includes bibliographical references and index.
Contents: v. 1. Fuzzy logic, neural networks, and new algorithms — v. 2. Dynamic
analysis and modeling.
ISBN 0-7803-6011-7
1. Signal processing. 2. Biomedical engineering. 3. Fuzzy logic. 4. Neural networks. I.
Akay, Metin. II. Series.

R857.S47 N66 2000


610'.285'632—dc21 00-027777
This book is dedicated to the memory of one of the
most influential poets of the 20th Century,
NAZIM HIKMET

"I stand in advancing light,


my hands hungry, the world beautiful.

My eyes can not get enough of the trees—


they are so hopeful and green.

A sunny road runs through the mulberries,


I am at the window of the prison infirmary.

I can not smell medicines—


carnations must be blooming nearby.

It's this way:


Being captured is beside the point,
The point is not to surrender."

Nazim Hikmet, 1948

("It's This Way" reproduced by permission of


Persea Books.)
IEEE Press
445 Hoes Lane, P.O. Box 1331
Piscataway, NJ 08855-1331
IEEE Press Editorial Board
Robert J. Herrick, Editor in Chief

M. Akay M. Eden M. S. Newman


J. B. Anderson M. E. El-Hawary M. Padgett
P. M. Anderson R. F. Hoyt W. D. Reeve
J. E. Brewer S. V. Kartalopoulos G. Zobrist
D. Kirk
Kenneth Moore, Director of IEEE Press
Catherine Faduska, Senior Acquisitions Editor
Linda Matarazzo, Associate Acquisitions Editor
Surendra Bhimani, Production Editor
IEEE Engineering in Medicine and Biology Society, Sponsor
EMB-S Liaison to IEEE Press, Metin Akay
Cover design: William T. Donnelly, WT Design

Technical Reviewers
Eric W. Abel, University of Dundee, United Kingdom
Richard D. Jones, Christchurch Hospital, Christchurch, New Zealand
Suzanne Keilson, Loyola College, MD
Kristina M. Ropella, Marquette University, Milwaukee, WI
Alvin Wald, Columbia University, New York, NY

Books of Related Interest from IEEE Press

NEURAL NETWORKS: A Comprehensive Foundation, Second Edition


Simon Haykin
A Prentice Hall book published in cooperation with IEEE Press
1999 Hardcover 600 pp IEEE Order No. PC5746 ISBN 0-7803-3494-9
RANDOM PROCESSES FOR IMAGE AND SIGNAL PROCESSING
Edward R. Dougherty
An SPIE Press book published in cooperation with IEEE Press
A volume in the SPIE/IEEE Series on Imaging Science & Engineering
1999 Hardcover 616 pp IEEE Order No. PC5747 ISBN 0-7803-3495-7
THE IMAGE PROCESSING HANDBOOK, Third Edition
John C. Russ
A CRC Press handbook published in cooperation with IEEE Press
1998 Hardcover 800 pp IEEE Order No. PC5775 ISBN 0-7803-4729-3

UNDERSTANDING NEURAL NETWORKS AND FUZZY LOGIC: Basic Concepts &


Applications
Stamatios V. Kartalopoulos
A volume in the IEEE Press Understanding & Technology Series
1996 Softcover 232 pp IEEE Order No. PP5591 ISBN 0-7803-1128-0
CONTENTS

PREFACE xiii

LIST OF CONTRIBUTORS xv

CHAPTER 1I UNCERTAINTY MANAGEMENT IN MEDICAL APPLICATIONS 1


Bernadette Bouchon-Meunier

1. Introduction 1
2. Imperfect Knowledge 1
2.1. Types of Imperfections 1
2.1.1. Uncertainties 1
2.1.2. Imprecisions 2
2.1.3. Incompleteness 2
2.1.4. Causes of Imperfect Knowledge 2
2.2. Choice of a Method 2
3. Fuzzy Set Theory 4
3.1. Introduction to Fuzzy Set Theory 4
3.2. Main Basic Concepts of Fuzzy Set Theory 5
3.2.1. Definitions 5
3.2.2. Operations on Fuzzy Sets 6
3.2.3. The Zadeh Extension Principle 8
3.3. Fuzzy Arithmetic 10
3.4. Fuzzy Relations 11
4. Possibility Theory 12
4.1. Possibility Measures 12
4.2. Possibility Distributions 14
4.3. Necessity Measures 15
4.4. Relative Possibility and Necessity of Fuzzy Sets 17
5. Approximate Reasoning 17
5.1. Linguistic Variables 17
5.2. Fuzzy Propositions 19
5.3. Possibility Distribution Associated with a Fuzzy Proposition 19
5.4. Fuzzy Implications 21
5.5. Fuzzy Inferences 22
6. Examples of Applications of Numerical Methods in '.Biology 23
7. Conclusion 24
References 25

vii
viii Contents

CHAPTER 2 APPLICATIONS OF FUZZY CLUSTERING TO BIOMEDICAL


SIGNAL PROCESSING AND DYNAMIC SYSTEM
IDENTIFICATION 27
Amir B. Geva

1. Introduction 27
1.1. Time Series Prediction and System Identification 28
1.2. Fuzzy Clustering 29
1.3. Nonstationary Signal Processing Using Unsupervised Fuzzy Clustering 29
2. Methods 30
2.1. State Recognition and Time Series Prediction Using Unsupervised Fuzzy
Clustering 31
2.2. Features Extraction and Reduction 32
2.2.1. Spectrum Estimation 33
2.2.2. Time-Frequency Analysis 33
2.3. The Hierarchical Unsupervised Fuzzy Clustering (HUFC) Algorithm 34
2.4. The Weighted Unsupervised Optimal Fuzzy Clustering (WUOFC)
Algorithm 36
2.5. The Weighted Fuzzy K-Mean (WFKM) Algorithm 37
2.6. The Fuzzy Hypervolume Cluster Validity Criteria 39
2.7. The Dynamic WUOFC Algorithm 40
3. Results 40
3.1. State Recognition and Events Detection 41
3.2. Time Series Prediction 44
4. Conclusion and Discussion 48
Acknowledgments 51
References 51

CHAPTER 3 NEURAL NETWORKS: A GUIDED TOUR 53


Simon Haykin

1. Some Basic Definitions 53


2. Supervised Learning 53
2.1. Multilayer Perceptrons and Back-Propagation Learning 54
2.2. Radial Basis Function (RBF) Networks 57
2.3 Support Vector Machines 58
3. Unsupervised Learning 59
3.1. Principal Components Analysis 59
3.2. Self-Organizing Maps 59
3.3. Information-Theoretic Models 60
4. Neurodynamic Programming 61
5. Temporal Processing Using Feed-Forward Networks 62
6. Dynamically Driven Recurrent Networks 63
7. Concluding Remarks 67
References 67

CHAPTER 4 NEURAL NETWORKS IN PROCESSING AND ANALYSIS


OF BIOMEDICAL SIGNALS 69
Homayoun Nazeran and Khosrow Behbehani

1. Overview and History of Artificial Neural Networks 69


1.1. What is an Artificial Neural Network? 70
Contents ix
1.2. How Did ANNs Come About? 71
1.3. Attributes of ANNs 73
1.4. Learning in ANNs 74
1.4.1. Supervised Learning 74
1.4.2. Unsupervised Learning 75
1.5. Hardware and Software Implementation of ANNs 76
Application of ANNs in Processing Information 77
2.1. Processing and Analysis of Biomedical Signals 77
2.2. Detection and Classification of Biomedical Signals Using ANNs 77
2.3. Detection and Classification of Electrocardiography Signals 78
2.4. Detection and Classification of Electromyography Signals 81
2.5. Detection and Classification of Electroencephalography Signals 83
2.6. Detection and Classification of Electrogastrography Signals 85
2.7 Detection and Classification of Respiratory Signals 86
2.7.1. Detection of Goiter-Induced Upper Airway Obstruction 86
2.7.2. Detection of Pharyngeal Wall Vibration During Sleep 88
2.8 ANNs in Biomedical Signal Enhancement 89
2.9 ANNs in Biomedical Signal Compression 89
Additional Reading and Related Material 91
Appendix: Back-Propagation Optimization Algorithm 92
References 95

CHAPTER 5 RARE EVENT DETECTION IN GENOMIC SEQUENCES BY


NEURAL NETWORKS AND SAMPLE STRATIFICATION 98
Wooyoung Choe, Okan K. Ersoy, and Minou Bina

1. Introduction 98
2. Sample Stratification 98
3. Stratifying Coefficients 99
3.1. Derivation of a Modified Back-Propagation Algorithm 100
3.2. Approximation of A Posteriori Probabilities 102
4. Bootstrap Stratification 104
4.1. Bootstrap Procedures 104
4.2. Bootstrapping of Rare Events 105
4.3. Subsampling of Common Events 105
4.4. Aggregating of Multiple Neural Networks 105
4.5. The Bootstrap Aggregating Rare Event Neural Networks 105
5. Data Set Used in the Experiments 106
5.1. Genomic Sequence Data 106
5.2. Normally Distributed Data 1, 2 107
5.3. Four-Class Synthetic Data 113
6. Experimental Results 113
6.1. Experiments with Genomic Sequence Data 113
6.2. Experiments with Normally Distributed Data 1 115
6.3. Experiments with Normally Distributed Data 2 118
6.4. Experiments with Four-Class Synthetic Data 118
7. Conclusions 120
References 120

CHAPTER 6 AN AXIOMATIC APPROACH TO REFORMULATING


RADIAL BASIS NEURAL NETWORKS 122
Nicolaos B. Karayiannis
X Contents

1. Introduction 122
2. Function Approximation Models and RBF Neural Networks 125
3. Reformulating Radial Basis Neural Networks 127
4. Admissible Generator Functions 129
4.1. Linear Generator Functions 129
4.2. Exponential Generator Functions 132
5. Selecting Generator Functions 133
5.1. The Blind Spot 134
5.2. Criteria for Selecting Generator Functions 136
5.3. Evaluation of Linear and Exponential Generator Functions 137
5.3.1. Linear Generator Functions 137
5.3.2. Exponential Generator Functions 138
6. Learning Algorithms Based on Gradient Descent 141
6.1. Batch Learning Algorithms 141
6.2. Sequential Learning Algorithms 143
7. Generator Functions and Gradient Descent Learning 144
8. Experimental Results 146
9. Conclusions 154
References 155

CHAPTER 7 SOFT LEARNING VECTOR QUANTIZATION


AND CLUSTERING ALGORITHMS BASED ON
REFORMULATION 158
Nicolaos B. Karayiannis

1. Introduction 158
2. Clustering Algorithms 159
2.1. Crisp and Fuzzy Partitions 160
2.2. Crisp c-Means Algorithm 162
2.3. Fuzzy c-Means Algorithm 164
2.4. Entropy-Constrained Fuzzy Clustering 165
3. Reformulating Fuzzy Clustering 168
3.1. Reformulating the Fuzzy c-Means Algorithm 168
3.2. Reformulating ECFC Algorithms 170
4. Generalized Reformulation Function 171
4.1. Update Equations 171
4.2. Admissible Reformulation Functions 173
4.3. Special Cases 173
5. Constructing Reformulation Functions: Generator Functions 174
6. Constructing Admissible Generator Functions 175
6.1. Increasing Generator Functions 176
6.2. Decreasing Generator Functions 176
6.3. Duality of Increasing and Decreasing Generator Functions 177
7. From Generator Functions to LVQ and Clustering Algorithms 178
7.1. Competition and Membership Functions 178
7.2. Special Cases: Fuzzy LVQ and Clustering Algorithms 180
7.2.1. Linear Generator Functions 180
7.2.2. Exponential Generator Functions 181
8. Soft LVQ and Clustering Algorithms Based on Nonlinear Generator
Functions 182
8.1. Implementation of the Algorithms 185
9. Initialization of Soft LVQ and Clustering Algorithms 186
9.1. A Prototype Splitting Procedure 186
9.2. Initialization Schemes 187
10. Magnetic Resonance Image Segmentation 188
Contents xi

11. Conclusions 194


Acknowledgments 195
References 196

CHAPTER 8 METASTABLE ASSOCIATIVE NETWORK MODELS OF


NEURONAL DYNAMICS TRANSITION DURING SLEEP 198
Mitsuyuki Nakao and Mitsuaki Yamamoto

1. Dynamics Transition of Neuronal Activities During Sleep 199


2. Physiological Substrate of the Global Neuromodulation 201
3. Neural Network Model 201
4. Spectral Analysis of Neuronal Activities in Neural Network Model 203
5. Dynamics of Neural Network in State Space 204
6. Metastability of the Network Attractor 206
6.1. Escape Time Distributions in Metastable Equilibrium States 206
6.2. Potential Walls Surrounding Metastable States 207
7. Possible Mechanisms of the Neuronal Dynamics Transition 210
8. Discussion 211
Acknowledgments 213
References 213

CHAPTER 9 ARTIFICIAL NEURAL NETWORKS FOR SPECTROSCOPIC


SIGNAL MEASUREMENT 216
Chii-Wann Lin, Tzu-Chien Hsiao, Mang-Ting Zeng, and
Hui-Hua Kenny Chiang

1. Introduction 216
2. Methods 217
2.1. Partial Least Squares 217
2.2. Back-Propagation Networks 218
2.3. Radial Basis Function Networks 219
2.4. Spectral Data Collection and Preprocessing 220
3. Results 221
3.1. PLS 221
3.2. BP 221
3.3. RBF 222
4. Discussion 222
Acknowledgments 231
References 231

CHAPTER 10 APPLICATIONS OF FEED-FORWARD NEURAL


NETWORKS IN THE ELECTROGASTROGRAM 233
Zhiyue Lin and J. D. Z. Chen

1. Introduction 233
2. Measurements and Preprocessing of the EGG 234
2.1. Measurements of the EGG 234
2.2. Preprocessing of the EGG Data 235
2.2.1. ARMA Modeling Parameters 235
2.2.2. Running Power Spectra 236
2.2.3. Amplitude (Power) Spectrum 238
xii Contents

3. Applications in the EGG 239


3.1. Detection and Deletion of Motion Artifacts in EGG Recordings 239
3.1.1. Input Data to the NN 239
3.1.2. Experimental Results 240
3.2. Identification of Gastric Contractions from the EGG 241
3.2.1. Experimental Data 241
3.2.2. Experimental Results 243
3.3. Classification of Normal and Abnormal EGGs 244
3.3.1. Experimental Data 246
3.3.2. Structure of the NN Classifier and Performance Indexes 246
3.3.3. Experimental Results 248
3.4. Feature-Based Detection of Delayed Gastric Emptying from the EGG 249
3.4.1. Experimental Data 250
3.4.2. Experimental Results 251
4. Discussion and Conclusions 252
References 253

INDEX 257

ABOUT THE EDITOR 259


PREFACE

Fuzzy set theory derives from the fact that almost all natural classes and concepts are
fuzzy rather than crisp in nature. According to Lotfi Zadeh, who is the founder of
fuzzy logic, all the reasoning that people use everyday is approximate in nature.
People work from approximate data, extract meaningful information from massive
data, and find crisp solutions. Fuzzy logic provides a suitable basis for the ability to
summarize information and to extricate information from the collections of masses of
data.
Like fuzzy logic, the concept of neural networks was introduced approximately
four decades ago. But theoretical developments in the last decade have led to numerous
new approaches, including multiple-layer networks, Kohenen networks, and Hopfield
networks. In addition to the various structures, numerous learning algorithms have
been developed, including back-propagation, Bayesian, potential functions, and genetic
algorithms.
In Volume I, the concepts of fuzzy logic are applied, including fuzzy clustering,
uncertainty management, fuzzy set theory, possibility theory, and approximate reason-
ing for biomedical signals and biological systems. In addition, the fundamentals of
neural networks and new learning algorithms with implementations and medical appli-
cations are presented.
Chapter 1 by Bouchon-Meunier is devoted to a review of the concepts of fuzzy
logic, uncertainty management, possibility theories, and their implementations.
Chapter 2 by Geva discusses the fundamentals of fuzzy clustering and nonstation-
ary fuzzy clustering algorithms and their applications to electroencephalography and
heart rate variability signals.
Chapter 3 by Haykin gives a guided tour of neural networks, including supervised
and unsupervised learning, neurodynamical programming, and dynamically driven
recurrent neural networks.
Chapter 4 by Nazeran and Behbehani reviews in-depth the classical neural net-
works implementations for the analysis of biomedical signals including electro-
cardiography, electromyography, electroencephalography, and respiratory signals.
Chapter 5 by Choe et al. discusses rare event detection in genomic sequences using
neural networks and sample stratification, which makes each sample in the data
sequence to have equal influence during the learning process.
Chapters 6 and 7 by Karayiannis are devoted to the soft learning vector quantiza-
tion and clustering algorithms based on reformulation and an axiomatic approach to
reformulating radial-basis neural networks.

xiii
xiv Preface

Chapter 8 by Nakao and Yamamoto discusses the metastable associative network


models of neural dynamics during sleep to understand the underlying mechanism for
the neural dynamics transition.
Chapter 9 by Lin et al. is devoted to the applications of multivariate analysis
methods including the partial least-squares method and the neural networks based
on back-propagation and radial-basis functions to measure the glucose concentration
from near-infrared spectra.
Chapter 10 by Lin and Chen discusses the applications of feed-forward neural
networks for the analysis of the surface electrogastrogram signals to detect and elim-
inate motion artifacts in electrogastrogram recordings and classify abnormal and
normal electrogastrogram signals.
First, I am grateful to the contributors for their help, support and understanding
throughout the preparation of this volume. I also thank IEEE Press Associate
Acquisitions Editor Linda Matarazzo, Production and Manufacturing Manager
Savoula Amanatidis, and Production Editor Surendra Bhimani for their help and
technical support.
Finally, many thanks to my wife Dr. Yasemin M. Akay of Dartmouth Medical
School, and my son Altug R. Akay, for their cares and sacrifices.

Metin Akay
Dartmouth College
Hanover, NH
LIST OF CONTRIBUTORS

Khosrow Behbehani Taipei, TAIWAN, R.O.C.


Joint Biomedical Engineering Program Phone: +886 2 2826 7027
University of Texas at Arlington and Fax: +886 2 2821 0847
University of Texas Southwestern Medical E-mail: chiang@bme.ym.edu.tw
Center at Dallas
P.O. Box 19138 Wooyoung Choe
Arlington TX 76019 USA Purdue University
E-mail: kb@uta.edu School of Electrical and Computer
Engineering
Minou Bina Electrical Engineering Building
Purdue University West Lafayette, IN 47907 USA
Dept. of Chemistry E-mail: wooyoung@ecn.purdue.edu
Brown Bldg. 313ID
West Lafayette, IN 47907 USA Okan K. Ersoy
Phone: + 1 765 494 5294 Purdue University
Fax: + 1 765 494 0239 School of Electrical and Computer
E-mail: bina@purdue.edu Engineering
Electrical Engineering Building
Bernadette Bouchon-Meunier West Lafayette, IN 47907 USA
Laboratoire d'Informatique de Paris 6 Phone: +1 765 494 6162
Universite Pierre et Marie Curie - CNRS E-mail: ersoy@ecn.purdue.edu
(UMR 7606)
Case courrier 169 Amir B. Geva, D.Sc.
4 Place Jussieu Bioelectric Laboratory, Head
75252 Paris Cedex 05, FRANCE Electrical and Computer Engineering
Fax: +33 1 44 27 70 00 Department
E-mail: Bernadette.Bouchon-Meunier@lip6.fr Ben-Gurion University of the Negev
P.O.B. 653, Beer-Sheva 84105, ISRAEL
J.D.Z. Chen Phone: +972 7 6472408
University of Texas Medical Branch at Fax: +972 7 6472949
Galveston E-mail: geva@ee.bgu.ac.il
Galveston, TX 77555-0632 USA home-page: http:/www.ee.bgu.ac.il/~geva/
Phone: + 1 409 747 3071
E-mail: jianchen@utmb.edu Simon Haykin
McMaster University
Hui-Hua Kenny Chiang CRL-201
Institute of Biomedical Engineering 1280 Main Street West
National Yang-Ming University Hamilton, ON CANADA L8S 4K1
No. 155, Sec. 2, Li-Lung St., Pei-Tou Phone: +1 905 525 9140, Ext. 24291

xv
xvi List of Contributors

Fax: + 1 905 521 2922 Mitsuyuki Nakao


E-mail: haykin@synapse.CRL.McMaster.CA Laboratory of Neurophysiology and
Bioinformatics
Tzu-Chien Hsiao Gradute School of Information Sciences
Institute of Biomedical Engineering Tohoku University, Aoba-yama 05
National Yang-Ming University Sendai 980-8579, JAPAN
No. 155, Sec. 2, Li-Lung St., Pei-Tou Phone: +81 22 217 7178
Taipei, TAIWAN, R.O.C. Fax: +81 22 217 7178
Phone: +886 2 2826 7027 E-mail: nakao@ecei.tohoku.ac.jp
Fax: +886 2 28210847
E-mail: hsiao@bme.ym.edu.tw Homayoun Nazeran, Ph.D.
Flinders University of South Australia
Nicolaos B. Karayiannis, Ph.D. School of Informatics and Engineering
The University of Houston GPO Box 2100, Adelaide
Dept. of Electrical and Computer Engineering SOUTH AUSTRALIA 5001
4800 Calhoun Phone: +61 8 8201 3604 (Office)
Houston, TX 77204-4793 USA + 61 8 8201 3606 (Laboratory)
Phone: 713 743 4436 Fax: +61 8 8201 3618
Fax: 713 743 4444 E-mail: Homer.Nazeran@flinders.edu.au
E-mail: Karayiannis@uh.edu
Mituaki Yamamoto
Chii-Wann Lin, Ph.D. Laboratory of Neurophysiology and
National TAIWAN University Bioinformatics
Institute of Biomedical Engineering Graduate School of Information Sciences
College of Medicine and College of Tohoku University, Aoba-yama 05
Engineering Sendai 980-8579, JAPAN
No. 1, Sec. 1, Jen-Ai Road Phone: +81 22 217 7178
Taipei, TAIWAN, 100, R.O.C. Fax: +81 22 217 7178
Phone: +886 2 23912217 E-mail: yamamoto@ecei.tohoku.ac.ip
Fax: +886 2 23940049
E-mail: cwlin@cbme.mc.ntu.edu.tw Man-Ting Zeng
Institute of Biomedical Engineering
Zhiyue Lin National Yang-Ming University
University of Kansas Medical Center No. 155, Sec. 2, Li-Lung St., Pei-Tou
Dept. of Medicine Taipei, TAIWAN, R.O.C.
Kansas City, KS 66160 USA Phone: +886 2 2826 7027
Phone: + 1 913 588 7729 Fax: +886 2 28210847
E-mail: zeng@bme.ym.edu.tw
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.

Chapter UNCERTAINTY MANAGEMENT IN


1 MEDICAL APPLICATIONS

Bernadette Bouchon-Meunier

1. INTRODUCTION

In biology and medicine, as well as in many other domains, imperfect knowledge


cannot be avoided. It is difficult to construct automatic systems to provide classification
or pattern recognition tools or help specialists make a decision. There exist two kinds of
difficulties: (1) those related to the type of imperfection we have to consider (partial
information, uncertainties, inaccuracies) and (2) those due to the type of problem we
have to solve (e.g., images to process, expert rules, databases).
Which mathematical model are we supposed to choose to manage this imperfect
knowledge? What is the best knowledge representation for a given problem? The
answers to such questions are not obvious, and our purpose is to present several frame-
works available to represent and manage imperfect knowledge, particularly in biologi-
cal and medical domains. We indicate principles, interest and limits of these
frameworks. We give more details about numerical approaches that have given rise
to more practical applications than about symbolic approaches, which will be men-
tioned only briefly.

2. IMPERFECT KNOWLEDGE

2.1. Types of Imperfections


Imperfections may have several forms, which we present briefly.

2.1.1. Uncertainties

Imperfections are called uncertainties when there is doubt about the validity of a
piece of information. This means that we are not certain that a statement is true or false
because of

• The random behavior of a phenomenon (for instance, the factors of transmis-


sion of genetic features) related to probabilistic uncertainty.
• The reliability or limited soundness of an observer of the phenomenon who
expresses the statement, or of the sensor used for a measurement. The uncer-
tainty is then nonprobabilistic.

1
2 Chapter 1 Uncertainty Management in Medical Applications

Uncertainties can be represented either by numbers, such as probabilities or con-


fidence degrees indicating the extent to which we are certain of the validity of a state-
ment, or by phrases such as "I believe that..." or "it is possible that "

2.1.2. Imprecisions

The second type of imperfection is imprecision, when some characteristics of a


phenomenon cannot be described accurately. Imprecisions have two main forms:
approximate values (for instance, the limits of the normal glycemia level at a given
age are not sensitive to a variation of l%o) or vague descriptions using terms of natural
language (for instance, "a high temperature" or "frequent attacks").

2.1.3. Incompleteness

Incomplete knowledge is the last kind of imperfection, in which there is a lack of


information about some variables or criteria or elements of a given situation. Such
incompleteness can appear because of defaults in knowledge acquisition (for instance,
the age of a patient has not been recorded) or because of general rules or facts that are
usually true but admit a few exceptions, the list of which is impossible to give (for
instance, generally, the medication X does not cause any drowsiness).

2.1.4. Causes of Imperfect Knowledge

These imperfections may have various causes:


• They can be related to conditions of observation that are insufficient to obtain
the necessary accuracy (for instance, in the case of radiographic images).
• They can be inherent in the phenomenon itself. This is often the case in biology
or medicine, because natural factors often have no precise value or precise limit
available for all patients. Conditions or values of criteria vary in a given situa-
tion (e.g., the size and shape of malignant microcalcifications in breast cancer).
It happens that several forms of imprecision cannot be managed independently.
For instance, uncertainties are generally present at the same time as inaccuracies, and
incompleteness entails uncertainties. It is then necessary to find the knowledge repre-
sentation suitable for all the existing imperfections.

2.2. Choice of a Method


The choice of a method to process data is linked to the choice of knowledge
representation, which can be numerical, symbolic, logical, or semantic, and it depends
on the nature of the problem to be solved: classification, automatic diagnosis, or
decision support, for instance. The available knowledge can consist of images or data-
bases containing factual information or expert knowledge provided by specialists in the
domain. They are, in some cases, directly managed by an appropriate tool, such as an
expert system or pattern recognition method if the object to identify on images is not
too variable, for instance. In other cases, learning is necessary as a preliminary step in
the construction of an automatic system. This means that examples of well-known
situations are given and assigned to a class, a diagnosis, a decision, or more generally
Section 2 Imperfect Knowledge 3

a label by a specialist. On the basis of these examples, a general method is constructed


to perform a similar assignment in new situations, for instance, by inductive learning,
case-based reasoning, or neural networks.
It is also possible that explanations are required for the reasons leading the system
to a given diagnosis or choice of a label. It is, for instance, interesting if the conceived
automatic system has training purposes. Such problems of human-machine commu-
nication are studied in artificial intelligence.
We indicate briefly in Table 1 the main knowledge representation and management
methods corresponding to the three kinds of imperfection we have mentioned. In the
following, we will focus on the numerical methods listed in the bold frame in Table 1.
There are other kinds of methods that are not directly dedicated to one of the
imperfections we have mentioned but provide numerical approaches to data manage-
ment, such as chaos, fractals, wavelets, neural networks, and genetics-based program-
ming, which are also intensively used, especially in medicine.
All these tools have their own advantages as well as some disadvantages. It is
therefore interesting to use several of them as complementary elements of a general
data processing system, taking advantage of synergy between them such that qualities
of one method compensate for disadvantages of another one. For instance, fuzzy logic
is used for its ability to manage imprecise knowledge, but it can take advantage of the
ability of neural networks to learn coefficients or functions. Such an association of
methods is typical of so-called soft computing, which was initiated by L.A. Zadeh in
the 1990s and provides interesting results in many real-world applications. In the next
sections, we present the fundamentals of the main numerical methods mentioned in
Table 1. For more details, see the books or basic papers indicated at the end of this
chapter [1-10].

TABLE 1 Classification of Methods for the Management of Imperfect Knowledge


Type of imperfection Representation method Management method
Modal logic
Symbolic beliefs Truth maintenance systems
Autoepistemic logic

Probabilistic logic
Uncertainties Probabilities Bayesian Induction
Belief networks
Confidence degrees Propagation of degrees
Belief, plausibility measures Evidence theory
Possibility, necessity degrees Possibilistic logic
Fuzzy logic
Imprecisions Fuzzy sets Fuzzy set-based techniques
Error intervals Interval analysis
Validity frequencies Numerical quantifiers

General laws, exceptions Hypotheses Hypothetical reasoning


Default rules Default reasoning
4 Chapter 1 Uncertainty Management in Medical Applications

3. FUZZY SET THEORY

3.1. Introduction to Fuzzy Set Theory


Fuzzy set theory, introduced in 1965 by Zadeh [11] provides knowledge represen-
tation suitable for biological and medical problems because it enables us to work with
imprecise information as well as some type of uncertainty. We present such a repre-
sentation using an example. Let us think of the glycemia level of patients. We can use a
threshold of 1.4 g/1, providing two classes of levels: those at most equal to the threshold,
labeled "normal," and those greater than the threshold, labeled "abnormal."
The transition from one label to the other appears too abrupt because a level of
1.39 g/1 is considered normal and a level of 1.41 g/1 is considered abnormal. Instead, we
can establish a progressive passage from the class of normal levels to the class of
abnormal ones and consider that the level is normal up to 1.3 g/1; that the greater the
level between 1.3 and 1.5 g/1, the less normal this level; and finally that the level is
considered really abnormal when greater than 1.5 g/1. We then define a fuzzy set A of
the set X of possible values of the glycemia level by means of a membership function/A,
which associates a coefficient fpXpc) in [0,1] to every element x of X. This coefficient
indicates the extent to which x belongs to A (see Figure 1).
The main novelty of fuzzy set theory compared with classical set theory is the
concept of partial membership of an element in a class or a category. This corresponds
to the idea that a level can be "somewhat abnormal." The possibility of representing
gradual knowledge stems from this concept, such as "the more the value increases
between given limits, the more abnormal the level," and of allowing progressive passage
from one class (the class of normal levels) to another one (the class of abnormal levels).
This possibility justifies the use of such a knowledge representation for modeling bio-
logical phenomena, in which there is generally no strict boundary between neighboring
situations.
Such a representation is also interesting because it can be adjusted to the environ-
ment. If the observed patients are elderly, the membership function of the class of
abnormal glycemia levels indicated in Figure 1 must be shifted 0.4 g/1 to the right.
Another advantage of this approach is that one can set up an interface between numer-
ical values (1.3 g/1) and symbolic ones expressed in natural language (normal level). For
instance, a young patient with a glycemia level of 1.7 g/1 (numerical value) is associated
with the symbolic value "abnormal." Conversely, a new patient with no record in a
hospital can indicate that he had an abnormal glycemia level in the past; this symbolic

0.5

Figure 1 Fuzzy set A representing the cate-


1.3 1.5 gory "abnormal" of the glycemia rate.
Section 3 Fuzzy Set Theory 5

information will be taken into account together with measured (numerical) levels
obtained in the future.
It is easy to see that fuzzy sets are useful for representing imprecise knowledge with
ill-defined boundaries, such as approximate values of vague characterizations (see
Figure 2). Such a representation is also compatible with the representation of some
kinds of uncertainty by means of possibility theory, which we will develop later.

fB

Figure 2 Fuzzy set B representing the


approximate value "about 1.4g/l" of the gly-
cemia rate. 1.4 1.5

3.2. Main Basic Concepts of Fuzzy Set Theory


3.2.7. Definitions

For a given universe X, a classical subset C is defined by a characteristic function


XC lying in {0, 1}, and a fuzzy set A is defined by a membership function fA : X -> [0,1].
A classical (or crisp) subset of X is then a particular case of a fuzzy set. We note that
classical (or crisp) subsets of X are particular cases of fuzzy sets, corresponding to
membership functions taking only the value 0 or 1.
Some particular elements are of interest in describing a fuzzy set:
Its support:

supp04) = {xe X/fA(x) φ 0} (1)

Its height:

h(A) = s\xpx€XfA(x) (2)

Its kernel or core:

kcT(A) = {x€X/fA(x)=l} (3)

Its cardinality:

\A\ = Y/AW (4)


x<=X

Fuzzy sets with a nonempty kernel and a height equal to 1 are called normalized.
Chapter 1 Uncertainty Management in Medical Applications

In the example given in Figure 1, with a continuous membership function, we have


Κβφΐ) = [1.5, +oo[, Supp(^) = [1.3, +oo[, h(A) = 1.
Let us remark that a fuzzy set can have several interpretations, depending on the
situation:
• Partial membership (fA(x) is the membership degree of x in the class A)
• Preference (fA(x) is the degree of preference attached to x)
• Typicality (fA(x) is the degree of typicality of x in the class A)
• Possibility (fA(x) is the degree of possibility that x is the value of a variable
defined on X)
Two fuzzy sets A and B of X are equal if and only if

VxeX fA(x)=Mx) (5)

3.2.2. Operations on Fuzzy Sets

We define the inclusion of fuzzy sets of X as a partial order such that A is included
in B, and we note A c B, if and only if

WxeX fA(x)<fB(x) (6)

with:
• The empty set (Vx e X fA{x) = 0) as smallest element
• The universe itself (VJC e X fA(x) = 1) as greatest element
It is then necessary to define operations on fuzzy sets extending the operations on
crisp subsets of X.
The intersection of A and B (Figure 3) is defined as the fuzzy set C = A Π B of X
with the following membership function:

VxeX VxeX (fc(x) = rrnnfA(x),fB(x)) (7)

The union of A and B (Figure 4) is defined as the fuzzy set D = A U B of X with the
following membership function:

WxeX fD{x) = VMK(fA{x),fB{x)) (8)

X Figure 3 Intersection of fuzzy sets A and B.


Section 3 Fuzzy Set Theory

Figure 4 Union of fuzzy sets A and B.

Figure 5 Complement of a fuzzy set A.

The properties of intersection and union of crisp subsets of X are preserved by


these definitions: associativity of Π and U, commutativity of Π and U, A U 0 = A,
Al)X = X, ΑΓ\Χ = A, An 0 = 0, AUB^AOADB, and distributivity of Π over
U and, conversely, of U over Π.
We now define the complement of a fuzzy set A of X (Figure 5) as the fuzzy set Ac
of X with the following membership function:

VxeX /AX)=1-/A(X) (9)

This definition preserves almost all the properties available in classical set theory,
except the following ones:
• Ac n A Φ 0
• Α°υΑφΧ
which means that a class and its complement may overlap, in agreement with the basic
idea of partial membership in fuzzy set theory.
In some cases, it can be interesting to lose some other properties and to use
definitions of intersection and union with a slightly different behavior.
The most common alternative operators are triangular norms (t-norms) T: [0,1] x
[0,1] ->■ [0,1] to define the intersection and triangular conorms (t-conorms) ±: [0, 1] x
[0,1] -> [0,1] to define the union. These operators have been introduced in probabilistic
metric spaces and they are
• Commutative
• Associative
• Monotonous
• Such that T(x, 1) = x, ±(JC, 0) x for any x in [0, 1]
8 Chapter 1 Uncertainty Management in Medical Applications

It is easy to check that min is a t-norm and max a t-conorm, which are dual in the
following sense:
• \-T(x,y) = 1(1-x, \-y)
• l-±(x,y) = T(l-x, l-y)
The other widely used t-norms are the product T{x, y) = xy and the so-called
Lukasiewicz t-norm T(x, y) = max(x-l-y- 1,0), respectively dual from the following
t-conorms: ±(x, y) = x + y — xy and l(x, y) = vam(x + y, 1) (Figures 6 and 7).

L· T(x,y) = maK(x+y-\,(S)

Figure 6 Intersection of A and B based on


the Lukasiewicz t-norm.

X (x,y) = min(x +y, 1)


fA h

Figure 7 Union of A and B based on the


Lukasiewicz t-conorm.

When several universes are considered simultaneously—for instance, several cri-


teria to make a decision, attributes to describe an object, variables to control a system—
it is necessary to define the Cartesian product of fuzzy sets of the various universes. This
situation is very frequent, because a decision, a diagnosis, the recognition of a class, and
so forth are generally based on the use of several factors involved simultaneously.
Let us consider universes XuX2,... ,Xr and their Cartesian product
X = X\ x X2 x ··· x Xr, the elements of which are r-tuples (xl,x2, ...,xr), with
*i € Xi,...,xr e Xr. From fuzzy sets A\, A2,..., Ar, respectively defined on
Xi,X2,...,Xr, we construct a fuzzy set of X denoted by A = A\ x A2 x ·· · x Ar, con-
sidered as their Cartesian product, with membership function

Vx = (xx,x2,...,xr)eX fA(x) = min(fAl (xx),... ,fAr{xr)) (10)

3.2.3. The Zadeh Extension Principle

Another important concept of fuzzy set theory is the so-called Zadeh extension
principle, enabling us to extend to fuzzy values the operations or tools used in classical
set theory or mathematics. Let us explain how it works. Fuzzy sets of X are imperfect
information about the elements of X. For instance, instead of observing Λ: precisely, we
Section 3 Fuzzy Set Theory 9

can only perceive a fuzzy set of X with a high membership degree attached to x. The
methods that would be available to manage the information regarding X in the case of
precise information need to be adapted to be able to manage fuzzy sets.
We consider a mapping φ from a first universe X to a second one Y, which can be
identical to X. The Zadeh extension principle defines a fuzzy set B of Y from a fuzzy set
A of X, in agreement with the mapping φ, in the following way:

VyeY fB(y) = sapx^(y)fA{x) ίίφ*(γ)φ0 (11)

andfB(y) = 0 otherwise, with φ*(γ) = {x e X/y = φ(χ)} ϊΐφ-.Χ^- Y, and φ*(γ) = {x <=
A7J> € φ(χ)} if φ: X ^- Ρ(Υ) (i.e., <>
/ is multivalued).
If A is a crisp subset of X reduced to a singleton {a}, the Zadeh extension principle
constructs a fuzzy set B of Y reduced to φ({α}).
If φ is a one-to-one mapping, then

VyeY fB(y)=fA&-l(y)) (12)

If we consider the Cartesian product of universes X = Xx x X2 x · · · x Xr and A


the Cartesian product of fuzzy sets of these universes A = Ay x A2 x · · · x Λ·, t n e
Zadeh extension principle associates a fuzzy set B of 7 with A as follows:

V? e Y fB(y) = supxHxi_Xr)epWmm(fAl(Xl),... ,fAt(xT)) if φ*(γ) φ <Ζ> (13)

and fB(y) = 0 otherwise


For example, let us consider the fuzzy set A representing "about 1.4" on the
universe X = [0, +oo[, as defined in Figure 2. If we know that the value of variable
W defined on X is greater than the value of variable V and that the value of V is about
1.4, we can characterize the value of W by the fuzzy set B obtained by applying the
extension principle to the order relation on [0, +oo[. We have Y = [0, +oo[ and
</>(x) = {y e Y/y > x}. We get

Yy € Y fB(y) = supy>xfA(x)
(14)
fB(y) = 0 if x<\3, fB(y) =1 if y> 1.4

which corresponds to a representation of "greater than about 1.4."


Another example of application of application of the extension principle defines a
distance between imprecise locations. Let us consider a set of points Z = {a, b, c, d}. The
distance between any pair of points of Z is defined by a mapping φ: Z x Z -*■ [0, +oo[.
If the points are observed imprecisely, we need to extend the notion of distance to fuzzy
sets.
We use the extension principle with X = Z x Z and Y = [0, +oo[, and we get a
fuzzy set C of [0, +oo[ with a membership function defined for any d e [0, +oo[ by

fc(d) = supl{xy)eX^y<Kxy)=d]mm(fA(x),fB(y))
if {(x, y)eX,xjiy, φ(χ, y) = d] φ 0 (15)
fc(d) = 0 otherwise
10 Chapter 1 Uncertainty Management in Medical Applications

This kind of distance can be used, for instance, in image processing.


The Zadeh extension principle is fundamental for extending to fuzzy sets all the
concepts we are familiar with in classical set theory, for instance, in reasoning or
arithmetic.

3.3. Fuzzy Arithmetic


Arithmetic is precisely one of the domains where fuzzy sets are widely used. Many
applications use the universe 0$ of real numbers, with fuzzy sets representing imprecise
measurements of real-valued variables (e.g., distance, weight). The membership func-
tions are generally chosen as simple as possible, compatible with the intuitive represen-
tation of approximations. This simple form of functions corresponds to convex fuzzy
sets.
A fuzzy set F on X is convex if it satisfies the following condition:

V(x,y)eMxMVze[x,y] fF (z) > min(fF(x),fF(y)) (16)

• A fuzzy quantity Q is a normalized fuzzy set of K.


• A model value of Q is an element m of K in the kernel of Q such that/ ß (w) = 1.
• A fuzzy interval I is a convex fuzzy quantity. It corresponds to an interval of D?
with imprecise boundaries.
• A fuzzy number M is a fuzzy interval with an upper semicontinuous membership
function, a compact support, and a unique modal value. It corresponds to an
imprecisely known real value.
It is often necessary to compute the addition or the product of imprecisely known
real values. For instance, if a patient has lost approximately 5 pounds during the first
week and 3 pounds during the second one, how much has he lost during these two
weeks? Symbolically, we can conclude that he has lost approximately 8 pounds, but we
need to formalize this operation to define automatic operations for more complex
problems. We use the Zadeh extension principle to extend the classical arithmetic
operations to fuzzy quantities. We do not go into detail with the general definition of
fuzzy quantities. We focus on particular forms of membership functions for which the
main operations are easily computable. They are called L-R fuzzy intervals.
An L-R fuzzy interval I is a fuzzy quantity with a membership function// defined
by means of four real parameters (m, m', a, b) with a and b strictly positive, and two
functions L and R, defined on 05+, lying in [0,1], upper semicontinuous, nonincreasing,
such that

L(0) = R(0) = 1
L(l) = 0 or L(x) > 0 Vx with lim^oc L(x) = 0
.R(l) = 0 or R(x) > 0 Vx with l i m ^ ^ R(x) = 0 (17)

The membership function of an L-R fuzzy interval defined by m, m', a, and b is


then
Section 3 Fuzzy Set Theory 11

fj(x) = L{{m — x)/a) if x < m


/,(*) = 1 if m < x < m' (18)
//(JC) = R((x - m ')/*) if x > m'

We note / = (m, m', a, b)LR. It can be interpreted as "approximately between m and


m'r
The particular case of an L-R fuzzy interval («,«, a, b)LR is an L-i? fuzzy number
denoted by M = («, a, o)LA. It can be interpreted as "approximately n."
Fuzzy quantities often have trapezoidal or triangular membership functions. They
are then L-R fuzzy intervals or numbers, with R(x) = L{x) = max(0, 1 — x). It is also
possible to use functions such as max(0,1 — x2), max(0,1 — x)2 or exp(—x) to define R
and L.
Given two L-R fuzzy intervals defined by the same functions L and R, respectively
denoted by / = (m, m', a, b)LR and / = («,«', c, d)LR, the main arithmetic operations
can be computed very simply, as follows:
• For the opposite of /: - / = (—m , —m, b, a)LR
• For the addition: / φ J = (m + n, m' + n'', a + c, b + d)LR
• For the subtraction: I®J = (m — ri,m' — n,a + d,b + c)LR if L — R
• For the product: / <g> / is generally not an L-R fuzzy interval, but it is possible to
approximate it by the following L-R fuzzy interval:

I ®J = {mn,m'n',mc + na,md + nb)LR (19)

These operations satisfy the classical properties of the analogous operations in


classical mathematics except for some of them. For instance, Q Θ (—Q) is different
from 0, but it accepts 0 as its modal value; it can be interpreted as "approximately
null."
For example, if / is a triangular fuzzy number with modal value 4 and support
]3,5[ and / a triangular fuzzy number with modal value 8 and support ]6,10[, we
represent them as L-R fuzzy numbers / = (4,4 - 3, 5 — 4)LR = (4, 1, l)LR,
/ = ( 8 , 8 - 6 , 1 0 - 8 ) ω = (8,2,2).
Then we obtain the following results:
• —/ = (—4,1, \)LR is a triangular fuzzy number with modal value —4 and with
support ] — 5, —3[.
• I ®J = (12, 3, 3)LR is a fuzzy number with modal value 12 and support ]9, 15[.
• J Θ I = (S — 4,2 + 1,2 + l)LR = (4, 3, 3)LR is a triangular fuzzy number with
modal value 4 and support ]1,7[.

3.4. Fuzzy Relations


Because fuzzy set theory represents a generalization of classical set theory, we need
to generalize all the classical tools available to manage crisp data. Fuzzy relations are
among the most important concepts in fuzzy set theory.
A fuzzy relation R between X and Y is defined as a fuzzy set of X x Y.
12 Chapter 1 Uncertainty Management in Medical Applications

An example of fuzzy relation can be defined on X = Y = M. to represent approx-


imate equality between real values, for instance, with the following membership func-
tion:

VxeXVyeY fa(x,y) = —-± , (20)


l+(x-y)
Another example, also defined on X = Y = U, is a representation of the relation
"y is really greater than x" with the following membership function:

πάηίΐ,Ζ-τΓ?) if y > x
V(x,y)eR2 fR(x,y). V Pr> (21)
0 otherwise
for a parameter β > 0 indicating the range of difference between x and y we accept.
If we have three universes X, Y, and Z, it is useful to combine fuzzy relations
between them. The max-min composition of two fuzzy relations R\ on X x Y and R2 on
Y x Z defines a fuzzy relation R = R{ o R2 on X x Z, with membership function:

V(x, z) e X x Z fR(x, z) = supy€Ymin(fRl(x, y),fRl(y, z)) (22)

The main utilizations of fuzzy relations concern the representation of resemblances


("almost equal") or orders ("really smaller"). We need to define general classes of fuzzy
relations suitable for such representations, based on particular properties of fuzzy
relations: symmetry, reflexivity, transitivity, antisymmetry, extending the analogous
properties of classical binary relations.
A similarity relation is a symmetrical, reflexive and max-min transitive fuzzy rela-
tion. It corresponds to the idea of resemblance and it can be used in classification,
clustering, and analogical reasoning, for instance.
A fuzzy preorder is a reflexive and transitive fuzzy relation R. If R is also anti-
symmetrical, R is a fuzzy order relation. It corresponds to the idea of ordering or
anteriority and it is useful in decision making, for instance, for the analysis of prefer-
ences or for temporal ordering of events.

4. POSSIBILITY THEORY

4.1. Possibility Measures


Fuzzy set theory provides a representation of imprecise knowledge. It does not
present any immediate representation of uncertain knowledge, which is nevertheless
necessary to reason with imprecise knowledge. Let us consider the precise and certain
rule "if the patient is at least 40 years old, then require a mammography." Imprecise
information such as "the patient is approximately 40 years old" leads to an uncertain
conclusion, "we are not certain that the mammography is required." This simple
example proves that imprecision and uncertainty are closely related.
Possibility theory was introduced in 1978 by Zadeh [12] to represent nonprobabil-
istic uncertainty linked with imprecise information in order to enable reasoning on
imperfect knowledge. It is based on two measures defined for any subset of a given
Section 4 Possibility Theory 13

universe X, the possibility and the necessity measure. Let P{X) denote the set of subsets
of the universe X.
A possibility measure is a mapping Π: P(X) -*■ [0, 1], such that
i. Π(0)=Ο,Π(ΛΓ) = 1, (23)
ii. VAX € P(X), A2 € P(X)... Π(υ,-=1)2...Λ,-) = sup/=li2.. Π(^,·)· (24)
In the case of afiniteuniverse X, we can reduce ii to ii', which is a particular case of
ii for any X:
ii'. VAeP(X),BeP(X) U(AUB) = max(JO.(A),U{B)) (25)
We can interpret this measure as follows: Π(Α) represents the extent to which it is
possible that the subset (or event) A of X occurs. If Π(Α) = 0, A is impossible; if
Π(Α) = 1, A is absolutely possible.
We remark that the possibility measure of the intersection of two subsets of X is
not determined from the possibility measure of these subsets. The only information we
obtain from i and ii is the following:

VA € P(X), B e P(X) Tl(A Γ)Β)< mm(Tl(A), Π(5)) (26)

Let us remark that two subsets can be individually possible (Π(Α) Φ 0, Π(Β) φ 0)
but jointly impossible (Π(,4 Γ\Β) = 0).
Let us consider the example of identification of a disease in a universe
X = [dlt d2,, d3, d4}. We suppose that it is absolutely possible to be in the presence of
disease d\ or disease d2, disease d3 is relatively possible, and disease i/4 is impossible, and
we represent this information as follows:

Π({</ι, d2}) = 1, n({J3}) = 0.8, U({d4}) = 0 (27)

We deduce that it is absolutely possible that the disease is one of {di,d2, d4], since

Π({^ ,d2,tU)) = max(l, 0) = 1 (28)

It is relatively possible that the disease is one of d$, d4 since

Π({</3,rf4})= max(0.8,0) = 0.8 (29)

but the intersection {d4} of these two subsets [d\, d2, d4] and {i/3, d4] of X corresponds to
a possibility measure equal to 0.
We deduce from conditions i and ii that
Π is monotonous with respect to the inclusion of subsets of X:

If A 2 B then U{A) > Π(5) (30)

If we consider any subset A of X and its complement Ac, at least one of them is
absolutely possible. This means that either an event or its complement is absolutely
possible:
14 Chapter 1 Uncertainty Management in Medical Applications

VA € P(X) max(U(A), Π(Λ°)) = 1


C
(31)
U(A) + U(A ) > 1

It is easy to see that possibility measures are less restricting than probability
measures, because the possibility degree of an event is not necessarily determined by
the possibihty degree of its complement.
4.2. Possibility Distributions
A possibility measure Π is completely defined if we assign a coefficient in [0,1] to
any subset of X. In the example of four diseases, we need 16 coefficients to determine Π.
It is easier to define possibility degrees if we restrict ourselves to the elements (and not
to the subsets) of X and we use condition ii to deduce the other coefficients.
A possibility distribution is a mapping π: X -*■ [0,1] satisfying the normalization
condition:

™ΡχεχΦ) = 1 (32)
A possibility distribution assigns a coefficient between 0 and 1 to every element of
X, for instance, to each of the four diseases d\, d2, d^, d4. Furthermore, at least one
element of X is absolutely possible, for instance, one disease in {d\,d2, d^, d4} is abso-
lutely possible. This does not mean that this disease is identified, because several of
them can be absolutely possible and other information is necessary to make a choice
between them.
Possibihty measure and distribution can be associated. From a possibility distribu-
tion π, assigning a coefficient to any element of X, we construct a possibility measure
assigning a coefficient to any subset of X as follows:

VAeP(X) Π(Α) = supx€An(x) (33)

Conversely, from any possibility measure Π, we construct a possibility distribution


as follows:

Vxel π(χ) = Π(Μ) (34)

For instance, a possibility distribution such as

tfdi) = 1, n(d2) = 0.4, n(d3) = 0.8, n(d4) = 0 (35)

is compatible with the preceding possibihty measure, which is not given completely as
only 3 of the 16 coefficients are indicated.
In the case of two universes X and Y, we need to define the extent to which a pair
(x, y) is possible, with x e X and y e Y.
The joint possibility distribution π(χ, y) on the Cartesian product X x Y is defined
for any x € X and y e Y and it expresses the extent to which x and y can occur
simultaneously.
The global knowledge of X x Y through the joint possibihty distribution π(χ, y)
provides marginal information on X and Y by means of the marginal possibility dis-
tributions, for instance on Y:
Section 4 Possibility Theory 15

Vy e Y nY(y) = sup^jrOc, y) (36)

which satisfy:

VxeX VyeY n(x,y) < πάτι{πχ{χ), πγ(γ)) (37)

We remark that a joint possibility distribution provides uniquely determined mar-


ginal distributions, but the converse is false. Determining a joint possibility distribution
π on X x Y from possibility distributions πχ on X and πγ on Y requires information
about the relationship between events on X and Y. If we have no information, π cannot
be known exactly.
The universes X and Y are noninteractive if

Vx € X Vy e Y π(χ, y) = τανα{πχ{χ), πγ(γ)) (38)

This possibility distribution π(χ, y) is the greatest among all those compatible with
πχ and nY. Two variables respectively defined on these universes are also called non-
interactive.
The effect of A' on Y can also be represented by means of a conditional possibility
distribution πΥ/Χ such that

Vx € X Vy e Y π(χ, y) = πγ/χ(χ, y) * πχ(χ) (39)

for a combination operator *, generally the minimum or the product.


For example, if we consider again the universe X = {d\, d2, d-$, d4] of diseases and
we add a universe Y = {sif s2, s3, s4, s5, s6] of symptoms, JTy(i,·) is the possibility degree
that a patient presents symptom s, and nx(dj) is the possibility degree that a patient
suffers from disease dj. For a disease dj and a symptom sh we define the possibility
degree n(dj, sf) that the pair (dj, j,) is possible. X clearly has an influence on Y and the
universes are interactive. We can have nx(dj) = 1, nY(Sj) = 1, but n(dj, st) = 0.05. We
can also define the conditional possibility degree nY/X(dj, st) that the symptom is J,
given that the disease is dj. For instance, if the available information provides the values
7ix(di) = 0.8 and πγ/χ{ά^, st) = 1, then n(d3, Sj) = 0.8, since 7r(i/3, si) = nY/x{d^, s,) * πχ
(d3) = 1 * 0.8 = 0.8, when we choose the minimum or the product for the operator *.
This means that if disease d3 is relatively possible for a given patient and if symptom j ,
is completely possible when disease d3 is present, then it is relatively possible that the
given patient presents both disease J3 and symptom j , .
4.3. Necessity Measures
In this example, we see that a possibility measure provides an information on the
fact that an event can occur, but it is not sufficient to describe the uncertainty about this
event ^ad to obtain a conclusion from available data. For instance, if U(A) = 1, the
event A is absolutely possible, but we can have U(AC) = 1, which proves that we have
an absolute uncertainty about A. A solution to this problem is to complete the infor-
mation on A by means of a measure of necessity on X.
A necessity measure is a mapping N: P(X) -» [0,1], such that
iii. N(<Z))=0,N(X) = l, (40)
iv. Wl, € P{X), A2 e P(X)... N(nl=l2..Ai) = Μ,·=1,2...ΛΓ(^,·), (41)
16 Chapter 1 Uncertainty Management in Medical Applications

In the case of a finite universe X, we can reduce iv to iv' which is a particular case
of iv for any X:
iv'. VA 6 P(X), B € P{X) N(A Π B) = mm(N(A), N(B)) (42)
We can interpret this measure as follows: N(A) represents the extent to which it is
certain that the subset (or event)^ of X occurs. If N(A) = 0, we have no certainty about
the occurrence of the event A; if N(A) = 1, we are absolutely certain that A occurs.
Necessity measures are monotonous with regard to set inclusion:

if A 2 B, then N(A) > N(B) (43)

The necessity degree of the union of subsets of X is not known precisely, but we
know a lower bound:

VA € P{X), B e P(X) N(A U E) > max(N(A), N(B)) (44)

We deduce also from iii and iv a link between the necessity measure of an event A
and its complement Ac:

VA e P(X) min(N(A), N(AC)) = 0


(45)
#04) + N(AC) < 1

We see that the information provided by a possibility measure and that provided
by a necessity measure are complementary and their properties show that they are
linked together. Furthermore, we can point out a duality between possibility and neces-
sity measures, as follows.
For a given universe X and a possibility measure Π on X, the measure defined by

VA e P{X) N{A) = 1 - Π(Λ°) (46)

is a necessity measure on X ifAc denotes the complement of A in X. We are certain that


A occurs (N(A) = 1) if and only if Ac is impossible (n(,4c) = 0) and then Π(Λ) = 1.
If Π is defined from a possibility distribution π, we can define its dual necessity
measure by

WAeP(X) N(A) = i n f ^ ( l - π(χ)) (47)

which means we need only one collection of coefficients between 0 and 1 associated with
the elements of the universe X (the values of π(χ)) to determine both possibility and
necessity measures.
With the previous example, the certainty on the fact that the patient suffers from
disease d\ is measured by N^d^}) and it can be deduced from the greatest possibility
that the patient suffers from one of the three other diseases:

N({di}) = 1 - Π({</2, d3, d4}) = 1 - m a x ( ^ 2 ) , n(d3), n(d4)) (48)

The duality between Π and N also appears in the following relations, satisfied
VA e P(X):
Section 5 Approximate Reasoning 17

• Π(Α)>Ν(Α),
• max(n(yi), l-N(A)) = l,
• If Ν(Α)φ 0, then U(A)= I,
• If Π(Α)ΦΙ, then ΛΤ(Λ) = 0.
These properties are important if we elicit the possibility and necessity measures
from a physician. For instance, if the physician provides first possibility degrees Π(Α)
for events A, we should not ask the physician to give necessity degrees for events with
possibility degrees strictly smaller than 1, because N(A) = 0 in this case. If the physician
provides first degrees of certainty, corresponding to values of a necessity measure, we
should not ask for possibility degrees for events with necessity degrees different from 0,
as U(A) = 1 in this case.
4.4. Relative Possibility and Necessity of Fuzzy
Sets
Possibility and necessity measures have been defined for crisp subsets of X, not for
fuzzy sets. In the case in which fuzzy sets are observed, analogous measures are defined
with a somewhat different purpose, which is to compare an observed fuzzy set F to a
reference fuzzy set A of X.
The possibility of F relative to A is defined as

Π(Λ A) = supx6^ min(fF(x),fA(x)) (49)

We remark that Π(.Ρ; A) = 0 indicates that F D A = 0, and U{F\ A) = 1 indicates


that F n A φ 0.
The dual quantity defined by N(F; A) = 1 - Π(Ρ"; A) is the necessity of F with
regard to A, defined as

N(F; A) = inf^* max(fr(*), 1 -fA(x)) (50)

These coefficients are used, among other things, to measure the extent to which F
is suitable with A. For example, with the universe X of real numbers, we can evaluate
the compatibility of the glycemia level F of a patient, described as "about 1.4g/l"
(Figure 2), with a reference description of the glycemia level as "abnormal" (Figure
1), by means of Tl(F; A) and N(F; A), and this information will express the extent to
which the glycemia level can be considered abnormal.

5. APPROXIMATE REASONING

Possibility theory, as presented in Section 4, is restricted to crisp subsets of a universe.


The purpose of its introduction was to evaluate uncertainty related to inaccuracy. We
need to establish a link between both approaches.
5.1. Linguistic Variables
A linguistic variable is a 3-tuple (V, X, Tv), defined from a variable V (e.g., dis-
tance, glycemia level, temperature) defined on a universe X and a set Tv = [Ax, A2...}
of fuzzy characterizations of V. For instance, with V = glycemia level, we can have
18 Chapter 1 Uncertainty Management in Medical Applications

Tv = {normal, abnormal} (Figure 8). We use the same notation for a linguistic char-
acterization and for its representation by a fuzzy set of X. The set Ty corresponds to
basic characterizations of V.
We need to construct more characterizations of V to enable efficient reasoning
from values of V.
A linguistic modifier is an operator m yielding a new characterization m(A) from
any characterization A of V in such a way that fm(A) = tm(fA) for a mathematical
transformation tm associated with m.
For a set M of modifiers, M{TV) denotes the set of fuzzy characterizations deduced
from Tv. For example, with M = {almost, very}, we obtain M(TV) = {very abnormal,
almost normal...} from Tv = {normal, abnormal} (Figure 9).
Examples of linguistic modifiers are defined by the following mathematical defini-
tions, corresponding to translations or homotheties:

• fm(A)(x) =/Α(Χ)2 ( ver y) introduced by Zadeh


• fm(A)(.x) =/Α(ΧΫ//2 (more or less) introduced by Zadeh
• fm(A)(x) = Γηΐη(1,λ/^(χ)), for λ > 1 (approximately)
• fm(A)(x) = max(0, υφ(χ) + 1 — u), for a parameter v in [1/2,1] (about)
• fm(A)(x) = min(l, max(0,0(x) + ß)), with 0 < ß < 1 (rather)
• 'mi/^W =/A(X + a)> f° r a r e a l parameter a (really or rather, depending on the
sign of a) where φ is the function identical tofA on its support and extending it
out of the support.
Section 5 Approximate Reasoning 19

5.2. Fuzzy Propositions


We consider a set L of linguistic variables and a set M of linguistic modifiers.
For a linguistic variable (V, X, 7 » of L, and elementary proposition is defined as
"V is A" ("the glycemia level is abnormal") by means of a normalized fuzzy set A of X
in Tv or in M ( 7 » .
The more suitable the precise value of V with A, the more true the proposition " V
is A." The truth value of an elementary fuzzy proposition "V is A" is defined by the
membership function/^ of A.
A compound fuzzy proposition is obtained by combining elementary propositions
"V is A," "W is B"... for noninteractive variables V.
The simplest fuzzy proposition is a conjunction of elementary fuzzy propositions
"V is A and W is B" (for instance, "the glycemia level is abnormal and the cholesterol
level is high"), for two variables V and W respectively defined on universes X and Y. It
is associated with the Cartesian product A x B of fuzzy sets of X and Y, characterizing
the pair {V, W) on X x Y. Its truth value is defined by mm(fA(x), fß(y)) or more
generally T(fA(x),fB(y)) for a t-norm T, in any (x, j) o f l x Y. Such a fuzzy proposi-
tion is very common in rules of knowledge-based systems and in fuzzy control.
Analogously, we can combine elementary propositions by a disjunction of the form
" V is A or W is B" (for instance, "the glycemia level is abnormal and the cholesterol
level is high"). The truth value of the fuzzy proposition is defined by max(fA(x), fB(y)),
or more generally -L(fA(x),fB(y)) for a t-conorm _L, in any (x, y) of X x Y.
An implication between two elementary fuzzy propositions provides a fuzzy pro-
position of the form "if V is A then W is B" (for instance, "if the glycemia level is
abnormal then the suggestion is sulfonylurea"), and we will study this form of fuzzy
proposition carefully because of its importance in reasoning in a fuzzy framework.
More generally, we can construct fuzzy propositions by conjunction, disjunction,
or implication on already compound fuzzy propositions.
A fuzzy proposition based on an implication between elementary or compound
fuzzy propositions, for instance, of the form "if V is A and W is B then U is C" ("if the
glycemia level is medium and the creatininemia level is smaller than k, then the sugges-
tion is not sulfonylurea") is a fuzzy rule, "V is A and W is B" is its premise, and "U is
C" is its conclusion.

5.3. Possibility Distribution Associated


with a Fuzzy Proposition
The concepts of linguistic variable and fuzzy proposition are useful for the man-
agement of imprecise knowledge when we associate them with possibility distributions
to represent uncertainty.
A fuzzy characterization A such as "abnormal" is prior information and its mem-
bership function/Λ indicates to what extent each element x of X belongs to A. A fuzzy
proposition such as "the glycemia level is abnormal" is posterior information, given
after an observation, which describes to what extent it is possible that the exact value of
the glycemia level is any element of X.
An elementary fuzzy proposition induces a possibility distribution πν Α on X,
defined from the membership function of A by
20 Chapter 1 Uncertainty Management in Medical Applications

VxeX itYJx)=fA{x) (51)

From this possibility distribution, we define a possibility and a necessity measure


for any crisp subset D of X, given the description of V by A:

ΠΚ,ΛΦ) = sup^jiy^C*)
NKA(D) = l-nv,A(Dc)
Analogously, a compound fuzzy proposition induces a possibility distribution on
the Cartesian product of the universes. For instance, a fuzzy proposition such as " V is
A and W is 2?," with V and W defined on universes X and Y, induces the following
possibility distribution:

Vx e X, Vy e Y n(ytW)AxB(x,y) = min(fA(x),fB(y)) (53)

Such a connection between membership functions and degrees of possibility, or


equivalently between imprecision and uncertainty, appears clearly if we again use the
example given in Figure 1. We see that a value of the glycemia level equal to 1.4 g/1
belongs to the class of abnormal levels with a degree equal to 0.5. Conversely, if we
know only that a given glycemia level is characterized as "abnormal," we deduce that
It is impossible that this level is less than 1.3 g/1, which means that the possibility
degrees are equal to zero for the values of the glycemia level smaller than 1.3 g/1.
It is absolutely possible that this level is at least equal to 1.5 g/1, which means
that the possibility distribution assigns a value equal to 1 to levels at least equal
to 1.5 g/1.
It is relatively possible, with a possibility degree between 0 and 1, that the glycemia
level is between 1.3 and 1.5 g/1.
In the case of an uncertain fuzzy proposition such as "V is A, with an uncertainty
e," for A e Tv, no element of the universe X can be rejected and every element x of X
has a possibility degree at least equal to e. Such a fuzzy proposition is associated with a
possibility distribution:

π'(χ) = ταζχ{πνΑ(χ), e) (54)

For instance, a fuzzy proposition weighted by an uncertainty, such as "it is pos-


sible that the glycemia level is abnormal, with an uncertainty 0.4" or, equivalently, "it is
possible that the glycemia level is abnormal, with a certainty 0.6," is represented by a
possibility distribution π' as indicated in Figure 10 by using the possibility distribution

0.5
ε
Figure 10 Possibility distribution of an
1.3 1.5 g/1
uncertain fuzzy proposition.
Section 5 Approximate Reasoning 21

πνΑ deduced from the membership function of "abnormal" given in Figure 1 and the
value 0.4 of e.
5.4. Fuzzy Implications
The use of imprecise and/or uncertain knowledge leads to reasoning in a way close
to human reasoning and different from classical logic. More particularly, we need:
To manipulate truth values intermediate between absolute truth and absolute
falsity
To use soft forms of quantifiers, more gradual than the universal and existential
quantifiers V and 3
To use deduction rules when the available information is imperfectly compatible
with the premise of the rule.
For these reasons, fuzzy logic has been introduced with the following character-
istics:
Propositions are fuzzy propositions constructed from sets L of linguistic variables
and M of linguistic modifiers.
The truth value of a fuzzy proposition belongs to [0,1] and is given by the member-
ship function of the fuzzy set used in the proposition.
Fuzzy logic can be considered as an extension of classical logic and it is identical
to classical logic when the propositions are based on crisp characterizations of
the variables.
Let us consider a fuzzy rule "if V is A then W is B," based on two linguistic
variables (V, X, 7 » and (W, Y, Tw).
A fuzzy implication associates with this fuzzy rule the membership function of a
fuzzy relation Ä o n J f x Y defined as

V ( * , y ) e X x F fR(x,y) = F(fA(x),fB(y)) (55)

for a function F chosen in such a way that, if A and B are singletons, then the fuzzy
implication is identical to the classical implication.
There exist many definitions of fuzzy implications. The most commonly used are
the following:

/*(*> y) = 1 -/<(*) + / Λ ( * ) -fe(y) Reichenbach


/*(*> y) = max(l -fA(x), min(fA(x),fB(y)) Willmott
/*(*> y) = max(l -fA(x),fs(y)) Kleene-Dienes
/*(*. y) = min(l -fA(x) +fB(y), 1) Lukasiewicz
fR(x,y) = mint/iOO/^C*:), 1) if fA(x) Φ 0 and 1 otherwise Goguen
fdx^y) = 1 if fn(x) ^fß(y) afl d 0 otherwise Rescher-Gaines
/R(*>JO = 1 if fA{x) 5/BO') and/ A 0) otherwise Brouwer-Gödel
A(x, y) = min(/"/4(x),/Ä0')) Mamdani*
fÄx, y) =fA(x) -feiy) Larsen*
22 Chapter 1 Uncertainty Management in Medical Applications

The last two quantities (*) do not generalize the classical implication, but they are
used in fuzzy control to manage fuzzy rules.
Generalized modus ponens is an extension of the scheme of reasoning called modus
ponens in classical logic. For two propositions p and q such that p =» q, if p is true, we
deduce that q is true. In fuzzy logic, we use fuzzy propositions and, if p' is true, with/»'
approximately identical to p, we want to get a conclusion, even though it is not q itself.
Generalized modus ponens (g.m.p.) is based on the following propositions:
Rule if V is A then W is B
Observed fact V is A'
Conclusion W is B'
The membership function fB> of the conclusion is computed from the available
information: fR to represent the rule,fA> to represent the observed fact, by means of the
so-called combination-projection rule:

VyeY fB'(y) = supx€XT(fAix),fR(x,y)) (56)

for a t-norm T called a generalized modus ponens operator.


The choice of T is determined by the compatibility of the generalized modus
ponens with the classical modus ponens: if A = A', then B = B'.
The most usual g.m.p. operators suitable with this condition are the following:
The Lukasiewicz t-norm T(u, v) = max(« + v — 1,0) with any of the fuzzy impli-
cations mentioned above.
The product t-norm T(u, v) = u.v with the five last fuzzy implications of our list
The min t-norm T{u, v) = min(w, v) with the four last ones
5.5. Fuzzy Inferences
The choice of a fuzzy implication is based on its behavior. Some fuzzy implications
entail an uncertainty about the conclusion (Kleene-Dienes implication, for instance),
whereas other provide imprecise conclusions (Reichenbach, Brouwer-Gödel, or
Goguen implication, for instance). Some of them entail both types of imperfection
(Lukasiewicz implication, for instance).
Let us consider the following example (see Figure 11):
Rule: "if the glycemia level is abnormal then sulfonylurea is suggested," with the
universe of distances X = U+ and the universe of degrees of suggestion Y = [0,1].
Observation: the glycemia level is 1.4 g/1.
Conclusion:
• It is relatively certain that sulfonylurea is suggested (with the Kleene-Dienes
implication).
• It is relatively certain that sulfonylurea is rather suggested (with the
Reichenbach, Brouwer-Gödel, or Goguen implication).
• It is relatively certain that sulfonylurea is rather suggested (with the
Lukasiewicz implication).
Fuzzy inferences are used in rule-based systems, when there exist imprecise data,
when we need a flexible system, with representation of the linguistic descriptions
Section 6 Examples of Applications of Numerical Methods in Biology 23

Figure 11 Example of a generalized modus ponens with various forms of observa-


tions A' and various fuzzy implications.

depending on the environment of the system or its conditions of utilization, when we


cope with categories with imprecise boundaries, and when there exist subjective vari-
ables described by human agents.

6. EXAMPLES OF APPLICATIONS OF NUMERICAL


METHODS IN BIOLOGY

There exist many knowledge-based systems using fuzzy logic. The treatment of glyce-
mia, for instance, has given rise to several automatic systems supporting diagnosis or
helping patients to take care of their glycemia level [13-15]. An example in other
domains is a system supporting the prescription of antibiotics [16].
Some general systems, which are expert system engines using fuzzy logic, have been
used to solve medical problems. MILORD is particularly interesting for its module of
24 Chapter 1 Uncertainty Management in Medical Applications

expert knowledge elicitation [17] and FLOPS takes into account fuzzy numbers and
fuzzy relations and is used to process medical images in cardiology [18]. Also,
CADIAG-2 provides a general diagnosis support system using fuzzy descriptions and
also fuzzy quantifiers such as "frequently" or "rarely" [19].
The management of temporal knowledge in an imprecise framework can be solved
by using fuzzy temporal constraints, and such an approach has been used for the
management of data in cardiology [20], for instance.
It is also interesting to use fuzzy techniques for diagnosis support systems taking
into account clinical indications that are difficult to describe precisely, such as the
density, compacity, and texture of visual marks. Such systems have been proposed
for the diagnosis of hormone disorders [21] or the analysis of symptoms of patients
admitted to a hospital [22].
In medical image processing, problems of pattern identification are added to the
difficulty in eliciting precise and certain rules from specialists, even though they are able
to make a diagnosis from an image. A system for the analysis of microcalcifications in
mammographic images has been proposed [23], a segmentation method based on fuzzy
logic has been described [24], and the fusion of cranial magnetic resonance has been
explained [25].
Databases can also be explored by means of imprecise queries, and an example of
an approach to this problem using fuzzy concepts has been proposed [26].
In this section, we have listed the main directions in using fuzzy logic in the
construction of automatic systems in medicine on the basis of existing practical appli-
cations. This list is obviously not exhaustive. More applications are discussed elsewhere
[27].

7. CONCLUSION
We have presented the main problems concerning the management of uncertainty and
imprecision in automatic systems, especially in medical applications. We have intro-
duced methodologies that enable us to cope with these imperfections.
We have not developed evidence theory, also called Dempster-Shafer theory,
which concerns the management of degrees of belief assigned to the occurrence of
events. The main interest lies in the combination rule introduced by Dempster that
provides a means of aggregating information obtained from several sources.
Another methodology used in medical applications is the construction of causal
networks, generally regarded as graphs, the vertices of which are associated with situa-
tions or symptoms or diseases. The arcs forward probabilities of occurrence of events
from one vertex to another and enable us to update probabilities of hypotheses when
new information is received or to point out dependences between elements.
As we focused on methods for dealing with imprecisions, let us point out the
reasons for their importance [1,2]: fuzzy set and possibility theory are of interest
when at least one of the following problems occurs:
• We have to deal with imperfect knowledge.
• Precise modeling of a system is difficult.
• We have to cope with both uncertain and imprecise knowledge.
References 25

• We have to manage numerical knowledge (numerical values of variables "100


millimeters") and symbolic knowledge (descriptions of variables in natural lan-
guage, "long") in a common framework.
• Human components are involved in the studied system (observers, users of the
system, agents) and bring approximate or vague descriptions of variables, sub-
jectivity (degree of risk, aggressiveness of the other participants), qualitative
rules ("if the level is too high, reduce the level"), and gradual knowledge
("the greater, the more dangerous").
• We have to take into account imprecise classes and ill-defined categories ("pain-
ful position").
• We look for flexible management of knowledge, adaptable to the environment
or to the situation we meet.
• The system is evolutionary, which makes it difficult to describe precisely each of
its states.

The number of medical applications developed since the 1970s justifies the devel-
opment we have presented.

REFERENCES

[1] B. Bouchon-Meunier, La logique floue, Que Sais-Jel 2nd ed. No. 2702. Paris: Presses
Universitaires de France, 1994.
[2] B. Bouchon-Meunier, La logiquefloueet ses applications. Paris: Addison-Wesley, 1995.
[3] B. Bouchon-Meunier and H. T. Nguyen, Les incertitudes dans les systemes intelligents. Paris:
Presses Universitaires de France, 1996.
[4] D. Dubois and H. Prade, Fuzzy Sets and Systems, Theory and Applications. New York:
Academic Press, 1980.
[5] D. Dubois and H. Prade, Theorie des possibilites, applications ά la representation des con-
naissances en informatique, 2nd ed. Paris: Masson, 1987.
[6] D. Dubois, H. Prade, and R. R. Yager, Readings in Fuzzy Sets for Intelligent Systems. San
Mateo, CA: Morgan Kaufmann, 1993.
[7] G. Klir and T. Folger, Fuzzy Sets, Uncertainty and Information. Englewood Cliffs, NJ:
Prentice Hall, 1988.
[8] L. Sombe, Raisonnements sur des informations incompletes en intelligence artificielle—
Comparaison de formalismes apartir d'un exemple. Toulouse: Editions Teknea, 1989.
[9] R. E. Neapolitan, Probabilistic Reasoning in Expert Systems. New York: Wiley, 1994.
[10] R. R. Yager and D. P. Filev, Essentials of Fuzzy Modeling and Control. New York: Wiley,
1990.
[11] L. A. Zadeh, Fuzzy sets. Information Control 8: 338-353, 1965.
[12] L. A. Zadeh, Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst. 1: 3-28, 1978.
[13] G. Soula, B. Vialettes, and J. L. San Marco, PROTIS, a fuzzy deduction rule system:
Application to the treatment of diabetes. Proceedings MEDINFO 83, IFIP-IMIA, pp.
177-187, Amsterdam, 1983.
[14] P. Y. Glorennec, H. Pircher, and J. P. Hespel, Fuzzy logic control of blood glucose.
Proceedings International Conference IPMU, pp. 916-920, Paris, 1994.
[15] J. C. Buisson, H. Farreny, and H. Prade, Dealing with imprecision and uncertainty in the
expert system DIABETO-III. Actes CIIAM-86, pp. 705-721, Hermes, 1986.
26 Chapter 1 Uncertainty Management in Medical Applications

[16] G. Palmer and B. Le Blanc, MENTA/MD: Moteur decisionnel dans le cadre de la logique
possibiliste. Actes 3emes Journees Nationales sur les Applications des Ensembles Flous, pp.
61-68, Nimes, 1993.
[17] R. Lopez de Mantaras, J. Agusti, E. Plaza, and C. Sierra, Milord: A fuzzy expert system
shell. In Fuzzy Expert Systems, A. Kandel, ed., pp. 213-223, Boca Raton, FL: CRC Press,
1992.
[18] J. J. Buckley, W. Siler, and D. Tucker, A fuzzy expert system. Fuzzy Sets Syst 20:1-16,1986.
[19] K. P. Adlassnig and G. Kolarz, CADIAG-2: Computer-assisted medical diagnosis using
fuzzy subsets. In Approximate Reasoning in Decision Analysis, M. M. Gupta and E. Sanchez,
eds., pp. 219-247. Amsterdam: North Holland.
[20] S. Barro, A. Bugarin, P. Felix, R. Ruiz, R. Marin, and F. Palacios, Fuzzy logic applications
in cardiology: Study of some cases. Proceedings International Conference IPMU, pp. 885-
891, Paris, 1994.
[21] E. Binaghi, M. L. Cirla, and A. Rampini, A fuzzy logic based system for the quantification
of visual inspection in clinical assessment. Proceedings International Conference IPMU, pp.
892-897, Paris, 1994.
[22] D. L. Hudson and M. E. Cohen, The role of approximate reasoning in a medical expert
system. In Fuzzy Expert Systems, A. Kandel, ed. Boca Raton, FL: CRC Press, 1992.
[23] S. Bothorel, B. Bouchon-Meunier, and S. Muller, A fuzzy logic-based approach for serni-
ological analysis of microcalcifications in mammographic images. Int. J. Intell. Syst., 1997.
[24] P. C. Smits, M. Mari, A. Teschioni, S. Dellepine, and F. Fontana, Application of fuzzy
methods to segmentation of medical images. Proceedings International Conference IPMU,
pp. 910-915, Paris, 1994.
[25] I. Bloch and H. Maitre, Fuzzy mathematical morphology. Ann. Math. Artif. Intell. 9:III-IV,
1993.
[26] M. C. Jaulent and A. Yang, Application of fuzzy pattern matching to theflexibleinterroga-
tion of a digital angiographies database. Proceedings International Conference IPMU, pp.
904-909, Paris, 1994.
[27] M. Cohen and D. Hudson, eds., Comparative Approaches to Medical Reasoning. Singapore:
World Scientific, 1995.
[28] R. R. Yager, S. Ovchinnikov, R. M. Tong, and H. T. Nguyen, eds., Fuzzy Sets and
Applications, Selected Papers by L. A. Zadeh. New York: Wiley, 1987.
[29] H. J. Zimmermann, Fuzzy Set Theory and Its Applications. Dordrecht: Kluwer, 1985.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.

Chapter APPLICATIONS OF FUZZY


2 CLUSTERING TO BIOMEDICAL
SIGNAL PROCESSING AND
DYNAMIC SYSTEM IDENTIFICATION

Amir B. Geva

1. INTRODUCTION
State recognition (diagnosis) and event prediction (prognosis) are important tasks in
biomedical signal processing. Examples can be found in tachycardia detection from
electrocardiogram (ECG) signals, epileptic seizure prediction from an electroencepha-
logram (EEG) signal, and prediction of vehicle drivers falling asleep from both signals.
The problem generally treats a set of ordered measurements of the system behavior and
asks for recognition of temporal patterns that may forecast an event or a transition
between two different states of the biological system.
Applying clustering methods to continuously sampled measurements in quasi-sta-
tionary conditions is useful for grouping discontinuous related temporal patterns. Since
the input patterns are time series, a similar series of events that lead to a similar result
would be clustered together. The switches from one stationary state to another, which
are usually vague and not focused on any particular time point, are naturally treated by
means of fuzzy clustering. In such cases, an adaptive selection of the number of clusters
(the number of underlying processes, or states, in the time series) can overcome the
general nonstationary nature of real-life time series.
The method includes the following steps: (0) rearrangement of the time series into
temporal patterns for the clustering procedure, (1) dynamic state recognition and event
detection by unsupervised fuzzy clustering, (2) system modeling using the noncontin-
uous temporal patterns of each cluster, and (3) time series prediction by means of
similar past temporal patterns from the same cluster of the last temporal pattern.
The prediction task can be simplified by decomposing the time series into separate
scales of wavelets and predicting each scale separately. The wavelet transform provides
an interpretation of the series structures and information about the history of the series,
using fewer coefficients than other methods.
The algorithm suggested for the clustering is a recursive algorithm for hierarchical-
fuzzy partitioning. The algorithm benefits from the advantages of hierarchical cluster-
ing while maintaining the rules of fuzzy sets. Each pattern can have a nonzero member-
ship in more than one data subset in the hierarchy. Feature extraction and reduction is
optionally reapplied for each data subset. A "natural" and feasible solution to the
cluster validity problem is suggested by combining hierarchical and fuzzy concepts.

27
28 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing

The algorithm is shown to be effective for a variety of data sets with a wide dynamic
range of both covariance matrices and number of members in each class. The new
method is demonstrated for well-known time series benchmarks and is applied to
state recognition of the recovery from exercise by the heart rate signal and to the
forecasting of biomedical events such as generalized epileptic seizures from the EEG
signal.

1.1. Time Series Prediction and System


Identification
A sequence of L observed data, s\,s2,...,sL, usually ordered in time, is called a
time series, although time may be replaced by any other variable. Real-life time series
can be taken from physical science, business, management, social and behavioral
science, economics, and so on. The goal of time series prediction or forecasting is to
find the continuation, sL+x, sL+2,..., of the observed sequence. Time series prediction is
based on the idea that the time series carry within them the potential for predicting their
future behavior. Analyzing observed data produced by a system can give good insight
into the system and knowledge about the laws underlying the data. With the knowledge
gained, good predictions of the system's future behavior can be made. The techniques
for time series analysis and predicting [1] can be classified into roughly two general
categories: (1) if there are known underlying deterministic equations describing the
series, in principle they can be solved to make a forecast, and (2) if the equations are
not known, one must find rules governing the data and information about the under-
lying model of the time series (such as whether it is linear, quadratic, periodic, chaotic).
Linear models (such as MA, AR, and ARMA) have been most frequently used for
time series analysis, although often there is no inherent reason to restrict consideration
to such models. Linear models have two particularly desirable features: they can be
understood in great detail, and they are straightforward to implement. Linear models
can give good prediction results for simple time series but can fail to predict time series
with a wide band spectrum, a stochastic or chaotic time series, in which the power
spectrum is not a useful characterization. The analysis of such a series requires a long
history of the series, which results in a very high order linear model. For example, a
good weather forecasting system demands a long history in order to capture the model
of changing seasons. In practice, the application of such a high-order linear model is
problematic both from the learning point of view and from the computational point of
view.
A number of new nonlinear techniques, such as neural networks (NNs), wavelets,
and chaos analysis, promise insight that traditional linear approaches cannot provide
[2-16]. The use of NNs can produce nonlinear models that embody a much broader
class of functions than linear models [4,5]. Some recent work shows that a feed-forward
NN, trained with back-propagation and a weight elimination algorithm, outperforms
traditional nonlinear statistical approaches in time series prediction [2,3]. The simplest
approach for learning a time series model by means of an NN is to provide its time-
delayed samples to the input layer of the NN. The more complex the series are, the
more information about the past is needed, so the size of the input layer and the
corresponding number of weights are increased. If, however, a system operates in
multiple modes and the dynamics is drifting or switching, standard approaches, such
Section 1 Introduction 29

as the multilayer perceptron, are likely to fail to represent the underlying input-output
relations [17].

1.2. Fuzzy Clustering


Fuzzy clustering algorithms [18,19] are widely used for various applications in
which grouping of overlapping and vague elements is necessary. Some experience has
been accumulated in the medicalfieldin diagnostics and decision-making support tools
where a wide range of measurements were used as the input data space and a decision
result was produced by optimally grouping the symptoms together [20-22]. The algo-
rithms may fail when the data include complex structures with large variability of
cluster shapes, variances, densities, and number of data points in each cluster.
Examples can be found in complex biomedical signals such as EEG signals and in
medical images such as magnetic resonance imaging (MRI) and positron emission
tomography (PET) images that include main objects and fine details with a large
dynamic intensity scale. One of the main problems in these cases is the estimation of
the number of clusters in the data, the so-called cluster validity problem. Cluster validity
is a difficult problem that is crucial for the practical application of clustering [23]. Most
of the common criteria for cluster validity have failed to estimate the correct number of
clusters in complex data with a large number of clusters and a large variety of distribu-
tions within and between clusters. Hierarchical clustering seems to be a natural
approach to solving this problem.
"Hard" hierarchical clustering methods (also referred to as graph clustering algo-
rithms) are very well known methods for recursively partitioning sets to subsets [20,24].
The partition can be bottom up or top down, but in both cases a data pattern that has
been classified to one of the clusters cannot be reclassified to other clusters. This
property of the classical hard hierarchical clustering methods makes them impractical
for real applications. This chapter introduces a method for a natural "soft" top-down
hierarchical partition, which fully follows fuzzy set rules [25] by means of unsupervised
fuzzy clustering. The method is applied for biomedical state recognition and events
forecasting.

1.3. Nonstationary Signal Processing Using


Unsupervised Fuzzy Clustering
Unsupervised fuzzy clustering is one of the common methods used for finding a
structure in given data [18,19,26], in particular finding structure related to time. In the
first part of this chapter we apply dynamic hierarchical unsupervised fuzzy clustering
for forecasting medical events from biomedical signals (EEG, heart rate variability,
etc.). There are two important differences between this problem and the classical task
of time series prediction in which an estimation of a specific future element is requested.
First, exact knowledge about the values in the estimated part of the series is generally
not essential for the event forecasting task. Second, the precursory elements can be
spread with nonuniform and changing weighting factors along the time scale and the
common assumption of stationary distribution in the time series can be rejected.
Applying the clustering methods to continuously sampled measurements in semista-
tionary conditions can be useful for grouping discontinuous related patterns and form-
ing a warning cluster. The switches from one stationary state to another, which are
usually vague and not focused on any particular time point, are naturally treated by
30 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing

means of fuzzy clustering. In such cases, an adaptive selection of the number of clusters
(the number of underlying semistationary processes in the signal) can overcome the
general nonstationary nature of biomedical signals.
The time series prediction task can be treated by combining fuzzy clustering tech-
niques with common methods. The deterministic versus stochastic (DVS) algorithm for
time series prediction, which was successfully demonstrated by Casdagli and Weigend
in the Santa Fe competition [3], is an important and relevant approach. In the DVS
algorithm k sets of samples from a nonuniform time scale are used with the present
window, according to an affinity criterion, as a precursory set for a future element. The
prediction phase of the algorithm can be regarded as a ^-nearest-neighbor clustering of
the time series elements to groups that reflect certain "states" of the series. The idea is
that we expect future results of similar states to be similar as well. In the same way that
the ^-nearest neighbor was used, unsupervised fuzzy clustering methods can be imple-
mented so as to provide an alternative method for time series prediction [15]. This
approach is expected to provide superior results in quasi-stationary conditions, where
a relatively small number of stationary distributions control the behavior of the series
and unexpected switches between them are observed. Again, an unsupervised selection
of the number of clusters and of the number of patterns in each cluster (the parameter
k, which is fixed for all the clusters in the DVS algorithm) can overcome the nonsta-
tionarity of the signals and improve the prediction results.

2. METHODS
The general scheme of the hybrid algorithm for state recognition and time series pre-
diction using unsupervised fuzzy clustering is presented in Figure 1. The method
includes the following steps:
0. Rearrangement of the time series into temporal patterns for the clustering
procedure
1. Dynamic state recognition and event detection by unsupervised fuzzy clustering
2. Modeling and system identification of the noncontinuous temporal patterns of
each cluster
3. Time series prediction by means of similar past temporal patterns from the same
cluster of the last temporal pattern
The clustering procedure can be applied directly on continuous overlapping win-
dows of the sampled raw data or on some of its derivatives (the phase or state space of
the data). In the clustering phase of the algorithm, we "collect" all the temporal pat-
terns from the past that are similar to the current temporal event by the clustering
procedure. This set of patterns is used in the next stage to predict the following samples
of the time series. Using only similar temporal patterns to predict the time series
simplifies the predictor learning task and, thus, enables better prediction ("From causes
which appear similar, we expect similar effects," Hume [27]). The prediction stage can
be done by one of the common time series prediction methods (e.g., linear prediction
with ARMA models or nonlinear prediction with NNs). The prediction method pre-
sented here combines unsupervised learning in the clustering phase and supervised
learning in the modeling phase. The learning procedure can be dynamic by utilizing
Section 2 Methods 31

Time series

j „ 6 9t, w = 1, ...,L

Temporal patterns

e
*t={Si ί,+ΛΓ-ΐ} ^

i=l,...,M,M=L-N-d +2

Fuzzy clustering

0 <> u,j <,\,i = 1,..., MJ = 1 K State


K recognition
Σ«,/=1,/=1,...,Λί
7-1

Modeling

for/=l,...,K
Ay={x,|i6//},

•>y={*/+w-i+dl'e/y·}, Dynamic system


identification

i=\,...,M-\}

by = c, · A ; => c,· = b, · pinvfA,)

Predicting

Figure 1 Time series analysis using fuzzy


clustering.

the clustering procedure for each new sample and predicting the next samples by the
adapted clustering results.

2.1. State Recognition and Time Series Prediction


Using Unsupervised Fuzzy Clustering
Given L samples of a time series, s„ e 7i, n = 1,..., L, ourfinalaim is to predict its
ί/th sample ahead, sL+j. The next steps shortened the algorithm for state recognition
and time series prediction using fuzzy clustering of its temporal patterns:
0. Construct the yV-dimensional data set for clustering from the following M tem-
poral patterns (column vectors):
32 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing

x, = {sit..., si+N_i] e %N, i=\,...,M, where M = L-N -d + 2

1. Cluster the temporal patterns into an optimal (subject to some cluster validity
criterion) number of fuzzy sets, K, that is, find the degree of membership,
0 <«_/,/< 1, of each temporal pattern x„ i=\,...,M, in each cluster,
j = 1,..., K, such that

Λ:
] ζ «,·,,· =1, i=\,...,M
j=i

2. Fit a prediction model to each fuzzy cluster, j' = 1,..., K, using the set of its
"maximal members," that is, the set of temporal patterns that have the maximal
degree of membership in the y'th cluster,
• A, = {x,|i € J ; }, and a column vector of the corresponding predictions,
• by = {si+N_i+(j\i e Jj}, as a learning set, where
• Jj = {i\Ujri = ke™xJQ(«*,,·), i = 1,. · ·, M — 1} is the set of indices of the tem-
poral patterns that have the maximal degree of membership in the y'th cluster.
Note that each temporal pattern is a member in one and only one set of max-
imal members.
Assuming a linear prediction model of order N,
• by = cy · Aj, for each cluster, j = 1,..., K, one can estimate the TV-dimensional
coefficients (row) vectors, Cy, by
• cy = by · pinv(Ay), where "pinv" stands for the Moore-Penrose pseudoinverse
[28] of a matrix.
Note that other prediction models, such as NN techniques, can be learned for each
cluster and that the degree of membership of each pattern in the cluster can be used
as the weight of this pattern in the learning process.
3. Predict the Jth sample ahead, sL+d by a fuzzy mixture of all the prediction models
that were found in the previous step (2), using the degree of membership of the last
pattern, uMj, in all the clusters, j = 1,..., K, for weighting the models.
For the preceding linear model

K
U
SL+d = Σ J<M ' CJ ' %M
7=1

The procedure can be partially applied and terminated after each of the stages accord-
ing to the user's requirements.
2.2. Features Extraction and Reduction
The clustering can be applied directly on windows of the sampled raw data or on
some of its derivatives (the phase space of the data). It is common practice to use a
transformation of the input instead of the data elements themselves in order to char-
acterize the signal's inherent properties [29]. A considerable amount of experience has
been gained in using several known transformations (such as discrete derivatives, spec-
trum estimation, and wavelet analysis) for feature extraction to reduce the data's
Section 2 Methods 33

dimension and to modify the data to fit the context of a specific problem. An important
feature of the proposed prediction methods is their ability to perform the clustering
under any such transformation of the input and thus to exploit the related benefits.

2.2.1. Spectrum Estimation

The power spectrum (including high-order spectrum) and spectrum of each tem-
poral pattern can be estimated by one of the common spectrum estimation methods
(e.g., short-time fast Fourier transform, AR, ARMA, eigenvector analysis), to con-
struct the features matrix for the clustering algorithm. The power spectrum is a direct
and robust way to describe the nature of the signal. One of the noteworthy advantages
of the power spectrum is its phase invariance. The main disadvantage of power spec-
trum estimation is the requirement for a stationary signal.

2.2.2. Time-Frequency Analysis

Multiscale Decomposition by the Fast Wavelet Transform. The wavelet transform


[8,9,30] provides an important tool in biomedical signal analysis and features extraction
[31,32]. It produces a good local representation of the signal in both the time domain
and the frequency domain. Unlike the Fourier transform, which is global and provides
a description of the overall regularity of signals, the wavelet transform looks for the
spatial distribution of singularities. The wavelets' coefficients of the different scales
offer a compact representation and more information about the history of the time
series. In this case the prediction task can be simplified by predicting each scale of the
wavelets' coefficients separately and then combining the predictions of all the scales in
order to predict the original series [33]. The wavelets' coefficients or their statistical
values can serve as the extracted features to form the input matrix for the clustering
algorithm.
This transform particularly suits the EEG signal, which, at any instant, is a mixture
of (usually up tofive)discrete "rhythms," with occasional embedded single events that
can be viewed as composed of elements of the same rhythms [34]. The simultaneous
multiconvolution of the wavelet transform ensures that in any given epoch, single
events will be captured near the central frequency of one or more band-pass filters,
and that any temporal spectral variation (or switch of dominant rhythm) will differently
influence the product of the several filters on consecutive time segments and will also
activate changes in the feature space [32]. For the EEG signal, the first five to seven
scales (depending on the sampling frequency) of wavelet coefficients obtained by the
fast wavelet transform with a suitable mother wavelet [7] are sufficient to capture
information within the frequency range of interest [32]. Feature reduction is achieved
by computing for each scale the statistical values of the moments of variance (energy),
skewness, and kurtosis of the wavelet coefficients. In addition to these statistical values,
the number of extrema ("zero crossings") per unit time in each scale of the transform,
which contain other, nonlinear, information about the signal, may be obtained and
added to the list of features for the clustering algorithm.
Multichannel Matching Pursuit. Mallat and Zhang [35] have introduced an algo-
rithm, called matching pursuit, that decomposes any signal into a linear expansion of
waveforms that are selected from a time-frequency dictionary of functions. These wave-
34 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing

forms are chosen in order to best match the signal structures. Matching pursuits are
general procedures to compute adaptive signal representations and feature extraction.
A matching pursuit can isolate the signal structures that are coherent with respect to a
given dictionary and provides an interpretation of the signal structures. At each itera-
tion of the algorithm, a waveform that is best adapted to approximate part of the signal
is chosen. If a signal structure does not correlate well with any particular dictionary
element, namely a noise component, it is subdecomposed into several elements and its
information is diluted. Although matching pursuit is a nonlinear procedure, it does
maintain an energy conservation that guarantees its convergence [35].
This algorithm has been generalized into spatiotemporal matching pursuit
(SToMP) and adapted to multiple source estimation of EEG complexes such as evoked
potentials (EPs), which are known to be summations of simultaneous electrical activ-
ities of deeper generators [36]. By using a physiologically motivated time-frequency
dictionary of waveforms, the number and the temporal activity pattern of the signal
generators may be estimated. A slightly different version of the SToMP algorithm can
be directly applied to the temporal patterns of a multidimensional signal as a feature
extraction procedure for the clustering algorithm [34].

2.3. The Hierarchical Unsupervised Fuzzy


Clustering (HUFC) Algorithm
The basic idea behind the algorithm is a reexamination of each cluster formed by a
process of fuzzy clustering, as a candidate for fuzzy subclassification. The genealogy of
the degree of membership of each data point, i, in the evolving process is preserved by
serial multiplication such that in each classification step, n, the actual degree of mem-
bership of a data point in a daughter cluster, /, will be t/}j x ujtj1, the latter being its
degree of membership in the mother cluster, k. At the termination of the algorithm,
data points may share membership in more than one cluster of the final generation as
well as in clusters of previous generations that did not subdivide.
The hierarchical unsupervised fuzzy clustering (HUFC) algorithm includes a recur-
sive call to a two-step procedure (Figure 2). In the optionally first step of each recursive
call, the relevant features of the subset of the data at this level of the recursive process
are extracted and reduced. By this means it is possible to select different optimal
features for different subsets of the data. It should be noted that, if this optionally
first step is recursively applied, the clustering is done using different features in each
level. Consequently, the domains of the membership function change and the final
membership function should be carefully interpreted according to the selected features
in each level. A basic method for feature reduction is the Karhunen-Loe've transform
(KLT), which is also commonly referred to as the eigenvector, principal component
(PCA), or Hotteling transform. The reconstruction error is minimized by selecting the
eigenvectors associated with the largest eigenvalues. Thus the KLT is optimal in the
least-square-error sense, or, from the point of view of information theory, the KLT
achieves the lowest overall distortion of any orthogonal linear transform for a fixed
number of coefficients [37]. The KLT can follow any specific appropriate feature extrac-
tion method (e.g., spectrum or wavelet analysis for signal and image processing) or
stand by itself [22]. In the second stage of each recursive call a new weighted version of
the unsupervised optimal fuzzy clustering (WUOFC) algorithm [26,38] is applied to the
relevant subset of the data. In this version of the UOFC algorithm each data point has a
Section 2 Methods 35

Time series rearrangement *


I
Optional feature extraction

HUFC '
Optional feature selection
and reduction
1
t l

For all subclusters


Weighted unsupervised optimal k=\,...,K
fuzzy clustering (WUOFC)

KM
1 Yes
No

Dynamic system identification


i
Time series prediction ►*

Figure 2 Schematic flow diagram of the hierarchical unsupervised fuzzy clustering


(HUFC) algorithm for state recognition and events prediction.

different weight in the partitioning. The number of clusters in each stage is determined
by adapted cluster validity criteria, based on the hypervolume measurement [26,38].
In the first call to the procedure all data points have an equal weight (of one) in
the partitioning. In the next level of the recursive process the same two-step procedure
is applied to each of the fuzzy clusters that were found in the previous partitioning.
Each fuzzy cluster is composed of all the data points with nonzero membership values
in it. These memberships are used as the weights of the data points for the next
recursive call to the WUOFC. The final membership values of each level are the
membership values that are found by the WUOFC algorithm multiplied by the
given weights of the data points. This procedure ensures that the final membership
values of each level of the recursive calls are decreasing. The recursive process is
terminated when the optimal number of clusters in the subset is one (which constrains
the chosen cluster validity criterion to be applicable and also sensitive for one cluster)
or when the number of data points in a cluster is smaller than some constant (usually
around 10) multiplied by the number of features [20,24]. Note that, in contrast to
"hard" hierarchical clustering, the final decision about the data point affiliation is
made only when the algorithm terminates, since each data point can have a nonzero
membership in more than one cluster.
The main part of the new algorithm is a recursive procedure HUFC(X,w) whose
inputs are an N x M data matrix, X, composed of M columns of data patterns,
Xj e TlN, j = 1,..., M, and a column vector, w e HM, of M weights of each data
pattern in the partitioning. The final result of the HUFC algorithm is a Ks x M global
matrix, Ug, of the memberships of all M data patterns in all final Ks fuzzy clusters. The
HUFC algorithm is initiated by setting the global matrix Ug to an empty matrix and the
global number of clusters Ks to zero and executed by calling HUFC(Xo, w0), where X0
36 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing

is the matrix of the M0 original data patterns and w0 is a column vector of M0 ones. The
pseudocode of the HUFC procedure includes the following steps:
HUFC(X,w)
1. Extract the optimal features from the column of the matrix X and reduce the
number of features from N to F (basically by KLT).
2. If the sum of the patterns' weights jHji, wj > Constant x F <> commonly
Constant« 10.
3. then (U, K) = WUOFC(X.w)
Q apply the weighted unsupervised optimal fuzzy clustering algorithm (see Section
2.4),
O where K is the chosen number of clusters in the given data and U is a K x M
matrix of
O the memberships of the M given patterns in these K clusters.
4. else AT = 1
5. If K > 1
6. then for k «- 1 to K
7. do HUFC(X, w x uk) O recursive call to the main procedure
<> where uk is the vector of the memberships of all M patterns in the k's cluster,
O and w x u^ denotes a vector whose jth component is Wj x Ukj,j = 1,..., M.
8. else append the column vector w to the global memberships matrix U8
9. K* <- Kg + 1 O increase the global number of clusters by one.
10. return
When the algorithm has terminated Ug contains the final memberships of all the data
patterns in all the AT8finalclusters.

2.4. The Weighted Unsupervised Optimal Fuzzy


Clustering (WUOFC) Algorithm
The (U, K) = WUOFC(X,w) procedure is recursively applied in step 3 of the
HUFC algorithm. The two inputs of the weighted version of the UOFC algorithm
[26,38] are a data matrix, X, and a column vector, w, of the weights of each column
pattern in the matrix X. No other parameters are given to the procedure. The final
outputs of the procedure are K fuzzy clusters, where K is the estimated number of
clusters in the given data, and U is a K x M matrix of the memberships of the M given
patterns in these K clusters. The data matrix X can be directly used as the input to the
WUOFC procedure or after applying an appropriate feature extraction and reduction
technique. Adapting adequate feature extraction and reduction methods to a given data
set is essential for feasible clustering results. Keeping in mind that the feature extraction
and reduction procedure of the HUFC algorithm is reactivated for each and every fuzzy
cluster (Figure 2) makes its adaptability and efficiency even more significant. The dis-
crete KLT can be optionally applied before the clustering for feature extraction [22] or
for feature reduction after any other feature extraction methods.
The UOFC hybrid algorithm [26,38] partitions the data by a combination of the
modified fuzzy fc-means (FKM) algorithm [18] and the fuzzy maximum-likelihood
estimation (FMLE) algorithm [26,38], which is similar to the EM algorithm [24]. The
advantage of the UOFC algorithm is the unsupervised initialization of cluster proto-
types and the criteria for cluster validity using fuzzy hypervolume and density func-
Section 2 Methods 37

tions. It performs well in a situation of large variability of cluster shapes, densities, and
number of data points in each cluster. The pseudocode of the weighted version of the
UOFC algorithm is iterated for an increasing number of clusters in the data set, calcu-
lating a new partition of the data set, and computing performance measures in each
run, until the optimal number of clusters is obtained:
(U, K) = WUOFC(X,w)
1. Choose a single, K = 1, initial centroid, Pi, at the weighted (by w) mean of all data
patterns.
2. While K < the maximal feasible number of clusters in the data
3. do Calculate a new partition of the data set by two phases (see Section 2.5):
3.1 Cluster with the weighted fuzzy Z-means with the Euclidean distance function:
(U, PJC) = WFKM(X, w, K, P*_,)
3.2 Use the final centroids of the stage 3.1, P^, as the initial centroids for the weighted
fuzzy ΛΤ-means with the exponential distance function;
a fuzzy modification of the maximum likelihood estimation (FMLE):
(U,Pjfc) = WFKM(X,w,Ä:,Pjr)
4. Calculate the cluster validity criteria (see Section 2.6).
5. Add another centroid equally distant (with a large number of standard deviations)
from all data points (see step 2 in the following modified fuzzy ΛΓ-means
algorithm).
6. Use the cluster validity criteria to choose and return the optimal number of cluster,
K, and the corresponding partition, U.
2.5. The Weighted Fuzzy K-Mean (WFKM)
Algorithm
The weighted version of the fuzzy ΑΓ-mean algorithm, which is used in stages 3.1
and 3.2 of the WUOFC algorithm, is derived from the minimization with respect to P
of a set of cluster centers and U, a membership matrix, of a weighted fuzzy version of
the least-squares function [18]:

M K

jq(V, P) = Σ Σ w' ■ uh ■ d2<P*· x<)- 0)


i=l k=l

where x, in the z'th pattern, the rth column in the X data matrix, pfc is the center of the
Mi cluster, uk ,· is the degree of membership of the data pattern x, in the Äth cluster, w, is
the weight of the rth pattern (as if w, patterns that are equal to x, were included in the
data matrix X), d2(pfc, x,·) is the square of the distance between x, and pk, M is the
number of data patterns, and K is the number of clusters in the partition. The para-
meter q (commonly set to 2) is the weighting exponent for uki and q controls the
"fuzziness" of the resulting clusters [18]. The pseudocode of the weighted fuzzy K-
mean clustering algorithm with the modified centroids initialization [15,26,38] includes
the following steps:
(U, Pk) = WFKM(X, w, K, P j d )
1. Use the final centroids (prototypes) of the previous partition, P^-i, as the initial
centroids for the current partition: in stage 3.1 of the WUOFC algorithm use the
38 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing

K - 1 (*) final centroids of its previous stage and for phase 3.2 use all the K final
centroids, P^, of stage 3.1.
repeat Calculate the degree of membership uk ,· of all data pattern in all clusters:
for k <- 1 to KC)
do for i +- 1 to M
do

« i , = d J (x,p i )" ( 1 - ? ) /ad 2 (x / ,p,)] 1 / ( 1 - ! ) (2)


I 7=1

(*) Only for k = K and in the first iteration of stage 3.1 of the UOFC algorithm use
the following distance: d2(x„ pk) = 10 · Sum( Diagonal( CoVariance(X))),
i= 1,..., M. Otherwise use the Euclidean distance in stage 3.1 (Eq. 4) or the
exponential distance (Eq. 5) in stage 3.2 of the WUOFC algorithm.
Calculate the new set of cluster centers:
for k <- 1 to K
do

M i M

PA = ΣS</-Lw■<--wi x<7
U
Xi/ΈΣ"ΐuh ■ w> (3)
ί=1 ' ί=1

4. until

max
kj r( u
kj — (previous ukj < €

In the first phase 3.1 of the WUOFC algorithm, the fuzzy weighted £-niean algorithm
is performed with the Euclidean distance function:

d 2 (p„x,) = [(P f c -x,) T -(P t -x,)]· (4)

The final cluster centers of the first phase 3.1 are used as the initial centroids for the
second phase. In the second phase 3.2, a fuzzy modification of the maximum likelihood
estimation is utilized by using the following exponential distance function in the
weighted fuzzy Ä-mean algorithm:

d2(p*, x,·) = m(Faf/2 ■ exp[(P* - x,)T · F*-1 · (p* - x,)/2] (5)

where ak = Υ*ίχ ukJ/ Σ*ίχ w, is the sum of memberships within the kth cluster, which
consist of the a priori probability of selecting the kth cluster and

M I M
*k = J2 U*J ■Wt ■ (p* ~ *') ■(p* ~ x ' ) T / Σ u
kj ■ wi (®
i=l ' 1=1
Section 2 Methods 39

is the fuzzy covariance matrix of the M i cluster.


By applying these two phases, the fuzzy K-mcan algorithm with the Euclidean
distance function is used to find a feasible initial partition, and the fuzzy modification
of the maximum likelihood estimation is utilized to refine the partition for normally
distributed clusters with large variability of the covariance matrix (shape, size, and
density) and the number of patterns in each cluster. Note that other distance functions
can be used according to the intrinsic characteristics of the data.

2.6. The Fuzzy Hypervolume Cluster Validity


Criteria
In step 4 of the WUOFC algorithm the following criteria for cluster validity are
calculated:
1. The fuzzy hypervolume criterion (HPV):

K
\m(K) = Σ hk, (7)
k=\

where the hypervolume of the M i cluster is defined by hk = [det(Ffc)]1/2.


2. The partition density (PD):

K I K
VPD(/0 = £ > * / £ > , (8)
k=l I k=\

where Ck = Σιεΐ* ukj ■ Wj, and lk is a set of indices of the "central members" in the
M i cluster.

h = {i\(Pk ~ X;)T · &kj · (Pkj·- xfj) < 1 Vj=l,...,N, i=l,...,M]

where gkj is they'th column of G^ = FjjT1, the inverse of the M i cluster covariance
matrix.
Note that a pattern x, is a "central member" in the fcth cluster only if all the
projections of the Mahanalobis distance between the pattern x, and the M i cen-
troid ρ λ are smaller then one (and not as in [26,34], where the Mahanalobis dis-
tance itself should be smaller than one).
3. The average partition density using "central members" (APDC):

K
1
VAD(*) = T : 5 > * / A * ] (9>
K
k=\

4. The average partition density using "maximal members" (APDM):

yAu(K) = lf}mk/hk] (10)


40 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing

where mk = £ i e J ukj- · Wj and J^ is the set of indices of the "maximal members" in


the fcth cluster:

J* = I »>*,/ = max (w, ,·), i = 1,..., M \


v m\ in j

5. The normalized (by K) partition indexes criterion:

M K

yf(u, P) = K ■ Σ E uh■ w< ■ d2(P*· x<) ^1 J)


i=l jfc=l

The UOFC algorithm is terminated when the performance measures for cluster
validity reach their best value. The choice of the criterion or combination of criteria to
be the performance measure is driven by the specific distribution of the data. One of the
main constraints on a validity criterion for the HUFC algorithm is its efficient applic-
ability for one cluster (compared to more than one cluster), remembering that the
recursive procedure is halted when the "partition" to one cluster is the best of all
partitions. This constraint precludes the use of any validity criterion that involves the
distance between clusters (such as the classical Fisher criterion [24], which is based on
the "between and within clusters scatter matrix," or the well established Xie-Beni
criterion [39] and many others).
2.7. The Dynamic WUOFC Algorithm
In the realization of temporal patterns clustering, the data set X can be dynamic.
For each new sample s i + i, we get another temporal pattern (column) in the matrix X,
and repeat the HUFC with the new data set. The dynamic procedure is started with an
initial X matrix with M0 columns and is rerun for each new sample. It is possible to save
computation time by initializing the prototypes of each partition by the centroids of the
last partition. We can gradually decrease the weights of old samples in the clustering
procedure by the next definition of w,:

wt = (1 + exp(-2 · i ■ β/Μ)Γι (12)

where M is the number of columns in X and β is a constant (usually set to M0, the initial
number of patterns in X). In the time series clustering, w, has the meaning of a memory
coefficient. By this dynamic procedure we get a dynamic number of clusters with
dynamic moving centroids. The dynamic parameters that are used to identify the
state of the system are the variable number of clusters, the location of the centroids
of the clusters, and the fuzzy-covariance matrices of the clusters.

3. RESULTS

As mentioned before, the first part of the hybrid procedure can be applied for state
recognition and events detection, while the whole procedure should be used only if time
series prediction is needed. In the following two sections we will demonstrate these
options.
Section 3 Results 41

3.1. State Recognition and Events Detection


In the first example the algorithm is used for prediction of epileptic seizures from
an epileptic rat's real EEG data. The energies of the four first scales of the discrete
wavelet transform [35] were used as the features for each second of two-channel EEG
data. Since the average density criterion, using the "central members" (Eq. 9), was
found in practice to give the best results with the EEG data [22,26,38], it is used as
the validity criterion in this case. In the first stage of the HUFC the EEG data were
clustered into two classes (Figures 3 and 4). In this initial stage, a rough partition into a
"normal" state (cluster 1), and a preseizure and seizure state (cluster 2) is performed.
We can clearly see that (after 600 seconds) epochs with isolated spikes in the preseizure
state are classified with the seizure (class 3), but also some "normal" EEG patterns
(probably with ^-complexes) are clustered with the seizure and preseizure segments. In
the final stage of the HUFC algorithm (Figure 5) these segments were separated such
that clusters number 4 and 5 can be adapted as a good predictor of the progressing
seizure, which can be identified by clusters 8 and 10. In 16 of 25 animals a preseizure
state ranging in time between 0.7 and 4.5 minutes was uniquely defined by the new
algorithm, either by the appearance of a distinct new cluster or by a new combination of
membership sharing in two or more clusters.
The second example is a heart rate signal of recovery from the exercise state into
the rest state. In this case, each sample of the signal, sn, n=\,...,L, is the time
between the current heartbeat and the previous heartbeat (the heart rate variability
signal). The number of samples in each temporal pattern, N, was chosen to be three, so
the three-dimensional (3D) temporal patterns for the clustering algorithm consist of
{sh si+i, si+i), i= 1,..., L — 2. In the first stage of the HUFC algorithm, four clusters
(Figure 7) were found by the average partition density criteria (Figure 6). In the final
stage, the data were clustered into 10 fuzzy sets. These 10 clusters help in exploring the
dynamic of the nonstationary heart rate signal. For example, the first two clusters are
related to the short-term phase of the recovery, where the heart rate is almost the same

epsll 1/Hypervolume epsll l/Partition./(i/, V;X)*K

<
1 2 3 1 2 3
epsll Partition density epsl 1 Average partition density

Figure 3 The validity criteria of the first stage


of the HUFC algorithm for the EEG data.
Only the average partition density criterion
suggests the choice of two clusters in this
first stage. The other criteria were found to
be useless for these continuous clusters. Number of clusters Number of clusters
42 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing

Figure 4 The first partition of rat number 11's EEG data into two clusters by the
HUFC algorithm. The upper panel shows the partition in the clustering
space of three (out of eight) energies of the scales of the discrete wavelet
transform of the EEG stretch terminating with a seizure. Each data point is
marked by the number of the cluster in which it has maximal degree of
membership. The number of clusters was determined by the average density
criterion for cluster validity (Figure 3). The lower panel shows the "hard"
affiliation of each successive point in the time series (1 second) to each of
the clusters. The seizure beginning (as located by a human expert) is marked
by a solid vertical line (after 700 seconds). (See insert for color illustrations.)
Section 3 Results 43

Figure 5 The final partition of the EEG data with the HUFC algorithm. Clusters 4
and 5 can be used to predict the seizure, which can be identified by clusters
8 and 10. (See insert for color illustrations.)

as during the exercise. We can learn from the small variance of these first clusters that in
this first stage the heart rate variability is very small. Cluster number 5 (which did not
exist in the first stage) indicates the very beginning stage of the recovery. The patterns of
clusters 9 and 10 mark the breathing during recovery and resting stages, respectively,
and so on. This example emphasizes the ability of the proposed algorithm to extract
and to quantify the dynamic states of the subject.
Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing

hrv 1/Hypervolume hrv 1 /Partition J(U, V;X)*K

1
1 2 3 4 5
hrv Partition density hrv Average partition density
-
'■1 1 Γ" "F'■ ' 1r "T"
1 f 1 1 1 1 1
1

MM!
1 I 1 1

.3

Nili
1

1 2
I

3 4 5
■ 1 1
I
1 \u
f * * \

1 2 3 4 5
i

Number of clusters Number of clusters

Figure 6 The validity criteria of the first stage of the HUFC algorithm for the heart
rate signal. Again, only the average partition density criterion (bottom
right) suggests the choice of four clusters in this first stage. The other
criteria were found to be useless for these continuous clusters.

3.2. Time Series Prediction


Because the functional estimation and prediction problem is ill-posed [10,11,16],
the only available way to check the feasibility of a proposed method is by measuring the
accuracy of the predicted results for available data sets. The new algorithm was vali-
dated by its application to the prediction of well-known benchmarks for nonstationary
and nonlinear time series, such as the sunspots time series and data sets "A" and "D"
from the Santa Fe competition [3], and showed good results when compared with well-
known results from the literature.
The prediction error (or accuracy) was computed from the normalized mean
square error (NMSE):

ΣΪ-\ (observation,J — prediction.) 2 j j r


J
NMSE(r) = =ti^ —- - (13)
Σ/L i (observation^: — mean(observation))2 <J2Tj^ J J

where j = 1,..., T enumerates the points in the withheld test set, "observation" is the
true value of the time series, S, "prediction" is the output of the prediction algorithm,
S, and σ2 denotes the sample variance of the observed time series in the test set. A value
of NMSE = 1 corresponds to simply predicting the average.
Learning the optimal dimension of the temporal patterns N is still an open pro-
blem. The number of clusters in each partition was automatically chosen by the average
partition density VAPD criterion using the "maximal members" (Eq. 10). Note that the
clusters that are created by continuously sampled temporal patterns are different from
Section 3 Results 45

Figure 7 Thefirstpartition of the recovery heart rate signals by the HUFC algorithm
into four clusters as suggested by the average partition density (Figure 6).
The upper panel shows the partition of the 3D temporal patterns {s,, J, + I ,
si+2), i= 1 L — 2, of the heart rate signal into the four clusters, and in
the lower panel we can see the affiliation of each temporal pattern with its
corresponding cluster marked on the original heart rate signal (the contin-
uous line). (See insert for color illustrations.)

other clusters; they are not well separated and do not have any regular shape (Figures
3-11), and it seems that the APD criterion is well adapted to these kinds of clusters
(where other validity criteria give poor results or, most of the time, show monotonic
behavior).
46 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing

Figure 8 The final partition of the heart rate variability signal into 10 clusters. The
upper panel shows the partition of the 3D temporal patterns of the heart
rate signal into the final 10 clusters, and in the lower panel we can see the
affiliation of each temporal pattern with its corresponding cluster marked
on the original heart rate signal (the continuous line). (See insert for color
illustrations.)

For illustrative purposes, we consider the well-known nonlinear time series


s„ = 4 · s„_! · (1 — i„_i). We used the 900 first points as a training set for the cluster-
ing-based prediction algorithm and the next 200 points as the test set (T = 200). We
predicted one sample ahead in each step {d = 1), using 2D temporal patterns (N = 2).
The top panel of Figure 9 shows the 2D temporal patterns, {sh si+l), / = ! , . . . , 899, in
Section 3 Results

x_l.utc 1/Hypervolume xj.ufc 1/Partition J(U, V\X)*K

1
f I
/

1 2 3 4 5 6 1 2 3 4 5 6
x_l .ufc Partition density x_l .ufc Average partition density

1 2 3 4 5 6
I 1 2 3 4 5 6
Number of clusters Number of clusters

Figure 9 The first partition of the s„ = 4 · s„_[ ■ (1 — s„_j) time series by the HUFC
algorithm. The lower panel shows the partition of the 2D temporal patterns,
{i„ si+i}, i = 1,..., 899, intofiveclusters as suggested by the average parti-
tion density criterion in the upper panel. (See insert for color illustrations.)

the two-dimensional state (or phase) space, divided intofiveclusters as suggested by the
APD criterion in lower panel. In the final stage of the HUFC algorithm, the temporal
patterns were divided into 101 clusters, and more clusters were chosen in nonlinear
areas of the phase space. Figure 10 shows the final prediction results with
NMSE = 6.116e-7.
In the second example we chose heart rate variability signal in the rest state. Again
the 900 first points were used as a training set and the next 100 points as the test set
48 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing

Figure 10 The one sample ahead (d = 1) prediction results of 200 samples of the s„ =
4 · i„_i · (1 - s„_i) time series. The circle (O) marks the original samples of
the time series and the "x" marks the predictions.

(Γ = 200). One sample ahead was predicted (d = 1), using 3D temporal patterns,
[st, si+lsi+2}, i= 1,..., 898 (N = 3). The top panel of Figure 11 shows final clustering
into 19 clusters as suggested using the APD criterion. The lower panel of Figure 11
shows the classification results on the original training set. For these real data the
prediction error was 0.3076 as shown in Figure 12.

4. CONCLUSION AND DISCUSSION

We have described a method for biomedical state recognition and dynamic system
identification by an unsupervised hierarchical-fuzzy clustering. The clustering is useful
in grouping noncontinuous temporal patterns from nonstationary signals and to form
warning clusters. Moreover, the vague flips from one state to another are naturally
treated by means of fuzzy clustering. In summary, two main problems are tackled by
the unsupervised fuzzy clustering procedure. First, it finds similar events in the "his-
tory" of the time series that are relevant to the prediction and avoids the use of non-
relevant information that can bias the prediction results. Note that noncontinuous time
series can be utilized by the clustering algorithm, so "old" observations of the time
series can be employed for the prediction. Second, using only this minimal required
number of similar temporal patterns improves the robustness and reduces the compu-
tation time of any prediction algorithm that is used. Yet the specific parameters for
temporal pattern classification by the unsupervised fuzzy clustering (the value of the
"fuzziness" of the partition, q, the partition validity criteria, etc.) should be further
investigated. Moreover, the results of the clustering (the membership function, the
centroids and their variances, etc.) could be more efficiently utilized in the prediction
process; for example, it seems promising to use RBF NN for the prediction of each
Section 4 Conclusion and Discussion 49

Figure 11 Thefinalpartition of the resting heart rate signals by the HUFC algorithm
into 19 clusters as suggested by the average partition density using the
"maximal members." The upper panel shows the partition of the 3D
temporal patterns {s,, ί,·+ι, ί, +2 ), i = 1,..., L — 2, of the heart rate signal
into 19 clusters, and in the lower panel we can see the affiliation of each
temporal pattern with its corresponding cluster marked on the original
heart rate signal (the continuous line). (See insert for color illustrations.)
50 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing

Figure 12 The one sample ahead (d = 1) prediction results of 100 samples of the
resting heart rate signal. The circle (O) marks the original samples of the
time series and the "x" marks the predictions.

cluster by using the cluster centroid and variance matrix as initial values of the neuron
transfer function.
The hierarchical-fuzzy algorithm tries to exploit the advantages of hierarchical
clustering while overcoming its disadvantages by mean of fuzzy clustering. The hier-
archical partition helps to fathom the inner structure of the data and thus enables
multiscale compression and reconstruction of the data. Thus, it can be used to analyze
complicated fractal structures and natural signals and pictures. The algorithm can work
for data with a wide dynamic range in both the covariance matrices and the number of
members in each cluster. The number of subclusters in each bifurcation is neither two
nor any other constant (as in some hierarchical clustering algorithm) but is adaptively
determined by the nature of the data. Choosing optimal different features for the
partition of each data subset can help get a feasible result when dealing with a mixture
of different types of data. The HUFC algorithm can be naturally realized by means of
parallel computation where each subclustering can be made by a different processor.
Hence it can be faster in practice than other fuzzy clustering algorithms.
The method was demonstrated by forecasting epileptic seizures from the EEG
signal and by state recognition of the recovery from exercise of the heart rate signal.
Adding more channels to the feature-extracting process and more parameters (derived
from other biomedical signals) to the clustering process should increase the forecasting
power of this method. Other possible applications for using the method as a warning
device could be the prediction of impending psychotic states, detrimental effects of
hypoxia in pilots, and loss of vigilance in drivers as well as extracerebral pathologies
such as a heart attack, based on heart rate variability. The method can also be utilized
for the prediction of nonbiological time series.
The time series prediction results of the method are encouraging. One of the
significant problems of methods for nonlinear time series prediction is the strong depen-
References 51

dence between the quality of the results and the specific characteristics of the time series
[3]. The main advantage of the new method is the unsupervised and adaptive learning of
the number of clusters, or states, in the time series space (the number of the underlying
process in the signal) and of the variable number of patterns in each cluster, which can
overcome the general nonstationary nature of the time series. However, the establish-
ment of reliable validity criteria for temporal pattern classification is an important open
issue. Another open problem of the methods is the adaptive learning of the best value
for the dimension of the temporal patterns, JV, that is, the dimension of the state space
of the specific time series analyzed. Trying to choose the "optimal" N by means of the
best results of the training set, by the conventional technique, did not give the best
results in all cases.

ACKNOWLEDGMENTS

The author would like to thank Dr. Dan Kerem from the Israeli Naval Medical
Institute for the EEG data and for his physiological advice, Eran Lumbroso from Tel-
Aviv university for the heart rate signals, and Shai Poliker from the ECE department of
BGU for his helpful comments. This research was supported by The Israel Science
Foundation founded by The Israel Academy of Sciences and Humanities.

REFERENCES

[1] J. D. Hamilton, Time Series Analysis. Princeton: Princeton University Press, 1994.
[2] A. S. Weigend, Predicting the future: A connectionist approach. Int. J. Neural Syst. 1(3):
193-209, 1990.
[3] A. S. Weigend and N. A. Gershenfeld, Time Series Prediction: Forecasting the Future and
Understanding the Past. Reading, MA: Addison-Wesley, 1992.
[4] K. Hornik, M. Stinchcombe, and H, White, Multi-layer feedforward network are universal
approximators, Neural Comput. 2: 359, 1988.
[5] E. Hartman, K. Keeler, and J. K. Kowalski, Layered neural networks with Gaussian hidden
units as universal approximators. Neural Comput. 2: 210, 1990.
[6] S. Haykin, Neural Networks, a Comprehensive Foundation. New York: Macmillan, 1994.
[7] S. G. Mallat and S. Zong, Characterization of signal from multiscale edges. IEEE Trans
PAMI 10:710-732, 1992
[8] I. Daubechies, Othogonal bases of compactly supported wavelets. Commun. Pure Appl.
Math. 41:909-996, 1988.
[9] S. G. Mallat, A theory for multiresolution signal decomposition: The wavelet representa-
tion. IEEE Trans. Pattern Analy. Machine Intell. 11: 674-693, 1989.
[10] B. R. Bakshi and G. Stephanopoulos, Wave-net: A multi resolution, hierarchical neural
network with localize learning. AIChE J. 39(1): 57-81, 1993.
[11] B. R. Bakshi and G. Stephanopoulos, Reasoning in time: Modeling analysis and pattern
recognition of temporal process trends. Adv. Chem. Eng. 22: 485-547, 1995.
[12] Q. Zhang and A. Benveniste, Wavelet networks. IEEE Trans. Neural Networks 3: 889-898,
1992.
[13] Q. Zhang, Using wavelet networks in nonparametric estimation. IEEE Trans. Neural
Networks 8(2): 227-236, 1997.
52 Chapter 2 Applications of Fuzzy Clustering to Biomedical Signal Processing

[14] B. Delyon, A. Juditsky, and A. Benveniste, Accuracy analysis for wavelet approximations.
IEEE Trans. Neural Networks 6(2): 332-348, 1995.
[15] A. B. Geva, Dynamic unsupervised fuzzy clustering in forecasting events from biomedical
signals. Ministry of Science, International Conference on Fuzzy Logic and Applications, Israel,
May 1997.
[16] T. Poggio and F. Girosi, A theory of network for approximation and learning. Proc. IEEE
78(9): 1481-1497, 1990.
[17] J. Kohlmorgen, K.-R. Müller, and K. Pawelzik, Segmentation and identification of drifting
dynamical systems. Proceedings of the 1997 IEEE Signal Processing Society Workshop on
Neural Networks for Signal Processing, Florida, September, 1997.
[18] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York:
Plenum, 1981.
[19] J. C. Bezdek and S. K. Pal, Fuzzy Models for Pattern Recognition. New York: IEEE Press,
1992.
[20] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice
Hall, 1988.
[21] I. Gath, C. Feuerstein, and A. B. Geva, Unsupervised classification and adaptive definition
of sleep patterns, Pattern Recogn. Lett. 15: 977-984, 1994.
[22] A. B. Geva and H. Pratt, Unsupervised clustering of evoked potentials by waveform. Med.
Biol. Eng. Comput. 543-550, 1994.
[23] N. R. Pal and J. C. Bezdek, On cluster validity for the fuzzy c-means model. IEEE Trans.
Fuzzy Syst. 3(3): 370-379, 1995.
[24] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, New York: Wiley-
Interscience, 1973.
[25] L. A. Zadeh, Fuzzy sets. Inform. Control 8: 338-353, 1965.
[26] I. Gath and A. B. Geva, Unsupervised optimal fuzzy clustering. IEEE Trans. Pattern Anal.
Machine Intel!. 7: 773-781, 1989.
[27] D. Hume, Enquiry Concerning Human Understanding, 1784.
[28] MatLab Reference Guide. The Math Works, 1992.
[29] A. Cohen, Biomedical Signal Processing, Boca Raton, FL: CRC Press, 1986.
[30] S. Mallat, A Wavelet Tour of Signal Processing, San Diego: Academic Press, 1998.
[31] A. Aldroubi and M. Unser, Wavelets in Medicine and Biology, Boca Raton, FL: CRC Press,
1996.
[32] M. Akay, Time-Frequency and Wavelets in Biomedical Signal Processing, New York: IEEE
Press, 1998.
[33] A. B. Geva, ScaleNet—MultiScale neural network architecture for time series prediction.
IEEE Trans. Neural Networks 9(6) 1471-1482, 1998.
[34] A. B. Geva and D. H. Kerem, Brain state identification and forecasting of acute pathology
using unsupervised fuzzy clustering of EEG temporal patterns. In Applications of Neuro-
Fuzzy Systems in Medicine and Bio-medical Engineering (BME), H.-N. Teodorescu, L. C.
Jain and A. Kandel, eds. Boca Raton, FL: CRC Press.
[35] S. G. Mallat and S. Zong, Characterization of signal from multiscale edges. IEEE Trans.
PAMI10: 710-732, 1992.
[36] A. B. Geva, H. Pratt, and Y. Y. Zeevi, Multichannel wavelet-type decomposition of evoked
potentials: Model-based recognition of generator activity. Med. Biol. Eng. Comput. 95(1):
40-46, 1997.
[37] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression, Boston: Kluwer
Academic Publisher, 1992.
[38] I. Gath and A. B. Geva, Fuzzy clustering for the estimation of the parameters of the
components of mixtures of normal distributions. Pattern Recogn. Lett. 9: 77-86, 1989.
[39] X. L. Xie and G. Beni, A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal.
Machine Intel!. 13(8): 841-847, 1991.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.

Chapter NEURAL NETWORKS: A GUIDED


3 TOUR

Simon Haykin

1. SOME BASIC DEFINITIONS


A neural network is a massively parallel distributed processor that has a natural pro-
pensity for storing experiential knowledge and making it available for use. It resembles
the brain in two respects:
1. Knowledge is acquired by the network through a learning process.
2. Interconnection strengths known as synaptic weights are used to store the knowl-
edge.
Basically, learning is a process by which the free parameters (i.e., synaptic weights and
bias levels) of a neural network are adapted through a continuing process of stimulation
by the environment in which the network is embedded. The type of learning is deter-
mined by the manner in which the parameter changes take place. Specifically, learning
machines may be classified as follows:
• Learning with a teacher, also referred to as supervised learning
• Learning without a teacher
This second class of learning machines may also be subdivided into:
• Reinforcement learning
• Unsupervised learning or self-organized learning
In the subsequent sections of this chapter, we will describe the important aspects of
these learning machines and highlight the algorithms involved in their designs. For a
detailed treatment of the subject, see Haykin (1999) [1]: this book has an up-to-date
bibliography comprising 41 pages of references.

2. SUPERVISED LEARNING
This form of learning assumes the availability of a labeled (i.e., ground-truthed) set of
training data made up of N input-output examples:

T = {(xi,di))li (1)

53
54 Chapter 3 Neural Networks: A Guided Tour

where x, = input vector of ith example


dt = desired (target) response of /th example, assumed to be scalar for
convenience of presentation
N = sample size.
Given the training sample T, the requirement is to compute the free parameters of the
neural network so that the actual output j>, of the neural network due to x,· is close
enough torf,·for all in i in a statistical sense. For example, we may use the mean-squared
error

E{n) = ^Yj,di-yi)1 (2)


i=l

as the index of performance to be minimized.

2.1. Multilayer Perceptrons and Back-Propagation


Learning
The back-propagation (BP) algorithm has emerged as the workhorse for the design
of a special class of layered feed-forward networks known as multilayer perceptrons
(MLPs). As shown in Figure 1, a multilayer perceptron has an input layer of source
nodes and an output layer of neurons (i.e., computation nodes); these two layers con-
nect the network to the outside world. In addition to these two layers, the multilayer
perceptron usually has one or more layers of hidden neurons, which are so called
because these neurons are not directly reachable from the input end or from the output

Input layer Layer of Layer of Figure 1 Fully connected feed-forward of


of source hidden output acyclic network with one hidden layer and
nodes neurons neurons o n e output layer.
Section 2 Supervised Learning 55

end. The hidden neurons play an important role: the extraction of important features
contained in the input data.
The training of an MLP is usually accomplished by using a BP algorithm that
involves two phases [2,3].
• Forward phase. During this phase the free parameters of the network are fixed,
and the input signal is propagated through the network of Figure 1 layer by
layer. The forward phase finishes with the computation of an error signal

£< = dt - V; (3)

where dt is the desired response and >>, is the actual output produced by the
network in response to the x,·.
• Backward phase. During this second phase, the error signal e, is propagated
through the network of Figure 1 in the backward direction, hence the name of
the algorithm. It is during this phase that adjustments are applied to the free
parameters of the network so as to minimize the error e, in a statistical sense.
Back-propagation learning may be implemented in one of two basic ways, as
summarized here:
1. Sequential mode (also referred to as the pattern mode, on-line mode, or stochastic
mode): In this mode of BP learning, adjustments are made to the free parameters
of the network on an example-by-example basis. The sequential mode is best suited
for pattern classification.
2. Batch mode: In this second mode of BP learning, adjustments are made to the free
parameters of the network on an epoch-by-epoch basis, where each epoch consists
of the entire set of training examples. The batch mode is best suited for nonlinear
regression.
The back-propagation learning algorithm is simple to implement and computationally
efficient in that its complexity is linear in the synaptic weights of the network. However,
a major limitation of the algorithm is that it can be excruciatingly slow, particularly
when we have to deal with a difficult learning task that requires the use of a large
network.
We may try to make back-propagation learning perform better by invoking the
following list of heuristics:
• Use neurons with antisymmetric activation functions (e.g., hyberbolic tangent
function) in preference to nonsymmetric activation functions (e.g., logistic func-
tion). Figure 2 shows examples of these two forms of activation functions.
• Shuffle the training examples after the presentation of each epoch; an epoch
involves the presentation of the entire set of training examples to the network.
• Follow an easy-to-learn example with a difficult one.
• Preprocess the input data so as to remove the mean and decorrelate the data.
• Arrange for the neurons in the different layers to learn at essentially the same
rate. This may be attained by assigning a learning-rate parameter to neurons in
the last layers that is smaller than those at the front end.
56 Chapter 3 Neural Networks: A Guided Tour
*>(")

a =-1.719

1.0 ■•-yT

-1.0 0
1
1
i
/
1.0
1 /
1 V'

- a = -1.719
(a)

Figure 2 (a) Antisymmetric activation function, (b) Nonsymmetric activation


function.

• Incorporate prior information into the network design whenever it is available.


One other heuristic that deserves to be mentioned is related to the size of the
training set, N, for a pattern classification task. Given a multilayer perceptron with a
total number of synaptic weights including bias levels, denoted by W, a rule of thumb
for selecting N is

N
-°® (4)

where O denotes "the order o f and e denotes the fraction of classification errors
permitted on test data. For example, with an error of 10% the number of training
examples needed should be about 10 times the number of synaptic weights in the net-
work.
Supposing that we have chosen a multilayer perceptron to be trained with the
back-propagation algorithm, how do we determine when it is "best" to stop the train-
Section 2 Supervised Learning 57

ing session? How do we select the size of individual hidden layers of the MLP? The
answers to these important questions may be gotten though the use of a statistical
technique known as cross-validation, which proceeds as follows:
• The set of training examples is split into two parts:
• Estimation subset used for training of the model
• Validation subset used for evaluating the model performance
• The network is finally tuned by using the entire set of training examples and
then tested on test data not seen before.

2.2. Radial Basis Function (RBF) Networks


Another popular layered feed-forward network is the radial basis function (RBF)
network, whose structure is shown in Figure 3. RBF networks use memory-based
learning for their design. Specifically, learning is viewed as a curve-fitting problem in
high-dimensional space [4,5].

1. Learning is equivalent to finding a surface in a multidimensional space that pro-


vides a best fit to the training data.
2. Generalization (i.e., response of the network to input data not seen before) is
equivalent to the use of this multidimensional surface to interpolate the test data.

RBF networks differ from multilayer perceptrons in some fundamental respects:

• RBF networks are local approximators, whereas multilayer perceptrons are


global approximators.

φ-Ι

Input Hidden layer Output


layer of m, radial layer
basis
functions

Figure 3 Radial basis function network.


58 Chapter 3 Neural Networks: A Guided Tour

• RBF networks have a single hidden layer, whereas multilayer perceptrons can
have any number of hidden layers.
• The output layer of an RBF network is always linear, whereas in a multilayer
perceptron it can be linear or nonHnear.
• The activation function of the hidden layer in an RBF network computes the
Euclidean distance between the input signal vector and a parameter vector of
the network, whereas the activation function of a multilayer perceptron com-
putes the inner product between the input signal vector and the pertinent synap-
tic weight vector.
The use of a linear output layer in an RBF network may be justified in light of
Cover's theorem on the separability of patterns. According to this theorem, provided
that the transformation from the input space to the feature (hidden) space is nonlinear
and the dimensionality of the feature space is high compared to that of the input (data)
space, then there is a high likelihood that a nonseparable pattern classification task in
the input space is transformed into a linearly separable one in the feature space.
Design methods for RBF networks include the following:
1. Random selection of fixed centers [4]
2. Self-organized selection of centers [6]
3. Supervised selection of centers [5]
4. Regularized interpolation exploiting the connection between an RBF network and
the Watson-Nadaraya regression kernel [7]

2.3. Support Vector Machines


Support vector machines (SVM) theory provides the most principled approach to the
design of neural networks, eliminating the need for domain knowledge [8]. SVM theory
applies to pattern classification, regression, or density estimation using an RBF network
(depicted in Figure 3) or an MLP with a single hidden layer (depicted in Figure 1).
Unlike the case of back-propagation learning, different cost functions are used for
pattern classification and regression. Most important, the use of SVM learning elim-
inates the problem of how to select the size of the hidden layer in an MLP or RBF
network. In the latter case, it also eliminates the problem of how to specify the centers
of the RBF units in the hidden layer.
Simply stated, support vectors are the data points (for the linearly separable case)
that are the most difficult to classify and optimally separated from each other.
In a support vector machine, the selection of basis functions is required to satisfy
Mercer's theorem: that is, each basis function is in the form of a positive definite inner-
product kernel:

^(χ,-,χ,^^χ,·)^,·) (5)

where x, and x,· are input vectors for examples i andy, and <p(xi) is the vector of hidden-
unit outputs for inputs x,·. The hidden (feature) space is chosen to be of high dimen-
sionality so as to transform a nonlinear separable pattern classification problem into a
linearly separable one. Most important, however, in a pattern-classification task, for
Section 3 Unsupervised Learning 59

example, the support vectors are selected by the SVM learning algorithm so as to
maximize the margin of separation between classes.
The curse-of-dimensionality problem, which can plague the design of multilayer
perceptrons and RBF networks, is avoided in support vector machines through the use
of quadratic programming. This technique, based directly on the input data, is used to
solve for the linear weights of the output layer [8].

3. UNSUPERVISED LEARNING

Turning next to unsupervised learning, adjustment of synaptic weights may be carried


through the use of neurobiological principles such as Hebbian learning and competitive
learning or information-theoretic principles. In this section we will describe specific
applications of these three approaches.
3.1. Principal Components Analysis
According to Hebb's postulate of learning, the change in synaptic weight ΔΗ>;, of a
neural network is defined by

Awß = ηχίγ) (6)

where η = learning-rate parameter


x, = input (presynaptic) signal
}>j = output (postsynaptic) signal
Principal component analysis (PCA) networks use a modified form of this self-orga-
nized learning rule. To begin with, consider a linear neuron designed to operate as a
maximum eigenfilter; such a neuron is referred to as Oja's neuron [9]. It is characterized
as follows:

AWß = ηγ^χ, - yjWß) (7)

where the term —r\y]wji is added to stabilize the learning process. As the number of
iterations approaches infinity, we find the following:
1. The synaptic weight vector of neuron j approaches the eigenvector associated with
the largest eigenvalue Xmax of the correlation matrix of the input vector (assumed
to be of zero mean).
2. The variance of the output of neuron j approaches the largest eigenvalue Xmax.
The generalized Hebbian algorithm (GHA), due to Sänger [10] is a straightforward
generalization of Oja's neuron for the extraction of any desired number of principal
components.
3.2. Self-Organizing Maps
In a self-organizing map (SOM), due to Kohonen (1997), the neurons are placed at
the nodes of a lattice, and they become selectively tuned to various input patterns
(vectors) in the course of a competitive learning process. The process is characterized
by the formation of a topographic map in which the spatial locations (i.e., coordinates)
60 Chapter 3 Neural Networks: A Guided Tour

Figure 4 Illustration of the relationship


between feature map φ and weight vector w,
of winning neuron i.

of the neurons in the lattice correspond to intrinsic features of the input patterns.
Figure 4 illustrates the basic idea of a self-organizing map, assuming the use of a
two-dimensional lattice of neurons as the network structure.
In reality, the SOM belongs to the class of vector coding algorithms [11]. That is, a
fixed number of code words are placed into a higher dimensional input space, thereby
facilitating data compression.
An integral feature of the SOM algorithm is the neighborhood function centered
around a neuron that wins the competitive process. The neighborhood function starts
by enclosing the entire lattice initially and is then allowed to shrink gradually until it
encompasses the winning neuron.
The algorithm exhibits two distinct phases in its operation:
1. Ordering phase, during which the topological ordering of the weight vectors takes
place
2. Convergence phase, during which the computational map is fine tuned
The SOM algorithm exhibits the following properties:

1. Approximation of the continuous input space by the weight vectors of the discrete
lattice.
2. Topological ordering exemplified by the fact that the spatial location of a neuron
in the lattice corresponds to a particular feature of the input pattern.
3. The feature map computed by the algorithm reflects variations in the statistics of
the input distribution.
4. SOM may be viewed as a nonlinear form of principal components analysis.

3.3. Information-Theoretic Models


Mutual information, defined in accordance with Shannon's information theory,
provides the basis of a powerful approach for self-organized learning. The theory is
Section 4 Neurodynamic Programming 61

embodied in the maximum mutual information (Infomax) principle, due to Linsker


[12], which may be stated as follows:
The transformation of a random vector X observed in the input layer of a neural network to a random
vector Y produced in the output layer should be so chosen that the activities of the neurons in the
output layer jointly maximize information about the activities in the input layer. The objective function
to be maximized is the mutual function /(Y;X) between X and Y.

The Infomax principle finds applications in the following areas:


• Design of self-organized models and feature maps [12].
• Discovery of properties of a noisy sensory input exhibiting coherence across
both space and time (first variant of Infomax due to Becker and Hinton
[13]).
• Dual image processing designed to maximize the spatial differentiation between
the corresponding regions of two separate images (views) of an environment of
interest as in radar polarimetry (second variant of Infomax due to Ukrainec and
Haykin [14]).
• Independent components analysis (ICA) for blind source separation (due to
Barlow [15]); see also Comon [16]. ICA may be viewed as the third variant of
Infomax [1].

4. NEURODYNAMIC PROGRAMMING
Supervised learning is a cognitive learning problem performed under the tutelage of a
teacher. It requires the availability of input-output examples representative of the
environment.
Reinforcement learning, on the other hand, is a behavioral learning problem [17].
It is performed through the interaction of a learning system with its environment. The
need for a teacher is eliminated by virtue of this interactive process.
Basically, neurodynamic programming is the modern approach to reinforcement
learning, building on Bellman's classic work on dynamic programming [18]. For a
formal definition of neurodynamic programming, we offer the following:
• Neurodynamic programming enables a system to learn how to make good
decisions by observing its own behavior and to improve its actions by using a
built-in mechanism through reinforcement.
Neurodynamic programming incorporates two primary ingredients:
1. The theoretical foundation provided by dynamic programming
2. The learning capabilities provided by neural networks as function approximators
An important feature of neurodynamic programming is that it solves the credit
assignment problem by assigning credit or blame to each one of a set of interacting
decisions in a principled manner. The credit assignment problem is also referred to as
the loading problem, the problem of loading a given set of training data into the free
parameters of the network.
62 Chapter 3 Neural Networks: A Guided Tour

Neurodynamic programming is a natural tool for solving planning tasks. For


optimal planning it is necessary to have efficient trade-off between immediate and
function costs: How can a system learn to improve long-term performance when this
improvement may require sacrificing short-term performance? In particular, it can
provide an elegant solution to this important problem that arises in highly diverse fields
(e.g., backgammon; dynamic allocation of resources in a mobile communications envir-
onment).

TEMPORAL PROCESSING USING


FEED-FORWARD NETWORKS

Time is an essential dimension of learning. We may incorporate time into the design of
a neural network implicitly or explicitly. A straightforward method of implicit repre-
sentation of time is to add a short-term memory structure at the input end of a static
neural network (e.g., multilayer perceptron), as illustrated in Figure 5. This configura-
tion is called a focused time-lagged feed-forward network (TLFN). Focused TLFNs are
limited to stationary dynamical processes.
To deal with nonstationary dynamical processes, we may use distributed TLFNs
where the effect of time is distributed at the synaptic level throughout the network. One
way in which this may be accomplished is to use finite-duration impulse response (FIR)
filters to implement the synaptic connections of an MLP; Figure 6 shows an FIR model
of a synapse. The training of a distributed TLFN is naturally a more difficult proposi-
tion than the training of a focused TLFN. Whereas we may use the ordinary back-
propagation algorithm to train a focused TLFN, we have to extend the back-propaga-
tion algorithm to cope with the replacement of a synaptic weight in the ordinary MLP
by a synaptic weight vector. This extension is referred to as the temporal back-propa-
gation algorithm, due to Wan [19].

Output

Figure 5 Focused time-lagged feed-forward network (TLFN); the bias levels have
been omitted for convenience of presentation.
Section 6 Dynamically Driven Recurrent Networks 63

x,(n) */(»-!) *,(« - 2) *i(n -/>+ l) . . *i(n -p)


1 ► t—►

w«(0) VD· w,-,(3) Wji(p-\)

4 ^Σ) ► JyW = Σ Wjffixfi, ~ k)

Figure 6 Finite-duration impulse response (FIR) filter.

6. DYNAMICALLY DRIVEN RECURRENT


NETWORKS

Another practical way to account for time in a neural network is to employ feedback at
the local or global level. Neural networks so configured are referred to as recurrent
networks.
We may identify two classes of recurrent networks:
1. Autonomous recurrent networks exemplified by the Hopfield network [20] and
brain-state-in-a-box (BSB) model. These networks are well suited for building
associative memories, each with its own domain of applications. Figure 7 shows
an example of a Hopfield network involving the use of four neurons.
2. Dynamically driven recurrent networks, which are well suited for input-output
mapping functions that are temporal in character.
Dynamically driven recurrent network architectures include the following:
1. Input-output recurrent model, commonly referred to as a nonlinear autoregressive
with exogenous inputs (NARX) model. Figure 8 shows an example of this net-
work.
2. State-space model, illustrated in Figure 9.
3. Recurrent multilayer perceptron, illustrated in Figure 10.
4. Second-order network, illustrated in Figure 11.
The first three configurations build on the state-space approach of modern control
theory. Second-order networks use second-order neurons, where the induced local
field (activation potential) of each neuron is defined by

V
k = JU2W*VXiUJ (8)

where wkij = denotes a weight


Xj = denotes a feedback signal derived from neuron j
uj = denotes a source signal.
Second-order networks (due to Giles and collaborators [21]) are well suited for deter-
ministic finite-state automata.
64 Chapter 3 Neural Networks: A Guided Tour

4
* 1

| * 1
z'1
z"1 z~ Unit-delay
r-i
operators

II-

<

Figure 7 Recurrent network with no self-


feedback loops and no hidden neurons.

To design a dynamically driven recurrent network, we may use any one of the
following approaches:
• Back-propagation through time (BPTT), which involves unfolding the temporal
operation of the recurrent network into a layered feedforward network [22].
This unfolding facilitates the application of the ordinary back-propagation
algorithm.
• Real-time recurrent learning, in which adjustments are made (using a gradient
descent method) to the synaptic weights of a fully connected recurrent network
in real time [23].
• Extended Kaimanfilter(EKF), which builds on the classic Kaiman filter theory
to compute the synaptic weights of the recurrent network. Two versions of the
algorithm are available [24]:
• Decoupled EKF
• Global EKF
The decoupled EKF algorithm is computationally less demanding but somewhat less
accurate than the global EKF algorithm.
A serious problem that can arise in the design of a dynamically driven recurrent
network is the vanishing gradients problem. This problem pertains to the training of a
recurrent network to produce a desired response at the current time that depends on
input data in the distant past [25]. It makes the learning of long-term dependences in
Section 6 Dynamically Driven Recurrent Networks 65

Input

~V~
«(»-I) i »-

u(n-2) f »-

u(n -q + 2) i »-

Output
Multilayer *~~*>0ι+1)
u(n-q+l) i »► perceptron

.T
^ ( n - g + 1)

y(n -q + 2) ' ►■

y(n-\) i ».

Ξ
y(n) ■> *-

Figure 8 Nonlinear autoregressive with exo-


ή
genous inputs (NARX) model.

Context

E
units
Bank of
unit ^^B
delays I

Output
Input
Hidden ^ ^ A ^ ^ Output ' vector
vector
■■■M layer ^^^^r layer
Multilayer perceptron with
Figure 9 Simple recurrent network (SRN). single hidden layer
66 Chapter 3 Neural Networks: A Guided Tour
Bank of
unit delays

Xo(»+l)
^ ^ ^ First ■ ^ ^ Second ■ ψψψ _ Output
^ ^ hidden J g W ^ hidden J i > ^ ° vector
^^Ψ^ layer ^ ^ ^ ^ ^ ^ ^ layer ^ ^ ^ ^ ^ ^ ^

Multilayer perceptron with


multiple hidden layers

Figure 10 Recurrent multilayer perceptron.

Figure 11 Second-order recurrent network; bias connections to the neurons are


omitted to simplify the presentation. The network has two inputs and
three state neurons, hence the need for 3 x 2 = 6 multipliers.
References 67

gradient-based training algorithms difficult if not impossible in certain cases. To over-


come the problem, we may use the following methods:
1. Extended Kaiman filter (encompassing second-order information) for training
2. Elaborate optimization methods such as pseudo-Newton and simulated annealing
[25]
3. Use of long time delays in the network architecture [26]
4. Hierarchical structuring of the network in multiple levels associated with different
time scales [27].
5. Use of gating units to circumvent some of the nonlinearities [28].

7. CONCLUDING REMARKS

Neural networks constitute a multidisciplinary subject rooted in the neurosciences,


psychology, statistical physics, statistics and mathematics, computer science, and
engineering. Neural networks are endowed with the ability to learn from examples
with or without a teacher. Moreover, they can approximate any continuous input-
output mapping function and can be designed to be fault tolerant with respect to
component failures. By virtue of these important properties, neural networks find
applications in such diverse fields as model building, time series analysis, financial
forecasting, signal processing, pattern classification, and control. Through a seamless
integration of neural networks with other complementary smart technologies such as
symbolic processors and fuzzy systems, we have the basis of a powerful hybrid
approach for building intelligent machines that represent the ultimate in information
processing by artificial means. For detailed expositions of the different aspects of
neural networks described in this chapter, the reader is referred to the book by
Haykin [1].

REFERENCES

[1] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed. Englewood Cliffs, NJ:
Prentice Hall, 1999.
[2] P. J. Werbos, Beyond regression: New tools for prediction and analysis in the behavioral
sciences. Ph.D. thesis, Harvard University, 1974.
[3] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by
error propagation. In Parallel Distributed Processing in the Microstructure of Cognition, D.
E. Rumelhart and J. L. McCleland, eds., Vol. 1, Chap. 8. Cambridge, MA: MIT Press, 1986.
[4] D. S. Broomhead and D. Lowe, Multivariable functional interpolation and adaptive net-
works. Complex Syst. 2: 321-355, 1988.
[5] T. Poggio and F. Girosi, Networks for approximation and learning. Proc. IEEE, 78: 1481-
1497, 1990.
[6] J. E. Moody and C. J. Darken, Fast learning in networks of locally-tuned processing units.
Neural Comput. 1: 281-294, 1989.
[7] P. V. Yee, Regularized radial basis function networks: Theory and applications to prob-
ability estimation, classification, and time series prediction. Ph.D. thesis, McMaster
University, 1998.
68 Chapter 3 Neural Networks: A Guided Tour

[8] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.


[9] E. Oja, A simplified neuron model as a principal component analyzer. J. Math. Biol. 15:
267-273, 1982.
[10] T. D. Sanger, An optimality principle for unsupervised learning. Adv. Neural Inform.
Process. Syst. 1: 11-19, 1989.
[11] S. P. Luttrell, Self-organization: A derivation from first principle of a class of learning
algorithms. IEEE Conference on Neural Networks, pp. 495-498, Washington, DC, 1989.
[12] R. Linsker, Towards an organizing principle for a layered perceptual network. In Neural
Information Processing Systems, D. Z. Anderson, ed., pp. 485-494. New York: American
Institute of Physics, 1988.
[13] S. Becker and G. E. Hinton, A self-organizing neural network that discovers surfaces in
random-dot stereograms. Nature 355: 161-163, 1982.
[14] A. M. Ukrainec and S. Haykin, A modular neural network for enhancement of cross-polar
radar targets. Neural Networks 9: 143-168, 1996.
[15] H. B. Barlow, Unsupervised learning. Neural Comput. 1: 295-311, 1989.
[16] P. Comon, Independent component analysis: A new concept? Signal Process. 36: 287-314,
1994.
[17] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA:
MIT Press, 1998.
[18] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA: Athenas
Scientific, 1996.
[19] E. A. Wan, Time series prediction by using a connectionist network with internal delay lines.
In Time Series Prediction: Forecasting the Future and Understanding the Past, A. S. Weigend
and N. A. Gershenfield, eds., pp. 195-217. Reading, MA: Addison-Wesley, 1994.
[20] J. J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities. Proc. Nat. Acad. Sei. USA 79: 2554-2558, 1982.
[21] C. L. Giles, C. B. Miller, D. Chen, H. H. Chen, G. Z. Sun, and Y. C. Lee, Learning and
extracting finite state automata with second-order recurrent neural networks. Neural
Comput. 4: 393-405, 1992.
[22] P. J. Werbos, Backpropagation through time: What it does and how to do it. Proc. IEEE 78:
1550-1560, 1990.
[23] R. J. Williams and D. Zipser, A learning algorithm for continually running fully recurrent
neural networks. Neural Comput. 1: 270-280, 1989.
[24] L. A. Feldkamp and G. V. Puskorius, A signal processing framework based on dynamic
neural networks with application to problems in adaptation, filtering and classification.
Special issue of Proceedings of the IEEE on Intelligent Signal Processing, vol. 86,
November 1998.
[25] Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with gradient
descent is difficult. IEEE Trans. Neural Networks 5: 157-166, 1994.
[26] C. L. Giles, T. Lin, and B. G. Home, Remembering the past: The role of embedded memory
in recurrent neural network architectures. Neural Networks for Signal Processing, VII,
Proceedings of the 1997 IEEE Workshop, p. 34. New York: IEEE Press, 1997.
[27] S. El Hihi and Y. Bengio, Hierarchical recurrent neural networks for long-term depen-
dencies. Adv. Neural Inform. Process. Syst. 8: 493-499, 1996.
[28] S. Hochreiter and J. Schmidhuber, LSTM can solve hard long time lag problems. Adv.
Neural Inform. Process. Syst. 9: 473-479, 1997.
[29] J. A. Anderson, Introduction to Neural Networks. Cambridge, MA: MIT Press, 1995.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.

Chapter NEURAL NETWORKS IN


4 PROCESSING AND ANALYSIS OF
BIOMEDICAL SIGNALS

Homayoun Nazeran and Khosrow Behbehani

Just because the brain looks like a mess of porridge doesn't mean it's a cereal computer.
Michael Arbib

Today there is a tremendous amount of interest in and excitement about artificial


neural networks (ANNs), also known as connectionist models, parallel distributed
processing models, and neuromorphic systems. Initially inspired by biological nervous
systems, the development of ANNs has been motivated by their applicability to certain
types of problems and their potential for parallel implementation. ANNs have been
studied and developed for many years in the hope of achieving brainlike performance in
signal processing and pattern recognition. The fact that the brain exists, learns, remem-
bers, and thinks is an existence proof that the ultimate objective is achievable.
People and animals outperform most advanced computers when it comes to tasks
such as speech and image recognition. Although computers are much faster and better
than biological neural systems for tasks based on precise and fast arithmetic operations,
ANNs represent the promising new generation of information processing systems.
ANNs can supplement the enormous processing power of digital computers in solving
problems intractable or difficult for traditional computation.
This chapter introduces the elements and attributes of ANNs and their architecture
and gives a brief history of the development of neural nets. It looks at some perfor-
mance and operational issues. It reviews the multilayer perceptron (MLP) topology
with the back-propagation learning algorithm and Kohonen's self-organizing feature
maps algorithm andfinallyfocuses on some of the most recent applications of ANNs in
biomedical signal processing.
The goal of this chapter is not to review the extensive field of ANNs (for excellent
and comprehensive treatments see Haykin [1]). The objective is to offer an overview,
which gives the reader the background necessary to understand the overall basics of
ANNs and their applicability to biomedical signal processing and pattern recognition in
general and patient monitoring in particular.

1. OVERVIEW AND HISTORY OF ARTIFICIAL


NEURAL NETWORKS

To understand the importance, current state, and future trends for ANNs, it is helpful
to review briefly the basic structure of ANNs and their development history.
70 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals

1.1. What Is an Artificial Neural Network?


Artificial neural networks are simply mapping functions that map values of one or
more inputs to single or multiple outputs. In this sense, they are similar to ordinary
transfer functions that map inputs to outputs. However, they are distinguished from
ordinary transfer functions because their structures provide a great degree of flexibility
in mapping. Specifically, one does not need to know the form of the functions govern-
ing the inputs to an ANN to map them to the desired output. In addition, ANNs can
adapt to changes in the input and output. That is, if the inputs change, the elements of
the ANN can be adjusted to continue to map the new inputs to the same output and
vice versa. This feature of ANNs creates a great potential for their application in signal
processing and pattern recognition.
Structurally, ANNs have a layout similar to the human neural system. That is, they
are composed of interconnected computational cells that are connected to other iden-
tical computational cells in a fashion similar to the synaptic connection of neurons to
other neurons. Due to this similarity, the computational cells are often called neurons;
alternatively, they are also referred to as processing elements (PEs). Figure 1 illustrates
a simple PE with n inputs that are connected to it in a manner analogous to biological
synaptic connections. The weights, which are shown as w's, represent adjustable weights
that affect the impact of each input on the PE's output.
Specifically, the PE maps the inputs, Χχ through xn-\, to the output y. This map-
ping has two steps. In the first step, the input to the neuron is summed to obtain a as
shown in Eq. 1.

a = £ WtX, (1)
i=0

where Xt and Wt reflect the magnitude of the inputs and their associated weights (for
i = 1,2, and n — 1), respectively. The PE output Y is computed as a function of a and
an offset or threshold value Θ as

Υ=Α<χ-θ) (2)

i0 .
w x
wO

xl I / ^ o v l

x2Q_w2 ^— ^/ / pE \x Output
^
y

wn-l^
Figure 1 An example of a simple artificial
neuron consisting of a single processing ele-
ΧΆ
'^ \J ment with synaptic weights and connections.
Section 1 Overview and History of Artificial Neural Networks 71

where / is called an activation function, which is generally nonlinear. Three common


forms of function / are shown in Figure 2. The functions shown are hard limiter,
threshold logic element, and sigmoidal function. More complex functions that include
temporal integration or other types of time dependences can be used [1].
In general, a PE unit can have n inputs and one output and a network of identical
PEs that are interconnected in a series of layers encompassing an input and an output
layer to form an artificial neural network. Figure 3 shows a three-layer network. This
network is called a multilayer perceptron (MLP). There are many possible combina-
tions of layers and pathways to interconnect the neurons in each node to other neurons.
The equation for obtaining the output of the network in Figure 1 is as follows:

=/iE^,-ö)j (3)

Hard limit Threshold logic Sigmoid

Figure 2 Three examples of commonly used nonlinearities in construction of artifi-


cial neural networks.

Hidden layer

Input layer

Figure 3 A multilayer perceptron.


72 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals

The network topology or the neuron interconnection scheme is a major design


choice. In some networks each PE in one layer receives input from every PE in the
previous layer and sends its outputs to every PE in the subsequent layer. Some network
architectures, however, have interconnections among PEs within a layer, and feedback
architectures even allow connections to PEs in previous layers. It is beyond the scope of
this chapter to cover all common forms of ANNs. See Haykin [1] for excellent coverage.

1.2. How Did ANNs Come About?


ANNs have an interesting history; an abridged outline is presented here.
McCulloch and Pitts (1943) outlined the first formal model of an elementary computing
neuron. Hebb (1949) first proposed a learning scheme for updating neurons' connec-
tions. This is now referred to as the Hebbian learning rule. This rule states that infor-
mation can be stored in connections. Hebb's learning rule made significant initial
contributions to neural networks theory. During the 1950s, the first neurocomputers
were built and tested by Minsky (1954). In Minsky's model, connections adapted
automatically. Perceptron, a neuronlike element, was invented by Rosenblatt (1958).
It was a trainable computing model capable of learning to classify certain patterns by
modifying connections to the threshold elements. Perceptron laid the groundwork for
the development of basic neural net learning algorithms even used today.
A useful network called ADALINE (ADAptive LINEar combiner) was introduced
by Widrow (1960) along with a powerful learning rule called the Widrow-Hoff learning
rule. The rule minimized the summed square error during training involving pattern
classification. Early applications of ADALINE and Many ADALINES called
MADALINES included pattern recognition, weather forecasting, and adaptive control.
The mathematical framework for a new training scheme of layered networks was
discovered by Werbose (1974). This scheme is called the back-propagation algorithm.
This work went largely unnoticed at that time.
The development of ANNs underwent a decline in the late 1960s following the
work of Minsky and Papert (1969), which showed that perceptron class networks
designed at that time were unable to solve relatively trivial problems. During 1965—
1984, further processing work was accomplished by a handful of researchers. In Japan,
Amari (1972,1977) pursued the study of learning in networks of threshold elements and
developed the mathematical theory of neural networks. In addition, Fukushima devel-
oped a class of neural network topologies known as neocognitrons. The neocognitron is
a model for visual pattern recognition. This network emulates the retinal images and
processes them using two-dimensional layers of neurons.
Kohonen in Finland (1977, 1982, 1984, and 1988) and Anderson in the United
States (1977), among others, developed associative memory networks. Kohonen devel-
oped unsupervised learning networks for feature mapping into regular arrays of neu-
rons (1982). Grossberg and Carpenter (1974, 1982) introduced a number of neural
architectures and theories and developed the theory of adaptive resonance networks.
In 1982, Hopfield introduced his recurrent neural network architecture for asso-
ciative memories. He formulated the computational properties of a fully connected
network of processing elements. The discovery of successful extension of neural net-
works knowledge had to wait until 1986, when Rumelhart et al. developed new learning
paradigms. The work of Rumelhart and McClelland (1986) revitalized the field of
ANNs. The new learning rules and other concepts introduced in their work removed
Section 1 Overview and History of Artificial Neural Networks 73

one of the most essential hurdles that grounded the development of the ANNs field. In
1986-1987, many neural network research programs were initiated and the field gained
tremendous momentum.

1.3. Attributes of ANNs


The human brain has billions of neurons, each of which is connected to thousands
of other neurons. Each neuron receives information from other neurons, performs a
simple operation on it, and passes the result on to other neurons. The brain, unlike a
digital computer, is not programmed but instead learns. Although it is still unclear how
learning takes place, there is agreement that it is the result of changes in the properties
of the synaptic connections between the neurons.
As they are inspired by the physioanatomy of the brain, the most important
attribute of ANNs is their ability to learn from examples. This means that appropriate
classifications of the input to the ANN may be possible even for input patterns
not included in the training set. This is predicated on the assumption that the training
set covers a complete representative group of patterns. This ability to learn and general-
ize means that ANNs have the potential to solve pattern recognition problems that
are intractable using rule-based conventional classifiers (i.e., statistical pattern recogni-
tion methods). That is not to say that ANNs are superior to statistical pattern
recognition methods. Instead, ANNs are complementary to many powerful techniques
that conventional classifiers offer.
In statistical pattern recognition methods, assumptions are typically made con-
cerning the probability density function of the input data. ANN models make no
assumptions about the underlying probability density functions of the input data,
thus possibly improving the performance of classifiers, especially when the data depart
significantly from normality. Other attractive features of ANNs are:
• Adaptation or learning
• Pursuing multiple hypotheses in parallel
• Fault tolerance
• Processing degraded or incomplete data
• Performing transformations
However, it is important to note that any classifier, regardless of method of implemen-
tation, acts merely to partition the feature space into regions corresponding to each
class and assigns patterns accordingly. In addition, performance of a classifier is usually
limited by the overlap of the classes in feature space. These combine with the practical
difficulty of obtaining a representative training set and using it to establish optimum
partition surfaces, to set the ground rules for developing a classifier.
The advantages offered by the neural nets approach compared with statistical
pattern recognition methods are:
• Less knowledge about the problem
• More complex partitioning of feature space
• High-performance parallel processing implementation
It is also argued that ANNs have the potential to approach the awesome capabilities of
the human brain in pattern recognition problems. However, neural net implementations
74 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals

tested to date generally tend to approach the level of performance of well-designed


statistical classifiers.
The disadvantages of neural net solutions compared with the statistical approaches
include:
• Extensive amount of training
• Slower operation when simulated on a conventional computer
• Unavailability of a detailed understanding of the decision-making process
A neural net will excel only when it has been trained to carve up the feature space
better than a comparable statistical classifier can do. Even so, its performance is fun-
damentally constrained by overlap of the classes.

1.4. Learning in ANNs


Learning is necessary when the information about inputs/outputs is unknown or
incomplete a priori, so that no design of a network can be performed in advance. The
learning process in the majority of ANNs can be categorized as supervised, unsuper-
vised, and hybrid.

7.4.7. Supervised Learning

In supervised learning, ANNs require a training data set of examples (training


patterns or input vectors) and their corresponding outputs (target vectors). The net-
work weights, which store the learned patterns, are initially set to random values,
usually in the range of —0.5 to +0.5. In addition, the input vectors are commonly
scaled so that the minimum and maximum values for any component are 0.1 to 0.9.
During the network training, a pattern is randomly selected from the training set, and
the input vector is propagated through the network. The output generated by the net-
work is compared to the target vector, and the connection weights between the hidden
and output layers are adjusted to make the output vector match the target vector as
closely as possible. Next, the weights for the hidden layer immediately below the output
layer are similarly adjusted.
Perhaps the most common form of supervised learning rule is the back-propaga-
tion algorithm (Rumelhart et al., 1986). The back-propagation algorithm is presented in
the Appendix to this chapter, as many of the applications in biomedical signal proces-
sing reviewed here make use of this learning technique. ANNs using back-propagation
apply a sigmoid transfer function rather than simpler threshold-type functions, thus
providing two important properties. First, because the sigmoid is nonlinear, it allows
the network to perform complex mappings of inputs to output vector spaces, and
second, because it is continuous and differentiable, it provides a closed form for updat-
ing the connection weights. Using the back-propagation learning rule, it is possible to
train multilayer perceptrons and perform complex nonlinear mapping between the
input and output domains. This is in contrast to simpler linear mappings with percep-
tron networks.
When back-propagation is applied, the training pattern is presented to the network
and the network output is calculated using existing network weights. An error term,
based on the difference between the calculated and the desired output, is then propa-
gated back through the network to calculate changes of the interconnection weights
Section 1 Overview and History of Artificial Neural Networks 75

between layers. By repeatedly presenting the network with training patterns and chan-
ging the weights to minimize the error term, the network is trained to give the desired
output for a given class of input. Weight changes can be carried out after each example
or after each presentation of the complete training set, sometimes called cumulative
back-propagation. By plotting the error term against the number of presentations, a
"learning curve" can be produced, which ideally converges to zero as the number of
presentations increases (for more details on back-propagation please see the Appendix).
An alternative form of network trained in a supervised fashion is ADALINE.
ADALINEs are much less complex than back-propagation trained MLPs but employ
similar strategies. The output of the ADALINE is a linear combination of the weighted
sums of the inputs. No sigmoid transfer function is used. An error term is then calcu-
lated and used to modify the network weights. The weighted sum is fed through a
thresholding unit to give a binary output.
Associative memories are another example of ANN learning in the supervised
mode. These networks consist of two sets of neurons, representing input and output
layers. All neurons in the input layer are connected to all neurons in the output layer.
By adjusting the connection weights between the neurons, it is possible to store
examples of input patterns and their corresponding output classes. Subsequent test
input patterns will produce outputs associated with the closest exemplar class.
Autoassociation memories are trained to produce an output identical to the input, so
when presented with noisy and incomplete test patterns they make an informed guess of
the missing data points and hence effectively achieve filtering (noise reduction).

1.4.2. Unsupervised Learning

We may think of the following analogy to compare the supervised with the unsu-
pervised mode of learning. Learning with supervision corresponds to classroom learn-
ing with the teacher's questions answered by the students and the answers corrected, if
needed, by the teacher. Unsupervised learning corresponds to learning the subject from
a videotape lecture covering the material but not including any other teacher's involve-
ment. The teacher lectures on directions and methods but is not available to provide
explanations and answer questions (Zurada, 1992).
In unsupervised learning the desired response or the target vector is unknown;
thus, explicit error information cannot be used to improve network behavior. Since
no information is available about the correctness or incorrectness of responses, learning
must somehow be accomplished based on observations of responses to input patterns
that we do not have much knowledge about. Unsupervised learning algorithms use
patterns that are typically redundant raw data having no labels regarding their class
membership or associations. In this mode of learning, the network must discover for
itself the existence of any possible patterns, regularities, distinguishing properties, and
so on. While discovering these possibilities, the network changes its parameters or
undergoes self-organization (Kohonen, 1990). This is called the self-organizing feature
maps (SOMs) algorithm. In this algorithm, the training process involves the presenta-
tion of pattern vectors from the training set one at a time. A winning node is selected in
a systematic fashion after all input features are presented. A weight adjustment process
takes place by using the concept of a neighborhood that shrinks over time and a
learning coefficient that also decreases with time. After several input patterns are pre-
sented, weights will form cluster centers that sample the input space such that the point
76 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals

density function of the cluster centers approaches the probability density function of the
input features. The weights will also be organized such that topologically close output
nodes are sensitive to inputs that are physically simpler. Thus, the output nodes will be
ordered in a natural way.
The unsupervised network simply consists of input and hidden neurons. The net-
work learns by associating different input pattern types with different clusters of hidden
nodes. When trained, different groups of hidden neurons respond to different classes of
input patterns. Some information about the number of clusters or similarity versus
dissimilarity of patterns can be helpful for this mode of learning.

1.5. Hardware and Software Implementation of


ANNs
Most current applications of neural nets are implemented by digital simulation.
Either software or hardware (DSP chips) are used to emulate parallel computation.
Such an implementation generally involves configuring simulation software for the
chosen network and then training the network. A stripped-down version of the simu-
lator is then embedded in the final application for use in a production environment. In
the digital simulation of an ANN, an important design factor is the sequence of updat-
ing the processing elements.
An important consideration in the implementation of neural nets is the training
process. Recall that any pattern recognition system, regardless of method of implemen-
tation, merely partitions the feature space into regions corresponding to the different
classes. In a neural net approach, the decision surfaces that result from extensive
training can be quite complex, particularly if the number of nodes and interconnections
is large (with a Bayes classifier, using normal statistics, decision surfaces are second-
order hyperplanes). If the size of the training set is not large, the network could merely
memorize the particular training set, rather than adjusting itself to recognize all mem-
bers of the classes at large. Overtraining can be avoided by using a large training set
that is distinct from the testing set. When the error rate on the test set stops to decrease
and begins to increase, overtraining has started.
Another important implementation consideration from both performance and
computational points of view is the size of the network. It has been shown that one
hidden layer is sufficient to approximate the mapping of any continuous function and
that at most two hidden layers are required to approximate any function in general
(Cybenko, 1989 and Hornik, 1989).
The number of PEs in the first layer is application dependent. When a back-
propagation network is used to classify patterns, the number of PEs is equal to the
number of elements in the feature vector. Likewise, the number of PEs in the output
layer is usually the same as the number of classes.
The number of subsequent hidden layers and the number of PEs in each such layer
are design choices. In most applications, the number of PEs in each hidden layer is a
small fraction of the number of units in the input layer. It is usually desirable to keep
this number small to avoid possible overtraining. On the other hand, too few PEs in the
hidden layer may make it difficult for the network to converge to a suitable partitioning
of a complex feature space. Once a network has converged, it can be reduced in size and
retrained, often with improvement in overall performance.
Section 2 Application of ANNs in Processing Information 77

The data used for training the neural net must be representative of the population
(over the entire feature space) for the network to model adequately the probability
density function of each class. It is also important that the training patterns be pre-
sented randomly. The network must be able to generalize to the entire training set as a
whole, not to the individual classes one at a time. Presenting classes of vectors sequen-
tially can result in poor convergence and unreliable class discrimination. Training on
patterns randomly generates a type of noise that can jog the network out of a local
minimum in the feature space. Noise is sometimes added to the training set to assist
convergence.

2. APPLICATION OF ANNs IN PROCESSING


INFORMATION

In a pattern recognition problem, the input to the ANN is the feature vector of the
unknown pattern. The feature vector is presented to each node in the input layer of the
network. Often the feature vector is augmented by an additional element that is chosen
to be unity. This provides an additional weight in the summation that acts as an offset
in the activation function. The unknown pattern is assigned to the class somehow
specified by the output vector. The network, then, accepts a feature vector as input
and generates an output vector indicating a membership value corresponding to the
class to which the unknown pattern belongs.
In addition to the selection of the processing element (nonlinearity and the activa-
tion function) and network topology, the behavior of the network is determined by the
connection weights. Values for these weights are adjusted during the training of the
network and are fixed when the network is operating in a production mode.
2.1. Processing and Analysis of Biomedical Signals
A main pillar of modern medicine is measurement of biomedical signals such as
cardiac electrical activity, heart sounds, respiratory flows and sounds, and brain and
other neural electrical activities. Almost all biomedical signals are contaminated with
noise artifacts due to sensing methods or environmental noise. In addition, extracting
useful information regarding any pathological conditions from measured physiological
signals demands careful analysis of the signal by an expert. Modern analog and digital
signal processing techniques have contributed significantly in making the analysis of
physiological signals easier and more accurate. For instance, these techniques help
remove different types of noise, identify inflection points, and combine multiple signals.
They also detect and classify changes in the signal due to pathological events and
conditions and transform the signal to extract hidden information not available in
the original signal.
2.2. Detection and Classification of Biomedical
Signals Using ANNs
The primary goal of processing biomedical signals is to identify the pathological
condition of the patient and monitor the changes in the condition over a course of
treatment or with a procedure. ANNs have the potential to provide such information as
demonstrated by their application in processing and analysis of biopotentials, medical
images, speech and auditory processing, and so forth. They are particularly useful in
78 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals

cases where the problem to be solved is intractable and development of an algorithmic


solution is difficult. In most applications, the efficacy of ANNs is evident when they are
used in conjunction with other signal and image processing techniques. In other words,
although ANNs could be used to extract features and compress data, their major
impact has been in classification of events.
In the following sections, we will attempt to outline some of the latest applications
that have been reported in conferences and journal articles. Naturally, doing justice to
the large number of excellent international research groups and individuals involved in
thefieldis a very difficult task. Also, to present comprehensive, balanced, and in-depth
coverage of all applications deserves a book in its own right or at least several review
journal articles. Therefore, the aim of this part of the chapter is to present a cross
section of some of the recent applications of ANNs in the processing and analysis of
biomedical signals.
In general, ANNs achieve two principal functions. The first is pattern association.
Input patterns usually contain distortions in their components due to natural variabil-
ity, noise, or missing information. This attribute allows ANNs to be used as pattern
recognition or association devices. The second important function of ANNs is dimen-
sionality reduction or feature extraction. The transformation of input patterns so that
nonessential data are removed, or more efficient features are created, is one of the more
valuable properties of ANNs and is beneficial in certain types of applications such as
automatic speech recognition.
To be more specific, ANNs have found application in the biomedical field in three
broad categories: signal compression, enhancement, and interpretation. They have been
used to compress, enhance, and interpret data. In the compression capacity, ANNs are
taught to represent all the information present (in a signal or in an image) as the
activation levels of its hidden layer neurons. If there are fewer hidden layer neurons
than inputs, then signal or image compression is achieved. When an ANN is used to
enhance a signal, it produces an output, which is free of noise or accentuates the desired
information contained in the biomedical signal or the medical image, thus aiding inter-
pretation. ANNs are also used to identify or classify events or states, such as normal
and pathological states. In this section, examples of the application of ANNs in proces-
sing and analysis of biomedical signals are presented.

2.3. Detection and Classification of


Electrocardiography Signals
The pumping action of the heart is coordinated by an electrical activation sequence
initiated at the natural pacemaker of the heart and conducted through the conductive
pathways leading into the myocardial cells. This activation sequence results in the
production of closed-line action ionic currents that flow in the thoracic volume con-
ductor.
The electrocardiogram (ECG) signal is a manifestation of electrical activity mea-
sured as potentials on the body surface using electrodes. In electrocardiography, the
electrical activity of the heart is represented as a net current dipole generator located at
the cardiac center of the heart (within the anatomical boundaries of the heart). The
thoracic medium is considered as the resistive load of this equivalent cardiac generator.
During the cardiac cycle, the magnitude and orientation of the cardiac dipole change
and its projection onto different cardiographic leads give rise to the ECG signal.
Section 2 Application of ANNs in Processing Information 79

The ECG signal is an important diagnostic tool in assessment of cardiac function.


The normal activity of the heart results in a regular heartbeat that can vary in response
to nervous, humoral, and other factors. Any departure from regularity caused by dis-
turbances in rhythm generation, conduction, or both can produce abnormalities in the
recorded ECG. These abnormalities are known as arrhythmias. Cardiac arrhythmias
may in turn change the normal sequence of atrial and ventricular activation and con-
traction.
Cardiac arrhythmias are a major cause of morbidity and mortality. They have
stimulated enormous research efforts by clinicians, biomedical engineers, physiologists,
and pharmacologists over the past four decades. Considerable progress has been made
on all fronts, and yet improvement in the processing and interpretation techniques
continues. One of the most challenging cardiac dysfunctions manifested in cardiac
arrhythmias, namely, sudden death, is still an unsolved problem.
Processing and analysis of the ECG signal are of paramount importance in the
detection of cardiac arrhythmias. The ECG signal is frequently influenced by a number
of noise components of different origin (biological, electrical, and mechanical).
Movements of muscles generate electromyogram (EMG) (high-frequency) noise;
respiration provokes baseline wander and modulates the amplitude of the QRS com-
plex. Electrical interference from powerlines, electrocautery, and electrode impedance
changes at the tissue-electrolyte interface also contribute to ECG noise. Therefore,
preprocessing ECG signals to remove the variety of noise components is a primary
and important signal processing task that influences all subsequent stages of the
arrhythmia detection process.
Another very critical step in arrhythmia classification is accurate and robust detec-
tion of the QRS complex, which is associated with ventricular depolarization and
precedes ventricular contraction. This component of the ECG signal contains signifi-
cant information that undergoes morphologic alterations in beats of ventricular origin.
Therefore, reliable detection of QRS morphologies constitutes a fundamental step in
the detection and classification of ventricular arrhythmias.
There are numerous approaches to automated ECG arrhythmia detection and
classification. They include signal processing methods such as frequency analysis,
time-frequency analysis, wavelet analysis, template matching classification techniques,
and feature extraction classification techniques. There are certain advantages and dis-
advantages associated with each technique. For example, the advantage associated with
the template matching classification technique is storage of morphological representa-
tions of input patterns, which are of great clinical value with control and follow-up
examination of patients. A disadvantage of this approach is the difficulty associated
with the temporal alignment of the input QRSs with those of the stored templates.
Feature extraction classifiers usually have a decision tree structure or a set of rules
that determines the result of classification by means of comparing the extracted features
with certain thresholds, usually determined empirically. Feature extraction-based tech-
niques have the advantage of reducing template temporal alignment calculations. The
efficiency of the feature extraction methods depends completely on how representative
the selected features are and on the thresholds used in the rules or decision trees.
A number of researchers [2-8] have used ANNs and a variety of features to detect
and classify ECG beats. Watrous and Towell [8] used a multilayer perceptron (MLP)
with the back-propagation learning algorithm to discriminate between normal and
ventricular beats. Marques et al. [6] used a diverse set of features as inputs to different
80 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals

ANNs to discriminate between normal rhythm, ventricular hypertrophies, and myo-


cardial infarcts. Nandal and Bossan [5] used an MLP for discriminating between nor-
mal, ventricular, and fusion rhythms, based on the first coefficients from the analysis of
the principal components of the ECG signal. Hu et al. [7] reported the development of
an adaptive MLP for classification of ECG beats. They achieved an average recognition
accuracy of 90% in classifying beats into two groups, normal and abnormal. In an
attempt to classify beats into 13 groups according to the MIT/BIH database annota-
tion, they reported an average recognition accuracy rate of 65%. A hierarchical system
of MLP networks, which first classified the beat into normal or abnormal and then
classified it into the specific beat type, was developed, which improved the recognition
accuracy to 84.5%.
An ECG rhythm classifier that performs well for a given training database often
fails miserably when presented with ECG waveforms from a different patient group.
Such an inconsistency in performance is a major hurdle preventing development of
highly reliable fully automated ECG processing systems with wide clinical acceptance.
New approaches to alleviate these problems are being developed by different
groups. Hu et al. at the University of Wisconsin proposed a "mixture-of-experts"
(MOE) approach to customize an ECG beat classsifier to further improve the perfor-
mance of ECG processing and to offer individualized arrhythmia detection [7].
They developed a small customized classifier based on brief patient-specific ECG
data. It was then combined with a global classifier, which was tuned to a large ECG
database of many patients, to form an MOE classifier structure. In testing the MOE
approach with the MIT/BIH arrhythmia database, significant performance improve-
ment was observed. The approach of Hu et al. was based on three popular ANN-
related algorithms, namely the self-organizing maps (SOMs) and learning vector quan-
tization (LVQ) algorithms, along with the MOE method. SOM and LVQ together were
used to train the patient-specific classifier, and MOE facilitated the combination of the
two classifiers (original and patient-specific) to realize patient adaptation. In MOE, the
two classifiers are modeled as two experts on ECG beat classification. The original
classifier, called the global expert, knew how to classify ECG beats for many other
patients whose ECG records were part of the in-house, large ECG database. The
patient-specific classifier, called the local expert, was trained specifically with the
ECG record of the patient. A gating function, based on the feature vector presented,
dynamically weighted the classification results of the global and the local experts to
reach a combined decision. This process was analogous to having two human experts, a
specialist and a family doctor, arrive at a consensus based on their expertise.
Barso et al. [9] reported a new ANN aimed at the morphological classification of
heartbeats detected on a multichannel ECG signal. In this work, they emphasized the
adaptive classification behavior of the ANN with the capacity to self-organize itself
dynamically in response to the characteristics of the ECG signal. This network is based
on adaptive resonance theory (ART) and was called MART for multichannel ART.
They argued that MART had the following capabilities:

• On-line unsupervised learning


• Updating of the morphological templates that match the input beats
• Creation of new morphological classes as they appear and removal of classes
that became obsolete
Section 2 Application of ANNs in Processing Information 81

• Adaptation of the discriminative capacity of each morphological class detected


to the variability of the input patterns associated with it
• Selective evaluation of the beat-class differences in each channel as a function of
the signal quality
Their results demonstrated how the adaptive behavior of MART diminished repe-
tition for the same morphological pattern, thus maintaining a high capacity for mor-
phological discrimination. It was also shown that MART selectively evaluates the beat-
class differences in each channel as a function of the signal quality, therefore granting
more credit to the final classification of the channels that, in principle, have greater
signal-to-noise ratios.
2.4. Detection and Classification of
Electromyography Signals
Muscle tissue conducts electrical potentials somewhat similarly to the way nerve
axons transmit action potentials. Skeletal muscle is organized functionally on the basis
of the motor unit. A motor unit consists of a single motoneuron and the group of
muscles that it enervates. The motor unit is the smallest unit that can be activated by a
volitional effort, in which case all constituent musclefibersare activated synchronously.
Thefibersof a given motor unit are interspersed with fibers of other motor units. Thus,
the muscle fibers associated with the single motor unit constitute a distributed, bio-
electric source unit. This source is located in a volume conductor consisting of all the
other muscle fibers that are both active and inactive. The evolved extracellular field
potential from the active fibers of a single motor unit has a triphasic form of brief
duration (3-15 ms) and amplitude of 0.02 to 2mV, depending on the size of the motor
unit. This special electrical signal generated in the muscle fibers as a result of recruit-
ment of a motor unit is called the motor unit action potential (MUAP).
Electromyography (EMG) is the study of the electrical activity of muscle and is a
valuable aid in the diagnosis of neuromuscular disorders. EMG findings are used to
detect and characterize disease processes affecting the motor units.
Neuromuscular diseases are a group of disorders that may involve the motor
cortex, the brain stem and other subcortical areas, spinal circuitry (anterior horn
cells of the spinal nerves), the neuromuscular junction, and the muscle itself. The
time course and other characteristics of MUAPs are considerably modified by disease.
In peripheral neuropathies, partial denervation of the muscle frequently occurs and is
followed by regeneration. Regenerating nerve fibers conduct more slowly than healthy
axons. In addition, in many forms of peripheral neuropathy, the excitability of neurons
is altered and there is widespread slowing of nerve conduction. One effect of this is that
neural impulses are more difficult to initiate and take longer in transit to the muscle,
generally causing scatter or desynchronization in the EMG pattern.
Neuromuscular diseases cause muscular weakness and/or wasting. There are many
muscular disorders, but two groups are readily amenable to computer-aided diagnosis
because of the consistency of their clinical appearance. These are motor neuron disease
(MND) and myopathies (MYOs) [10].
MND is a disease causing selective degeneration of the upper and lower motor
neuron. This disease affects middle-aged to old people, with progressive widespread loss
of motor neurons usually leading to death within 3 to 5 years. MND typically increases
the duration and the amplitude of MUAPs compared with those in normal subjects.
82 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals

The occurrence of longer and higher magnitude MUAPs reflects an increase in the
number or density of fibers in motor units or increases in the temporal dispersion of
the activity picked up by the recording electrode.
MYOs are a group of diseases that primarily affect skeletal musclefibers.They are
divided into two groups: inherited and acquired. Most muscular dystrophies are her-
editary, causing severe degenerative changes in the muscle fibers. Polymyositis is an
example of an acquired myopathy, which is characterized by an acute or subacute onset
with muscle weakness progressing slowly over a matter of weeks. MYOs typically
shorten the MUAP duration and reduce its amplitude compared to normal. These
findings are attributed to fiber loss within the motor unit, with the degree of reduction
of duration and amplitude reflecting the amount of fiber loss.
In routine clinical electromyography, MUAP morphology is subjectively evaluated
by the technician or neurologist. However, a purely descriptive approach is not suffi-
cient and an exact quantitative measurement of different MUAP parameters are neces-
sary. In addition, manual analysis is time consuming and the subjective measurement of
MUAP parameters introduces variable sources of error.
During the past two decades, rapid advances in signal processing methods and
computing technology have made automated quantitative EMG analysis feasible.
Computer-aided EMG processing and diagnosis saves time, standardizes the measure-
ments, and enables the extraction of additional features that cannot be easily calculated
by manual methods. To further the development of quantitative EMG techniques, the
need for adding automated decision has emerged.
Integrated computer-aided diagnosis of neuromuscular disorders has become pos-
sible by combining quantitative MUAP analysis techniques with the pattern recogni-
tion capabilities of ANNs. Pattichis et al. [11] used a parametric pattern recognition
algorithm that facilitated automatic MUAP feature extraction and combined this algo-
rithm with the classification abilities of ANN models to provide an integrated environ-
ment for the diagnosis of MND and MYOs. They investigated 10 network architectures
with two learning paradigms: supervised and unsupervised. For supervised learning,
they used back-propagation, and for unsupervised learning, they used Kohonen's self-
organizing feature maps algorithm.
In their comprehensive study, Pattichis et al. used a total of 880 MUAPs collected
from the biceps brachii muscle of 44 subjects: 14 normals, 16 with MND, and 14 with
MYOs. The mean and standard deviation of seven features were extracted from
MUAPs, giving a total of 14 features. Duration, spike duration, amplitude, area,
spike area, number of phases, and number of turns were used for classification into
normal, MND, and MYO. Ten different ANN architectures were designed with 14
input nodes, 3 output nodes, and 2 hidden layers with different numbers of nodes
(i.e., 10-5, 40-10, and 100-10).
The training set was formed by randomly selecting MUAP data from 24 (8 subjects
from each group) of the 44 subjects. The data from the remaining 20 subjects were used
for evaluating the performance of the ANN models. These sets were used for cluster
analysis and for training the back-propagation and Kohonen self-organizing feature
maps paradigms.
The diagnostic performance of the neural models investigated was of the order of
80-90% for models trained with the back-propagation algorithm and 80% for models
trained with Kohonen's self-organizing feature map algorithm. .K-means cluster analy-
sis was also used on the same data set and was observed to offer poorer performance.
Section 2 Application of ANNs in Processing Information 83

2.5. Detection and Classification of


Electroencephalography Signals
The summated neuronal activity of the brain recorded as minute electrical poten-
tials from the human scalp is called the electroencephalogram (EEG). Conventionally,
such potentials are recorded with three types of electrodes: scalp, cortical, and depth
electrodes. For scalp recording, the electrodes are typically placed on the scalp in
accordance with some internationally defined geometrical sites (i.e., the 10-20 system,
which is based on measurements made from the nasion, inion, and right and left
preauricular points). Information derived from the depth electrodes and microelec-
trodes has shown that under normal circumstances conducted action potentials in
axons contribute little to the surface EEG because they occur asynchronously in time
in large numbers of axons, which run in many directions relative to the surface. Thus,
their net influence on potential at the surface is negligible. Whether recorded from the
scalp, cortex, or depths of the brain, these biopotentials represent a superposition of the
volume conductor fields produced by a variety of active neuronal current generators.
Unlike the relatively less complex bioelectric sources in the nerve trunks generating the
electroneurogram (ENG), the sources generating the EEG are aggregates of neuronal
elements such as dendrites, cell bodies, and axons.
Electrophysiologists have shown that the EEG signal is a reflection of the net effect
of local postsynaptic potentials of cortical cells. They argue that the EEG recorded at
the surface must be due to the orderly and symmetric arrangement of vertically oriented
pyramidal cells with their long apical dendrites running in parallel. These cells are
located in layers III-V of the cerebral cortex.
The EEG signal contains information regarding changes in the electrical potential
of the brain obtained from a given set of recordings. These data include the character-
istic waveforms with accompanying variations in amplitude, frequency, phase, and so
on as well as brief occurrence of electrical patterns.
The EEG signal patterns are modulated by a wide range of variables, including
biochemical, metabolic, circulatory, humoral, neuroelectric, and behavioral factors.
The EEG is extremely difficult for an untrained observer to interpret, partially because
of the spatial mapping of functions onto different regions of the brain and electrode
placement. The EEG is used routinely in the diagnosis of neurological disorders such as
epilepsy, stroke, and brain damage; it is also used in sleep and drug research and the
investigation of psychiatric disorders. A comprehensive discussion of EEG application
in automatic detection and classification of different neurological disorders and the
potential impact of ANNs in all cases is beyond the scope of this chapter. Therefore,
attention is given only to epilepsy.
One of the more clinical uses of the EEG is in diagnosing, monitoring, and mana-
ging different types of epilepsy and in localizing the focus in the brain causing the
epilepsy. This disorder is characterized by sudden recurrent and transient disturbances
of mental function and/or movement of the body that result from excessive discharge of
the brain cells. The presence of epileptiform activity in the EEG confirms the diagnosis
of epilepsy, which sometimes can be confused with other disorders producing similar
seizurelike activity.
Visual screening of EEG records for detection and classification of sharp transients
is a complex and arduous process, indeed a time-consuming exercise for highly trained
professionals, who are generally in short supply. This makes the automated detection of
84 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals

spikes and seizures of great clinical significance. In addition, the use of ambulatory
monitoring, which produces 24-hour or longer continuous EEG recordings, is becom-
ing more common, creating a higher sense of urgency for the development of auto-
mated methods for detection and classification of EEG spikes. Over the past several
years, substantial progress has been achieved in the analysis of EEG signals and devel-
opment of automatic recognition of epileptiform transients [12]. Such methods usually
decompose the EEG signal into waves or half-waves by determining extrema amplitude.
Each wave is then examined for its fit to a set of predetermined criteria, such as
duration, amplitude, slope, and sharpness.
Accurate detection of spike/spike-wave (SSW) is now possible from artifact-free
signals. However, difficulties arise with EMG and EEG activities resembling spikes.
Such difficulties plague automatic spike detection methods. Neural network classifica-
tion methods can provide opportunities to improve on the performance of more tradi-
tional methods.
In most recent years, ANNs have been used for detecting EEG spikes. ANN-
based spike detection systems basically use two different approaches for input repre-
sentation: (1) extracted EEG features and (2) raw EEG signal. In the first approach,
features such as slope and sharpness are extracted and presented to the ANN for
training and testing [13]. Success of such methods depends on proper selection of
features, which may not be known completely a priori. In the second approach, the
raw EEG signal is presented to the ANN after proper scaling and windowing [14].
The success of such methods relies heavily on selection of the appropriate window
size. A compromise is usually made between window size, effective EMG filtering,
and detection accuracy.
Kalayci and Ozdamar [15] implemented a family of three-layer MLP feed-forward
neural networks employing the back-propagation learning algorithm to detect EEG
spikes. They used wavelet transform (WT) to extract features from SSW and non-SSW
(any other activity including background EEG and EMG artifact) to train and test their
ANN classifiers. The EEG data were collected from five patients (two males and three
females with an average age of 13.8 years and range of 8-15 years) diagnosed with
epilepsy. In this study, a total of 3614 (761 SSW and 2853 non-SSW) files were gener-
ated. They used 1200 (400 SSW and 800 non-SSW) wavelet-transformed files randomly
selected to form the training set. The data from 2414 (361 SSW and 2053 non-SSW)
wavelet-transformed files were used as the testing set. Two sets of wavelet features,
Daubechies 4 and Daubechies 20, were extracted from SSW and non-SSW data files
containing record lengths of 512 EEG data points. To investigate the effects of the
different WT scales on detection, feature sets with 8 coefficients from resolution scale 1
and 24 coefficients from resolution scale 3 were used.
A family of MLPs was designed for this study. Each MLP had a different number
of input nodes (20, 16, 8), a variable number of hidden layer neurons (from 3 to 8), and
one output neuron. SSW and non-SSW events were represented as 0.8 and —0.8,
respectively, and a hyperbolic tangent function was used as the activation function.
Each network was trained and tested at least twice, and the best classification accuracy
was chosen.
Classification performance of the ANN models was measured using conventional
criteria. SSWs and non-SSWs were considered as positive and negative events, respec-
tively. A true positive (TP) outcome was registered when both the ANN and neurolo-
gists classified an EEG portion as SSW. False positive (FP), true negative (TN), and
Section 2 Application of ANNs in Processing Information 85

false negative (FN) outcomes were similarly described. Sensitivity and specificity were
calculated as %TP/(TP + FN) and %TN/(FP + TN), respectively. The average of
these two percentages was used to calculate the overall classification accuracy.
Proper selection of the WT resolution scale with the variety of ANNs tested in this
study showed more than 90% accuracy in detection performance, as defined by the
average of sensitivity and specificity.

2.6. Detection and Classification of


Electrogastrography Signals
The electrogastrogram (EGG) is the cutaneous recording of myoelectrical activity
of the stomach measured by placing surface electrodes on the abdominal skin. Four
biopotential electrodes filled with electrolyte gel are used. Three recording electrodes
are placed over the stomach, and the common reference electrode is placed 6-10 cm
superior near the right breast. The EGG signal consists of two rhythmic components:
gastric slow waves and gastric spikes. The gastric slow wave is present all the time. It
controls the frequency and propagation of contractions of the stomach. It originates in
a region near the junction of the proximal one-third and two-thirds of the gastric body
and propagates circumferentially and distally toward the pylorus with increasing velo-
city and amplitude. The normal frequency of the slow wave is 3 cycles/min (cpm) in
humans.
Gastric spikes are present only when gastric contractions occur and are super-
imposed on gastric slow waves. The spike has a random phase and a normal frequency
of 60 cpm. The dominant frequency of the EGG represents the frequency of the slow
wave. Spikes are reflected as relative amplitude changes in the EGG signal.
Unlike other biopotentials, for example, the electrocardiogram, the EGG has a low
signal-to-noise ratio (SNR). It contains respiration noise, ECG noise, and electrode
motion artifacts. Popular signal processing methods to improve the SNR of the EGG
signal include phase lock filtering, autoregressive modeling, adaptive filtering, fast
Fourier transform, and smoothed power spectral analysis.
In clinics, the digestive process of the stomach is commonly assessed by measuring
gastric emptying. The gastric emptying test is performed by asking the subject to eat an
isotope-labeled solid meal and then stay under a gamma camera for 2 hours while
abdominal images are acquired. In this test, a patient is said to have delayed gastric
emptying if more than 70% of the ingested solid meal remains in the stomach 2 hours
after the meal is eaten. Patients with delayed gastric emptying must be treated using
appropriate medications. Patients with severe delayed gastric emptying may have to
undergo abdominal surgery for intestinal tube feeding. The gastric emptying test is
invasive and expensive. Therefore, it is desirable to develop noninvasive, low-cost
methods for diagnosis of delayed gastric emptying.
The frequency content of the EGG signal is an accurate measure of the gastric slow
wave, and the relative amplitude change of the EGG reflects the contractility of the
stomach. The EGG is expected to have relatively higher amplitude when the stomach
contracts and relatively lower amplitude when there are no contractions. The EGG has
been proved reliable and accurate and, as such, it has attractive applications in medical
research and clinical diagnosis of patients with suspected gastric motor disorders.
The EGG accurately represents the frequency of gastric slow wave and provides
important information about motor activities of the stomach. Gastric electrical dys-
86 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals

rhythmia has been shown to be associated with motor disorders of the stomach,
whereas the relative amplitude of the EGG is associated with gastric contractility.
Therefore, accurate detection of gastric dysrhythmia as well as amplitude variations
of the EGG is very important in clinical applications. EGG dysrhythmias reflect
motor disorders of the stomach. Gastric dysrhythmia includes tachygastria (slow
wave frequency of 4-9 cpm), bradygastria (slow wave frequency of 0.5-2cpm), and
arrhythmia (no rhythmic activity). It is known that gastric dysrhythmia is often short.
ANNs have been used for automated diagnosis of delayed gastric emptying from
EGGs with encouraging results [16]. In this work, Lin et al. used five spectral para-
meters of the EGG data as inputs to a back-propagation neural network with three
hidden nodes to acquire a correct diagnosis of 80%. Although satisfied with their
encouraging results, these investigators further improved the performance of the
neural network by using genetic algorithms in conjunction with cascade correlation
learning architecture [17]. This algorithm offered the advantage of automatically
growing the architecture of the neural network to give a suitable network size, and
it also reduced the training time and complexity associated with the back-propagation
network. The algorithm enabled the group to conclude that a neural network with
three hidden units seems to be a good choice for this application. They could achieve
the correct diagnosis in 83% of cases with a sensitivity of 84% and a specificity of
82%. The authors stated that although these results were comparable to those
obtained with their back-propagation network, this approach eliminated the guess-
work associated with the size and connectivity pattern of the network in advance and
improved the detection speed.
In this study, the researchers acquired EGG data from 152 patients. Based on
gastric emptying tests, the patients were predefined as two classes: 76 patients with
delayed gastric emptying and 76 patients with normal gastric emptying. The training
set contained 38 patients with delayed gastric emptying and 38 with normal gastric
emptying randomly selected from 152 patients. The remaining 76 patients were used as
testing set, which also contained 38 patients with delayed gastric emptying and 38 with
normal gastric emptying.

2.7. Detection and Classification of Respiratory


Signals
Two examples of the application of ANN-based processing of respiratory signals
are presented in this section. The first example illustrates how ANNs can be used to
assist the diagnosis of upper airway obstruction resulting from goiter. The second
example illustrates the superiority of an ANN-based detector in ascertaining the pre-
sence of pharyngeal wall vibration in subjects with obstructive sleep apnea. Detection of
pharyngeal wall vibration in these subjects is important, as it signals imminent collapse
of the airway and cessation of airflow to the lungs during sleep.

2.7.1. Detection of Goiter-Induced Upper Airway


Obstruction

Bright et al. [18] investigated the use of ANN-based detectors to determine whether
a patient's goiter is causing upper airway obstruction (UAO). They explored the pos-
sibility of processing the flow-volume loops from standard forced expiratory vital
Section 2 Application of ANNs in Processing Information 87

capacity (FVC) maneuvers [19] to establish whether goiter has caused UOA. Flow-
volume curves from 155 patients attending a specialized thyroid clinic were obtained
using the recommended three maneuvers [20]. From these traces, the trace with highest
peakflowwas selected for processing. Using these loops and other available physiologic
data, it was determined that 46 of the patients had UAO. Flow-volume loops from
these patients and 51 subjects who were thought not to have UAO were selected as set 1
and set 2. Further,flowvolumes from 50 other patients, not in the pool of the preceding
155 patients, with chronic obstructive pulmonary disease (COPD) were included as set
3. The performance of human experts in detecting the presence of UAO using only
flow-volume loops was established in the following fashion.
The three sets of loops described were mixed randomly. Two expert clinicians
independently examined the traces. They assessed each trace for the presence of
UAO by assigning an ordinal scale from 1 to 4. In this scoring, the assigned numbers
had the following significance: 1, not at all certain; 2, moderately certain; 3, quite
certain; and 4 very certain. Eight weeks later, each observer was asked to repeat the
scoring of the records independently. The inter- and intraobserver agreements between
the first and second scorings were measured by computing a kappa factor [21]. The
interobserver kappa factors for the first and second scorings were 0.58 and 0.68, respec-
tively. The intraobserver kappa was 0.5 for one observer and 0.46 for the other. This
process demonstrated the subjectivity and relative nonrepeatability of human scoring of
the data. Bright et al. used several standard measures obtained from the FVC maneuver
as inputs. These included peak expiratory flow (PEF), the exhaled volume during the
first second of the FVC maneuver (FEV1), and FEV1/FVC. Further, they developed
several novel indices to quantify the shape of theflow-volumeloop. Specifically, these
features aimed at reflecting the observed relative flatness of the upper part of the loop
(high-flow region) for patients with UAO as compared with the loops for patients
without UAO. For instance, the exhaled volume range for each patient was divided
into 20 points and the corresponding measured flow at each point was expressed as a
percentage of measured peak flow. Initially, the standardflow-volumeloop measures as
well as the devised novel measures, 50 features total, were used as input to an ANN.
The number of input nodes for this network was equal to the total number of features
(i.e., 50). The network also had two hidden layers with the number of nodes in the first
layer equal to approximately twice the number of input nodes and that in the second
layer equal to half the input nodes. The network had two outputs. Approximately two-
thirds to one-half of theflow-volumecurves for each patient category from the patient
test record was used to train the network. The remaining records were combined to
form a test set. After completion of training, the total weight from each network input
node to each output node was computed. These sums were compared with each other,
and the five highest weight sums were selected as a reduced input set. The number of
inputs was then reduced to five.
To evaluate the performance of the second ANN with reduced number of inputs, it
was presented with 67 records that included 17 from data set 1 (i.e., patients with
UAO), 28 from set 2 (i.e., patients with goiter only), and 22 from set 3 (i.e., COPD
patients). The network exhibited 88% sensitivity, 94% specificity, and 92% total accu-
racy. To compare the performance of the devised ANN with that of other classification
methods, an analysis of the patient records using logistic regression was performed. In
this study, dependent variables for the logistic regression classifier were the same as the
inputs to the ANN. When tested with the patient records described earlier, the logistic
88 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals

regression model had a sensitivity of 70.6%, specificity of 92%, and total accuracy of
86.6% (all at 5% significance level). Bright et al. concluded that the performance of the
ANN was superior to those of human and logistic regression classification.

2.7.2. Detection of Pharyngeal Wall Vibration


During Sleep

It has been shown by Burk et al. [21] that pharyngeal wall vibration can be used as
an indicator of imminent collapse of the airway during sleep for patients with obstruc-
tive sleep apnea (OSA). They initially devised a classical method to detect pharyngeal
wall vibration (PWV) [22]. They applied this methodology to adjust automatically the
pressure applied by a continuous positive airway pressure (CPAP) machine, commonly
used to treat patients with OSA. They demonstrated that this approach is as effective as
treating (OSA) patients with conventional CPAP [23]. In their investigation of the
proposed system Behbehani et al. [24] determined that the performance of the phar-
yngeal wall vibration detector was satisfactory. However, they established that the
detection of PWV was not 100%. Lopez et al. [25] set out to develop an ANN-based
detector of the pharyngeal wall for patients with OSA. They aimed at improving the
detection rate over the previous classical method.
Lopez et al. designed an ADALINE network with 15 input nodes (plus 1 bias), two
hidden layers each having two nodes, and one output node. A schematic diagram of the
network is shown in Figure 4.
This topology was selected through a competitive comparison of the performance
of six distinct topologies. Specifically, 2853 data vectors containing pharyngeal vibra-
tion episodes combined with 4352 data vectors without any pharyngeal vibration epi-
sodes (a total of 7205) were used to train the competing topologies and select the most
effective one. The efficacy of the networks was judged by percentage of false positive
and percentage of negative detection as well as the final error. The training data was
collected from five volunteer patients (three male and two female). Input signals to the

Figure 4 Topology of the network for detect-


ing pharyngeal wall vibration [26].
Section 2 Application of ANNs in Processing Information 89

networks were derived from sensed pressure at the nasal mask. For this purpose, 32-
point ensembles from the nasal mask pressure were formed and their fast Fourier
transforms (FFTs) were obtained. The spectra were normalized for each patient by
computing the average value of the spectra for the last 15 values of the spectra, exclud-
ing the low-frequency value.
Performance of the network was evaluated by comparing its detection of PVWs
with the classical method. The evaluation data came from a group of five volunteer
male patients other than those whose data were used for selecting and training the
network. The comparison was made at three pressure levels, 4, 8, and 13 cm H 2 0
(i.e., low, medium, and high). Statistical comparison of the results showed that the
ANN-based network had a lower percentage of false negative at 4 and 13 cm H 2 0,
while the classical method had a lower percentage of false negative at 8 cm H 2 0 (with a
significance level of 0.05). Similarly, for 4 and 13 cm H 2 0, the percentage of false
positives for the ANN-based detector was lower than for the classical method.
However, at 8 cm H 2 0 there was no significant difference between the two methods.
Further, the ANN-based detector had the same rate for detecting percent false positive
and percent false negatives at all pressures. A similar comparison for the classical
method showed that at 4 cm H 2 0 the percentage of false positives was less than the
percentage of false negatives; there was no difference at 8 and 13 cm H 2 0. Lack of
dependence on operation pressure for the ANN-based detector is an attractive feature,
as it allows one to apply the same detector across all pressures [24,26].

2.8. ANNs in Biomedical Signal Enhancement


Xue et al. [27] developed an ANN whitening filter to model the lower frequency
components of the ECG signal, which are inherently nonlinear and nonstationary. The
filtered signal, which contained mostly higher frequency QRS energy, was then passed
through a linear, matched filter to detect the location of the QRS complex. This ANN
whiteningfilterwas very effective at removing the nonstationary noise characteristics of
ECG. Using this novel approach, the researchers reported a detection rate of 99.5% for
a very noisy patient record in the MIT/BIH arrhythmia database. The results compared
favorably with the 97.5% detection rate obtained using a linear adaptive whitening
filter and the 96.5% rate achieved with band-pass filtering. This work shows a novel
application of ANNs for ECG signal filtering and recognition. This work also shows
the potential of using ANNs for nonlinear modeling of biomedical signals such as ECG
signals.

2.9. ANNs in Biomedical Signal Compression


The ECG data compression is a more constrained problem than image or speech
data compression. With image and speech data, the human eye or ear can act as a
smoothingfilter,permitting a certain amount of tolerance for distortion. However, with
the ECG, not only must the overall distortion be low but also certain essential areas of
the ECG signal need to be preserved with the highest possible morphologic fidelity for
accurate diagnosis. This is particularly true for the resting ECG, for which the number
of measurements or parameters used for interpretation is very large. In the exercise
ECG, however, fewer parameters are used for interpretation, but still the high mor-
phologic fidelity of the QRS and T segments is of critical importance. During an
exercise test, the signal is contaminated with more noise such as muscle artifact, motion
90 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals

artifact, and baseline wander. Visual inspection should also be included in the assess-
ment of the quality of reconstruction. It is important to detect the time of occurrence
and the type of distortion.
With the increase in ECG signal recording and monitoring in clinical diagnosis of
cardiac disorders, an enormous amount of ECG data is collected. If a patient's ECG is
sampled at 500 samples per second and if each sample takes 12 bits, then a typical 30-
second record of the 12-lead ECG requires 270 kbytes and a 12-hour Holter record
requires 129.6 Mbytes of memory. Therefore, there is a clear need for ECG compression
algorithms to store, archive, and transmit this information. Many different compres-
sion algorithms have been developed to achieve this important goal [28].
One such method is based on principal component analysis (PCA) [29]. This is also
known as the Karhunen-Loeve (or the eigenvector) transform (KLT). Basically, it
consists of finding a linear combination of the original signal such that the obtained
signals are orthogonal and their variance is maximized. KLT is an optimal transform in
the sense that the least number of orthonormal functions is needed to represent the
original signal for a given root mean square (rms) error. Moreover, the PCA results in
uncorrelated transform coefficients (diagonal covariance matrix) and minimizes the
total entropy compared with any other transform. The main drawback of this technique
is that it requires the computation of the eigenvalues and eigenvectors of the correlation
matrix of the data set, which, in the case of ECG signals, is a very large matrix.
Neural network implementation of PCA provides a means for unsupervised fea-
ture discovery and dimensionality reduction. Al-Hujazi and Al-Nashash [30] described
a method for the compression of ECG data using the generalized Hebbian algorithm
[31]. This algorithm is based on an interesting observation that a single linear neuron
with a Hebbian-type adaptation rule for its weights can evolve into a filter for the first
principal component of the input distribution. Al-Hujazi et al. extended the algorithm
to produce a multiple Hebbian neural network to reduce the computation time neces-
sary to calculate the principal components of arbitrary size on the input vector. They
found it necessary to implement a multiple neural network due to large variations
among ECG arrhythmias.
The method was tested on normal data as well as data representing three different
types of pathologies obtained from the MIT/BIH ECG database, which resulted in a
compression ratio (CR) up to 30 with a percent root mean square error (PRD) of 5%.
However, it is emphasized that for this method to be useful, the training set must
include all expected arrhythmias.
Hamilton et al. developed a compression algorithm for ECG signals based on an
autoassociative neural network [32]. To achieve compression, they first detected the
QRS complexes and then compressed the ECG signal using an autoassociative ANN,
one in which the input and output patterns are the same. A multilayer perceptron with
the back-propagation learning algorithm was employed. This consisted of a hidden
layer with a reduced number of nodes to produce compression. Since the majority of
beats within a given recording segment have the same gross morphology, by storing an
average waveform and compressing the difference only, optimum gain was made of the
ANN's compression capability. They used network sizes with 360 inputs, 360 outputs,
and a variable number of hidden nodes: 6, 8, 9, 10, 12, 15, 18, 24, 36, and 72. They
showed that the compression ratio of the network was controlled by the ratio of hidden
layer networks to input-output layer neurons with the poorest PRD occurring at higher
compression ratios. They achieved a CR of 10 with a PRD of 7.1% for a network size
Additional Reading and Related Material 91

of 360-36-360. It was also shown that removal and separate storage of the DC offset for
each beat before compression improved the overall system performance. Removal of
the DC component resulted in a CR of 10 with a PRD of 4.6% for a network of size
360-30-360.
Another popular method of data compression is vector quantization (VQ).
Basically, it involves the creation of a codebook of vectors. Creating the optimum
codebook will achieve the best possible data compression. The basic vector design
uses an encoder to replace an input vector Xn with a vector from the codebook.
Compression of the data is achieved by using the address of the codebook vector in
place of the original vector. If the codebook has C elements, then the rate of the
quantizer, R, in bits/vector is R = log2 C. The compression ratio, CR, in bits/sample
is CR = R/N, where N is the number of samples in each vector. One of the advantages
of VQ is that fractional compression ratios are achievable.
Many different techniques are available to create a codebook that best spans the
data of interest [33]. Neural network implementation of VQ provides a means to create
a codebook of vectors that attempt to span the low-frequency components of the ECG
signal. Since these vectors are potentially less informative, less important information
will be distorted. McAuliffe [34] used a Kohonen neural network that adapted the
codebook vectors based on distance measurements and controlled the scope of the
changes based on time. Compression of the signal was achieved by inserting the address
of the codebook vector that best represented the original vector in place of the vector.
The results showed that minimal distortion was introduced into the ECG, with com-
pression ratios ranging from 3 : 1 to 19 : 1, depending on noise content and heart rate.

ADDITIONAL READING AND RELATED MATERIAL

For an overview of the back-propagation algorithm and related MATLAB files,


please refer to the following resources:

• H. Demuth and M. Beale, Neural Network Toolbox for Use with MATLAB,
User's Guide, Chap. 5, fifth printing—version 3, Math Works Inc., 1998.
• MATLAB Neural Network Toolbox, nnet M-files.
• http://www.mathworks.com/ftp/nnetsv5.shtml

For an overview of self-organizing networks and related MATLABfilesand codes


please refer to the following resources:
• H. Demuth and M. Beale, Neural Network Toolbox for Use with MATLAB,
User's Guide, Chap. 7, fifth printing—version 3, Math Works Inc., 1998.
• MATLAB Neural Network Toolbox, nnet M-files.
• T. Kohonen, J. Hynninen, J. Kangas, and L. Laaksonen, The self-organizing
map program package. Tech. Rep. version 3.1, Helsinki University of
Technology, Laboratory of Computer and Information Science,
Rakentajanaukio 2 C, SF-02150, Espoo Finland, 1995.
92 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals

APPENDIX: BACK-PROPAGATION OPTIMIZATION ALGORITHM

Due to its widespread use, we briefly present the steps in training an artificial
neural network using standard back-propagation technique. In this presentation, we
use a network with two hidden layers to illustrate the training process. However, the
method can easily be applied to networks with more than two hidden layers. Consider
the network shown in Figure Al.
It has n input nodes {Xt for i = 1,2 «), two hidden layers with p nodes in the
first layer (Γ, for i = 1,2,... ,p) and q nodes in the second layer (Z, for i = 1,2,..., q),
and m output nodes (Γ, for i = 1,2,..., m). The activation functions for the hidden and
output nodes are usually selected to have the following properties. They are continuous,
differentiable, and monotonically nondecreasing. One of the most commonly used
activation functions is a bipolar sigmoid function with a range of—1 to 1. This function
can be expressed as

^ndpFir1
The shape of this function is shown in Figure 2. An additional desirable feature of
this function is that its derivative can be expressed in terms of the function itself. That
is,

/'fe) = iti+/fe)][i-/(?)]
where/'Cg) is the derivative of/ with respect to g. This feature of the function speeds up
the computation of the activation function derivative that is an essential part of the
back-propagation training method.

Figure Al Topology of an artificial neural network with two hidden layers.


Appendix: Back-Propagation Optimization Algorithm 93

The process of training the network starts with randomly selecting values for
connecting weights: un, ul2,..., unp, vn,vn,..., vpq, and wu, wn,..., wqm. The com-
putations that follow the initial selection of these weights can be divided into three
distinct stages, with each stage having few steps. The first stage is called feed-forward
and it involves three steps. In the first step, each input node receives the signal for
training the network, xt for i = 1,2,..., n. The input nodes then pass these inputs
through the connecting branches to the first hidden layer as
n

'Ρϊι = Σ XiU h
' (A1)
where j / represents the input signal to the hidden node h = 1,2,..., p in the first
hidden layer. In the second step of the feed-forward stage, the outputs of the nodes
in the first hidden layer are computed and applied to the second layer. Specifically,

yh=f(Yh) for A =1,2,...,/» (A2)

and

ΣVhJ for j = 1,2,...,?


^j=J2y (A3)
h=\

where z, is the input to the hidden node j in the second hidden layer and

zj=f(%) forj=l,2,...,q (A4)

is the output from node j . The third and final step in the feed-forward stage is to
compute the network output using the signals generated by the nodes in the second
layer of the network. That is,

Tk=^,zjwjk fork=l,2,...,m (A5)

where Tk is the input to the output node k and the output from the node is given by

Tk =f(T\) (A6)

The second stage in this method of training the network is called back-propagation of
the error. In this stage, a three-step process is also followed. First, the error for each
output node is computed by

ek = Tk-Rk foik=l,2,...,m (A7)

where Rk is the desired or target output. The correction factor Sk for the weights on the
branches ending at the output node k is computed as
94 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals

h = ekf'(Tk) (A8)
In addition, the increment for weight correction is obtained from
AWjk = aSkZj (A9)

where a is the learning factor, normally chosen to be less than unity. The next step is to
back-propagate the Sk's through the last layer (in this case the second layer) of hidden
nodes to obtain
m

h = Σ 8*w* (A10)
The correction factors for the nodes in the second layer are computed in a manner
similar to the computation of 8k for the output nodes. That is,

Sj^jfitj) (All)

and the incremental corrections to the weights ending at node./ of the second layer are
computed as
Avhj = aSjyh (A12)

Similarly, the correction for the branch weights connecting to the nodes in the
other hidden layer(s) (in this case the first layer) can be computed as

t = jt, ¥hj (A13)

and

h = tfW (AH)
The correction increment for these weights is obtained from

Auih=aShxi (A15)

Equations A12 and A15 are often called the delta rule.
The third and final stage in the back-propagation algorithm is updating the
weights. In this stage, the weights on the branches connecting to the nodes are updated
as follows:

w , r = H$d + Aw;* (A16)

for each output unit (j = 1,2, ...,q; k = 1,2, ...,m),

v^ = V^ + AVhJ (A17)

for each hidden unit in the second layer (h = \,2,... ,p;j =\,2, ...,q), and
References 95
..old
= uih +Auih (A18)

for weights connecting to each unit in the first hidden layer (i = 1,2,...,«;
«=1,2,...,/>)·
In practice, the network is trained by presenting it with a number of examples.
Consider the training data to consist of input Χ(λ) and output R(k) pairs where

*ι(λ)'
χ2(λ)

Χ(λ) = (A19)

χ„(λ)

and

*ι(λ)'
R2(X)
Κ(λ): (Α20)

Μ».
In (A 19) and (A20) the argument λ depicts the example number in the training set. For
instance, a training set may have 5000 examples, in which case λ = 1,2,..., 5000. In a
pattern mode of training the network is presented with one example from the training
set at a time and all three stages of the back-propagation training including the updat-
ing of the weights (Eqs. A16 through A18) in the third stage are carried out. The
training process is often repeated by randomizing the order in which the examples
are presented to the network. This can be accomplished either by randomizing all of
the examples in the training set for each pass or by establishing epochs that contain a
subset of the training examples. For instance, if there are 5000 total examples, they may
be grouped into 100 epochs each containing 50 examples. The order in which the epochs
are presented to the network is then randomized.
There are alternative modes of training such as batch mode, in which the weights
are not updated for each example in an epoch; rather they are updated after all exam-
ples within an epoch are presented to the network. However, description of these
methods is beyond the scope of this chapter and the reader is referred to [1] and [35].

REFERENCES

[1] S. Haykin, Neural Networks: A Comprehensive Foundation. Macmillan and IEEE Computer
Society, 1994.
[2] Y. S. Tsai, B. N. Huang, and S. F. Tung, An experiment on ECG classification using back-
propagation neural network. Proceedings, Annual International Conference IEEEjEMBS,
pp. 1463-1464, 1990.
[3] J. Bortolan, R. Degani, and J. L. Willems, ECG classification with neural networks and
cluster analysis. Proc. Comput. Cardiol. 177-180, 1991.
96 Chapter 4 Neural Networks in Processing and Analysis of Biomedical Signals

[4] Y. H. Hu, W. J. Tompkins, and Q. Xue. Artificial neural network for ECG arrhythmia
monitoring. In Neural Network for Signal Processing II, S. Y. Kang, F. Fallside, J. A.
Sorenson, and C. A. Kamm, eds., pp. 350-359, Piscataway, NJ: IEEE Press, 1992.
[5] J. Nandal and M. de C. Bossan, Classification of cardiac arrhythmias based on principal
component analysis and feedforward neural networks. Comput. Cardiol. 341-344, 1993.
[6] J. P. Marques de Sa, A. P. Goncalves, F. O. Ferreira, and C. Abreu-Lima, Comparison of
artificial neural network based ECG classifiers using different feature types. Comput.
Cardiol. 545-547, 1994.
[7] Y. H. Hu, W. J. Tompkins, J. L. Urrusti, and V. X. Alfonso, Applications of artificial neural
networks for ECG signal detection and classification. /. Electrocardiol. 26(Suppl): 66-73,
1993.
[8] R. Watrous and G. Towell, A patient-adaptive neural network ECG patient monitoring
algorithm. Comput. Cardiol. 229-232, 1995.
[9] S. Barso, M. Fernandez-Delago, J. A. Vial-Sobrino, C. V. Reguerio, and E. Sanchez,
Classifying multichannel ECG patterns with an adaptive neural network. IEEE/EMB
Mag. 17(1): 45-55, 1998.
[10] W. Trojaborg, Motor unit disorders and myopathies. In Textbook of Clinical
Neurophysiology, M. A. Halliday, R. S. Butler, and R. Paul, eds., pp. 417-438. New
York: Wiley, 1987.
[11] C. S. Pattichis, C. N. Schizas, and L. T. Middleton, Neural network models in EMG
diagnosis. IEEE Trans. Biomed. Eng. 42(5): 486-^96, 1995.
[12] J. D. Frost, Automatic recognition and characterization of epileptiform discharges in human
EEG. J. Clin. Neurophysiol. 2(3): 231-249, 1985.
[13] C. Eberhart and R. W. Dobbins, Neural Networks PC Tools—A Practical Guide, Chap. 10,
Case study I: Detection of electroencephalogram spikes. San Diego: Academic Press, 1990.
[14] O. Ozdamar, G. Zhu, I. Yaylali, and P. Jayakar, Real-time detection of EEG spikes using
neural networks. IEEE EMBS 14th International Conference Proceedings, pp. 1022-1023,
1992.
[15] T. Kalayci, and O. Ozdamar, Wavelet preprocessing for automated neural network detec-
tion of EEG spikes, IEEE EMB Mag. 14(2): 160-166, 1995.
[16] Z. Y. Lin, J. D. Z. Chen, and R. W. McCallum, Noninvasive diagnosis of delayed gastric
emptying from cutaneous electrogastrograms using multi-layer feedforward neural net-
works. Gastroenterology, 112(4):A777, 1997.
[17] Z. Y. Lin, H. Liang, J. D. Z. Chen, and R. W. McCallum, Application of combined genetic
algorithms in conjunction with cascade correlation to diagnosis of delayed gastric emptying
from electrogastrograms. IEEE EMBS 19th International Conference Proceedings, pp. 1355-
1358, 1997.
[18] P. Bright, M. R. Miller, J. A. Franklyn, and M. C. Sheppard, The use of a neural network to
detect upper airway obstruction caused by goiter. Am. J. Respir. Crit. Care Med., 157: 1885-
1891, 1998.
[19] Guidelines for the measurement of respiratory function. Recommendations of the BTS and
ARTP. Respir. Med. 88: 165-194, 1994.
[20] J. Cohen, A coefficient of agreement for nominal scales. Educ. Psycho!. Meas. 20: 37-46,
1960.
[21] J. R. Burk, E. A. Lucas, J. R. Axe, K. Behbehani, and F. Yen, Auto-CPAP in the treatment
of obstructive sleep apnea: A new approach. 1992 Annual Meeting Abstracts, Association of
Professional Sleep Societies 6th Annual Meeting. Sleep Res. 22: 61, 177, 1992.
[22] K. Behbehani and T. Kang, A microprocessor-based sleep apnea ventilator. Proceedings of
11th Annual International Conference of IEEE Engineering in Medicine and Biology, Seattle,
November 1989.
References 97

[23] E. A. Lucas, J. R. Burk, J. R. Axe, K. Behbehani, and F. C. Yen, Auto-CPAP in the


treatment of CPAP-using obstructive sleep apnea patients. World Federation of Sleep
Research Societies, September 1995.
[24] K. Behbehani, F. C. Yen, J. R. Burk, E. A. Lucas, and J. R. Axe, Automatic control of
airway pressure for treatment of obstructive sleep apnea. IEEE Trans. Biomed. Eng. 42(10):
1007-1016, 1995.
[25] F. J. Lopez, K. Behbehani, and F. Kamangar, An artificial neural network based snore
detector, Proceedings of the IEEE Engineering in Medicine and Biology Society 16th Annual
International Conference, Baltimore, November 3-6, 1994.
[26] K. Behbehani, F. Lopez, F. C. Yen, E. A. Lucas, J. R. Burk, J. P. Axe, and F. Kamangar,
Pharyngeal wall vibration detection using an artificial neural network. /. Med. Biol. Eng.
Comput. 35: 193-198, 1997.
[27] Q. Xue, Y. H. Hu, and W. J. Tompkins, Neural-network-based adaptive matched filtering
for QRS detection. IEEE Trans. Biomed. Eng. 39(4): 317-329, 1992.
[28] S. M. S. Jalaleddine, C. G. Hutchens, R. D. Strattan, and W. A. Coberly, ECG data
compression techniques—A unified approach. IEEE Trans. Biomed. Eng. 37(4): 329-343,
1990.
[29] D. F. Elliot and K. R. Rao, Fast Transforms, Algorithms, Analysis and Applications. New
York: Academic Press, 1982.
[30] E. Al-Hujazi and H. Al-Nashash, ECG data compression using Hebbian neural networks. /.
Med. Eng. Technol. 20(6): 211-218, 1996.
[31] E. Oja, A simplified neuron model as a principal component analyzer. /. Math. Biol. 15:
267-273, 1982.
[32] D. J. Hamilton, D. C. Thomson, and W. A. Sandham, ANN compression of morphologi-
cally similar ECG complexes. Med. Biol. Eng. Comput. 33(6): 841-843, 1995.
[33] R. M. Gray, Vector quantization. IEEE ASSP Mag. April 1984.
[34] J. D. McAuliffe, Data compression of the exercise ECG using a Kohonen neural network. J.
Electrocardiol. 26(Suppl): 80-89, 1993.
[35] L. Fausett, Fundamentals of Neural Networks, Architectures, Algorithms and Applications.
Englewood Cliffs, NJ: Prentice Hall, 1994.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.

Chapter RARE EVENT DETECTION IN


5 GENOMIC SEQUENCES BY NEURAL
NETWORKS AND SAMPLE
STRATIFICATION

Wooyoung Choe, Okan K. Ersoy, and Minou Bina

1. INTRODUCTION

Training neural networks to recognize events that occur with low probability is sig-
nificant in many applications. This is a difficult and challenging problem. When we
investigated the use of neural networks to identify regulatory regions in human genomic
sequences, we realized that there are rare events in the sequences. That is, repetitive
DNA sequences, namely Alu regions, represent small portions of genomic sequences,
which consist mostly of non-Alu regions. This results in a shortage of examples. We
propose two schemes to solve these problems by neural networks and sample stratifica-
tion.
Sample stratification is a technique for making each class in a sample have equal
influence during learning of neural networks [1,2]. It is preferable to use a stratified
sample that includes an equal number of examples from each class in the training
sample for classification with neural networks. However, it is usually not possible to
make a sample stratified because we cannot have enough examples in rare event cases.
The first method presented in this chapter stratifies a sample by adding up the weighted
sum of the derivatives during the backward pass of training. The second method uses
bootstrap aggregating. After training neural networks with multiple sets of boot-
strapped examples of rare event classes and subsampled examples of common event
classes, we do multiple voting for classification. These two schemes make rare event
classes have a better chance of being included in the sample for training and improve
the classification accuracy of neural networks. We demonstrate the performance of the
two schemes with real human genomic sequences for locating regulatory regions
obtained from the National Center for Biotechnology Information (NCBI)
Repository. We also compare the results of the proposed methods with those of
Bayesian classifiers with two-dimensional Gaussian-distributed data.

2. SAMPLE STRATIFICATION

If a sample contains examples according to its probability of occurrence, the sample is


termed representative. A representative sample is preferably used in most cases.

98
Section 3 Stratifying Coefficients 99

However, for classification in which the problem is assigning an example to a class or


category, it is better to use an equal number of examples from each class, although the
probability of occurrence is different from class to class. In that case, the sample is not
representative and is termed stratified. With a stratified sample, examples from small
classes have a better chance of being included than those from large classes.
Consider an extended example in terms of a model to classify human genomic
sequences to see why it is desirable to use a stratified sample for classification. Our data
source is a set of DNA sequences consisting of repetitive and non-repetitive sequences.
Suppose the sample contains 100 examples. If we draw a representative sample, non-
Alu will most likely occur 95 times and Alu will most likely occur 5 times. On the other
hand, we could stratify the sample so as to have 50 examples of each class. Which
sample will produce the best classification accuracy? We anticipate that the stratified
sample will produce the best accuracy. Increasing the number of examples of Alu from
5 to 50 produces a big improvement in accuracy on the Alu's, whereas decreasing the
number of examples of non-Alu from 95 to 50 produces only a small decline in accuracy
on the non-Alu's. The stratified sample is better because the improvement on the rare
events is greater than the loss of accuracy on the common events, even when the test is
made on a representative sample.
Next, suppose that sample size is increased to 1000. We know that increasing
sample size makes the model more accurate. The problem is that we cannot maintain
equal numbers of examples for both events: we would need 500 Alu's, but we have only
50. In this situation, we can just use a sample that includes all of the examples of rare
events and common events of which we have many examples. This produces a sample
with 50 Alu's and 950 non-Alu's. Then there are 19 times as many non-Alu's as Alu's in
our sample. This may cause a problem in training neural networks because the data are
unbalanced. To alleviate this, it is necessary to give the Alu's 19 times as much impact
as the non-Alu's.

3. STRATIFYING COEFFICIENTS

We can use importance sampling for sample stratification. Importance sampling is a


technique that can be applied to simulations of low-probability events without incur-
ring the computational costs usually associated with such simulations [3-5]. Previous
work in digital communication systems has shown the potential of dramatically low-
ering the computational burden of simulations by utilizing importance sampling. In this
technique, one modifies the probability distribution of the underlying random process
in order to make the rare events occur more frequently. The desired probabilities at the
output of the process are then found by weighting each event by a factor that is a
function of only the state of the input; this factor is independent of the process itself.
Applying the basic idea behind importance sampling technique to neural networks,
Monro et al. [6] used a likelihood ratio weighting function (LRWF) which is used in
neural network learning in the sense of weighted least squares. This weighting function
allows neural networks to be trained utilizing a data set in which the events occur with
high probability, but also successfully classify data in which the events occur with much
lower probability. In this way, they tried to reduce the high computational burden
associated with training neural networks to recognize events that occur with low
probability.
100 Chapter 5 Rare Event Detection in Genomic Sequences by Neural Networks

Based on the importance sampling scheme, we can make a sample stratified by


modifying the backward pass through neural networks using backpropagation, where
we accumulate the derivatives of the error with respect to each weight. During the
backward pass, as we accumulate the derivatives, we add up not the sum of the deri-
vatives but their weighted sum. For example, in the previous case, when the example is
an Alu, we add 19 times the derivative, but when the example is a non-Alu, we add just
the derivative. At the end of the epoch, when we change the weights, each Alu will have
19 times as much impact as each non-Alu, and all the Alu's together will have the same
impact as the non-Alu's.
3.1. Derivation of a Modified Back-Propagation
Algorithm
We derive a modified form of the back-propagation algorithm that includes a term
to be called stratifying coefficient (SC) and to be denoted by c(x). It is included in the
computation of the error terms associated with the final output layer of the neural
networks. The SC is similar to LRWF, but the basic idea is different. SC adds more
weight to the rare event to make a sample stratified, whereas LRWF gives less weight to
the rare event to make a sample representative.
Consider a specific weighted error, Ep, due to the presentation of the input vector;?
as

EP = \Y\Dpj-Z^xp,W)fcixp) (1)
j

where Dpj is the 7th component of the desired output vector due to the presentation of
input vector p. The output of node j of the output layer, which is the Mb. layer, is
denoted as Zpj(xp, w). The SC, c(xp) evaluated at the present input vector, is decided by
the ratio of the probability of the common event to the probability of the rare event.
The dependence of Zpj on the present input vector xp and the weights, denoted by w,
will be suppressed in the following notation.
The output of node j in the mth layer due to the presentation of the input vector p
is defined as

Z£=/(0 (2)

where/(.) is a continuously differentiable, nondecreasing, nonlinear activation function


such as a sigmoid. Furthermore, the input to node j of the mth layer due to the
presentation of the input vector p is defined as

where wm denotes the weight matrix between the mth and the (m - l)st layer of the
networks.
The back-propagation algorithm applies a correction Aw^ to synaptic weight wfl,
which is proportional to the instantaneous gradient dEp/dwp"/. According to the chain
rule, we may express this gradient as follows:
Section 3 Stratifying Coefficients 101

dEp dEp dY%


(4)
9Y% *Ki
PJ 7W
Pj
(5)
3H£,~

The negative of the gradient vector components of the error Ep with respect to Yß are
given by
dE
«m _ P
"PJ =
Om (6)
dY
pl
Applying the chain rule allows this partial derivative to be written as

™ dE„ dEp dZp"j


cffi P_ (7)
°PJ ~ gym
PJ
dZ$dYg

The second factor can be easily computed from Eq. 3 as

(8)
dYiPJ

which is simply the first derivative of the activation function evaluated at the present
input to that particular node.
In order to compute the first term, consider two cases. In the first case, the error
signal is developed at the output layer N. This can be computed from Eq. 1 as
dE„
-[Dpj - Zp)]c(xp) (9)
az:PJ
Substituting Eqs. 8 and 9 in to Eq. 7 yields

§ = [θρ]-Ζ^ο(ΧρΥΧΥ») (10)

For the second case, when computing the error terms for some layer other than the
output layer, the δρ/s can be computed recursively from those associated with the
output layer as

BEp ' dEp BYpT1'


m
!)7
°^pj
-Σ k
dYpt1 &% .
dE 9
V^ „,/n+l 7m
= Σ QYm+lQzm^ki
P
^pi
(11)
dE„
yp_^P_m+\

Combining this result with Eq. 6 gives


102 Chapter 5 Rare Event Detection in Genomic Sequences by Neural Networks

^=/'(θΣδ7Η+1 <12)
These results can be summarized in three equations. First, an input vector, xp, is
propagated through the network until an output is computed for each of the output
nodes of the output layer. These values are denoted as Y$. Next, the error terms
associated with the output layer are computed from Eq. 10. The error terms associated
with each of the other m — 1 layers of the network are computed from Eq. 12. Finally,
the weights are updated as

Apwjf = η%Ζ3Γι (13)

where η represents the learning rate of the networks. Usually η is chosen to be some
nominal value such as 0.01.
From [7], it is seen that the only change to the back-propagation algorithm is the
inclusion of the stratifying coefficient in Eq. 10. All other steps of the algorithm remain
the same.

3.2. Approximation of A Posteriori Probabilities


Neural networks can provide outputs to approximate a posteriori probabilities,
which can be used for a higher level of decision making [8-11]. We can use conventional
statistical methods such as Parzen density estimation for this purpose, but this requires
many computations and is not reliable with high-dimensional inputs. For example, the
input dimension for human genomic sequences was 256 in our experiments. That is why
approximation of a posteriori probabilities is one important characteristic of neural
networks in human genomic sequence analysis.
We can estimate how the network with the stratifying coefficient scheme approx-
imates a posteriori probabilities in the following way. With a squared-error cost func-
tion, the network parameters are chosen to minimize the following:

(X)-Dif\f(X,Cj)dx (14)
J
j=l [ /=1

This equation represents a sum of squared errors, with two errors appearing for each
input-class pair. For a particular pair of input X and class Cj, each error, Zt{X) — Dt is
simply the difference of the actual network output Zt{X) and the corresponding desired
output Dj. The two errors are squared, summed, and weighted by the joint probability
f(X, Cj) of the particular input-class pair.
Substituting/(Jr, Cj) =f(Cj\X)f(X) in (12) yields

Σ Υμ^Χ) - D,Yf(Cj\X) f(X) dx (15)

The back-propagation algorithm is modified by minimizing the new weighted error


function defined as
Section 3 Stratifying Coefficients 103

£ Ypm - DMX) >f(Cj\x)


f(X)dx (16)
■-J J=l '=1
2 2
c(X)-f(X)dx (17)

This result can be written as

E& = jyfozm-DtYfiCAX) rw*t (18)


U=l '=1

where f*(X) = c(X)f(X). In this way, the neural network can be forced to form its
mean-square error estimates off(Cj\X) according to the distributionf*{X) rather than
the distribution/(Z). Expanding the bracketed expression in Eq. (18) yields

Ea = | Σ ijzhXV(.Cj\JO - IZjiJODfiCjlX) + D}f(Cj\X)] \ΠΧ)dx (19)


1
U-l M

Exploiting the fact that ZJ(X) is a function only of JST and E;=i/(C/l^0 = 1 allows Eq.
19 to be expressed

-ιΐέ ZhX) - 2Zi(X) £D,f(Cj\X) +£ Dff(Cj\X) \f\X)dx (20)

For a two-class problem, £>, equals one if input X belongs to Class C, and zero other-
wise. Therefore, Σ]=ι £>if(Cj\X) =f(C,\X), and

E Z
*=| E t ^ " 2Ζ,(Χ)/·(0,·|Ζ) +/(Q|JT)] l r W <** (21)

Adding and subtracting J^=lf2(Ci\X) in Eq. 21 allows it to be cast in the following:

£
a = | E t z ? W ~ 2z,(ry(c,|Jso +/2(c,|Z) +/(c,iz) -/2(Q|Z)] /*Wdx (22)
2 2
fi ■f*(*)dx (23)
J
I 1=1 ;=1
2
fi
- ί Σ κ <J50-/(C,|Z)]
2
f(X)dx
J
i=l
2
+ £[/(c,i*) -/ (c,i^)] hjf)<fa (24)
104 Chapter 5 Rare Event Detection in Genomic Sequences by Neural Networks

Because the second term in Eq. 24 is independent of the network outputs, minimization
of ΕΆ is achieved by choosing network parameters to minimize the first term.
Finally, we get

2sa = Ex. YptX) -m\X)f I + Δ (25)

where Ex*[.] is the expected value with respect to the modified distribution f*(X).
Equation 25 shows the neural network outputs' approximate a posteriori probabilities
based on the modified distribution.

4. BOOTSTRAP STRATIFICATION

We propose an alternative approach to sample stratification: we make smaller sets of


examples from the entire data. Every set includes all the examples of bootstrapped
examples of the smaller class and subsampled examples of the larger class. With
these sets, we train a set of neural networks. Since there is an equal number of occur-
rences of every event, the neural networks spend as much time in learning about the
rare events as about the common events.
This approach comes out of three basic facts: First, bootstrap methods are extre-
mely valuable in situations where data sizes are too small to invoke good results.
Second, a classifier trained with subsampled data does not degrade much compared
with the one trained with complete data. Third, aggregating can improve the perfor-
mance of a classifier.

4.1. Bootstrap Procedures


In general, the bootstrap procedure is a technique for resampling the given data in
order to induce information about the sampling distribution of a classifier [12-14]. This
generates multiple copies of a classifier. Aggregation averages over the copies when
predicting a numerical outcome and does a multiple vote when predicting a class. The
multiple copies are made by building bootstrap replicates of the learning set and using
these as new learning sets. The method can be quite effective, especially, for an
"unstable" learning algorithm for which a small change in the data effects a large
change in the computed hypothesis.
Consider a given set of N examples, each belonging to one of M classes, and a
classifier for a training set of examples. Bootstrap procedures construct multiple classi-
fiers from the examples. The classifier trained on trial k will be denoted by Q , and Cs is
the consensus classifier. Q(x) and Cs(x) are the classes decided by Ck and Cs.
For each trial k= l,2,...,K, the training set of size N is sampled from the
original examples. This training set is the same size as the original data, but some
examples may not appear in it whereas others appear more than once. A classifier Ck
is generated from the sample, and the final classifier Cs is formed by aggregating the K
classifiers from these trials. To classify an instance x, a vote for class c is recorded by
every classifier for which Ct(x) = c, and Cs(x) is then the class with the most votes.
Section 4 Bootstrap Stratification 105

4.2. Bootstrapping of Rare Events


The bootstrap technique is suitable for training DNA sequences that are rare
events and have a limited number of examples. Bootstrap procedures can be a way
to overcome the difficulties as follows. Suppose we have DNA sequences S that may
not be enough to train neural networks. Hence, we take bootstrapped samples {SB}
from {5} and form {C(x, SB)}. Finally, let the {C(x, SB)} vote to form CB(x). The {SB}
form replicate data sets, each consisting of N cases, drawn at random, but with replace-
ment, from S. Each example may appear repeated times or not at all in any particular
SB. The [SB] are a replicate data set drawn from the bootstrap distribution approx-
imating the distribution from S. By doing this, we get multiple versions of learning sets
and gain increased classification accuracy.

4.3. Subsampling of Common Events


We divide common event data into some subsamples to make them balanced with
rare event data. That means sampling without replacement from data to get subsam-
ples. We do not need to use the bootstrap because we have enough examples.
Subsampling can be thought to be even more intuitive than the bootstrap, because
the subsamples are actually samples from the true distribution, whereas the bootstrap
resamples are samples from an estimated distribution.

4.4. Aggregating of Multiple Neural Networks


The earliest attempt at combining multiple networks can be credited to Nilsson
[15] who proposed "committee" machines based on a collection of single-layer net-
works as an attempt to design multilayer neural networks that could classify
complicated data. Hansen and Salamon [16] discussed the application of an ensem-
ble of multilayer neural networks. The parallel self-organizing consensual neural
network was proposed by Valafar and Ersoy [17]. Opitz and Shavlik [18] presented
a technique that searched for a correct and diverse population of neural networks to
be used in ensemble by genetic algorithms. Turner and Ghosh [19] provided an
analytical framework to quantify the improvements in classification results due to
combining.

4.5. The Bootstrap Aggregating Rare Event Neural


Networks
The bootstrap aggregating rare event neural network (BARENN) is developed for
the purpose of increasing classification accuracy, avoiding local minima, reducing learn-
ing times, obtaining a high degree of robustness and fault tolerance, and achieving truly
parallel architecture.
The BARENN consists of unit neural networks (UNNs). Each unit is a particular
neural network [20-22]. We can use simple back-propagation (BP) as a UNN, but we
can also use a new type of NN architecture for fast training. Competitive learning and
the least-squares method are known to be faster than other supervised neural networks
for training. So we can use a scheme of competitive neural networks and the least-
squares method. An alternative is EBUDS [23], which combines BP learning and a
direct solution method. Here we can set the number of training epochs smaller than in
ordinary BP for improving the training time.
106 Chapter 5 Rare Event Detection in Genomic Sequences by Neural Networks

Figure 1 Bootstrap stratification scheme.

Training Procedure
1. Divide common event data into n subdivisions.
2. Bootstrap rare event data into n data.
3. Train n NNs independently.
4. Combine the output of the individual neural networks by consensus.
The system block diagram is shown in Figure 1.

5. DATA SET USED IN THE EXPERIMENTS

5.1. Genomic Sequence Data


The first data set used in the experiments consists of genomic sequence data. All
the repetitive DNA sequences (17,073 entries) were taken from Human Repbases in
Jurka's repository [24]. Table 1 shows some of those entries, and Table 2 gives an
example. Table 3 is a list of Alu subclasses and their frequency of occurrence in the
data set.
Section 5 Data Set Used in the Experiments 107

TABLE 1 Locations of Extracted Repetitive Sequences

HSAL00997. REPBASE; HSAL07247 REPBASE; HSAL13582. REPBASE:


HSAL00998 REPBASE; HSAL07248 REPBASE; HSAL13583 REPBASE;
HSAL00999 REPBASE; HSAL07249. REPBASE; HSAL13584. REPBASE
HSAL01000 REPBASE; HSAL07250. REPBASE; HSAL13585. REPBASE;
HSAL01001. REPBASE; HSAL07251 REPBASE; HSAL13586, REPBASE:
HSAL01002. REPBASE; HSAL07252 REPBASE; HSAL13587 REPBASE;
HSAL01003. REPBASE; HSAL07253. REPBASE; HSAL13588. REPBASE
HSAL01004. REPBASE; HSAL07254. REPBASE: HSAL13589 REPBASE;
HSAL01005. REPBASE; HSAL07255. REPBASE: HSAL13590 REPBASE;
HSAL01007. REPBASE: HSAL07256. REPBASE: HSAL13591 REPBASE;
HSAL01008. REPBASE HSAL07257. REPBASE: HSAL13592. REPBASE;
HSAL01009. REPBASE; HSAL07258. REPBASE; HSAL13593. REPBASE;
HSAL01010. REPBASE; HSAL07259. REPBASE: HSAL13594. REPBASE;
HSAL01011. REPBASE; HSAL07260. REPBASE; HSAL13595. REPBASE;
HSAL01012. REPBASE; HSAL07261. REPBASE; HSAL13596. REPBASE:
HSAL01013. REPBASE; HSAL07262. REPBASE; HSAL13597. REPBASE;
HSAL01014. REPBASE; HSAL07263. REPBASE; HSAL13598. REPBASE:
HSAL01015, REPBASE; HSAL07264. REPBASE; HSAL13599. REPBASE:
HSAL01016. REPBASE; HSAL07265. REPBASE; HSAL13600. REPBASE;
HSAL01017..REPBASE; HSAL07266, REPBASE; HSAL13601. REPBASE
HSAL01018 .REPBASE; HSAL07267, REPBASE; HSAL 13602. REPBASE
HSAL01019 .REPBASE; HSAL07268, REPBASE: HSAL13603. REPBASE;
HSAL01020. REPBASE; HSAL07269. REPBASE; HSAL13604. REPBASE:
HSAL01021. REPBASE; HSAL07270. REPBASE; HSAL13605. REPBASE;
HSAL01022, REPBASE; HSAL07271. REPBASE; HSAL13606. REPBASE;
HSAL01023, REPBASE; HSAL07272. REPBASE: HSAL13607. REPBASE;
HSAL01024, REPBASE; HSAL07273, REPBASE; HSAL13608. REPBASE
HSAL01025 REPBASE; HSAL07274. REPBASE; HSAL 13609. REPBASE
HSAL01026 .REPBASE; HSAL07275. REPBASE: HSAL13610. REPBASE;
HSAL01027,.REPBASE; HSAL07276. REPBASE: HSAL13611. REPBASE;
HSAL01028 REPBASE; HSAL07277. REPBASE; HSAL13612. REPBASE;
HSAL01029 REPBASE; HSAL07278 REPBASE: HSAL13613. REPBASE;
HSAL01030 REPBASE HSAL07279 .REPBASE: HSAL13614. REPBASE:

For unique sequences, in UniGene in the NCBI Repository, all annotated human
unique sequences were extracted (41,120 entries) in protein coding regions. From this
set, entries were discarded if they did not contain the string "complete cds." Finally
6120 entries were obtained. Table 4 shows some of the entries, and an example of the
sequence data is shown in Table 5.

5.2. Normally Distributed Data 1, 2


The second data sets are two-dimensional, Gaussian-distributed patterns labeled 1
and 2. We can express the conditional probability density functions for the two classes
as follows:

/WQ) = β Χ ρ
ώ ("ά "* " ^1'2) {οτί=ί 2
'
108 Chapter 5 Rare Event Detection in Genomic Sequences by Neural Networks

TABLE 2 Contents of Repetitive Sequence File

ID HSAL00581 repbase; DNA; PRI; 338 BP.


CC HSAL000581 DNA
XX
AC X67491;
XX
DE Alu repetitive element (Alu-Sp)
XX
KW GLUDP5 gene; glutamate dehydrogenase.
XX
OS Homo sapiens (human)
CC human
OC Eukaryota; Animalia; Metazoa; Chordata; Vertebrata; Mammalia;
OC Theria; Eutheria; Primates; Haplorhini; Catarrhini; Hominidae.
XX
RN [1]
RP 1-338
RC [1] (bases 1 to 2679)
RA Moschonas N.K.;
RT "Direct Submission";
RL Submitted (22-JUL-1992) N.K. Moschonas, Inst. of Molecular Biol.$\&$
RL Biotechnology, Forth Dept. of Biology, Univ. of Crete, P.O. Box
XX
RN [2]
RP 1-338
RC [2] (bases 1 to 2679)
RA Rzimagiorgis G., Leversha M.A., Chroniary K.., Goulielmos G.,
RA Sargent CA., Ferguson-Smith M., Moschonas N.K.;
RT "Structure and expression analysis of a member of the human
RT glutamate dehydrogenase (GLUD) gene family mapped to chromosome";
RL Hum. Genet. 91, 433-438
XX
DR GENBANK; X67491; 85.0.
CC Positions 2135 2472 Accession No X67491 GenBank (rel. 85.0)
XX
FH Key Location/Qualifiers
FH
FT repeat_region 1. .338
FT /rpt_family= "Alu-Sp"
XX
SQ Sequence 338 BP; 113 A; 78 C; 88 G; 59 T; 0 other;

HsalO0581 Length: 338 September 11, 1997 12:40 Type: N Check: 1800 ..

1 GGCCAGGCAC GGTAGCTCAT GCCTACAATC CTAGTGCTTT GGGAGGCCAA


51 GGCGGGTGGA TCACCTCAGG TAGGGAGTTT GAGACCAGCC TGACCAACAT
101 GGTGAAACCC CGTCTCTAGT AAAAATACAA AAAATTAGCC GGGCGTGGTG
151 GTGCATGCCT GTAATCCCAG CAACTTGGGA GGCTGAGGCA GGAGAATCAC
Section 5 Data Set Used in the Experiments 109

TABLE 3 Types and Distributions of Repetitive Alu Sequences


Type Frequency Type Frequency
Sx 962 Sc 871
Jb 1474 Ya8 31
Spqxzg 2210 Sz 924
Scpqxzg 48 FLA 192
Sxzg 1324 Sg 285
J 2345 Sq 451
X 907 Ya5 46
Ya 900 Szg 108
Jo 2186 FLAX 47
S 1175 SbO 24
Sp 550 Spq 13
Cls 939 Total 18012

TABLE 4 Locations of Extracted Unique Sequences

Hs#S222012 Hs#S305515 Hs#S430646 Hs#S553445 Hs#S696622


Hs#S222230 Hs#S305517 Hs#S430922 Hs#S553460 Hs#S697401
Hs#S222333 Hs#S305519 Hs#S431069 Hs#S553465 Hs#S698622
Hs#S222452 Hs#S305520 Hs#S431125 Hs#S553469 Hs#S699401
Hs#S222538 Hs#S305521 Hs#S431235 Hs#S553474 Hs#S700480
Hs#S222677 Hs#S305524 Hs#S431250 Hs#S553475 Hs#S702622
Hs#S223036 Hs#S305525 Hs#S431669 Hs#S553483 Hs#S703776
Hs#S223101 Hs#S305527 Hs#S431709 Hs#S553486 Hs#S704622
Hs#S223216 Hs#S305528 Hs#S432299 Hs#S553489 Hs#S705436
Hs#S223364 Hs#S305529 Hs#S432345 Hs#S553491 Hs#S705459
Hs#S223414 Hs#S305533 Hs#S432389 Hs#S553492 Hs#S705462
Hs#S223604 Hs#S305537 Hs#S432593 Hs#S553495 Hs#S705463
Hs#S223823 Hs#S305540 Hs#S432644 Hs#S553508 Hs#S705467
Hs#S223936 Hs#S305542 Hs#S432862 Hs#S553514 Hs#S705468
Hs#S224051 Hs#S305543 Hs#S432921 Hs#S553516 Hs#S705469
Hs#S224260 Hs#S305544 Hs#S433162 Hs#S553518 Hs#S705470
Hs#S224334 Hs#S305545 Hs#S433310 Hs#S553519 Hs#S705471
Hs#S224420 Hs#S305546 Hs#S433658 Hs#S553520 Hs#S705473
Hs#S224570 Hs#S305552 Hs#S433795 Hs#S553522 Hs#S705474
Hs#S225120 Hs#S305554 Hs#S433954 Hs#S553524 Hs#S705475
Hs#S226034 Hs#S305555 Hs#S434093 Hs#S553527 Hs#S705476
Hs#S226052 Hs#S305610 Hs#S434611 Hs#S553536 Hs#S705477
Hs#S226054 Hs#S305651 Hs#S434763 Hs#S553550 Hs#S705478
Hs#S226058 Hs#S305787 Hs#S434793 Hs#S553556 Hs#S705479
Hs#S226059 Hs#S305999 Hs#S434950 Hs#S553565 Hs#S705480
Hs#S226061 Hs#S306044 Hs#S435293 Hs#S553572 Hs#S705482
Hs#S226062 Hs#S306169 Hs#S435311 Hs#S553583 Hs#S705486
Hs#S226063 Hs#S306396 Hs#S435420 Hs#S553586 Hs#S705487
Hs#S226064 Hs#S306510 Hs#S435893 Hs#S553587 Hs#S705488
Hs#S226066 Hs#S306647 Hs#S436055 Hs#S553590 Hs#S705489
Hs#S226067 Hs#S307427 Hs#S436136 Hs#S553594 Hs#S705490
Hs#S226070 Hs#S307595 Hs#S436350 Hs#S553600 Hs#S705493
Hs#S226072 Hs#S307797 Hs#S436525 Hs#S553602 Hs#S705495
110 Chapter 5 Rare Event Detection in Genomic Sequences by Neural Networks

TABLE 5 Contents of a Unique Sequence File

> gnl|UG|Hs$#$S582962 Calcium/calmodulin-dependent protein kinase IV/gene=CAMK4


/cyto = 5q21-q23/cds = (77,1498)/gb = D30742/gi=487908/ug = Hs.351 /len= 1740

GCGGCGGCTGGCGGCCGGCTTCTCGCTCGGGCAGCGGCGGCGGCGGCGGCGGCGGCTTCC
GGAGTCCCGCTGCGAAGATGCTCAAAGTCACGGTGCCCTCCTGCTCCGCCTCGTCCTGCT
CTTCGGTCACCGCCAGTGCGGCCCCGGGGACCGCGAGCCTCGTCCCGGATTACTGGATCG
ACGGCTCCAACAGGGATGCGCTGAGCGATTTCTTCGAGGTGGAGTCGGAGCTGGGACGGG
GTGCTACATCCATTGTGTACAGATGCAAACAGAAGGGGACCCAGAAGCCTTATGCTCTCA
AAGTGTTAAAGAAAACAGTGGACAAAAAAATCGTAAGAACTGAGATAGGAGTTCTTCTTC
GCCTCTCACATCCAAACATTATAAAACTTAAAGAGATATTTGAAACCCCTACAGAAATCA
GTCTGGTCCTAGAACTCGTCACAGGAGGAGAACTGTTTGATAGGATTGTGGAAAAGGGAT
ATTACAGTGAGCGAGATGCTGCAGATGCCGTTAAACAAATCCTGGAGGCAGTTGCTTATC
TACATGAAAATGGGATTGTCCATCGTGATCTCAAACCAGAGAATCTTCTTTATGCAACTC
CAGCCCCAGATGCACCACTCAAAATCGCTGATTTTGGACTCTCTAAAATTGTGGAACATC
AAGTGCTCATGAAGACAGTATGTGGAACCCCAGGGTACTGCGCACCTGAAATTCTTAGAG
GTTGTGCCTATGGACCTGAGGTGGACATGTGGTCTGTAGGAATAATCACCTACATCTTAC
TTTGTGGATTTGAACCATTCTATGATGAAAGAGGCGATCAGTTCATGTTCAGGAGAATTC
TGAATTGTGAATATTACTTTATCTCCCCCTGGTGGGATGAAGTATCTCTAAATGCCAAGG
ACTTGGTCAGAAAATTAATTGTTTTGGATCCAAAGAAACGGCTGACTACATTTCAAGCTC
TCCAGCATCCGTGGGTCACAGGTAAAGCAGCCAATTTTGTACACATGGATACCGCTCAAA
AGAAGCTCCAAGAATTCAATGCCCGGCGTAAGCTTAAGGCAGCGGTGAAGGCTGTGGTGG
CCTCTTCCCGCCTGGGAAGTGCCAGCAGCAGCCATGGCAGCATCCAGGAGAGCCACAAGG
CTAGCCGAGACCCTTCTCCAATCCAAGATGGCAACGAGGACATGAAAGCTATTCCAGAAG
GAGAGAAAATTCAAGGCGATGGGGCCCAAGCCGCAGTTAAGGGGGCACAGGCTGAGCTGA
TGAAGGTGCAAGCCTTAGAGAAAGTTAAAGGTGCAGATATAAATGCTGAAGAGGCCCCCA
AAATGGTGCCCAAGGCAGTGGAGGATGGGATAAAGGTGGCTGACCTGGAACTAGAGGAGG
GCCTAGCAGAGGAGAAGCTGAAGACTGTGGAGGAGGCAGCAGCTCCCAGAGAAGGGCAAG
GAAGCTCTGCTGTGGGTTTTGAAGTTCCACAGCAAGATGTGATCCTGCCAGAGTACTAAA
CAGCTTCCTTCAGATCTGGAAGCCAAACACCGGCATTTTATGTACTTTGTCCTTCAGCAA
GAAAGGTGTGGAAGCATGATATGTACTATAGTGATTCTGTTTTTGAGGTGCAAAAAACAT
ACATATATACCAGTTGGTAATTCTAACTTCAATGCATGTGACTGCTTTATGAAAATAATA
GTGTCTTCTATGGCATGTAATGGATACCTAATACCGATGAGTTAAATCTTGCAAGTTAAC

where μχ = mean vector = [0 0]T


of = variance =[1 1; 1 9]
= mean vector = [3 3]T
a: variance = [1 1; 1 9]
Figures 2 and 3 show joint scatter plots of classes Cl and C2 for training and for
testing, respectively. Figures 4 and 5 show the joint scatter plots for the second data
set generated in the following way:

/( |C,)= exp l|x Mi 2 {oTi=l 2


* db (~a " " ) '
where μ,χ = mean vector = [0 0]T
σ\ = variance = [1 0; 0 1]
μ2 = mean vector = [4 0]T
σ{ = variance = [4 0; 0 4]
Section 5 Data Set Used in the Experiments 111

Figure 2 Normally distributed synthetic data 1 for training.

Figure 3 Normally distributed synthetic data 1 for testing.


112 Chapter 5 Rare Event Detection in Genomic Sequences by Neural Networks

Figure 4 Normally distributed synthetic data 2 for training.

Figure 5 Normally distributed synthetic data 2 for testing.


Section 6 Experimental Results 113

5.3. Four-Class Synthetic Data


Figures 6 and 7 show joint scatter plots for four-class synthetic Gaussian-distrib-
uted data generated in the following way:

/WQ)= exp
i (~ά "*" ^1'2) for''=l*2'3'4
where of = variance = [2 0; 0 2] for /= 1,\ 2,\ 3,\ 4
μι = mean vector = [0 0]T
μ2 = mean vector = [0 5]T
μ3 = mean vector = [5 5]T
μ4 = mean vector = [5 0]T
We use these data to show the robustness of the stratifying coefficient scheme for
multiclass data.

6. EXPERIMENTAL RESULTS

6.1. Experiments with Genomic Sequence Data


Experimental results are provided to test the theoretical results discussed in the
previous sections. The classification performances of the new schemes are compared
with those of a back-propagation (BP) neural network and a BP network incorporating
LRWF. We also present experimental results with other types of neural networks for

Figure 6 Four-class synthetic data for training.


114 Chapter 5 Rare Event Detection in Genomic Sequences by Neural Networks

Figure 7 Four-class synthetic data for testing.

fast training, namely a hybrid neural network that combines competitive learning and
delta rule learning and EBUDS (error back-propagation using direct solution), which
uses back-propagation and the direct least-squares method [23].
In the beginning, we investigated many different neural network architectures in
terms of the number of hidden layers and the number of weights in each layer. The
experimental results presented are obtained using networks with 60 nodes in the input
layers, 32 nodes in the hidden layers, and 2 nodes in the output layers. The networks
were trained by various training methods using 1000, 2000, and 3000 samples. The
learning rate we used in the experiments with BP was 0.01.
In the following tables PD means the detection rate and is defined as the percentage
of rare events that are correctly classified as rare events. Similarly, FAR means the false
alarm rate and is defined as the percentage of the total number of events falsely declared
as rare events.
Table 6 shows the results for the regular neural network. We can see that the
network does not perform well as the training data size increases. For the BP with
LRWF using the importance-sampling concept, the performance of neural networks
depends on the number of stages. In Table 7, we see similar results with the two-stage
BP with LRWF.
Table 8 shows that the first proposed scheme for rare event detection works well.
Actually, the BP with LRWF and the BP with SC have the same weighting scheme, but
the results are different. The reason is as follows. For the BP with LRWF, the assump-
tion made is that the data sample should be representative, which means every example
in the population has an equal chance of being in it when we make our sample. But as
far as classification is concerned, it is preferable that the data sample should be strat-
Section 6 Experimental Results 115
TABLE 6 Performance of a Two-Stage Back-Propagation Neural Network

Accuracy Accuracy
No. of Tr data No. of Ts data (resub.) (hold-out) Po FAR

1000 1000 0.9985 0.9305 0.78 0.0616


2000 1000 0.7515 0.95 0 0
3000 1000 0.7407 0.95 0 0

(Genomic Data, P[H)]: 0.25, P*[H,]: 0.05)

TABLE 7 Performance of a Two-Stage Back-Propagation Neural Network with LRWF

Accuracy Accuracy
No. of Tr data No. of Ts data (resub.) (hold-out) Po FAR

1000 1000 0.999 0.9235 0.78 0.0689


2000 1000 0.772 0.95 0 0
3000 1000 0.752 0.95 0 0

(Genomic Data, P[H,]: 0.25, P*[H,]: 0.05)

TABLE 8 Performance of a Two-Stage Back-Propagation Neural Network with SC

Accuracy Accuracy
No. of Tr data No. of Ts data (resub.) (hold-out) PD FAR

1000 1000 0.999 0.924 0.78 0.0684


2000 1000 0.9915 0.923 0.74 0.0674
3000 1000 0.998 0.924 0.78 0.0680

(Genomic Data, Ρ[Η[]: 0.25, P*[H,]: 0.05)

TABLE 9 Performance of a Hybrid Neural Network (Competitive Learning and Pseudoinverse)

Accuracy Accuracy
No. of Tr data No. of Ts data (resub.) (hold-out) Po FAR
1000 1000 0.875 0.954 0.64 0.0295
2000 1000 0.8765 0.973 0.72 0.0137

(Genomic Data, P[H,]: 0.25, P*[H,]: 0.05)

ified, which means examples from small classes have a better chance of being included
than those from large classes. We can confirm this fact through Table 7 and Table 8.
Table 9 shows similar results with the hybrid network. We used competitive learn-
ing in the first stage of the hybrid scheme.
Tables 10 and 11 show the performance of the bootstrap stratification. Table 10
shows the results of the BP with bootstrap stratification. This network achieved better
performance than any other rare event neural networks. Table 11 shows the outcome of
the hybrid scheme, with bootstrap stratification having better performance than the
simple hybrid scheme. For fast training, we adopted Verma's scheme [23] to reduce
training time, with the performance slightly changed, as shown in Table 12.
116 Chapter 5 Rare Event Detection in Genomic Sequences by Neural Networks
TABLE 10 Performance of a Bootstrap Stratification Neural Network (Two-Stage BP)
Accuracy Accuracy
No. of Tr data No. of Ts data (resub.) (hold-out) PD FAR
1000 1000 0.999 0.94 0.81 0.0537
2000 1000 0.998 0.9256 0.80 0.0674
3000 1000 0.998 0.9305 0.80 0.0625

(Genomic Data, P[H,]: 0.25, P*[H,]: 0.05)

TABLE 11 Performance of a Bootstrap Stratification Neural Network (Competitive Learning and


Pseudoinverse)
Accuracy Accuracy
No. of Tr data No. of Ts data (resub.) (hold-out) Po FAR

1000 1000 0.875 0.967 0.70 0.0214


2000 1000 0.883 0.929 0.76 0.0621

(Genomic Data, P[H,]: 0.25, P*[H,]: 0.05)

TABLE 12 Performance of a Bootstrap Stratification Neural Network with Fast Training


(200 Epochs)
Accuracy Accuracy
No. of Tr data No. of Ts data (resub.) (hold-out) ^D FAR

1000 1000 0.855 0.9295 0.78 0.0625


2000 1000 0.856 0.9318 0.78 0.0612
3000 1000 0.852 0.9538 0.68 0.0332

(Genomic Data, PfHi]: 0.25, P*[H,]: 0.05)

6.2. Experiments with Normally Distributed Data 1


The results shown in Tables 13 through 16 indicate that the bootstrap stratification
method works best in this case. Table 17 shows the failure of a simple Bayesian classifier
[25] for rare event data, although it works well for ordinary Gaussian-distributed data.
We can see the improved results in a Bayesian classifier by the consideration of rare
events in Table 18.

TABLE 13 Performance of a Two-Stage Back-Propagation Neural Network


Accuracy Accuracy
PtH,] Ρ*[Η,] (resub.) (hold-out) Po FAR
0.25 0.05 0.98 0.963 0.80 0.0284
0.25 0.05 0.979 0.966 0.80 0.0253
0.25 0.05 0.979 0.956 0.80 0.0263

(Synthetic Gaussian Datal, # of Train Data: 1000, # of Test Data: 1000)


Section 6 Experimental Results 117

TABLE 14 Performance of a Two-Stage Back-Propagation Neural Network with LRWF


Accuracy Accuracy
ΡΙΗ,Ι P*[H,] (resub.) (hold-out) PD FAR

0.25 0.05 0.978 0.965 0.80 0.0263


0.25 0.05 0.977 0.966 0.80 0.0253
0.25 0.05 0.978 0.966 0.80 0.0253

(Synthetic Gaussian Datal, # of Train Data: 1000, # of Test Data: 1000)

TABLE 15 Performance of a Two-Stage Back-Propagation Neural Network with SC


Accuracy Accuracy
P[H,l ΡΊΗ,] (resub.) (hold-out) PD FAR

0.25 0.05 0.979 0.965 0.80 0.0263


0.25 0.05 0.978 0.963 0.80 0.0284
0.25 0.05 0.979 0.966 0.80 0.0253

(Synthetic Gaussian Datal, # of Train Data: 1000, # of Test Data: 1000)

TABLE 16 Performance of a Bootstrap Stratification Neural Network (Two-Stage BP)


Accuracy Accuracy
PIH,] Ρ*[Η,] (resub.) (hold-out) Po FAR

0.25 0.05 0.983 0.960 0.84 0.0337


0.25 0.05 0.985 0.963 0.82 0.0295
0.25 0.05 0.984 0.962 0.84 0.0316

(Synthetic Gaussian Datal, # of Train Data: 1000, # of Test Data: 1000)

TABLE 17 Performance of a Bayes Classifier for Minimum Error


Accuracy Accuracy
P[H,] P"[H,] (resub.) (hold-out) PD FAR
0.25 0.05 0.929 0.732 0.58 0.0053

(Synthetic Gaussian Datal, # of Train Data: 1000, # of Test Data: 1000)

TABLE 18 Performance of a Bayes Classifier for Minimum Cost


Accuracy Accuracy
PtH,] P*[H,l (resub.) (hold-out) PD FAR
0.25 0.05 0.967 0.72 0.80 0.0189

(Synthetic Gaussian Datal, # of Train Data: 1000, # of Test Data: 1000)


118 Chapter 5 Rare Event Detection in Genomic Sequences by Neural Networks

6.3. Experiments with Normally Distributed Data 2


Tables 19 through 24 show that the proposed stratification methods perform better
than regular neural networks without stratification.

TABLE 19 Performance of a Two-Stage Back-Propagation Neural Network


Accuracy Accuracy
PIH,] P*W,] (resub.) (hold-out) Po FAR
0.25 0.05 0.872 0.916 0.66 0.0705
0.25 0.05 0.875 0.941 0.62 0.0421
0.25 0.05 0.874 0.922 0.64 0.0632

(Synthetic Gaussian Data2, # of Train Data: 1000, # of Test Data: 1000)

TABLE 20 Performance of a Two-Stage Back-Propagation Neural Network with LRWP


Accuracy Accuracy
ΡΙΗ,Ι P^H,] (resub.) (hold-out) Po FAR

0.25 0.05 0.878 0.949 0.62 0.0337


0.25 0.05 0.718 0.95 0.0 0.0
0.25 0.05 0.879 0.953 0.62 0.0295

(Synthetic Gaussian Data2, # of Train Data: 1000, # of Test Data: 1000)

TABLE 21 Performance of a Two-Stage Back-Propagation Neural Network with SC


Accuracy Accuracy
P[H,l Ρ'ΙΗ,] (resub.) (hold-out) Po FAR

0.25 0.05 0.878 0.947 0.62 0.0358


0.25 0.05 0.87 0.928 0.66 0.0579
0.25 0.05 0.875 0.909 0.68 0.0789

(Synthetic Gaussian Data2, # of Train Data: 1000, # of Test Data: 1000)

TABLE 22 Performance of Bootstrap Stratification NN (Two-stage BP)


Accuracy Accuracy
ΡΙΗ,Ι Ρ'ΙΗ,] (resub.) (hold-out) Po FAR

0.25 0.05 0.898 0.916 0.68 0.0821


0.25 0.05 0.903 0.899 0.68 0.0895
0.25 0.05 0.900 0.951 0.62 0.0316

(Synthetic Gaussian Data2, # of Train Data: 1000, # of Test Data: 1000)

TABLE 23 Performance of a Bayes Classifier for Minimum Error

Accuracy Accuracy
Ρ[Η,] ΡηΗ,] (resub.) (hold-out) Pa FAR

0.25 0.05 0.831 0.696 0.58 0.0074

(Synthetic Gaussian Data2, # of Train Data: 1000, # of Test Data: 1000)


Section 6 Experimental Results 119

TABLE 24 Performance of a Bayes Classifier for Minimum Cost


Accuracy Accuracy
ΡΙΗ,Ι P*[H11 (resub.) (hold-out) Po FAR

0.25 0.05 0.873 0.694 0.62 0.0221

(Synthetic Gaussian Data2, # of Train Data: 1000, # of Test Data: 1000)

6.4. Experiments with Four-Class Synthetic Data


Tables 25 to 27 demonstrate the robustness of BP with SC with four-class data.
Simple BP, and BP with LRWF, fail more frequently than BP with SC. These results
indicate that we can preferably utilize this scheme for the detection of rare events for
multiclass data.

TABLE 25 Performance of a Two-Stage Back-Propagation Neural Network


Accuracy Accuracy
ΡΙΗ,] P*[H,] (resub.) (hold-out) Po FAR
0.25 0.05 86.6 97.10 0.98 0.0036
0.25 0.05 42.55 48.1 0.36 0.0047
0.25 0.05 73.95 95.0 0 0
0.25 0.05 73.95 95.0 0 0
0.25 0.05 51.65 68.85 0 0.0176
0.25 0.05 86.85 96.95 0 0
0.25 0.05 39.35 49.55 0 0
0.25 0.05 73.95 95.0 0 0
0.25 0.05 73.95 95.0 0 0
0.25 0.05 39.35 48.15 0 0
Average 64.23 78.97 0.1340 0.0026
(Four-class Synthetic Data, # of Train Data: 1000, # of Test Data: 1000)

TABLE 26 Performance of a Two-Stage Back-Propagation Neural Network with LRWF


Accuracy Accuracy
ΡΙΗ,] P*[Hi] (resub.) (hold-out) PD FAR
0.25 0.05 86.4 97.15 0.96 0.0021
0.25 0.05 86.5 97.2 0.96 0.0014
0.25 0.05 73.95 95.0 0 0
0.25 0.05 73.95 95.0 0 0
0.25 0.05 86.4 97.15 0.94 0.0007
0.25 0.05 73.95 95.0 0 0
0.25 0.05 73.95 95.0 0 0
0.25 0.05 86.5 97.2 0.98 0.0013
0.25 0.05 73.95 95.0 0 0
0.25 0.05 36.75 47.55 0.02 0.1003
Average 75.23 91.125 0.386 0.0106
(Four-class Synthetic Data, # of Train Data: 1000, # of Test Data: 1000)
120 Chapter 5 Rare Event Detection in Genomic Sequences by Neural Networks

TABLE 27 Performance of a Two-Stage Back-Propagation Neural Network with SC


Accuracy Accuracy
P[H,] Ρ'ΙΗ,] (resub.) (hold-out) PD FAR
0.25 0.05 67.75 51.5 0.98 0.1779
0.25 0.05 86.8 96.9 0 0
0.25 0.05 62.75 51.55 0.98 0.2590
0.25 0.05 99.25 98.3 0.98 0.0071
0.25 0.05 49.65 49.75 0 0
0.25 0.05 99.2 98.3 0.98 0.0081
0.25 0.05 99.1 98.65 1 0.0049
0.25 0.05 62.65 51.9 1 0.0703
0.25 0.05 99.1 98.75 1 0.0038
0.25 0.05 99.1 98.65 1 0.0039
Average 82.035 79.425 0.79 0.0535

(Four-class Synthetic Data, # of Train Data: 1000, # of Test Data: 1000)

7. CONCLUSIONS
In this chapter, we presented two methods for rare event detection in association with
human genomic sequences as well as synthetically generated data, using neural net-
works and sample stratification. In the first scheme, we used a modification of the
importance-sampling concept, which modifies the probability distribution of the under-
lying random process in order to make rare events occur more frequently. This method
uses a stratifying coefficient multiplying the sum of the derivatives during the backward
pass of training. In the second scheme, we utilized a bootstrap technique. These two
schemes make rare events have a better chance of being included in the sample for
training in order to improve the classification accuracy of neural networks. The results
indicate that the proposed schemes have the potential to improve significantly the
classification performance to recognize rare events.
More progress is required on acceptable minimum data size for rare event detec-
tion. We cannot make up for two small amounts of data. At a certain level of scarcity,
we cannot obtain the desired results even with the proposed techniques. Another
research direction would be toward investigating the relationship between the perfor-
mance and the number of bootstrap replicates. We cannot arbitrarily increase the
number of replicates for better performance without considering complexity problems
as well as saturation of improvement of classification accuracy. We want to find a
reasonable bound for the number of replicates considering the performance and the
complexity of the neural networks.

REFERENCES

[1] W. Choe, O. K. Ersoy, and M. Bina, Detection of rare events by neural networks.
Proceedings of the Artificial Neural Networks in Engineering Conference (ANNIE '98), pp.
5-10, 1998.
[2] M. Smith, Neural Networks for Statistical Modelling. New York: Van Nostrand Reinhold,
1993.
References 121

[3] P. M. Hahn and M. C. Jeruchim, Developments in the theory and application of importance
sampling. IEEE Trans. Commun. COM-34(7): 715-719, 1986.
[4] M. C. Jeruchim, P. Balaban, and K. S. Shanmugan, Simulation of Communications Systems.
New York: Plenum, 1992.
[5] R. L. Mitchelle, Importance sampling applied to simulation of false alarm statistics. IEEE
Trans. Aerosp. Electron. Syst. AES-17(1): 15-24, 1981.
[6] D. J. Monro, O. K. Ersoy, M. R. Bell, and J. S. Sadowsky, Neural network learning of low-
probability events. IEEE Trans. Aerosp. Electron. Syst. 3(3): 898-910, July 1996.
[7] D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, Parallel Distributed
Processing, Vol. 1: Foundations. Cambridge, MA: MIT Press, 1986.
[8] M. D. Richard and R. P. Lippman, Neural network classifiers estimate bayesian a posteriori
probabilities. Neural Comput. 3: 461-483, 1991.
[9] D. W. Ruck, S. K. Rogers, M. Kabrisky, M. E. Oxley, and B. W. Suter, The multilayer
perceptron as an approximation to a Bayes optimal discriminant function. IEEE Trans.
Neural Networks 1(4): 296-298, 1990.
[10] E. A. Wan, Neural networks classification: A bayesian interpretation. IEEE Trans. Neural
Networks 1(4): 303-305, 1990.
[11] H. White, Learning in artificial neural networks: A statistical perspective. Neural Comput. 1:
425-464, 1989.
[12] L. Breiman, Bagging predictor. Machine Learning 24: 123-140, 1996.
[13] J. R. Quinlan, Bagging, boosting, and C4.5. Proceedings Fourteenth National Conference on
Artificial Intelligence, pp. 725-730, 1996.
[14] A. M. Zoubir and B. Boashash, The bootstrap and its application in signal processing. IEEE
Signal Process. Mag. January: 56-76, 1998.
[15] N. Nilsson, Learning Machines. New York: McGraw-Hill, 1965.
[16] L. K. Hansen and P. Salamon, Neural network ensembles. IEEE Trans. Pattern Anal.
Machine Intell. 12(10): 993-1001, 1990.
[17] H. Valafar and O. K. Ersoy, Parallel, self-organizing, consensual neural network. Report.
TR-EE 90-56, School of Electrical Engineering, Purdue University, 1990.
[18] D. W. Opitz and J. W. Shavlik, Generating accurate and diverse members of a neural
networks ensemble. In: Advances in Neural Information Processing System, Vol. 8.
Cambridge, MA: MIT Press, 1996.
[19] K. Turner and J. Ghosh, Theoretical foundations of linear and order statistics combiners for
neural pattern classifiers. TR-95-02-98, Computer and Vision Research Center, University
of Texas at Austin, 1995.
[20] S. Haykin, Neural Networks. Englewood Cliffs, NJ: Macmillan, 1994.
[21] J. Kangas, T. Kohonen, and J. Laaksonen, Variants of self-organizing maps. IEEE Trans.
Neural Networks 1(1): 93-99, 1990.
[22] R. P. Lippmann, Pattern classification using neural networks. IEEE Commun. Mag. 27: 47-
64, 1989.
[23] R. Verma, Fast training of multilayer perceptrons. IEEE Trans. Neural Networks 8(6): 1314-
1320, 1997.
[24] J. Jurka, Repbase update. Genetic Information Research Institute, http://www.girinst.org/
~server/repbase.html, 1997.
[25] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed. San Diego: Academic
Press, 1990.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.

Chapter AN AXIOMATIC APPROACH TO


6 REFORMULATING RADIAL BASIS
NEURAL NETWORKS

Nicolaos B. Karayiannis

1. INTRODUCTION

The structure of a radial basis function (RBF) neural network is shown in Figure 1. An
RBF neural network is usually trained to map a vector xk € W into a vector yk e D5"°,
where the pairs (xk, yk), 1 < k < M, form the training set. If this mapping is viewed as a
function in the input space 03", learning can be seen as a function approximation
problem. According to this point of view, learning is equivalent to finding a surface
in a multidimensional space that provides the best fit to the training da,ta.
Generalization is therefore synonymous with interpolation between the data points
along the constrained surface generated by the fitting procedure as the optimum
approximation to this mapping.
Broomhead and Lowe [1] were the first to explore the use of radial basis functions
in the design of neural networks and to show how RBF networks model nonlinear
relationships and implement interpolation. Micchelli [2] showed that RBF neural net-
works can produce an interpolating surface that exactly passes through all the pairs of
the training set. However, the exact fit is neither useful nor desirable in applications
because it may produce anomalous interpolation surfaces. Poggio and Girosi [3] viewed
the learning process in an RBF network as an ill-posed problem, in the sense that the
information in the training data is not sufficient to reconstruct uniquely the mapping in
regions where data are not available. From this point of view, learning is closely related
to classical approximation techniques, such as generalized splines and regularization
theory. Park and Sandberg [4,5] proved that RBF networks with one layer of radial
basis functions are capable of universal approximation. Under certain mild conditions
on the radial basis functions, RBF networks are capable of approximating arbitrarily
well any function. Similar proofs also exist in the literature for conventional feed-
forward neural models with sigmoidal nonlinearities [6].
The performance of an RBF network depends on the number and positions of the
radial basis functions, their shapes, and the method used for learning the input-output
mapping. The existing learning strategies for RBF neural networks can be classified as
follows: (1) strategies selecting the radial basis function centers randomly from the
training data [1], (2) strategies employing unsupervised procedures for selecting the
radial basis function centers [7-10], and (3) strategies employing supervised procedures
for selecting the radial basis function centers [3,9,11-15].

122
Section 1 Introduction 123

Figure 1 A radial basis function neural network.

Broomhead and Lowe [1] suggested that, in the absence of a priori knowledge, the
centers of the radial basis functions can either be distributed uniformly within the
region of the input space for which there is data or chosen to be a subset of the training
points by analogy with strict interpolation. This approach is sensible only if the training
data are distributed in a representative manner for the problem under consideration, an
assumption that is very rarely satisfied in practical applications. Moody and Darken
[10] proposed a hybrid learning process for training RBF networks with Gaussian
radial basis functions, which is widely used in practice. This learning procedure employs
different schemes for updating the output weights, that is, the weights that connect the
radial basis functions with the output units, and the centers of the radial basis func-
tions, that is, the vectors in the input space that represent the prototypes of the input
vectors included in the training set. Moody and Darken used the c-means (or fe-means)
clustering algorithm [16] and "P nearest neighbor" heuristic to determine the positions
and widths of the Gaussian radial basis functions, respectively. The output weights are
updated according to this scheme using a supervised least-mean-squares learning rule.
Poggio and Girosi [3] proposed a fully supervised approach for training RBF neural
networks with Gaussian radial basis functions, which updates the radial basis function
centers together with the output weights. Poggio and Girosi used Green's formulas to
obtain an optimal solution with respect to the objective function and employed gradient
descent to approximate the regularized solution. They also proposed that Kohonen's
self-organizing feature map [17,18] can be used for initializing the radial basis function
centers before gradient descent is used to adjust all the free parameters of the network.
Chen et al. [7,8] proposed a learning procedure for RBF neural networks based on
the orthogonal least squares (OLS) method. The OLS method is used as a forward
regression procedure to select a suitable set of radial basis function centers. In fact,
124 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks

this approach selects radial basis function centers one by one until an adequate RBF
network has been constructed. Cha and Kassam [11] proposed a stochastic gradient
training algorithm for RBF networks with Gaussian radial basis functions. This algo-
rithm uses gradient descent to update all free parameters of an RBF network, which
include the radial basis function centers, the widths of the Gaussian radial basis func-
tions, and the output weights. Whitehead and Choate [19] proposed an evolutionary
training algorithm for RBF neural networks. In this approach, the centers of the radial
basis functions are governed by space-filling curves whose parameters evolve geneti-
cally. This encoding causes each group of codetermined basis functions to evolve in
order to fit a region of the input space. Roy et al. [20] proposed a set of learning
principles that led to a training algorithm for a network that contains "truncated"
radial basis functions and other types of hidden units. This algorithm uses random
clustering and linear programming to design and train this network with polynomial
time complexity.
Despite the existence of a variety of learning schemes, RBF neural networks are
frequently trained in practice using variations of the learning scheme proposed by
Moody and Darken [10]. According to these hybrid learning schemes, the prototypes
that represent the radial basis function centers are determined separately according to
some unsupervised clustering or vector quantization algorithm and the output weights
are determined by a supervised procedure to implement the desired input-output map-
ping. These approaches were developed as a natural reaction to the long training times
typically associated with the training of conventional feed-forward neural networks
using gradient descent [21]. In fact, these hybrid learning schemes achieve fast training
of RBF neural networks because of the strategy they employ for learning the desired
input-output mapping. However, the same strategy prevents the training set from
participating in the formation of the radial basis function centers, with a negative
impact on the performance of trained RBF neural networks [9]. This created a
wrong impression about the actual capabilities of an otherwise powerful neural
model. The training of RBF networks using gradient descent offers a solution to the
trade-off between performance and training speed. Moreover, such training can make
RBF neural networks serious competitors to classical feed-forward neural networks.
Learning schemes attempting to train RBF networks byfixingthe locations of the
radial basis function centers are very slightly affected by the specific form of the radial
basis functions used. On the other hand, the convergence of gradient descent learning
and the performance of the trained RBF networks are both affected rather strongly by
the choice of radial basis functions. The search for admissible radial basis functions
other than the Gaussian function motivated the development of an axiomatic approach
for constructing reformulated RBF neural networks suitable for gradient descent learn-
ing [12-15].
This chapter begins with a review of function approximation models used for
interpolation and points out their relationship with RBF neural networks. An axio-
matic approach provides the basis for reformulating RBF neural networks, which is
accomplished by searching for radial basis functions other than Gaussian. According to
this approach, the construction of admissible RBF models reduces to the selection of
generator functions that satisfy certain properties. The search for potential generator
functions is facilitated by considering the admissibility in the wide and strict sense of
linear and exponential functions. The selection of specific generator functions is based
on criteria related to their behavior when the training of reformulated RBF networks is
Section 2 Function Approximation Models and RBF Neural Networks 125

performed by gradient descent. This analysis is followed by the presentation of batch


and sequential learning algorithms developed for reformulated RBF networks using
gradient descent. These algorithms are used to train reformulated RBF networks to
classify vowel data.

2. FUNCTION APPROXIMATION MODELS AND


RBF NEURAL NETWORKS

There are many similarities between RBF neural networks and function approximation
models used to perform interpolation. Such a function approximation model attempts
to determine a surface in a Euchdean space R"' that provides the best fit to the data
(xk, yk), \<k<M, where xk e X c RV) and yk € R for all k = 1,2,..., M. Micchelli
[2] considered the solution of the interpolation s(xk) = yk, 1 < k < M, by functions
s : R"' -» R of the form

M
s(x) = ^wkg(\\x-xk\\2) (1)

This formulation treats interpolation as a function approximation problem, with the


function j(·) generated by the fitting procedure as the best approximation to this map-
ping. Given the form of the basis function g(), the function approximation problem
described by s(xk) = yk, 1 < k < M, reduces to determining the weights wk, 1 < k < M,
associated with the model (1).
The model described by (1) is admissible for interpolation if the basis function g()
satisfies certain conditions. Micchelli [2] showed that a function g() can be used to solve
this interpolation problem if the M x M matrix G = \gtj\ with entries gy = g(||x,· — χ;||2)
is positive definite. The matrix G is positive definite if the function g() is completely
monotonic on (0, oo). A function g(·) is called completely monotonic on (0, oo) if it is
continuous on (0, oo) and its €th order derivatives g(£)(x) satisfy (— \fg(t\x) > 0,
Vxe(0,oo),for* = 0 , l , 2 , . . .
RBF neural network models can be viewed as the natural extension of this form-
alism. Consider the function approximation model described by

y = w0 + J2wj8(Ux-vj\\2) (2)

If the function g() satisfies certain conditions, the model (2) can be used to implement a
desired mapping R"' -» R specified by the training set (xk, yk), 1 < k < M. This is
usually accomplished by devising a learning procedure for determining its adjustable
parameters. In addition to the weights wj, 0 < j < c, the adjustable parameters of the
model (2) also include the vectors v; € V C R", 1< j < c. These vectors are determined
during learning as the prototypes of the input vectors xk, 1 < k < M. The adjustable
parameters of the model (2) are frequently updated by minimizing some measure of the
discrepancy between the expected output ν λ of the model and the corresponding input
xk and its actual response
126 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks

;Pfe = wo + i>,ir(llx*-v,ll2) (3)

for all pairs (xk, yk), 1 < k < M, included in the training set.
The function approximation model (2) can be extended to implement any mapping
R"< -+ Rn», n0 > 1, as

yi=fUo + f^Wyg(\\x-xj\\2)\, l<i<n0 (4)

where/(·) is a function that is nondecreasing, continuous, and differentiable every-


where. The model (4) describes an RBF neural network with inputs from 0?"', c radial
basis function units, and n0 output units if

g(x2) = <Kx) (5)

and φ{·) is a radial basis function. In such a case, the response of the network to the
input vector xk is

=/[?, wvWiAk
hk=f\Y,
hk h
jA' . 1 < » < »e (6)

where h^ = 1, Vfc, and A,·,* represents the response of the radial basis function located
at the 7'th prototype to the input vector xk, given as

hj,k = <K\\x-k-Vj\\)
(7)
= S(Hx*-V;l!2), l<j<c

The response (6) of the RBF neural network to the input Xjt is actually the output of the
upper associative network. When the RBF network is presented with x^, the input of
the upper associative network is formed by the responses (7) of the radial basis func-
tions located at the prototypes v,, 1 <j < c, as shown in Figure 1.
The models used in practice to implement RBF neural networks usually contain
linear output units. An RBF model with linear output units can be seen as the special
case of (4) that corresponds to/(x) = x. The choice of a linear function/(·) was mainly
motivated by the hybrid learning schemes originally developed by training RBF neural
networks. Nevertheless, the learning process is only slightly affected by the form of/(·)
if RBF neural networks are trained using learning algorithms based on gradient des-
cent. Moreover, the form of an admissible function /(·) does not affect the function
approximation capability of the model (4) or the conditions that must be satisfied by
radial basis functions. Finally, the use of a nonlinear sigmoidal function/(·) could make
RBF models stronger competitors to conventional feed-forward neural networks in
some applications, such as applications involving pattern classification.
Section 3 Reformulating Radial Basis Neural Networks 127

3. REFORMULATING RADIAL BASIS NEURAL


NETWORKS

An RBF neural network is often interpreted as a composition of localized receptive


fields. The locations of these receptivefieldsare determined by the prototypes and their
shapes are determined by the radial basis functions used. The interpretation often
associated with RBF neural networks imposed some silent restrictions on the selection
of radial basis functions. For example, RBF neural networks often employ the decreas-
ing Gaussian radial basis function despite the fact that there exist both increasing and
decreasing radial basis functions. The "neural" interpretation of the model (4) can be
the basis of a systematic search for radial basis functions that can be used for refor-
mulating RBF neural networks [12-15]. Such a systematic search is based on mathe-
matical restrictions imposed on radial basis functions by their role in the formation of
receptive fields.
The interpretation of an RBF neural network as a composition of receptive fields
requires that the responses of all radial basis functions to all inputs are always positive.
If the prototypes are interpreted as the centers of receptivefields,it is required that the
response of any radial basis function becomes stronger as the input approaches its
corresponding prototype. Finally, it is required that the response of any radial basis
function becomes more sensitive to an input vector as this input vector approaches its
corresponding prototype.
Let hj£ = g(\\*k — vy II2) be the response of the^'th radial basis function of an RBF
neural network to the input x^. According to the preceding interpretation of RBF
neural networks, any admissible radial basis function φ(χ) = g(x?) must satisfy the
following three axiomatic requirements [12-15]:

Axiom 1: hjk > 0 for all xk e X and v, € V.

Axiom 2: hjk > hJt for all xk, xt e X and y,· € V such that ||xfc - y,-||2 < ||x€ - y,||2.

Axiom 3: If VxAife = dhjk/dxk denotes the gradient of hjyk with respect to the
corresponding input x*., then

IIVxAfcll2 ^ Ι1ΔχΑ·.ιΙΙ2
Ι|χ*-ν,·|| 2 > ||χ,-ν,·|| 2

for all X*, xte X and vy e V such that ||xfc - y,||2 < |]x€ - v,||2.
These basic axiomatic requirements impose some rather mild mathematical restric-
tions on the search for admissible radial basis functions. Nevertheless, this search can
be further restricted by imposing additional requirements that lead to stronger math-
ematical conditions. For example, it is reasonable to require that the responses of all
radial basis functions to all inputs are bounded, that is, hJk < oo, V,,k. On the other
hand, the third axiomatic requirement can be made stronger by requiring that

l|V x A*|| 2 > IIVXA,*||2 (8)


128 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks

if |Xfc - v,.||2 < ||x, - v,.||2. Since ||xfc - vy||2 < ||x< - v,||2,

l|v x A*ll 2 >


\\v hj, \\2
Xk ■>■""
" ** k (9)

If \\V*Akf > HVxAill2 a n d llxjfc — y/N2 < ||x< -ν,·|| 2 , then

IIVxA*" 2 ; HVxAfcll2; ll^M 2 m


lXfc — ν,-ΙΙ2 || Χ ί -ν,·|| 2 ||χ<-ν,|| 2

and the third axiomatic requirement is satisfied. This implies that the condition (8) is
stronger than that imposed by the third axiomatic requirement.
The preceding discussion suggests two complementary axiomatic requirements for
radial basis functions [13]:

Axiom 4: hj^ < oo for all \k € X and \j € V.

Axiom 5: If VXA fc = 3A; ^/9χ^ denotes the gradient of hjk with respect to the
corresponding input x^, then

Ιΐν,Α,ιι 2 > IIVxAJI2


for all Xfc, \ t e X and y,· 6 V such that ||xfc — y,||2 < ||x£ — y,||2.
The selection of admissible radial basis functions can be facilitated by the following
theorem [13]:

Theorem 1: The model described by (4) represents an RBF neural network in


accordance with all five axiomatic requirements if and only if g(·) is a continuous
function on (0, oo) such that:
l.g(jc)>0, Vxe(0,oo).
2. g(x) is a monotonically decreasing function of x € (0, oo), that is, g'(x) < 0,
Vx e (0, oo).
3. g'(x) is a monotonically increasing function of x e (0, oo), that is, g"(x) > 0,
Wx € (0, oo).
4. limx_>0+ g(x) = L, where L is a finite number.
5. d(x) = g'(x) + 2xg"{x) > 0, Vx e (0, oo).
A radial basis function is said to be admissible in the wide sense if it satisfies the
three basic axiomatic requirements or, equivalently, the first three conditions of
Theorem 1 [12,14,15]. If a radial basis function satisfies all five axiomatic requirements
or, equivalently, all five conditions of Theorem 1, then it is said to be admissible in the
strict sense [13].
A systematic search for admissible radial basis functions can be facilitated by
considering basis functions of the form φ(χ) = g^x2), with g(·) defined in terms of a
generator function g 0 () as g(x) = (go(.x))i/(l~m\ m φ 1 [12-15]. The selection of genera-
Section 4 Admissible Generator Functions 129

tor functions that lead to admissible radial basis functions can be facilitated by the
following theorem [13]:

Theorem 2: Consider the model (4) and let g(x) be defined in terms of the generator
function g0(x) that is continuous on (0, oo) as

g(x) = (go(x))m-m\ τηφ\- (Π)

If m > 1, then this model represents an RBF neural network in accordance with all
five axiomatic requirements if:
1. g0(x) > 0, Vx € (0, oo).
2. g0(x) is a monotonically increasing function of x e (0, oo), that is, go(x) > 0,
Vx € (0, oo).
3. r0(x) = Qn/Qn - 1))feo(*))2- SoWSo (*) > 0, Vx e (0, oo).
4. lim^ 0 + go(x) = Lx > 0.
5. do(x) = go(x)go(x) - 2xr0(x) < 0, Vx e (0, oo).
If m < 1, then this model represents an RBF neural network in accordance with all
five axiomatic requirements if:
1. g0(x) > 0, Vx e (0, oo).
2. g0(x) is a monotonically decreasing function of x e (0, oo), that is, go(x) < 0,
Vx e (0, oo).
3. r0(x) = (m/(m - 1)) (g^x))2 -g0(x)go(x) < 0, Vx e (0, oo).
4. lim^ 0 + go(*) = Li < oo.
5. 4)(x) = go(x)go(x) - 2xr0(x) > 0, Vx e (0, oo).
Any generator function that satisfies the first three conditions of Theorem 2 leads
to admissible radial basis functions in the wide sense [12,14,15]. Admissible radial basis
functions in the strict sense can be obtained from generator functions that satisfy all five
conditions of Theorem 2 [13].

4. ADMISSIBLE GENERATOR FUNCTIONS

This section investigates the admissibility in the wide and strict sense of linear and
exponential generator functions.

4.1. Linear Generator Functions


Consider the function g(x) = (go(*))1/(1~m\ with g0(x) = ax + b and m>\.
Clearly, g0(x) = ax + b > 0, Vx € (0, oo), for all a > 0 and b > 0. Moreover, g0(x) =
ax + b is a monotonically increasing function if gfa) = a>0. For g0(x) = ax + b,
go(x) = a, go(x) = 0,

r
oW = ^ — [ ^ (12)
130 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks

If m > 1, then r0(x) > 0, Vx e (0, oo). Thus, g0(x) = ax + b is an admissible generator
function in the wide sense (i.e., in the sense that it satisfies the three basic axiomatic
requirements) for all a > 0 and b > 0. Certainly, all combinations of a > 0 and b > 0
also lead to admissible generator functions in the wide sense.
For g0(x) = ax + b, the fourth axiomatic requirement is satisfied if

lim g0(x) = b>0 (13)


χ->·0 +

For g0(x) = ax + b,

do(x) = (ax + b)a - 2x ——- a2 (14)

If m > 1, the fifth axiomatic requirement is satisfied if do(x) < 0, Vx e (0, oo). For
a > 0, the condition d^ipc) < 0 is satisfied by g0(x) = ax + b if

w-li)
*> 7- (15)
m + \a
Since m > 1, the fifth axiomatic requirement is satisfied only if b = 0 or, equivalently, if
g0(x) = ax. However, the value b = 0 violates the fourth axiomatic requirement. Thus,
there exists no combinations of a > 0 and b > 0 leading to admissible generator func-
tion in the strict sense that has the form g0(x) = ax + b.
If a = 1 and b = γ2, then the linear generator function becomes g0(x) = x + γ1.
For this generator function, g(x) = (x + j^y/O-»») if w = 3, g(x) = (χ + γ2)~ι/2 corre-
sponds to the inverse multiquadratic radial basis function

<Kx) = g(xi) = ^2+l^i/2 (16)

For this generator function, ΐ™χ_>0+ g0(x) = γ2 and limx^.0+g(x) = y2/(1_m). Since
w > 1, g(·) is a bounded function if γ takes nonzero values. However, the bound of
g{·) increases and approaches infinity as γ decreases and approaches 0. If m > 1,
the condition d0(x) < 0 is satisfied by go(x) = x + γ2 if

m-1 j ., _.
m+ 1

Clearly, the fifth axiomatic requirement is satisfied only for γ — 0, which leads to an
unbounded function g() [12,14,15].
Another interesting generator function from a practical point of view can be
obtained from g0(x) = ax + b by selecting b = \ and a = δ > 0. For g0(x) = 1 + δχ,
limx^.0+g(x) = \ivnx_,.0+g0(x) = 1. For this choice of parameters, the corresponding
radial basis function φ(χ) = g(x2) is bounded by 1, which is also the bound of the
Gaussian radial basis function. If m > 1, the condition do(x) < 0 is satisfied by go(x) =
l+Sxi(
Section 4 Admissible Generator Functions 131

m-ll
x > —Γ7Ι ( 18 )
m+lo
For afixedm > 1, the fifth axiomatic requirement is satisfied in the limit S -*■ oo. Thus,
a reasonable choice for S in practical situations is <5 » 1.
The radial basis function that corresponds to the linear generator function g0{x) =
ax + b and some value of m > 1 can also be obtained from the decreasing function
g0{x) = 1 /{ax + b) combined with an appropriate value of m < 1. As an example, for
m = 3, g0(x) = ax + b leads to g{x) = {ax + b)~l/2, For a = 1 and b = y2, this genera-
tor function corresponds to the multiquadratic radial basis function φ(χ) =
g{x2) = {χ2 + γ2)~ι/2. The multiquadratic radial basis function can also be obtained
using the decreasing generator function go(x) = l/{x + y2) with m = — 1. In general,
the function g{x) = (go(*))1/(1-m) corresponding to the increasing generator function
g0{x) = ax + b and m = m, > 1 is identical to the function g{x) = (go(*))1/(1-'")
corresponding to the decreasing function go{x) = 1 /{ax + b) and m — md if

1 — w, md — 1

or, equivalently, if

m,· + md = 2 (20)

Since m,<> 1, (20) implies that md < 1.


The admissibility of the decreasing generator function g0{x) = 1 /{ax + b) can be
verified by using directly the results of Theorem 2. Consider the function
g(x) = (go(x))W~m), with g0{x) = \/{ax + b) and m<\. For all a > 0 and 6 > 0,
g0{x) = I/{ax + b)>0, Vx e (0, oo). Since g(,(x) = -a/{ax + bf < 0, Vx € (0, oo),
g0{x) = l/(ax + b) is a monotonically decreasing function for all a > 0. Since
^'(*) = 2«2/(«* + Z>)3,

w - 1 {ax + by

For m < 1, r0(x) < 0, Vx e (0, oo), and g0(x) = I/{ax + b) is an admissible generator
function in the wide sense.
For g0{x) = l/{ax + b),

lim g0{x) = T (22)


x->0 + t>

which implies that g0{x) = 1 /{ax + b) satisfies the fourth axiomatic requirement unless
b approaches 0, which implies that ΐΓτηΛ_,0+ g0{x) = l/b = oo. For g0{x) = I/{ax + b),

4>(*) = , ° L«(a^Tx-b) (»)


(ax + bf \ m - 1 /
132 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks

If m < 1, the fifth axiomatic requirement is satisfied if do(x) > 0, Vx e (0, oo). Since
a > 0, the condition CIQ(X) > 0 is satisfied by g0(x) = l/(ax + b) if

m- lb ,ΛΛ.
x> 5- (24)

Once again, the fifth axiomatic requirement is satisfied for b = 0, a value that violates
the fourth axiomatic requirement.
4.2. Exponential Generator Functions
Consider the function g(x) = (go(x))1/(1-m), with g0(x) = exp(ßx), ß > 0, and
m > 1. For any £, g0(*) = expGSx) > 0, Vx e (0, oo). For all ß > 0, g0(x) = exp(jöx) is
a monotonically increasing function of x e (0, oo). For go(x) = exp(ßx), go(x) = ßexp
{ßx) and go'(x) = ß2 expißx)· In this case,

r0(x) = —[—r(ßexp(ßx))2 (25)


m— l
If m > 1, then r0(x) > 0, Vx e (0, oo). Thus, g0(x) = exp(ßx) is an admissible generator
function in the wide sense for all ß > 0.
For g0(x) = expOßx), ß > 0,

lim g0(*) = 1 > 0 (26)


χ->·0 +

which implies that g0(x) = exp(ßx) satisfies the fourth axiomatic requirement. For
g0(x) = expGöx), ß > 0,

4(x) = (ßexp(ßx))z^- - ^—j-xj


χ (27)
G ™-ι )
For m > 1, the fifth axiomatic requirement is satisfied if d0(x) < 0, Vx € (0, oo). The
condition ^(x) < 0 is satisfied by g0(x) = exp(/Jx) only if

Χ>
-Ίβ- = Ύ>0 (28)

where σ2 — {m- l)/ß. Regardless of the value β > 0, g0(x) = exp(/*x) is not an admis-
sible generator function in the strict sense.
Consider also the function g(x) = (go(*))1/(1-m)' w i t n 8o(x) = exp(-/?x), 0 > 0, and
m < 1. For any β g000 = exp(-ßx) > 0, Vx € (0, oo). For all β > 0, #000 =
-ßexp(-ßx) < 0, Vx € (0, oo), and g0(x) — exp(-/3x) is a monotonically decreasing
function. Since go(x) = ß2exp(—ßx),

r0{x) = —[—r iß exp(-^)) 2 (29)


tn — i

If w < 1, then r0(x) < 0, Vx € (0, oo), and g0(x) = &χρ(-βχ) is an admissible generator
function in the wide sense for all ß > 0.
Section 5 Selecting Generator Functions 133

For g0(x) = exp(-)öx), ß > 0,

lim go(x) = 1 < oo (30)


x-*0+

which implies that g0(x) = exp(-/frc) satisfies the fourth axiomatic requirement. For
g0(x) = exp(-ßx), ß>0,

do(x) = (ßexp(-ßx))2(-± +y ^ * ) (31)

For m < 1, the fifth axiomatic requirement is satisfied if do(x) > 0, Vx e (0, oo). The
condition do(x)> 0 is satisfied by g0(x) = exp(—βχ) if

1
~m ^ ιπ\
X> (32)
~W = -2
where σ2 = (1 - m)/ß. Once again, g0(x) = exp(—ßx) is not an admissible generator
function in the strict sense regardless of the value of ß > 0.
It must be emphasized that both increasing and decreasing generator functions
essentially lead to the same radial basis function. If m > 1, the increasing exponential
generator function g0(x) = exp(ßx), ß>0, corresponds to the Gaussian radial basis
function φ(χ) = g(x?) = exp(—Λ^/σ2), with σ2 = (m — \)/β. If m < 1, the decreasing
exponential generator function g0(x) = exp(—βχ), β > 0, also corresponds to the
Gaussian radial basis function φ(χ) = g(x?) = exp(—χ^/σ2), with σ2 = (1 — m)/ß. In
fact, the function g(x) = 0?ο(*))1/(1-"'* corresponding to the increasing generator func-
tion g0(x) = exp(/Jjc), β > 0, with 'm = m, > 1 is identical to the function g(x) =
(#o(*))1/(1_m) corresponding to the decreasing function g0(x) = exp(-jöx), ß > 0, with m
= ΑΜ^ < 1 if

zu,r — 1 = 1 - md (33)

or, equivalently, if

m,+/n</ = 2 (34)

5. SELECTING GENERATOR FUNCTIONS

All possible generator functions considered in the previous section satisfy the three
basic axiomatic requirements but none of them satisfies allfiveaxiomatic requirements.
In particular, the fifth axiomatic requirement is satisfied only by generator functions of
the form g0(x) = ax, which violate the fourth axiomatic requirement. Therefore, it is
clear that at least one of thefiveaxiomatic requirements must be compromised in order
to select a generator function. Since the response of the radial basis functions must be
bounded in some function approximation applications, generator functions can be
selected by compromising the fifth axiomatic requirement. Although this requirement
134 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks

is by itself very restrictive, its implications can be used to guide the search for generator
functions appropriate for gradient descent learning [13].
5.1. The Blind Spot
Since hJ<k = g(\\xk - Vj||2),

VxA* = *'(«** - v,||2)VXt(||xfc - v;||2)


= 2 f '(||x jt -v 7 || 2 )(x fc -v 7 )

The norm of the gradient V^A,· *. can be obtained from (35) as

IIVxA/tll2 = 4||x* - τ,ΙΙV(||Xfc - ν,·ΙΙ2))2


(36)
= 4£(||χ*-ν,·||2)
where b(x) = x(g'(x))2. According to Theorem 1, the fifth axiomatic requirement is
satisfied if and only if d{x) = g'(x) + 2xg"{x) > 0, Vx € (0, oo). Since b(x) = x(g'(x))2,

b'(x) = g'(x)(g'(x) + 2xg"(x))


= g'(x)d(x)

Theorem 1 requires that g(x) be a decreasing function of x € (0, oo), which implies that
g'(x) < 0, Vx € (0, oo). Thus, (37) indicates that the fifth axiomatic requirement is
satisfied if b'{x) < 0, Vx e (0, oo). If this condition is not satisfied, then HV^Ay^H2 is
not a monotonically decreasing function of ||x^ — Vy||2 in the interval (0, oo), as required
by the fifth axiomatic requirement. Given a function g(-) satisfying the three basic
axiomatic requirements, the fifth axiomatic requirement can be relaxed by requiring
that IIV^AyfcH2 is a monotonically decreasing function of ||x* —V/||2 in the interval
(B, oo) for some B > 0. Accordingly to (36), this is guaranteed if the function b(x) =
x(g'(x))2 has a maximum at x = B or, equivalently, if there exists a B > 0 such that b'
(B) = 0 and b"(B) < 0. If B e (0, oo) is a solution of b'(x) = 0 and b"(B) < 0, then
b'(x) > 0, Vx € (0, B), and b'(x) <0,Vxe (B, oo). Thus, || V^A^H2 is an increasing
function of |χ& — τ/||2 for ||χ^ — y,||2 € (0,5) and a decreasing function of ||χ^ - yy||2
for |Xfc - ν;·||2 € (B, oo). For all input vectors xk that satisfy \\xk - y,||2 < B, the norm of
the gradient V^A,·^ corresponding to the y'th radial basis function decreases as χ^
approaches its center that is located at the prototype v,. This is exactly the opposite
of the behavior that would intuitively be expected, given the interpretation of radial
basis functions as receptivefields.As far as gradient descent learning is concerned, the
hypersphere KB = {x € X c R"' : ||x - v||2 e (0, B)} is a "blind spot" for the radial
basis function located at the prototype v. The blind spot provides a measure of the
sensitivity of radial basis functions to input vectors close to their centers.
The blind spot ΊΖ^ corresponding to the linear generator function g0(x) = ax + b
is determined by

*-Ξπ§
Section 5 Selecting Generator Functions 135

The effect of the parameter m on the size of the blind spot is revealed by the behavior of
the ratio (m - l)/(m + 1) viewed as a function of m. Since (m — \)/{m + 1) increases as
the value of m increases, increasing the value of m expands the blind spot. For a fixed
value of m > 1, Blin = 0 only if b = 0. For b ^ 0, 5 Un decreases and approaches 0 as a
increases and approaches infinity. If a = 1 and b = yL, Bün approaches 0 as γ
approaches 0. If a = S and b=\, B^n decreases and approaches 0 as S increases and
approaches infinity.
The blind spot 7£Bex corresponding to the exponential generator function go(x) =
exp(ßx) is determined by

w-1
Be*P=^T2ß (39)

For a fixed value of ß, the blind spot depends exclusively on the parameter m. Once
again, the blind spot corresponding to the exponential generator function expands as
the value of m increases. For a fixed value of m > 1, Bexp decreases and approaches 0 as
ß increases and approaches infinity. For g0(x) = exp(/Jx), g(x) = (go(x))l^l~m) =
exp(—χ/σ2) with σ2 = (m — l)/ß. As a result, the blind spot corresponding to the
exponential generator function approaches 0 only if the width of the Gaussian radial
basis function φ(χ) = g(x2) = exp(—χ2/σ2) approaches 0. Such a range of values of σ
would make it difficult for Gaussian radial basis functions to behave as receptive fields
that can cover the entire input space.
It is clear from (38) and (39) that the blind spot corresponding to the exponential
generator function is much more sensitive to changes of m than that corresponding to
the linear generator function. This can be quantified by computing for both generator
functions the relative sensitivity of B = B{m) in terms of m, defined as

For the linear generator function g0(x) = ax + b, dBuJBm = (2/(m + \)2)(b/a) and

^•=S^T <41>
For the exponential generator function g0(x) = exp(ßx), dBexp/dm = \/(2ß) and
„, tn
SB = r (42)

Combining (41) and (42) gives

s% = ^ S 2 L (43)

Since m > 1, Sj}rap > SjjUn. As an example, for m = 3 the sensitivity with respect to m of
the blind spot corresponding to the exponential generator function is twice that corre-
sponding to the linear generator function.
136 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks

5.2. Criteria for Selecting Generator Functions


The response of the radial basis function located at the prototype y, to training
vectors depends on their Euclidean distance from v, and the shape of the generator
function used. If the generator function does not satisfy the fifth axiomatic requirement,
the response of the radial basis function located at each prototype exhibits the desired
behavior only if the training vectors are located outside its blind spot. This implies that
the training of an RBF model by a learning procedure based on gradient descent
depends mainly on the sensitivity of the radial basis functions to training vectors out-
side their blind spots. This indicates that the criteria used for selecting generator func-
tions should involve both the shapes of the radial basis functions relative to their blind
spots and the sensitivity of the radial basis functions to input vectors outside their blind
spots. The sensitivity of the response Ay* of theyth radial basis function to any input x^
can be measured by the norm of the gradient VXjfcA,ft. Thus, the shape and sensitivity of
the radial basis function located at the prototype vy are mainly affected by:

1. The value A*,* = g(B) of the response hj^ = g(\\Xk — yj\\2) of the y'th radial basis
function at ||χ^ — \j\\L = B and the rate at which the response hJtk = g(||x;t - V/||2)
decreases as |χ& — v,||2 increases above B and approaches infinity, and
2. The maximum value attained by the norm of the gradient V^Ay^ at |Xjt — Vy||2 = B
and the rate at which ||VX/tA,jfc||2 decreases as ||χ* —V/||2 increases above B and
approaches infinity.

The criteria that may be used for selecting radial basis functions can be established
by considering the following extreme situation. Suppose the response hjtk = g(||xjfc—
Vyll2) diminishes very quickly and the receptive field located at the prototype y,· does
not extend far beyond the blind spot. This can have a negative impact on the function
approximation ability of the corresponding RBF model because the region outside the
blind spot contains the input vectors that affect the implementation of the input-output
mapping as indicated by the sensitivity measure HV^Ay^H2. Thus, a generator function
must be selected in such a way that:

1. The response Ay^ and the sensitivity measure HV^A^H2 take substantial values
outside the blind spot before they approach 0, and
2. The response A;jt is sizable outside the blind spot even after the values of || V^Ay^H2
become negligible.
The rate at which the response AyA = g(||xfc-Vy||2) decreases is related to the
"tails" of the functions g(·) that correspond to different generator functions. The use
of a short-tailed function g() shrinks the receptive field of the RBF model, whereas the
use of a long-tailed function g(·) increases the overlapping between the receptive fields
located at different prototypes. If g(x) = (go(*))1/(1~m) and AM > 1, the tail of g(x) is
determined by how fast the corresponding generator function g0(x) changes as a func-
tion x. As x increases, the exponential generator function g0(x) = exp(ßx) increases
faster than the linear generator function g0(x) = ax + b.Asa result, the response g(x) =
(g0(jc))1/(1_m) diminishes very quickly if g0(·) is exponential and slowly if g0(·) is linear.
The behavior of the sensitivity measure || VXtAyift||2 also depends on the properties
of the function g(·)· For Ay>fc = g(||Xfc - v,||2), VXtAyft can be obtained from (35) as
Section 5 Selecting Generator Functions 137

ν « Λ * = -<*M( X * - T/) (44>

where

a M = 2g'(||x f t -y y || 2 ) (45)

From (44),

l|vxA*ll2 = llx*-v,ll2«;,* (46)

The selection of a specific function g(·) influences the sensitivity measure WVXkhj k\\2
through aJJc = -2g\\\xk - v,- f). If g(x) = feo(*))1/(1"m), then

L
g'(x) = r -(g0(x))m/{1-m)go(x)
l m
~ (47)
= γ4^(*(*)Γ*ό(*)
Since A,-* = s(||x* - v,|| 2 ), a,·,* is given by

"M = ^ ( Α Μ Γ ^ Ο Ο Ι Χ * - ν,ΙΙ2) (48)

5.3. Evaluation of Linear and Exponential


Generator Functions
The criteria presented above are used here for evaluating linear and exponential
generator functions.

5.3.1. Linear Generator Functions

Ifg(x) = Üfo(*))1/(I-m*> w i t f t So(x) = αχ + b and m > 1, the response A,·^ = g{\\Xk —


2
ν,·|| ) of they'th radial basis function to xk is

/ \ l/(i»-l)
h (49)
J*=\-u »rr.ü)

For this generator function, go(x) = a and (48) gives

h h
m-\
^l(m-\) (50)
2a \ )
m a\\xk-Vj\\2 + b)

Thus, ||VXtA,?ft||2 can be obtained from (46) as


138 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
2m/(m-l)
|V |1 (51)
^ -U^i)^-^U-.,l' + tJ
Figures 2a and b show the normalized response (y 2 ) 1 ^"" 1 ^* of the y'th radial basis
function to the input vector xk and the normalized sensitivity measure (y 2 ) 2/(m-1)
l|VxA,*ll2 plotted as functions of ||χ*-ν,·|| 2 for g(x) = (g0(*))1/(1_m), with
g0(x) = x + y2, m = 3, for y2 = 0.1 and y2 = 0.01, respectively. In accordance with
the analysis, || V^A^H2 increases monotonically as ||χ^ — ν7·||2 increases from 0 to B =
y2/2 and decreases monotonically as ||x* - y,||2 increases above B and approaches
infinity. Figure 2 indicates that, regardless of the value of y, the response hj^ of the
radial basis function located at the prototype v, is sizable outside the blind spot even
after the values of ||VXtA//t||2 become negligible. Thus, the radial basis function located
at the prototype v,· is activated by all input vectors that correspond to substantial values
of I I V X A A M I I 2 ·

5.3.2. Exponential Generator Functions

If g{x) = (go(X))1''(1~m\ with go(x) = exP(A*) a n d m


> 1, the response hjk = g(\\Xk
—v,||2) of they'th radial basis function to xk is

hj,k = exp I -^— I (52)

where σ2 = (w - l)/ß. For this generator function, go(x) = β exp(^x) = ßgo(x). In this
case, ^(||x fc - v / ) = ß{hhk)l-m and (48) gives

2 / ||x,-v;|| 2 \ (53)

Thus, \\VXkhjyk\\2 can be obtained from (46) as

(54)

Figures 3a and b show the response hjk = g(\\Xk — v,|| ) of they'th radial basis function
2N

2
to the input vector x* and the sensitivity measure HV^A^H plotted as functions of
2 m m)
llxjfc - v/U for g(x) = (.g0(x)) ~ , with g0(x) = exp(^x), m = 3, for ß = 5 and ß = 10,
respectively. Once again, IIV^A,-^2 increases monotonically as ||χ* —ν,·||2 increases
from 0 to B = l/ß and decreases monotonically as ||χ^ — v,||2 increases above B and
approaches infinity. Nevertheless, there are some significant differences between the
response hjk and the sensitivity measure HV^A^H2 corresponding to linear and expo-
nential generator functions as indicated by comparing Figures 2 and 3. If
g0(jc) = exp(ßx), then the response A;fcis substantial for the input vectors inside the
Section 5 Selecting Generator Functions 139

B
1.6 -i 1—r-| 1 1—r-| 1 1—r-| r
(y2)-'A,7.*
1.4
<y2)^\\Vxkhj_k\\2 -■
1.2 -

■i 1.0

«
1ω 0.8
w
e
a 0.6
0.4 -

0.2

■ « ■ ··»-- ■ i_
10~ 10" 10"1 10° 101 102
l|Xt-Vy||
(a)
B
1.6 1 1 | 1 1 1 | 1
. r»

1.4 —
i \ (y2)^l|VXtA,,t|f - · - ■

1.2 —
\
sensitivity

1
_/ \ -
o

/ \
\
1 °· 8
CO
\
\
-

1 0.6 \ —

-1 \
0.4 \
\
V
0.2
U
n 7l—1—-r-rf^-r-^r- - _ ^ l ■ . .1 .
10- 10" 10" 10° 101 102
*k-yß
(b)

Figure 2 The normalized response {yi)llt-m l^hjk of they'th radial basis function and
the normalized norm of the gradient (y2)2/(m~1)l|VX/l/!/i)t||2 plotted as func-
tions of ||x* - v, ||2 for g(x) = fe0W)1/(1"""), with g0(x) = x + y2, m = 3 and
(a)y 2 = 0.1, (b)y 2 = 0.01.
140 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks

z 1 1 l | I I I ) 1 1 1| 1 1 1-| 1 — —1 1

— nh
1.8 /' j,k "■

1.6 / \ ΙΙ ν ^ Λ ;,*ΙΙ 2 -·
£. 1.4 _ / \ _
/
id sensitiv)

\
_ / 1
/ \ -
a * \
1 0.8 \
1 \
ei 0.6
/
0.4 —
/ -
0.2
/
n
10" 10" 10" 10° 101 102

(a)

4 1 1 1| 1 I I ■ 1 II 1 1 1 | 1 1 1"

3.5 i \
1
\ II^ A ;,*H 2 -·
3 1

2.5 1
\ -
1

2
I
1
-
1
g. 1.5 - 1 i -
/
1 /
/
\ -
0.5
\
n x^.. 1 ll 1
10" 10" 10" 10° 10' 102
ll« 4 -v/
(b)
Figure 3 The response hjk of the y'th radial basis function and the norm of
the gradient WVXthJk\\2 plotted as functions of ||χ*-γ,|| 2 for
Six) = (goW)1/('"m). with ?„(*) = exp(/3x), m = 3, and (a) ß = 5, (b)
/J=10.
Section 6 Learning Algorithms Based on Gradient Descent 141

blind spot but diminishes very quickly for values of \\xk — ν,·||2 above B. In fact, the
values of hjk become negligible even before ||VXi/i;/t||2 approaches asymptotically zero
values. This is in direct contrast to the behavior of the same quantities corresponding to
linear generator functions, which are shown in Figure 2.

6. LEARNING ALGORITHMS BASED ON


GRADIENT DESCENT

Reformulated RBF neural networks can be trained to map xk € W' into


y*: = [>Ί,ΑΛ.* ■ · -yno,k\T € K"°, where the vector pairs (xk, yk), 1 < k < M, form the
training set. If xk e W' is the input to a reformulated RBF network, its response is
9k = \yi,k)>2,k■ ■ -Prickf\ where yik is the actual response of the ith output unit to xk
given by

hk =f(S>i,k)
=/(wfh,))
(55)
=f\i2wijhjk)

with h0Jc = 1, and hjM = g{\\xk - v,||2), 1 <j < c, hk = [h0>khitk... hcJc]7', and
w, = [Wj0Wjti... wic]T. Training is typically based on the minimization of the error
between the actual outputs of the network yk, 1 < k < M, and the desired responses
yfc, 1 < k < M.
6.1. Batch Learning Algorithms
A reformulated RBF neural network can be trained by minimizing the error

E
= ^TJTpi,k-hkf (56)

Minimization of (56) using gradient descent implies that all training examples are
presented to the RBF network simultaneously. Such training strategy leads to batch
learning algorithms. The update equation for the weight vectors of the upper associa-
tive network can be obtained using gradient descent as [15]

^ (57)
= l2^,€P.khk
k=\

where η is the learning rate and t°pk is the output error, given as

4,k=f'(yP,k)(yP,k-yp,k) (58)
142 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks

Similarly, the update equation for the prototypes can be obtained using gradient des-
cent as [15]

Δν? = -ην^Ε
m
(59)
k=\

where η is the learning rate and e* k is the hidden error, defined as

4,k = aq,k Σ elkWiq (60)


i=l

with aqk = —2g'(\\xk -\q\\2). The selection of a specific function g() influences the
update of the prototypes through aqk = — 2g'(||xt — v?||2), which is involved in the
calculation of the corresponding hidden error €qk, Since hqk = giW^k ~ v?ll2) aQ d
i(x) = (go(x))l/^~m\ aq,k is given by (48) and the hidden error (60) becomes

4* = ^(Wsodi** - v?n2) | » i g (6i)


An RBF neural network can be trained according to the algorithm presented
above in a sequence of adaptation cycles, where an adaptation cycle involves the update
of all adjustable parameters of the network. An adaptation cycle begins by replacing the
current estimate of each weight vector v/p, I < p < n0, by its updated version
m
Y/p + AYIP = Yip + η £ <fch* (62)
k=i

Given the learning rate η and the responses h^ of the radial basis functions, these weight
vectors are updated according to the output errors e°p<k, 1 < p < n0. Following the
update of these weight vectors, the current estimate of each prototype \q, 1 < q < c,
is replaced by

M
\q + A\q = \q + η^€**(** - ν,) (63)

For a given value of the learning rate η, the update of \q depends on the hidden errors
ej£, 1 < k < M. The hidden error ehqk is influenced by the output errors €°uk, 1 < i < n0,
and the weights vt>/(?, 1 < /' < n0, through the term J3"=1 e^w^. Thus, the RBF network
is trained according to this scheme by propagating back to the output error.
This algorithm can be summarized as follows:
1. Select η and e; initialize {wy} with zero values; initialize the prototypes y,, 1 <j < c;
set h0ik = l,Vk.
2. Compute the initial response:
Section 6 Learning Algorithms Based on Gradient Descent 143

• hj,k = (g0(\\xk-yj\\2)i/il-m\Vj,k.
• h = [ho,khi,k---hc,k]T,Vk.
• hk =/(wfhfc)> V«, k.
3. Compute Ε = \Σ,£ι Σ £ ι ( * . * ~ hk)2·
4. Set£Oid = £ .
5. Update the adjustable parameters:
• «?jk =f'(yi,k)(yi,k -&,*)» v/, k.

• 4 = (2/(m - l)) ^(||x f c - vy||2) (hlk)m Σ%χ <k">v, V, *.

6. Compute the current response:


• ^ = Gro(llx fc -v/)) 1/(1 - m) ,Y/,fc.
• h = \h,kKk · · · hc,kf, Vfc.
• hk =f(*fbk), Vi, k.
7. Compute £ = | Σ ^ Σ £ ι ( * . * " h k ) 2 -
8. If: (£Old - E)/EM > e; then: go to 4.
6.2. Sequential Learning Algorithms
Reformulated RBF neural networks can also be trained "on line" by sequential
learning algorithms. Such algorithms can be developed by using gradient descent to
minimize the errors

Ek = \Y(yi,k-hk)2 (64)

for k = 1,2,..., M. The update equation for the weight vectors of the upper associative
network can be obtained using gradient descent as [15]

A w M = w M - w M _!
= -nVWpEk (65)
= W°p,khk

where wp,k-\ a n d wp,fc a r e the estimates of the weight vector wp before and after the
presentation of the training example (x^, yk), η is the learning rate, and €°pk is the output
error defined at (58). Similarly, the update equation for the prototypes can be obtained
using gradient descent as [15]

Δν ? Λ = v?>A - \q<k_x
= -nVyqEk (66)

= Wq,k(Xk ~ Vq)

where \qk_i and vqk are the estimates of the prototype yq before and after the pre-
sentation of the training example (x^, yk), η is the learning rate, and eqtk is the hidden
error defined in (61).
144 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks

When an adaptation cycle begins, the current estimates of the weight vectors wp
and the prototypes \q are stored in v/p 0 and v, 0 , respectively. After an example (x^, yfc),
1 < k < M, is presented to the network, each weight vector vrp, I <p < n0, is updated
as

Wp./t <- w M _i + A w M = w M _i + w?,khk (67)

Following the update of all the weight vectors w^, 1 < p < n0, each prototype \q,
1 < q < c, is updated according to

An adaptation cycle is completed in this case after the sequential presentation to the
network of all the examples included in the training set. Once again, the RBF network
is trained according to this scheme by propagating back the output error.
This algorithm can be summarized as follows:
1. Select η and e; initialize {wy} with zero values; initialize the prototypes v,·, 1 <j<c;
set hok = 1, Vk.
2. Compute the initial response:
• A/,* = (go(l|xt-y/ll 2 )) I/(1 ~ M) ,Vy,fc.
• hk = [ho,kKk ■ ■ ■ hc,kf, Vfc.
• hk =/(wfh*). Vi, k.
3. Compute £ = ^ ! Σ £ ι ( * . * " hkf-
4. Set£ o l d = £.
5. Update the adjustable parameters for all k = 1, 2 , . . . , M:
• *lk =f'(yi,k)(yi,k-hk)> v/.
• w, <- w,· + n€°ikhk, Vi.
• & = (2/(w - 1)) iidlXfc - YjfXhjjF ΣΖι <k"V> V-/'·
ν
• y
j « - 7/ + »»«/.ti** - ; ) ' v->·
6. Compute the current response:
• hJik = (g0(\\xk-yj\\2))m-m\Vj,k.
• h = [ho,kKk---hc,kf,Vk
• hk =/(w?h/fc). Vi, k.
7. Compute £ = \ £ f = 1 E£i(V/.* " hk)2-
8. If: (£ old - E)/EM > e; then: go to 4.

7. GENERATOR FUNCTIONS AND GRADIENT


DESCENT LEARNING

The effect of the generator function on gradient descent learning algorithms developed
for reformulated RBF neural networks is essentially related to the criteria established in
Section 5 for selecting generator functions. These criteria were established on the basis
of the response hjk of they'th radial basis function to an input vector x^ and the norm of
the gradient VXtAyfc that can be used to measure the sensitivity of the radial basis
function response hjk to an input vector xk. Since VXkhj,k = -'Vy.hjk, (46) gives
Section 7 Generator Functions and Gradient Descent Learning 145

IIVM||2 = ||x*-v/a£t (69)

According to (69), the quantity \\xk — ν,ΙΙ 2 ^* can also be used to measure the sensitivity
of the response of the y'th radial basis function to changes in the prototype v,· that
represents its location in the input space.
The gradient descent learning algorithms presented in Section 6 attempt to train an
RBF neural network to implement a desired input-output mapping by producing
incremental changes of its adjustable parameters, that is, the output weights and the
prototypes. If the responses of the radial basis functions are not substantially affected
by incremental changes of the prototypes, then the learning process reduces to incre-
mental changes of the output weights and eventually the algorithm trains a single-
layered neural network. Given the limitations of single-layered neural networks [21],
such updates alone are unlikely to implement nontrivial input-output mappings. Thus,
the ability of the network to implement a desired input-output mapping depends to a
large extent on the sensitivity of the responses of the radial basis functions to incre-
mental changes of their corresponding prototypes. This discussion indicates that the
sensitivity measure ||VT hjtk\\2 is relevant to gradient descent learning algorithms devel-
oped for reformulated RBF neural networks. Moreover, the form of this sensitivity
measure in (69) underlines the significant role of the generator function, whose selection
affects || Vy.Ay.kH2 as indicated by the definition of a,·,* in (48). The effect of the generator
function on gradient descent learning is revealed by comparing the response h}^ and the
sensitivity measure ||VT hj^\\2 = |VXjfcAj!>k\\2 corresponding to the linear and exponential
generator functions, which are plotted as functions of ||χ^ — v,||2 in Figures 2 and 3,
respectively.
According to Figure 2, the response A^ of they'th radial basis function to the input
Xjt diminishes very slowly outside the blind spot, that is, as ||χ^ — ν,·||2 increases above
B. This implies that the training vector \k has a nonnegligible effect on the response A,·,*
of the radial basis function located at this prototype. The behavior of the sensitivity
measure ||VT.A,-tfc||2 outside the blind spot indicates that the update of the prototype y,
produces significant variations in the input of the upper associative network, which is
trained to implement the desired input-output mapping by updating the output
weights. Figure 2 also reveals the trade-off involved in the selection of the free para-
meter γ in practice. As the value of γ decreases, ||VT A; ^||2 attains significantly higher
values. This implies that they'th radial basis function is more sensitive to updates of the
prototype v, due to input vectors outside its blind spot. The blind spot shrinks as the
value of γ decreases but || V¥.A,-^||2 approaches 0 quickly outside the blind spot, that is,
as the value of ||χ* —Vy||2 increases above B. This implies that the receptive fields
located at the prototypes shrink, which can have a negative impact on gradient descent
learning. Decreasing the value of γ can also affect the number of radial basis functions
required for the implementation of the desired input-output mapping. This is due to
that fact that more radial basis functions are required to cover the input space. The
receptivefieldslocated at the prototypes can be expanded by increasing the value of γ.
However, ||VT Ay_fc|2 becomes flat as the value of γ increases. This implies that very large
values of γ can decrease the sensitivity of the radial basis functions to the input vectors
included in their receptive fields.
According to Figure 3, the response of they'th radial basis function to the input \k
diminishes very quickly outside the blind spot, that is, as \\xk — ν,·||2 increases above B.
146 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks

This behavior indicates that if an RBF network is constructed using exponential gen-
erator functions, the inputs xk corresponding to high values of ||VT.A,->fc||2 have no
significant effect on the response of the radial basis function located at the prototype
Vj. As a result, the update of this prototype due to x,t does not produce significant
variations in the input of the upper associative network that implements the desired
input-output mapping. Figure 3 also indicates that the bund spot shrinks as the value
of ß increases while ||VV AJ>fc||2 reaches higher values. Decreasing the value of ß expands
the blind spot but ||VT.«/,jt||2 reaches lower values. In other words, the selection of the
value of ß in practice involves a trade-off similar to that associated with the selection of
the free parameter γ when the radial basis functions are formed by linear generator
functions.

8. EXPERIMENTAL RESULTS

The performance of reformulated RBF neural networks generated by linear and expo-
nential generator functions was evaluated and compared with that of conventional
RBF networks using a set of 2D vowel data formed by computing the first two formats
Fl and F2 from samples of 10 vowels spoken by 67 speakers [22,23]. This data set has
been extensively used to compare different pattern classification approaches because
there is significant overlapping between the points corresponding to different vowels in
the F1-F2 plane [9,22,23]. The available 671 feature vectors were divided into a training
set, containing 338 vectors, and a testing set, containing 333 vectors. The training and
testing sets formed from the 2D vowel data were classified by various RBF neural
networks. The RBF networks tested in these experiments consisted of 2 inputs and
10 linear output units, each representing a vowel. The number of radial basis functions
varied in these experiments from 15 to 50. The RBF networks were trained using a
normalized version of the input data produced by replacing each feature sample x by
x = (x — μχ)/σχ, where μχ and σχ denote the sample mean and standard deviation of
this feature over the entire data set. The networks were trained to respond with yiik = 1
and yjk = 0, V/ φ i, when presented with an input vector xk e X representing the rth
vowel. The assignment of input vectors to classes was based on a winner-takes-all
strategy. More specifically, each input vector was assigned to the class represented by
the output unit of the trained RBF network with the maximum response.
The training set formed from the 2D vowel data was used to train conventional
RBF neural networks with Gaussian radial basis functions. The prototypes represent-
ing the locations of the radial basis functions in the feature space were determined by
thefc-meansalgorithm, and the weights connecting the radial basis functions and the
output units were updated to minimize the error at the output using gradient descent
with learning rate η = 10 -2 . The widths of the Gaussian radial basis functions were
determined according to the "closest neighbor" heuristic. More specifically, the width σ;
of the radial basis function located at the prototype v, was determined as σ7 = m i n ^
{|lvy — v*ll)· Table 1 summarizes the percentage of feature vectors from the training and
testing sets classified incorrectly by RBF neural networks trained in the three different
trials. These experimental results verify the erratic behavior that is often associated with
RBF neural networks. There was a significant variation in the percentage of classifica-
tion errors recorded on the training and testing sets as the number of radial basis
functions varied from 15 to 50. When the number of radial basis functions was rela-
Section 8 Experimental Results 147

TABLE 1 Percentage of Classification Errors Produced in Three Trials on the Training Set (Etrain)
and the Testing Set (f,est) Formed from the 2D Vowel Data by Conventional RBF Networks
Containing c Gaussian Radial Basis Functions

Trial #1 Trial #2 Trial #3


^trainC/») £,=«(%) £,rai„(%) £««(%) £train(%) £««(%)
33.7 33.0 31.9 28.2 31.7 27.9
29.6 26.4 28.7 26.4 24.3 19.2
25.4 24.9 26.9 26.4 29.0 26.7
23.7 21.0 25.7 22.5 25.4 23.7
24.8 23.4 24.3 22.5 25.1 22.8
22.8 20.1 27.8 22.8 22.8 20.1
23.7 20.1 21.3 19.5 23.1 19.8
24.3 22.5 24.8 24.6 24.8 24.6

tively small (from 15 to 25), the performance of the trained RBF networks on both
training and testing sets was rather poor. Their performance improved as the number of
radial basis functions increased above 30. Even in this case, however, the classification
errors did not consistently decrease as the number of radial basis functions increased, a
behavior that is reasonably expected at least for the training set. The performance of
conventional RBF networks is significantly affected by the initialization of the learning
process, as indicated by the classification errors produced by RBF networks with the
same number of radial basis functions in three trials. Since the output weights were all
set to 0 in the beginning of the learning process, the initialization used in these trials
affected only the partition of the feature vectors produced by the c-means algorithm.
Thus, the performance of RBF neural networks trained using this learning scheme is
mainly affected by the prototypes produced by the unsupervised clustering algorithm,
which determine the locations of the Gaussian radial basis functions in the feature
space.
A variety of reformulated RBF neural networks were trained on the training set
formed from the 2D vowel data using the gradient descent algorithm presented in
Section 6. The radial basis functions were of the form φ{χ) = gix1) and g(·) was defined
in terms of an increasing generator function g0() asg(x) = (go(x))l/^~m\ with m > 1. In
each case, the learning rate was selected sufficiently small so as to avoid a temporary
increase of the total error especially in the beginning of the learning process.
Table 2 shows the percentage of classification errors on the training and testing sets
produced by reformulated RBF networks generated by linear generator functions
g0(x) = χ + γ2 with m = 3 and γ = 0, γ = 0.1, γ = 1. It is clear from Table 2 that the
performance of reformulated RBF networks improved as the value of γ increased from
0 to 1. The improvement was significant when the reformulated RBF networks con-
tained a relatively small number of radial basis functions (15 to 25). This is particularly
important, because the ability of RBF networks to generalize degrades as the number of
radial basis functions increases. The price to be paid for such an improvement in
performance is slower convergence of the learning algorithm. It was experimentally
verified that the gradient descent learning algorithm converges slower as the value of
γ increases from 0 to 1, despite the fact that the maximum allowable learning rate for
γ = 0 was η = 10~5 whereas the networks with γ = 0.1 and γ = 1 were trained with
η = 10 -4 . This is consistent with the sensitivity analysis of the gradient descent learning
148 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
TABLE 2 Percentage of Classification Errors Produced on the Training Set (Etrain) and the Testing
Set (Etest) Formed from the 2D Vowel Data by Reformulated RBF Networks Containing c Radial Basis
Functions of the Form φ(χ) = g(x2), with g(x) = (0o(*)) 1/<1_m) . 3oW = x + y2, m = 3, and Various
Values of y

y = 0.0(ii == 10 -5 ) y = 0.1ft == IO"4) y = 1 . 0 f t == io- 4 )


c ■Etrain(%) EiU%) £ t r a in(%) £.*,,(%) •Etrain(%) iw(%)
15 37.6 36.0 33.4 30.9 24.0 24.9
20 31.7 30.3 27.5 30.6 23.4 21.0
25 25.4 22.2 22.8 22.5 22.5 22.2
30 22.2 23.7 22.2 21.9 22.5 21.3
35 27.8 27.0 21.3 24.9 22.8 21.0
40 22.8 22.5 21.3 21.0 22.5 21.3
45 24.6 24.0 19.5 21.0 22.2 20.4
50 18.9 24.6 16.3 22.5 22.8 21.0

algorithm presented in Section 7 and the behavior of the sensitivity measure ||VY A7 A||2,
which is plotted in Figure 2. It is clear from Figure 2 that the response of each radial
basis function y, becomes increasingly sensitive to changes of ||χ* — y,||2 as γ decreases
and approaches 0.
Table 3 shows the percentage of classification errors on the training and testing sets
produced in three trials by reformulated RBF networks generated by the linear gen-
erator function g0(x) = x + y2 with γ = 1 and m = 3. The learning process was initi-
alized in each trial using a different set of prototypes distributed randomly over the
feature space. In all three trials, the output weights were all set to 0. According to Table
3, the differences in the percentage of classification errors produced in different trials by
reformulated RBF networks of the same size were not significant. Unlike conventional
RBF neural networks, reformulated RBF neural networks are only slightly affected by
the initialization of the learning process. This is not surprising, because the prototypes
of reformulated RBF networks are updated during learning as indicated by the training
set. In contrast, the prototypes of conventional RBF networks remain fixed during
learning. As a result, an unsuccessful set of prototypes can severely affect the imple-

TABLE 3 Percentage of Classification Errors Produced in Three Trials on the Training Set (Etrain)
and the Testing Set (E,est) Formed from the 2D Vowel Data by Reformulated RBF Networks
Containing c Radial Basis Functions of the Form φ(χ) = gix2), with g(x) = (go(x)) 1/(1 ~ m) ,
fif0(x) = x + y2, m = 3, and y = 1

Trial #1 (t,:= 10-") Trial #2 ft:= io- 4 ) Trial #3 ft:= io- 4 )

c -EtrainC/o) £,est(%) £train(%) £<=«(%) ^trainC/o) £««,(%)


15 24.0 24.9 22.8 22.5 24.6 24.0
20 23.4 21.0 22.2 22.2 24.0 22.5
25 22.5 22.2 21.6 22.5 22.8 22.5
30 22.5 21.3 22.5 20.7 22.2 20.1
35 22.8 21.0 22.5 20.4 22.5 19.8
40 22.5 21.3 22.5 19.5 22.5 19.8
45 22.2 20.4 22.5 20.7 22.5 20.1
50 22.8 21.0 22.8 20.7 21.3 20.4
Section 8 Experimental Results 149

mentation of the input-output mapping that relies only on the update of the output
weights.
Table 4 summarizes the percentage of classification errors produced on the training
and testing sets formed from the 2D vowel data by reformulated RBF networks gen-
erated by exponential generator functions g0(x) = exp(/Sx) with m = 3 and ß = 0.5,
ß= 1, /? = 5. For a fixed value of m, the width σ of all Gaussian radial basis functions
is determined as σ2 = (m - l)/ß. For m = 3, the values ^β = 0.5, β = 1, and β = 5
correspond to σ2 = 4, σ2 = 2, and σ2 = 0.4, respectively. The networks were also
trained with β = 0.1 (σ2 = 20) and β = 0.2 (σ2 = 10) but the learning algorithm did
not converge and the networks classified incorrectly almost half of the feature vectors
from the training and testing sets. Table 4 indicates that even the networks trained with
β = 0.5 did not achieve satisfactory performance. The performance of the trained net-
works on both training and testing sets improved as the value of β increased above 0.5.
The best performance on the training set was achieved for β = 5. However, the best
performance on the testing set was achieved for β = 1. It is remarkable that for β = 1
the percentage of classification errors on the testing set was almost constant as the
number of radial basis functions increased above 20. The gradient descent algorithm
also converged for values of β above 5 but the performance of the trained RBF net-
works on the testing set degraded. This set of experiments is consistent with the sensi-
tivity analysis of gradient descent learning and reveals the trade-off associated with the
selection of the value of β or, equivalently, the width of the resulting Gaussian radial
basis functions. Small values of β reduce the sensitivity of the radial basis functions to
changes of their respective prototypes and the learning algorithm does not converge.
Large values of β lead to radial basis functions that are more sensitive to changes of
their respective prototypes. However, such values of β create sharp radial basis func-
tions, and that has a negative effect on the ability of the trained RBF networks to
generalize. These experiments also indicate that there exists a certain range of values of
β that guarantee convergence of the learning algorithm and satisfactory performance.
Table 5 summarizes the percentage of classification errors produced in three trials
by reformulated RBF networks generated by the exponential generator function g0(x)
= exp(/Jje) with m = 3 and β = 1. In this case the width of all Gaussian radial basis
functions was σ = «/2. The learning process was initialized in each trial by different sets

TABLE 4 Percentage of Classification Errors Produced on the Training Set (£train) and the Testing
Set (£test) Formed from the 2D Vowel Data by Reformulated RBF Networks Containing c Radial Basis
Functions of the Form φ(χ) = g(x2), with g(x) = (SO(*))1/<1_m)< 0oW =exp(ßx), m = 3, and Various
Values of ß
4
0 = 0.5(1,:= 10"4) 0 = 1 . 0 0 , = = 10"4) /5 = 5.0(i> == 10" )
c ■Etrain(%) £,«.(%) *.rai„(%) £tet(%) •EtrainC/o) £«*.(%)
15 29.0 25.2 22.2 21.9 25.1 27.9
20 29.3 27.6 23.7 20.1 22.5 21.0
25 29.3 27.9 25.4 20.7 21.3 21.6
30 27.2 27.0 23.1 20.7 22.2 21.9
35 28.1 25.5 24.8 20.7 21.6 21.9
40 27.5 25.5 23.4 20.7 21.0 21.0
45 27.8 24.6 24.3 19.5 19.8 21.3
50 26.6 24.0 24.3 19.8 19.8 21.0
Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks
TABLE 5 Percentage of Classification Errors Produced in Three Trials on the Training Set (£train)
and the Testing Set (£test> Formed From the 2D Vowel Data by Reformulated RBF Networks
Containing c Radial Basis Functions of the Form φ(χ) = g(x2), with g(x) = (sro(x))1/<1~m)<
sr0(x) =exp08x), m = 3, and β -= 1
4
= 10"4)
Trial #1 (η -- Trial #2 (,, == io- ) Trial #3 (η == 10"4)
c •£train(°/°) £,«<(%) £
train(%) iw(%) -EtrainC/o) £,e S .(%)

15 22.2 21.9 23.1 22.2 25.4 21.0


20 23.7 20.1 23.7 21.3 24.6 20.7
25 25.4 20.7 25.1 20.1 24.6 21.6
30 23.1 20.7 24.3 20.4 23.4 19.8
35 24.8 20.7 25.1 20.1 24.3 20.4
40 23.4 20.7 24.8 20.1 24.8 20.1
45 24.3 19.5 24.3 19.8 24.0 20.7
50 24.3 19.8 25.4 20.7 24.6 20.4

of randomly selected prototypes. In fact, the sets of initial prototypes used in these
experiments were identical to those used in the experiments employing conventional
RBF networks and reformulated RBF networks generated by linear generator func-
tions, the results of which are summarized in Tables 1 and 3, respectively. The output
weights were all set to 0. The classification errors observed in each trial on the training
set and the testing set were not significantly affected by the number of radial basis
functions. The classification errors on the training set were very close in different trials.
The same is true for the classification errors recorded in the three trials on the testing
set. Thus, the performance of the trained RBF networks was only slightly affected by
the initialization of the training process. Compared with the reformulated RBF net-
works generated by linear generator functions, the reformulated RBF networks tested
in these experiments produced a higher percentage of classification errors on the train-
ing set and a slightly lower percentage of classification errors on the testing set. In fact,
the RBF networks generated by linear generator functions achieved more balanced
performance on the training and testing sets.
Figures 4, 5, and 6 show the partition of the feature space produced by reformu-
lated RBF networks trained using gradient descent with 30, 40, and 50 radial basis
functions, respectively. The function g() was of the form g(x) = (g0(x))1/(m~l), with
g0(x) — x and m = 3. The networks with 30 and 40 radial basis functions produced
different partitions of the feature space, which indicates that there exist significant
differences in the representations of the feature space by 30 and 40 prototypes. It is
clear from Figures 4a and 5a that both networks attempt to find the best possible
compromise in regions of the input space with extensive overlapping between the
classes. As a result, some of the regions formed differ in terms of both shape and
size. Nevertheless, the partitions produced by both networks for the training set offer
a reasonable compromise for the testing set as indicated by Figures 4b and 5b. Figure
6a shows that the RBF network trained with 50 radial basis functions follows all the
peculiarities of the training set. This is clear from the anomalous surfaces produced by
the network in its attempt to classify correctly feature vectors from the training set that
can be considered outliers with respect to their classes. Moreover, Figure 6b indicates
that the partition of the feature space produced by the network for the training set is
not appropriate for the testing set. This is a clear indication that increasing the number
Section 8 Experimental Results 151

4000

3500

3000

2500

2000

1500

1000

200 400 600 800 1000 1200 1400


Fl(Hz)
(b)

Figure 4 Classification of the 2D vowel data produced by a reformulated RBF


neural network trained using gradient descent: (a) the training data set
and (b) the testing data set. Generator functions: g0(x) = x with m = 3;
number of radial basis functions: c = 30; learning rate: IJ = 10 -5 .
152 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks

Figure 5 Classification of the 2D vowel data produced by a reformulated RBF


neural network trained using gradient descent: (a) the training data set
and (b) the testing data set. Generator function: g0(x) = x with m = 3;
number of radial basis functions: c = 40; learning rate: η = 10 -5 .
Section 8 Experimental Results 153

Figure 6 Classification of the 2D vowel data produced by a reformulated RBF


neural network trained using gradient descent: (a) the training data set
and (b) the testing data set. Generator function: g0(x) = x with m = 3;
number of radial basis functions: c = 50; learning rate: 17 = 10~5.
154 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks

of radial basis functions above a certain threshold compromises the ability of trained
RBF neural networks to generalize.

9. CONCLUSIONS

The success of a neural network model depends rather strongly on its association with
an attractive learning algorithm. For example, the popularity of conventional feed-
forward neural networks was to a large extent due to the error back-propagation
algorithm. On the other hand, the effectiveness of RBF neural models for function
approximation was hampered by the lack of an effective and reliable learning algorithm
for such models. Despite its disadvantages, gradient descent seems to be the only
method leading to appealing learning algorithms.
According to the axiomatic approach presented in this chapter for reformulating
RBF neural networks, the development of admissible RBF models reduces to the
selection of admissible generator functions that determine the form and properties of
the radial basis functions. The reformulated RBF networks generated by linear and
exponential generator functions can be trained by gradient descent and perform con-
siderably better than conventional RBF networks. Moreover, their training by gradient
descent is not necessarily slow. Consider, for example, reformulated RBF networks
generated by linear generator functions of the form g0(x) = χ + γ*. Values of γ close
to 0 can speed up the convergence of gradient descent learning and still produce trained
RBF networks performing better than conventional RBF networks. For values of γ
approaching 1, the convergence of gradient descent learning slows down but the per-
formance of the trained RBF networks improves. The convergence of gradient descent
learning is also affected by the value of the parameter m relative to 1. Values of m close
to 1 tend to create sharp radial basis functions. As the value of m increases, the radial
basis functions become wider and more responsive to neighboring input vectors.
However, increasing the value of m reduces the sensitivity of the radial basis functions
to changes of their respective prototypes, and this slows down the convergence of
gradient descent learning. Nevertheless, the selection of the parameter m is not crucial
in practice. Reformulated RBF networks can be trained in practical situations by fixing
the parameter m to a value between 2 and 4.
The experimental evaluation of reformulated RBF networks presented in this
chapter showed that the association of RBF networks with erratic behavior and poor
performance is unfair to this powerful neural architecture. The experimental results also
indicated that the disadvantages often associated with RBF neural networks can only
be attributed to the learning schemes used for their training and not to the models
themselves. If the learning scheme used to train RBF neural networks decouples the
determination of the prototypes and the updates of the output weights, then the pro-
totypes are simply determined to satisfy the optimization criterion behind the unsuper-
vised algorithm employed. Nevertheless, the satisfaction of this criterion does not
necessarily guarantee that the partition of the input space by the prototypes facilitates
the implementation of the desired input-output mapping. The simple reason for this is
that the training set does not participate in the formation of the prototypes. In contrast,
the update of the prototypes during the learning process produces a partition of the
input space that is specifically designed to facilitate the input-output mapping. In effect,
this partition leads to trained reformulated RBF neural networks that are strong com-
References 155

petitors to other popular neural models, including feed-forward neural networks with
sigmoidal hidden units.
There is experimental evidence that the performance of reformulated RBF neural
networks improves when their supervised training by gradient descent is initialized by
using an effective unsupervised procedure to determine the initial set of prototypes from
the input vectors included in the training set. An alternative to employing the c-means
algorithm for determining the initial set of prototypes would be the use of unsupervised
algorithms that are not significantly affected by their initialization. The search for such
codebook design techniques led to soft clustering [16,24-27] and soft learning vector
quantization algorithms [9,14,26-29]. Unlike crisp clustering and vector quantization
techniques, these algorithms form the prototypes on the basis of soft instead of crisp
decisions. As a result, this strategy reduces significantly the effect of the initial set of
prototypes on the partition of the input vectors produced by such algorithms. The use
of soft clustering and LVQ algorithms for initializing the training of reformulated RBF
neural networks is a particularly promising approach currently under investigation.
Such an initialization approach is strongly supported by recent developments in unsu-
pervised competitive learning, which indicated that the same generator functions used
for constructing reformulated RBF neural networks can also be used to generate soft
LVQ and clustering algorithms [14,24,27].
The generator function can be seen as the concept that establishes a direct relation-
ship between reformulated RBF models and soft LVQ algorithms [14]. This relation-
ship makes reformulated RBF models potential targets of the search for architectures
inherently capable ofmerging neural modeling with fuzzy-theoretic concepts, a problem
that attracted considerable attention recently [30]. In this context, a problem worth
investigating is the ability of reformulated RBF neural networks to detect the presence
of uncertainty in the training set and quantify the existing uncertainty by approximat-
ing any membership profile arbitrarily well from sample data.

REFERENCES

[1] D. S. Broomhead and D. Lowe, Multivariable functional interpolation and adaptive net-
works. Complex Syst. 2: 321-355, 1988.
[2] C. A. Micchelli, Interpolation of scattered data: Distance matrices and conditionally positive
definite functions. Construct. Approx. 2: 11-22, 1986.
[3] T. Poggio and F. Girosi, Regularization algorithms for learning that are equivalent to
multilayer networks. Science 247: 978-982, 1990.
[4] J. Park and I. W. Sandberg, Universal approximation using radial-basis-function networks.
Neural Comput. 3: 246-257, 1991.
[5] J. Park and I. W. Sandberg, Approximation and radial-basis-function networks. Neural
Comput. 5: 305-316, 1993.
[6] G. Cybenko, Approximation by superpositions of a sigmoidal function. Math. Control
Signals Syst. 2: 303-314, 1989.
[7] S. Chen, C. F. N. Cowan, and P. M. Grant, Orthogonal least squares learning algorithm for
radial basis function networks. IEEE Trans. Neural Networks 2(2): 302-309, 1991.
[8] S. Chen, G. J. Gibson, C. F. N. Cowan, and P. M. Grant, Reconstruction of binary signals
using an adaptive radial-basis-function equalizer. Signal Process. 22: 77-93, 1991.
156 Chapter 6 An Axiomatic Approach to Reformulating Radial Basis Neural Networks

[9] N. B. Karayiannis and W. Mi, Growing radial basis neural networks: Merging supervised
and unsupervised learning with network growth techniques. IEEE Trans. Neural Networks
8(6): 1492-1506, 1997.
[10] J. E. Moody and C. J. Darken, Fast learning in networks of locally-tuned processing units.
Neural Comput. 1: 281-294, 1989.
[11] I. Cha and S. A. Kassam, Interference cancellation using radial basis function networks.
Signal Process. 47: 247-268, 1995.
[12] N. B. Karayiannis, Gradient descent learning of radial basis neural networks. Proceedings of
1997 IEEE International Conference on Neural Networks, pp. 1815-1820, Houston, June 9-
12, 1997.
[13] N. B. Karayiannis, Learning algorithms for reformulated radial basis neural networks.
Proceedings of 1998 IEEE International Joint Conference on Neural Networks, pp. 2230-
2235, Anchorage, AK, May 4-9, 1998.
[14] N. B. Karayiannis, Reformulating learning vector quantization and radial basis neural net-
works. Fundam. Inform, 37: 137-175, 1999.
[15] N. B. Karayiannis, Reformulated radial basis neural networks trained by gradient descent.
IEEE Trans. Neural Networks, 10: 657-671, 1999.
[16] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York:
Plenum, 1981.
[17] T. Kohonen, Self-Organization and Associative Memory, 3rd ed. Berlin: Springer-Verlag,
1989.
[18] T. Kohonen, The self-organizing map. Proc. IEEE 78(9): 1464-1480, 1990.
[19] B. A. Whitehead and T. D. Choate, Evolving space-filling curves to distribute radial basis
functions over an input space. IEEE Trans. Neural Networks 5(1): 15-23, 1994.
[20] A. Roy, S. Govil, and R. Miranda, A neural-network learning theory and a polynomial time
RBF algorithm. IEEE Trans. Neural Networks 8(6): 1301-1313, 1997.
[21] N. B. Karayiannis and A. N. Venetsanopoulos, Artificial Neural Networks: Learning
Algorithms, Performance Evaluation and Applications. Boston: Kluwer Academic, 1993.
[22] R. P. Lippmann, Pattern classification using neural networks. IEEE Commun. Mag. 27: 47-
54, 1989.
[23] K. Ng and R. P. Lippmann, Practical characteristics of neural network and conventional
pattern classifiers. In Advances in Neural Information Processing Systems 3, R. P. Lippmann
et al., eds., pp. 970-976. San Mateo, CA: Morgan Kaufmann, 1991.
[24] N. B. Karayiannis, Fuzzy and possibilistic clustering algorithms based on generalized refor-
mulation. Proceedings of the Fifth IEEE International Conference on Fuzzy Systems, pp.
1393-1399, New Orleans, September 8-11, 1996.
[25] N. B. Karayiannis, Fuzzy partition entropies and entropy constrained clustering algorithms.
/. Intell. Fuzzy Syst. 5(2): 103-111, 1997.
[26] N. B. Karayiannis, Ordered weighted learning vector quantization and clustering algo-
rithms. Proceedings of 1998 International Conference on Fuzzy Systems, pp. 1388-1393,
Anchorage, AK, May 4-9, 1998.
[27] N. B. Karayiannis, Soft learning vector quantization and clustering algorithms based in
reformulation. Proceedings of 1998 International Conference on Fuzzy Systems, pp. 1441-
1446, Anchorage, AK, May 4-9, 1998.
[28] N. B. Karayiannis and J. C. Bezdek, An integrated approach to fuzzy learning
vector quantization and fuzzy c-means clustering. IEEE Trans. Fuzzy Syst. 5(4): 622-628,
1997.
[29] E. C.-K. Tsao, J. C. Bezdek, and N. R. Pal, Fuzzy Kohonen clustering networks. Pattern
Recogn. 27(5): 757-764, 1994.
[30] G. Purushothaman and N. B. Karayiannis, Quantum neural networks (QNNs): Inherently
fuzzy feedforward neural networks. IEEE Trans. Neural Networks 8(3): 679-693, 1997.
References 157

[31] N. B. Karayiannis, Entropy constrained learning vector quantization algorithms and their
application in image compression, SPIE Proceedings, Vol. 3030: Applications of Artificial
Neural Networks in Image Processing II, pp. 2-13, San Jose, CA, 1997.
[32] N. B. Karayiannis, An axiomatic approach to soft learning vector quantization and cluster-
ing. IEEE Trans. Neural Networks, 10: 1153-1165, 1999.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.

Chapter SOFT LEARNING VECTOR


7 QUANTIZATION AND CLUSTERING
ALGORITHMS BASED ON
REFORMULATION

Nicolaos B. Karayiannis

1. INTRODUCTION
Consider the set X c R" that is formed by M feature vectors from an «-dimensional
Euclidean space, that is, X = {xlt x 2 ,..., x«}, x, e R", 1 < / < M. Clustering is the
process of partitioning the M feature vectors to c < M clusters, which are represented
by the prototypes v, € V, 1 < j < c. Vector quantization can be seen as a mapping from
an «-dimensional Euclidean space X into the finite set V = {vi, v 2 ,..., vc} c W, also
referred to as the codebook.
Codebook design can be performed by clustering algorithms, which are typically
developed by solving a constrained minimization problem using alternating optimiza-
tion. These clustering techniques include the crisp c-means [1], fuzzy c-means [1],. and
generalized fuzzy c-means [2,3]. Alternative approaches to clustering resulted in the
development of entropy-constrained clustering and vector quantization algorithms
that are directly or indirectly related to Gibbs distribution and annealing. Rose et al.
[4] proposed a deterministic annealing algorithm for clustering using concepts from
probability theory and statistical mechanics. The clustering algorithms produced by
entropy-constrained minimization are not inherently invariant under uniform scaling
of the feature vectors. Karayiannis [5,6] proposed a new approach to fuzzy clustering,
which provided the basis for the development of entropy-constrained fuzzy clustering
(ECFC) algorithms. The formulation of the clustering problem considered in this
approach allows the development of fuzzy clustering algorithms invariant under uni-
form scaling of the feature vectors.
Recent developments in neural network architectures resulted in learning vector
quantization (LVQ) algorithms [7-21]. Learning vector quantization is the name used
for unsupervised learning algorithms associated with the competitive neural network
shown in Figure 1. The network consists of an input layer and an output layer. Each
node in the input layer is connected directly to the cells in the output layer. A prototype
vector is associated with each cell in the output layer as shown in Figure 1. Batch fuzzy
learning vector quantization (FLVQ) algorithms were introduced by Tsao et al. [20], and
their connection to probabilistic vector quantization models proposed in [22] was stu-
died in [23]. The update equations for FLVQ involve the membership functions of the
fuzzy c-means (FCM) algorithm, which are used to determine the strength of attraction

158
Section 2 Clustering Algorithms 159

X X X
'\k 2k 3k nk

T
X
k ~ t X l)t X 2* "' X n*]

Figure 1 The LVQ competitive network.

between each prototype and the input vectors. Tsao et al. [20] justified their update
equation by pointing out its close relationship to the fuzzy and crisp c-means algorithms.
Karayiannis and Bezdek [15] developed a broad family of batch LVQ algorithms by
minimizing the average of the generalized means of the Euclidean distances between
each of the feature vectors and the prototypes. The minimization problem considered in
this derivation is actually a reformulation of the problem of determining fuzzy c-parti-
tions that was solved by the FCM algorithm [24]. Under certain conditions, the resulting
batch LVQ scheme can be implemented as the FCM or FLVQ algorithms [15].
This chapter begins with the reformulation of FCM and ECFC clustering algo-
rithms. These results provide the basis for the development of a unified theory for soft
LVQ and clustering. This theory provides an alternative to developing LVQ and clus-
tering algorithms, a task that is almost exclusively based on alternating optimization. In
fact, this theory allows the development of LVQ and clustering algorithms by minimiz-
ing a reformulation function using gradient descent. Further investigation indicates that
the development of LVQ and clustering algorithms reduces to the selection of an
admissible generator function. Existing fuzzy LVQ and clustering algorithms are inter-
preted as the products of hnear and exponential generator functions. This chapter also
presents a family of soft LVQ and clustering algorithms produced by nonlinear gen-
erator functions. These algorithms are used to perform segmentation of magnetic reso-
nance images of the brain.

2. CLUSTERING ALGORITHMS

Clustering algorithms can be classified as crisp or fuzzy according to the scheme they
employ for partitioning the feature vectors into clusters. This section introduces crisp
Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

and fuzzy partitions and also describes the crisp c-means, fuzzy c-means, and entropy-
constrained fuzzy clustering algorithms.
2.1. Crisp and Fuzzy Partitions
A partition π{Χ) of the set X is a collection of subsets of X that are pairwise
disjoint, all nonempty, and their union yields the original set. Let X be the finite set
X = {xj, X2,..., Xji/}· A family of c 6 [2, M) subsets Aj, 1 <j < c, of X is a crisp or
hard c-partition of X if

υ;=,Λ· = x (i)
AjnAj = Q>, l<i^j<c (2)
<2>cAjCX, l<j<c (3)

Each subset Aj, 1 < j < c, is assigned a characteristic function or indicator function
Uj(xi) = u,y, defined as

J 1 if x, € Aj
w
"'·> | 0 if x, £ Λ,

According to this definition, each Uy can take the value of 0 or 1, that is,

uy e {0,1}, \<i<M; \<j<c (5)

For any x, e X, there exists aj =f such that «,·,·» = 1 and Uy = 0, V/ 56/". Thus,
c
u
Σ v< = 1. 1 <1<M (6)

Since each subset >^, 1 < j < c, of Af is nonempty, ΣΪίχ uy > 0, 1 < j < c. Since all Aj,
1 <j<c, are proper subsets of X, there exists no single subset of X containing all
elements of X. This implies that J ^ M,·, < M, 1 <j<c. Thus,

u
0<Σ >j>< M, \<j<c (7)
1=1

In summary, the M x c matrix U = [«,·,] is a crisp c-partition in the set Uc defined as

c M l
Uc = U e mMcUy € {0, 1}, V/J; ^ ue. = 1, Vi; 0 < ^ ui}■< M, Vy (8)
<=i

Crisp c-partitions assign each element of A* to a single subset, thus ignoring the
possibility that this element may also belong to other subsets. This is a clear disadvan-
tage of crisp c-partitions, which affects their ability to partition the elements of # in a
way consistent with human intuition and their physical properties. In fact, crisp c-
partitions fail to represent the uncertainty that is often encountered in practical appli-
Section 2 Clustering Algorithms 161

cations. The use of crisp c-partitions in formulating real-world problems restricts the
space of possible solutions searched by the resulting algorithmic tools.
The shortcomings of crisp c-partitions were illustrated by Bezdek [1], who consid-
ered all valid 2-partitions of the set

X = {peach, nectarine, plum} (9)

Table 1 shows all valid 2-partitions of the set X. According to the first two partitions,
the nectarine (a hybrid of peach and plum) is assigned to the same subset with either
peach or plum. However, both partitions fail to reveal the close relationship between
the nectarine and both the peach and the plum. The third partition creates a subset
containing only the nectarine, thus separating the only hybrid fruit in the set. Although
this separation seems to be reasonable, the partition fails to indicate the unquestionable
relationship between the peach, the plum, and their hybrid.
Fuzzy c-partitions were introduced in an attempt to overcome the shortcomings of
crisp c-partitions. Fuzzy c-partitions are constructed by partitioning the set X into
fuzzy subsets. In this case, the characteristic or indicator function (4) becomes a mem-
bership function Uy = «/(x,), which represents the degree to which x, can be considered
as a member of the subset A,·. According to the classical definition of fuzzy c-partitions,
the M x c matrix U = [uy] is a fuzzy c-partition if and only if its entries satisfy the
following conditions:

uy e[0,l], l<i<M;l<j<c (10)


c
Y^uy=\, \<i<M (11)

M
0
< Σ "y < M> 1 <7 < c (12)

In summary, the M x c matrix U = [uy] is a fuzzy c-partition in the set Uf defined as

M
Uf = JU € mMcUy € [0,1], ViJ; Σ uy = 1, V/; 0 < £ uy < M, V; (13)

Since the conditions (11) and (12) are also satisfied by valid crisp c-partitions, the only
difference between crisp and fuzzy c-partitions is that in the case of fuzzy c-partitions
{«,·;} are allowed to take values from the interval [0, 1]. Since {0,1} c [0,1], the condi-
tion (11) is also satisfied by crisp c-partitions which require that «,·, e {0, 1}, Vi,j. Thus,

TABLE 1 Valid Crisp 2-Partitions of the Set X = {peach, nectarine, plum}

Subset A A2 ΛΛ A2 A A2
peach 1 0 1 0 0 1
nectarine 1 0 0 1 1 0
plum 0 1 0 1 0 1
162 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

the space of crisp c-partitions is a subspace of the space of fuzzy c-partitions. In fact, the
classical definition of fuzzy c-partitions was strongly influenced by the properties of
crisp c-partitions. This influence becomes even more obvious on noting the relative
freedom allowed by fuzzy set theory for selecting membership functions. The condition
(11) was relaxed in some recent approaches [2,3,8-13,16,25-27].
2.2. Crisp omeans Algorithm
Suppose the M x c matrix U = [uy] defines a valid crisp c-partition of the set
X = [xl,x2,...,xM}, that is, ui} e {0,1}, Vi, j , and E/=i My = ^ Vi - L e t V =
[\y, v 2 ,...,?(.} be the set of prototypes representing the feature vectors in X. Then
the discrepancy associated with the representation of each x, by one of the prototypes
Y/> 1 <j<c, can be measured by

·Ή = Σ"#ΙΙ χ '- ¥ ;ΙΙ 2 (14)

The total discrepancy associated with the representation of all feature vectors x,·,
1 < i < M, by the prototypes v;, 1 < j < c, is given by

M c <15>
= ΣΣΜ*/-*/·ΙΙ 2
(=1 ;'=1

The c-means algorithm was developed using alternating optimization to solve the
minimization problem [1]

mi^ I /,(U, V) = f ) Σ Mx« " ^ I (16>


where V = [viv 2 ... vc] e Wc, Uy = «y(x,) is the indicator function that assigns x, to the
y'th cluster, and the M x c matrix U = [uy] is a crisp c-partition in the set Uc defined in
(8). Define the sets

It = {n <j<c: ||x, - v,||2 < ||χ(· - v,|| 2 , V£ φ]) (17)

and their complements

2} = { l , 2 , . . . , c } - X , (18)

The cardinality J , of the set J , represents the number of prototypes whose distance
from x, is equal to the minimum distance between x, and all prototypes. The coupled
necessary conditions for solutions (U, V) € Uc x RMC of (16) are [1]:
• If |J,| > 1, then Uy = 0, V/ € ϊ,, and J2JeX. Uy = 1, Vi
Section 2 Clustering Algorithms 163

If II, | = 1, then

1, if ||x, - v,||2 < ||x, - ▼<||2, V«#y,


1 < i < M; \<j<c (19)
0, otherwise

and

^ =0 ^ . IZJS' (2°)

The c-means algorithm begins with the selection of an initial set of prototypes,
which implies the partition of the feature vectors into c clusters. Each cluster is repre-
sented by a prototype, which is computed as the centroid of the feature vectors belong-
ing to that cluster. Each of the feature vectors is assigned to the cluster whose prototype
is its closest neighbor. The new prototypes are computed from the results of the new
partition, and this process is repeated until the changes of the prototypes from one
iteration to the next become negligible.
The c-means algorithm can be summarized as follows:
1. Select c and e; fix N; set v = 0.
2. Generate an initial set of prototypes: Vo = {vii0, V2JO, · · ·. vCio}·
3. Set v = v + 1 (increment iteration counter).
4. Calculate:

[ 1, if ||χ,·-ν Λ υ || 2 <||χ,·-ν,, ν || 2 , νΐφί, . .


• uiJyV=\ 1 <ι<Μ; 1 <j <c,
I 0, otherwise.

• V = (Σί=ι Η**Ι)/<ΣΖΙ 2 «v.*)* l *J *c-


• £ν=Σ;=ιΐκ>-ν,>-ιΐι .
5. If v < TV and Ev > e, then go to step 3.
Although the c-means algorithm is simple and intuitively appealing, it strongly
depends on the selection of the initial codebook and can easily be trapped in local
minima [26]. The selection of a good initial set of prototypes was attempted by random
code, splitting, and pairwise nearest neighbor techniques [28]. However, there seems to
be no simple solution to the local minima problem. For a given initial set of prototypes,
the algorithm finds the nearest local minimum in the space of all possible partitions.
Another shortcoming of the c-means algorithm is the creation of degraded clusters, that
is, clusters with no feature vectors assigned to them.
The disadvantages of crisp clustering algorithms are mainly due to the nearest
prototype condition they employ for assigning the feature vectors to clusters, which
completely ignores the uncertainty involved in this process. Most of the problems
associated with crisp clustering algorithms can be overcome by fuzzy clustering algo-
rithms, which are capable of quantifying the uncertainty typically associated with the
representation of a set of feature vectors by a set of prototypes. Fuzzy clustering
164 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

algorithms consider each cluster as a fuzzy set. As a result, each feature vector may be
assigned to multiple clusters with some degree of certainty measured by the membership
function.
2.3. Fuzzy c-means Algorithm
The fuzzy c-means (FCM) algorithm is one of the most powerful tools used to
perform clustering in practical applications. The FCM algorithm was developed using
alternating optimization to solve the minimization problems [1]

W f xR' { /m(u)v) = vy"(«öriix,-vyi


M c
(21)

where 1 < m < oo, V = [V]V2 ... vc] e 05ne, Uy = W/(x,·) is the membership function that
assigns x, to the y'th cluster, and the M x c matrix U = [uy] is a fuzzy c-partition in the
set Uf defined in (13). Define the sets

X, = { / l < 7 < c : | | x , - v y | | 2 = 0} (22)

and their complements

ί/ = {1,,2 c}-I, (23)

The cardinality I , of the set I , represents the number of prototypes that coincide with
x(. The coupled first-order necessary conditions for solutions (U, V)eW/X R"c of (21)
are [1]:
• If I,· Φ <Z>, then Uy = 0, V/ e I„ and Σ/€ζ,> uy = 1 > ^ ·
• If I , = 0, then

and

Σ,=ι ("»Γ χ ί j . (25)

The "fuzziness" of the clustering produced by the FCM is controlled by the para-
meter m, which is greater than 1 [1]. As this parameter approaches 1, the partition of the
feature vector space is a nearly crisp decision-making process. Increasing this parameter
tends to degrade membership toward a maximally fuzzy partition [1].
The fuzzy c-means algorithm can be summarized as follows:
1. Select c, m, and e; fix N; set v = 0.
2. Generate an initial set of prototypes Vo = {vio, V2,o, · · ·, vc>o}.
Section 2 Clustering Algorithms 165

3. Set v = v+l.
4. Calculate:

J
\trviix«-v^-iii7 j

5. If v < N and £„ > e, then go to step 3.

2.4. Entropy-Constrained Fuzzy Clustering


The entropy was proposed by Shannon as a measure of uncertainty for statistical
models. Bezdek [1] proposed the partition entropy as a formal analogy for fuzzy
c-partitions. The partition entropy of any fuzzy c-partition U e Uf of X is defined as

i M c
=
^ -ΜΣΣ«»Η (26)
i=l j=\

where Uy In Uy = 0 if «,·,· = 0. The definition of the partition entropy implies that


0 < H(U) < In c. The partition entropy is minimized if and only if the c-partition is
crisp, that is, if and only if each entry uy of U e Uf takes only the value 0 or 1. The
partition entropy attains its maximum value Hmax = In c if and only if U e Uf corre-
sponds to a maximally fuzzy c-partition, that is, if and only if Uy = \/c, Wi, j .
Entropy-constrained fuzzy clustering (ECFC) algorithms were developed using
alternating optimization to solve the minimization problems [5,6]

min {ΐμ(υ,\) = μσ(υ) + (1-μ)σΌ(υ,Υ)} (27)


UfxW

where V = [viv2 · · · vj e R™, the M x c matrix U = [uy] e Uf is a fuzzy c-partition in


the set Uf defined in (13), G(U) is defined in terms of the partition entropy H(JJ) as

G(U) = - # ( U )
1 Α , Α , (28)
=
ΜΣΣ««Η
(=1 .7=1

and Z>(U, V) is the average distortion between the feature vectors x, e X and the pro-
totypes \j e V, defined as

i M. _c
W^MEE1*-^!211
(29)
;=i y=i
166 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

7μ(υ, V) is defined in terms of the scaling parameter a and the fuzzification parameter
μ € (0, 1). The scaling parameter σ can be used to develop clustering algorithms invar-
iant under uniform scaling of the feature vectors x, € X. The fuzzification parameter μ
determines the relative effect of the entropy and the distortion terms on the objective
function /^(U, V). If μ -*■ 1, the entropy term is dominant and minimization of Ιμ
(U, V) implies maximization of the partition entropy H(U) = — G(U). As the value of
μ decreases, the effect of the entropy term decreases and the minimization of the
distortion between the feature vectors and the prototypes plays an increasingly domi-
nant role. If μ -*· 0, the effect of the entropy term in (27) is eliminated and the cluster-
ing process is almost exclusively based on the minimization of the distortion D(V, V)
between the feature vectors and the prototypes.
The coupled necessary conditions for solutions (U, V) eUf x Wc of(27) are [5,6]

uy = —c " —-5-, 1 < 1 < M; \<j<c (30)

where δμ = (1 — μ)/μ, and

Ύ=Σ£*2&., i<j<c ÖD
Σί=1 "y

The fuzziness of the c-partition produced by this formulation depends on the value of
μ e (0,1) [6]. If μ « 1, then δμ = (1 - μ)/μ approaches 0 and Uy « \, Vi,j. Such a value
of μ produces a maximally fuzzy partition. As the value of μ decreases, the minimiza-
tion of the distortion between the feature vectors and the prototypes plays a more
significant role. In this case, (30) assigns membership values to the feature vectors
according to their relative distance from a certain prototype. As μ approaches 0, the
membership function (30) approaches the indicator function associated with the crisp c-
means algorithms, that is, «,·, -*■ 1 if ||x,· - y,-||2 < ||x,· - v^||2, W φ}, and Uy -» 0 other-
wise. The transition from a maximally fuzzy to a nearly crisp partition can also be seen
as an annealing process. The value of δμ = (1 — μ)/μ increases as the value of μ
decreases, while δμ = (1 — μ)/μ -*■ oo as μ -*■ 0. Clearly, δμι = μ/(1 — μ) can be inter-
preted as the system temperature, which decreases from a very high value to 0 during the
annealing process. In this context, the fuzziness of the partition relates to the number of
accessible states for the system.
The distortion component of the objective function (27) is a function of the feature
vectors and the range of its values is unknown. In contrast, the entropy component of
(27) is not a function of the feature vectors and the range of its values is known. The
scaling parameter σ in (27) can be used to balance the relative effect of the entropy and
the distortion components of the objective function. This is necessary for the develop-
ment of clustering algorithms that are invariant under uniform scaling of the feature
vectors. The scaling parameter σ can be computed at each iteration according to the
condition

a 0 #(U) = aD(V, V) (32)


Section 2 Clustering Algorithms 167

where σ0 is a bias constant that determines the relative weight assigned to the partition
entropy and scaled average distortion components of the objective function (27). If
σ0 > 1, the evaluation of the scaling parameter σ according to (32) favors the average
distortion over the partition entropy. Conversely, the evaluation of σ according to (32)
increases the role of the partition entropy in the clustering process if σ0 < 1. According
to the scheme presented above, σ can be computed at each iteration in terms of the
current estimates of the prototypes and the membership values as

'Σ£ΙΣΜΜ*-Τ/ΙΙ2

In the beginning of the clustering process, where the membership values are unknown,
it can be assumed that the partition is maximally fuzzy, that is, U = [1/c]. In this case,
the scaling parameter σ can be evaluated by requiring that

a0H(l\/c]) = aD([l/c],\) (34)

which is satisfied if

σ = σ0 - j (35)
2
i^EjUll^-v,·!!
According to (35), σ = a0./ymax/Ä where // max = lnc is the maximum value of the
partition entropy and D is the average of the squared Euclidean distances, defined as
^ = (l/M C )E£ 1 E;=illx 1 -Vyll 2 ·
The ECFC algorithm can be summarized as follows:
1. Select c, μ, σ0, and e;fixN; set v = 0.
2. Generate an initial set of prototypes V0 = {vio, V2,o> · · ·. vc0}·
3. Compute:
• δμ = (1 - ß)/ß-
• σ = a0(Mc)/(Z?=i EU H* ~ MI 2 )·
4. Set v = v + 1 .
5. Calculate:

&\ρ(-σδμ\\\,- - νΛν_!||2) . .
• Ui j v = —z - r-, 1 < ι < M; 1 < ; < c.
J
' Σ^εχρί-ο^ΙΙχ,-ν^,ΙΙ2)'· " ~
• vj.v = (ΣΖι "ν>Χ/)/(Σ£ι «y.v), 1 <j < c
• o = - σ 0 ( Σ " ι Σ > ι %»In %»)/(Σ£ι Σ£=Ι %»ΙΙ* " Τ/JI2)·
2
• £ν = Σ;=Ι ι ν - * / > - ! ιι ·
6. If ν < Ν and £„ > e, then go to step 4.
168 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

3. REFORMULATING FUZZY CLUSTERING

The crisp and fuzzy clustering algorithms presented in the previous section were all
developed using alternating optimization. According to this optimization strategy, the
formula for the optimal set of membership functions is obtained by assuming afixedset
of prototypes. Conversely, the optimal prototypes are obtained by assuming a fixed set
of membership functions. Reformulation is a methodology that allows the development
of clustering algorithms using gradient descent to minimize a functional that is closely
related to that minimized by alternating optimization. The algorithms obtained by
reformulating fuzzy clustering can be implemented on the competitive neural network
shown in Figure 1. Thus, reformulation essentially establishes a link between clustering
and learning vector quantization.

3.1. Reformulating the Fuzzy omeans Algorithm


The reformulation of the FCM algorithm can be accomplished by determining the
form of the functional (21) minimized by the FCM assuming that the membership
functions are optimal [15,24]. If the membership functions [uy} are optimally given as
in (24), then

M c
/ = χ ν 2 1
'» ΣΣΜ
i=l y=l
.·- ;ΙΙ ("«/·Γ"
1-m
2 \ 1/(1-«)'

1-m
2 1/(1 ,) 1
=ΣΣ^(έ(ι^-^ιι ) -" )
ί=1 j=l V=l /

Since the membership functions {«,·,} satisfy the condition £ / = 1 ui}, = 1, 1 < i < M, the
functional Jm minimized by the FCM algorithm can be written as [1,24]

The functional (37) has a structured form and depends exclusively on the feature
vectors and the set of prototypes V = {vj, v 2 ,..., ve}.
A recent approach to learning vector quantization revealed that (37) is a mean-
ingful measure of the discrepancy associated with the representation of the feature
vectors in X = {\x, x 2 ,..., \M] by the prototypes in V [15]. According to this approach,
a broad family of batch LVQ algorithms can be derived by minimizing [15]

, M
L (38)
P = MT(DP^V)
Section 3 Reformulating Fuzzy Clustering 169

where Dp(\h V) is the generalized mean (or unweighted /»-norm) of ||x,· - y,||2, 1 < j < c,
defined as [29]

QpVPi-vft)
Dp(x,,V)=KV(||x,-v,ry) (39)

with p e R - {0}. The update equation for each prototype \j was obtained using gra-
dient descent as [15]
M
AV
J = 7/ Σ α,^χ'' ~ ^ ( 40)
i=l

where ?j, is the learning rate and {ay} are the competition functions, defined as

■•-(steiiT
The minimization problem that resulted in this family of batch LVQ algorithms is
actually a reformulation of the problem of determining fuzzy c-partitions that was
solved by FCM algorithms. For p = 1/(1 - m), (38) takes the form

^g(;|>-/r-j
l-m

(42)
m-\
=1 J
M ""
which indicates that (38) is an alternative functional expression for the reformulation
function (37) that was discussed by Hathaway and Bezdek [24]. Moreover, for
p = 1/(1 — m), the update equation (40) can be written as
M
Δ^^^Πχ,.-ν,.) (43)
1=1

where {uy} are the membership functions (24) of the FCM algorithm.
The reformulation function (37) can also be written as Jm = Mcl~mRm, where
i M
Κ
>* = ΤΪΣ^ < 44 >
i=l
1 m
with/(x) = x - and

St = S,(x„ V) = - Σ sGlx, - ν,-ΙΙ2) (45)


Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

with g{x) =Γ\χ) = xm-m).


3.2. Reformulating ECFC Algorithms
ECFC algorithms can be reformulated using the same procedure that resulted in
the reformulation of the FCM algorithm. The functional (27) minimized by ECFC
algorithms can be written as
M c
a
^ = I? Σ Σ «ff(ta "</ + σΜ*< " Vyll2) (46)
;=1 j=\

If the membership function {«,·,} are optimally given as in (30), then

—"(έ exp(-CT<5Mi|x,-vJ2)
(47)

V=l
- In (cSd

where

S, = Sfa, V) = -J2 βχρ(-<τ*μ||x, - v,·||2) (48)


c
J=i

If {uy} are given in (30), then combining (46) and (47) gives
M c

i=l j=\ '


(49)

i=l ' 7=1

Since ^ = 1 exp(-a^||x,- - vy||2) = cS„ (49) gives

(=1 (=1 >*

Since the term -μ. In c is independent of the prototypes, the reformulation function
(50) that corresponds to ECFC algorithms can be simplified as Ιμ — σ(1 - μ)Κμ, where

M
,=i

with f(x) = - In χ/(σδμ) and


Section 4 Generalized Reformulation Function 171

S, = S,(x„ V) = - £ > ( l l x , - vyll2) (52)


c
j=\

with g(x) =f~\x) = ^χρ(-σδμχ).

4. GENERALIZED REFORMULATION FUNCTION


The FCM and ECFC algorithms correspond to reformulation functions of the same
basic form. This motivated the search for a generalized reformulation function that can
lead to a variety of soft LVQ and clustering algorithms [11,13]. Consider the family of
functions of the general form

/=i

where

S, = Sfa, V) = -J2 «ill* - M2) (54)

The search for admissible reformulation functions of this form requires the determina-
tion of the conditions that must be satisfied by the functions /(·) and g(·) involved in
their definition, which are assumed to be differentiable everywhere.
4.1. Update Equations
Minimization of admissible reformulation functions of the preceding form using
gradient descent can produce a variety of batch LVQ algorithms whose behavior and
properties depend on the functions/(·) and g(-). The gradient VT R = dR/dvj of R with
respect to the prototype v, can be determined as

j M

ί=1

ν
M dS ' ν,'3ί
U i
M c
1 (\ \
2
=^Σ/'<^^Σ^-^ )) <55>
. M

= ττ; Σ / ' ^ Ό ΐ χ , - - y/iiX(iixi - */ii2)


i=l
M
2
E/'OS/te'Olx.-VyfXXi-V,·)
Mc . ,
1=1
172 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

The update equation for the prototypes can be obtained according to the gradient
descent method as

Δν,- = -ί?;νν.Α

" ( 56 )
= Vj 2_ ay(x< - vj)
(=1

where η} = (2/Mc)??/ is the learning rate for the prototype v; and the competition
functions {ay} are computed in terms of/(·) and g(·) as

<*ij=f'(Si)g'(\\xi-yj\\2) (57)

The LVQ algorithms derived earlier can be implemented iteratively. Let v7>_i,
1 <j < c, be the set of prototypes obtained after the (v — l)th iteration. According to
the update equation (56), a new set of prototypes vy v, 1 <j<c, can be obtained
according to

M
v,· „ = y,-iV_i + nj<v ] Γ α0>(χ,· - ν,-,,,-ι), 1 <j <c (58)
i=l

where ην is the learning rate at iterate v and aijv =/'(S,>_i)g'(||x; — v,>_il|2), with
Si<v_i = (l/c)Y^=lg(\\Xi■ — V/,v_i ||2). Under certain conditions, the LVQ algorithms
described by the update equation (58) reduce to iterative clustering algorithms. The
update equation (58) can also be written as

( 1-
M \ M

%v X! a'7,v Kv-i + %v Σ α'7.Λ·' l


-J - c
(59>
According to (59), ν,·.„ can be evaluated only in terms of the feature vectors x, € X if the
direct effect of v;>_i diminishes. This can be accomplished if 1 - ηί<ν Σ^ίι aij,v = 0 or>
equivalently, if

^=(£«^1 ' 1
^J^C (6°)

The closed-form formula that can be used to compute the prototypes at each iteration
can be obtained by substituting the learning rates at (60) in the update equation (59) as

2_)=1 aij,vXi
V.»- v-M a - 1<7'<C (61)
2^=1 y,v
Section 4 Generalized Reformulation Function 173

4.2. Admissible Reformulation Functions


The search for admissible reformulation functions is based on the properties of the
competition functions ay = a,y(x,-, V), which regulate the competition between the pro-
totypes \j, 1 < j < c, for each feature vector x, e X. The following three axioms
describe the properties of admissible competition functions [10-13]:

Axiom 1: If c = 1, then an = 1, 1 < i < M.

Axiom 2: ay >0,l<i<M;l<j<c.

Axiom 3: If ||x,· - Vp||2 > ||x,· - v^H2 > 0, then aip < aiq, 1 <p, q < c, and p Φ q.
Axiom 1 indicates that there is actually no competition in the trivial case where all
feature vectors x, e X are represented by a single prototype. Thus, the single prototype
is equally attracted by all feature vectors x, e X. Axiom 2 implies that all feature
vectors x, e X compete to attract all prototypes y,·, 1 <j<c. Axiom 3 implies that a
prototype v^, that is closer in the Euclidean distance sense to the feature vector x, than
another prototype yp is attracted more strongly by this feature vector.
Axioms 1-3 lead to the admissibility conditions for reformulation functions sum-
marized by the following theorem [11,13]:

Theorem 1: Consider the finite set of feature vectors X = {xj, x 2 ,..., χ Μ ), which
are represented by the set of c < M prototypes V = {vi, v 2 ,..., vc}. Then the func-
tion R defined by (53) and (54) is an admissible reformulation function of the first
(second) kind iff(x) and g(x) are differentiable everywhere functions of x € (0, oo)
satisfying/(g(x)) = x,f(x) and g(x) are both monotonically decreasing (increasing)
functions of x e (0, oo), and g'(x) is a monotonically increasing (decreasing) func-
tion of x e (0, oo).

4.3. Special Cases


The reformulation function that corresponds to the FCM algorithm can be
obtained as a special case of the general reformulation function defined by (53) and
(54) with/(.x) = xx~m and g(x) =f~\x) = xm'm). In this case,/(x) and g(x) are both
monotonically decreasing functions of x e (0, oo) if 1 — m < 0 or m> 1. For
m e (1, oo), g'{x) and f'(x) are both increasing functions of x € (0, oo) and R is an
admissible reformulation function of the first kind. This range of values of m is con-
sistent with the formulation that resulted in the FCM algorithm [1]. Nevertheless,
Karayiannis and Bezdek [15] observed that fuzzy LVQ and clustering algorithms can
be derived by minimizing Lp in (38) for p — 1/(1 - m) e (—oo, 0) U (0,1) or, equiva-
lent^, for m e (-oo, 0) U (1, oo). This observation is also consistent with Theorem 1.
Clearly, f(x) and g(x) are both monotonically increasing functions of JC € (0, oo) if
1— m > 0 or m < 1. Nevertheless, the interval (0, 1) is excluded because g'(x) is mono-
tonically decreasing function of x e (0, oo) only if m < 0. For m € (-oo, 0),/'(x) is an
increasing function of x e (0, oo) and R is a reformulation function of the second kind.
The reformulation function corresponding to the ECFC algorithm can be obtained
as a special case of the general reformulation function defined by (53) and (54) with
f(x) = — Ιηχ/(σδμ) and g(x) =f~l(x) = exp(—aSßx), where δμ = (1 - μ)/μ and σ > 0.
174 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

In this case, f(x) and g(x) are both monotonically decreasing functions of x e (0, oo) if
δμ = (1 - μ)/μ > 0 or, equivalently, if μ e (0,1). For μ e (0,1), g'(x) = -σδμ
εχρ(-σ<5μΧ) is a monotonically increasing function of x e (0, oo) and R is a reformula-
tion function of the first kind. If μ e (-oo, 0) U (1, oo), then δμ = (1 — μ)/μ < 0 and
both/Xx) and g(x) are monotonically increasing functions of x e (0, oo). However, the
functions f(x) and g(x) fail to form admissible reformulation functions of the second
kind for μ e (-oo, 0) U (1, oo) because in this case g'(x) is an increasing function of
x e (0, oo) and, thus, violates one of the admissibility conditions of Theorem 1.

CONSTRUCTING REFORMULATION
FUNCTIONS: GENERATOR FUNCTIONS

The function g(-) that results in the FCM algorithm has the form g(x) = (goW)1/(1""l),
where g0(x) = xandm e (—oo, 0) U (1, oo). If m e (1, oo), the partition produced by the
algorithm approaches asymptotically a crisp c-partition as m -*■ 1 + . The partition
becomes increasingly fuzzy as m increases and approaches a maximally fuzzy partition
as m ->■ oo. If m e (—oo, 0), the resulting partition becomes increasingly fuzzy as m
increases from —oo and approaches 0 from the left. The function g() that results in
the ECFC algorithm can also be written as

g(x) = expi-aS^x)

= exp/-a -x\ (62)

= (exp(ax)fß-1)/ß

For μ = (m - \)/m, (62) takes the form g(x) = (g0(x)Y/(l~m), with g0(x) = exp(ax). If
m ->· 1 + (μ -> 0), then <5μ -> oo and the partition produced by the resulting algorithm
approaches asymptotically a crisp c-partition. If m -*■ oo (μ ->· 1), then δμ ->· 0 and the
resulting algorithm produces a maximally fuzzy partition.
The FCM and ECFC algorithms are generated by a function g(·) of the form

g(x) - fe0(*))1/(1-m), mφ 1 (63)

with g0(x) = x and g0(x) = exp(ax), respectively. This is an indication that a broad
variety of reformulation functions can be constructed using function g(-) of the form
(63), where g0(x) is called the generator function. For a given generator function g0(-),
the corresponding function/(·) can be determined from g(x) = (goW) 1/il-m) m s u c n a
way that/(g(x)) = x, while the competition functions {ay} can be obtained using (57).
The following theorem summarizes the conditions that must be satisfied by admissible
generator functions [11,13].

Theorem 2: Consider the function R defined by (53) and (54). If g(-) is defined in
terms of the generator function g 0 () t n a t is continuous on (0, oo) as
g(x) = (So(*))1/(1~m\ m^l, then/(x) =/ 0 (* I ~ m ), with/0fe0(*)) = *· The generator
function g0(x) leads to an admissible reformulation function R if:
l.gQ(x)>0,Wxe(0,oo),
Section 6 Constructing Admissible Generator Functions 175

2. g0(x) is a monotonically increasing function of x e (0, oo), that is, go(x) > 0,
Vx e (0, oo), and
3. r0(x) = (m/(m - l))(^(x))2 - g0(x)gö(x) > 0, Vx e (0, oo),
or
l.go(*)>0,Vxe(0,oo),
2. g0(x) is a monotonically decreasing function of x e (0, oo), that is, go(x) < 0,
Vx 6 (0, oo), and
3. r0(x) = (m/(m - l))feo(x))2 - *o(*)*o (*) < 0, Vx e (0, oo).
If go(x) is an increasing generator function and m > 1 (m < 1), then /? is a reformula-
tion function of the first (second) kind. If go(x) is a decreasing generator function and
m > 1 (m < 1), then i? is a reformulation function of the second (first) kind.
The generator function go(x) = x > 0, Vx e (0, oo), that results in FCM algorithms
is a monotonically increasing function (go(x) = 1). Since go(x) = 0, the third condition
of Theorem 2 requires that m/(m — 1) > 0, which is valid for m > 1 or m < 0. If
m > 1 (m < 0), then Ä is a reformulation function of the first (second) kind.
Consider also the generator function g0(x) = exp(ax) > 0, Vx 6 (0, oo), that results
in ECFC algorithms. If σ > 0, then go(x) = σβχρ(σχ) > 0, Vx € (0, oo), and g0(x) =
exp(ax) is an increasing generator function. Since gS(x) = σ2 exp(ax), the third condi-
tion of Theorem 2 holds if m/(m — 1) > 1 or, equivalently, if m > 1. This implies that R
is an admissible function of the first kind. If σ < 0, then go(x) = βχρ(σχ) is a decreasing
generator function. Since it is required by the third condition of Theorem 2 that m < 1,
the decreasing generator function g0(x) = βχρ(σχ), σ < 0, also produces reformulation
functions of the first kind.

6. CONSTRUCTING ADMISSIBLE GENERATOR


FUNCTIONS

Theorem 2 indicates that the construction of admissible LVQ models reduces to the
search for admissible generator functions. A variety of admissible generator functions
can be determined by a constructive approach that begins from the admissibility con-
ditions of Theorem 2 and determines the form of admissible generator functions by
solving a differential equation.
The construction of admissible generator functions can be attempted by letting
go(x) be a function of g0(x), that is,

go(x)=p(go(x)) (64)

where \/p{x) is an integrable function. Theorem 2 requires that g(,(x) > 0, Vx e (0, oo),
for m > 1 and go(x) < 0, Vx e (0, oo), for m < 1. Since it is also required by Theorem 2
that g0(x) > 0, Vx e (0, oo), the function /?(■) must be selected so that p(x) > 0,
Vx e (0, oo), if m > 1 and p(x) < 0, Vx e (0, oo), if m < 1. For such functions, the
admissibility conditions of Theorem 2 are satisfied by all solutions go(x) > 0,
Vx e (0, oo), of the differential equation (64) that satisfy the conditions r0(x) > 0,
Vx e (0, oo), for m > 1 and r0(x) < 0, Vx € (0, oo), for m < 1.
176 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

6.1. Increasing Generator Functions


Assume that m > 1 and let the function /?(·) be of the form

p(x) = kx", k>0 (65)

The function g0(·) can be obtained in this case by solving the differential equation

SO(*) = *feo(*))". £>0 (66)


According to (66), go'(x) = kn(g0(x))n~lgo(x) = k?n(go(x)f"~l. In this case,

roW = feoW)2(^-f-») (67)

If m > 1, it is required that r0(x) > 0, Vx e (0, oo), which holds for all m/(m - 1) > n.
For m > 1, m/(m — 1) > 1 and the inequality m/(m — 1) > n holds for all « < 1. For
n = 1, m/(m — 1) - n = l/(m — 1) > 0. Thus, the condition r0(x) > 0, Vx € (0, oo), is
satisfied for all n < 1.
Assume that m > 1 and consider the differential equation (66) for n < 1. For n = 1,
(65) corresponds to p(x) = kx. In this case, the solutions of (66) are

g0(x) = cexp(ax) (68)

where c > 0 and σ = k/c > 0. The remainder of this chapter investigates generator
functions of the form g0(x) = exp(ax), σ > 0, which result from (68) by setting c = 1.
For n < 1, the admissible solutions of (66) are of the form

g0(x) = (ax + b)q (69)

where q = 1/(1 - «) > 0, a = k(\ - «) > 0 and b > 0. For n = 0, p(x) = k and (69) leads
to linear generator functions

g0(x) = ax + b, a>0, b>0 (70)

For a = 1 and b = 0, (70) gives the generator function g0(x) = x, which leads to FCM
and FLVQ algorithms.
6.2. Decreasing Generator Functions
Assume that m < 1 and let

p(x) = -kx", k>0 (71)

This function />(·) satisfies the condition p{x) < 0, Vx e (0, oo). For this function, g 0 () is
a solution of the differential equation

gfa) = -k(Mo(x)T, k>0 (72)

From (72), gS(x) = -kn(g0(x)f-1 gfo) = k2n(g0(x))2"-1 and


Section 6 Constructing Admissible Generator Functions 177

r0(x) = (go(x))2(^-[-n) (73)

For m < 1, Theorem 2 requires that r0(x) < 0, Wx e (0, oo), which holds for all
m/(m — 1) < n. If m < 1, then m/(w — 1) < 1 and the inequality m/(m — 1) < n holds
for all n > 1. For n = 1, w/(w — 1) — « = l/(m — 1) < 0. Thus, the condition r0(x) < 0,
Vx € (0, oo), is satisfied for all n > 1.
Assume that m < 1 and consider the solutions of the differential equation (72) for
/I > 1. For n = 1, p(x) = —kx and the solutions of (72) are

g0(x) = cexp(-ax) (74)

where c > 0 and σ =fc/c> 0. For c = 1, (74) leads to decreasing exponential functions
of the form g0(x) = exp(-ax), σ > 0. For n > 1, the admissible solutions of (72) are of
the form

g0(x) = (ax + b)< (75)

where q = 1/(1 - n) < 0, a = k(n - 1) > 0 and b > 0. For n = 2, p(x) = -kx2 and (75)
leads to the generator functions

g0(x) = (ax + by1, a > 0, b > 0 (76)

6.3. Duality of Increasing and Decreasing


Generator Functions
For any function />(·) that leads to an increasing generator function go(x), there
exists another function /?(·) that leads to the corresponding decreasing generator func-
tion l/g0(x). Let g0(x) be an increasing generator function obtained by solving the
differential equation

^=Afro<*)) (77)

The differential equation that has l/go(x) as a solution can be obtained by substituting
in (77) g0(x) by l/g0(x) as

i(i^))=jP/(i^)) (78)

or, equivalently,

° W - - f e oWWM
* dx » ,Ä
! - ^>) (79)

If an increasing generator function g0(x) is a solution of the differential equation

go(x)=Pi(go(x)) (80)
178 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

then the corresponding decreasing generator function l/go(x) can be obtained as the
solution of

*o(*)=A*feo(*)) (81)
where pd(·) is given in terms of/>,·(·) as

Pd(x) =
= -xPi{-)
~χ2ρ(χ)
(82)

As an example, consider the family of increasing generator functions g0(x) obtained as


solutions of (80), with/>,(;*:) = kx", with n < 1. The corresponding decreasing generator
functions l/go(x) c a n De obtained by solving (81) with/^(jc) = — x?kx~" = —kx" , where
n' = 2-n> 1.

7. FROM GENERATOR FUNCTIONS TO LVQ AND


CLUSTERING ALGORITHMS

This section presents the derivation and examines the properties of LVQ and clustering
algorithms produced by admissible generator functions.
7.1. Competition and Membership Functions
Given an admissible generator function g0(·), the corresponding LVQ and cluster-
ing algorithms can be obtained by gradient descent minimization of the reformulation
function defined by (53) and (54) with g(x) = (g0(x)Y/(l~m), τηφ\. If g(x) =
fo,(x))1/(1-m), η,φί, then

g'(x) = τ-^*ό(*)(*ο(*)Γ /(Ι -- Μ) (83)


i —m
According to Theorem 1, any pair of admissible functions g(·) and /(·) satisfy the
condition f{g(x)) = x. If g(x) = (g0(x))l/(i~m), ηιφΐ, then the function /(·) has the
foim/(*) =fo(x1~m), where/0fe0(*)) = *· If/(*) =/o(* 1_m ), then

f'{x) = {\-mYi{xx-m)x-m (84)

If g{x) = (goW)1/(1-m)> m Φ 1 > t n e n the competition functions {a,·,·} can be obtained by


combining (83) and (84) with (57) as

«i^W"'0"-" (85)

where

eij = g^i-yJ\\lY^s\-m)
(δθ)

= fodlx,· - vyll Vo'MlIx,- - y/llVj/)


and
Section 7 From Generator Functions to LVQ and Clustering Algorithms 179

:
S0(I|X,--V;|| 2 )

-it
(8?)
2 l/(l-m)\ I"»»
^o(iix,-vj :n

Since (ay/0y)1/m = (yy)1/(m_1), it can easily be verified that {ay} and {%} satisfy the
condition

W-
l/m
1<i<M (88)

This condition can be used to determine the constraints imposed by the generator
function on the resulting c-partition by relating the competition functions {ay} with
the corresponding membership functions {uy}. Fuzzy LVQ and clustering algorithms
can be obtained as special cases of the proposed formulation if the corresponding
generator functions produce fuzzy c-partitions. A generator function produces fuzzy
c-partitions if the membership functions {uy} determined in terms of the corresponding
competition functions {ay} and {#,·,} satisfy the condition

Y^uy = \, \<i<M (89)

If the condition (89) is not satisfied by the membership functions {uy} formed in terms
of {ay} and {#,·,}, then the proposed formulation produces soft LVQ and clustering
algorithms.
The update equation (58) involving the competition and membership functions
obtained in this section can produce clustering or batch learning vector quantization
algorithms. The update equation (58) produces a clustering algorithm if m is fixed
during the learning process and the learning rates are computed at each iteration as
1j,v = (Hi=i %v) -1 · The update equation (58) produces a learning vector quantization
algorithm if η},tV = (Σ/=ι aij,v)~l a n d the value of m is not fixed during learning. If m
decreases during learning, the update equation (58) produces descending learning vec-
tor quantization algorithms [20]. In such a case, the algorithms produce increasingly
crisp c-partitions as the learning process progresses. In practice, m is often calculated at
iterate v as

m — mv = mj + v[(ntf — m^/N] (90)

where m, and nif are the initial and final values of m, respectively, and N is the total
number of iterations [20].
180 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

7.2. Special Cases: Fuzzy LVQ and Clustering


Algorithms
The remainder of this section presents fuzzy LVQ and clustering algorithms pro-
duced by linear and exponential generator functions.

7.2.1. Linear Generator Functions

Consider the generator function go(x) — x, which corresponds to fo(x) = x and


fd(x) = 1. For this generator function, (87) gives

According to (86), 0,·,· = 1, Wi, j , and at] = (Yy)mKm~l)- For g0(x) = x, the competition
functions {ay} can be obtained using (91) as

•-(;§esCT
This is the form of the competition functions of the batch LVQ algorithms developed
by minimizing (38) using gradient descent [15].
Since 0,-,- = 1, Vi, j , the condition (88) becomes.

-i,(piv)l/m = l, \<i<M (93)


° 7=1

The condition (93) implies that the linear generator function g0(x) — x produces fuzzy
c-partitions, with the membership functions {uy} obtained from the competition func-
tions [ay] as uy = {aij)l,mlc. Using (92),

which are the membership functions of the FCM algorithm [1].


The learning vector quantization algorithm resulting from the generator function
g0(x) = x can be implemented as the FLVQ, which can be summarized as follows
[15,20,23]:
1. Select c, mh ntf, and e; fix N; set v = 0.
2. Generate an initial set of prototypes V0 = {vi0, v2,o> · · ·. vCfi)·
3. Setv = v + 1 .
4. Calculate:
• m — mi + v[(mf — m,)/N\.
Section 7 From Generator Functions to LVQ and Clustering Algorithms 181
m
/ c /,, 2x i/(m-i)\ -

• ^ = (E"i«ff.v) _1 . 1<7<C.
V;- v = ν,·,ν_ι + ^> Σ,=1 «y,v(Xi - V/,v-l), 1<j < C
ν 2
• ^ = Σ;=ιΐΐν- ;>-ιΐι ·
5. If v < N and Ev > e, then go to step 3.

7.2.2. Exponential Generator Functions

The proposed formulation can also produce fuzzy c-partitions if go(x) = exp(ax),
which corresponds to/ 0 (x) = In χ/σ and/0'(*) = 1/(σχ). For this generator function,
(87) gives

In this case, θϋ = (^)" 1 and αϋ = %(Ky)m/(m-1) = (7y)1/(m_1). For g0(x) = exp(ax), the
competition functions {or,-,} can be obtained using (95) as
- 1
/ , \ l/(m-l)\
2
/exp(a||x;.-v,|| )\ \
εχρίσΙΙχ,-ν,ΙΙ2)/ J

This is the form of the competition functions of entropy-constrained learning vector


quantization (ECLVQ) algorithms [8].
For this generator function, the condition (88) becomes
1 C

- ] Γ « 0 = 1, \<i<M (97)

According to (97), the exponential generator function go{x) = exp(crx) also produces
fuzzy c-partitions, with the membership functions {«,·,} obtained from the competition
functions {ay} as uy — cty/c. For μ = (ηι— \)/m and δμ = (1 — μ)/μ, (96) gives

βχρ(-σί μ ||χ,-ν ; || 2 )
Σ*=ι βχρί-σδ^ΙΙχ,-ν^ΙΙ2)

which are the membership functions of ECFC algorithms [5,6].


The learning vector quantization algorithm resulting from the generator function
go(x) = εχρ(σχ) is the entropy-constrained LVQ (ECLVQ), which can be summarized as
follows [7]:
1. Select c, mh ntf, and e;fixN; set v = 0.
182 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

2. Generate an initial set of prototypes V0 = {v 10 , v2,o, · · ·, vc>o}.


3. Compute σ = σ0 (Mc In c)/(£Zi Σ£=ι II* - T/.olP)-
4. Set v = v + 1 .
5. Calculate:
• m = mi + v[(my — m,)/7V].

/^χρΜχ,.-ν,^Ι^-'Υ1 .
• «»> = 2_\ -Γ-» ΠΪΓ
,1 <
< '. < ^ 1 <7 < c.
< M;
^\εχρ(σ||χ,-νΑι,_ιΓ)/ y

• ^ =(E£i«j/' i<y<c.
• \JiV = v,- „_, + ??,· „ Σ ^ ι a,y,«(x,· - v/,v-i), 1 < 7 < c.
• σ = - ο · ο ( Σ £ ι Σ ; = Ι "/,> In «*,ν)/(Σ£ι Σ)=ι «fcvll* - v,>|| 2 ).
• £v = I£=illy/.v-7/.v-ill 2 .
6. If v < ./V and Ev > e then go to step 4.

8. SOFT LVQ AND CLUSTERING ALGORITHMS


BASED ON NONLINEAR GENERATOR
FUNCTIONS

The procedure presented in Section 5 for constructing admissible generator functions


indicated that solving the differential equation go(x) = k(g0(x))n with k > 0 and n < 1
produces nonlinear generator functions of the form go(x) = (ax + b)q, q > 0, with a > 0
and b > 0. This section presents the derivation and examines the properties of the soft
LVQ and clustering algorithms produced by g0(x) = (ax + b)q, q > 0, with a = 1 and
b = 0. For q > 0, g0(x) = x9 is an increasing generator function. For q=\, go(x) = x9
reduces to the linear function go(x) = x that produces the F C M and FLVQ algorithms.
For g0(x) = xq, go(x) = qxq~\ g((x) = q(q - \)xq~2, and

r0(x) = q(l-T^yq-l) (99)

The condition r0(x) > 0 is satisfied for all x 6 (0, oo) if

il-T^)>° (100>
For q > 0, the inequality (100) holds if 1 > q/(\ — m) > 0. If m > 1 (1 — m < 0), then
the inequality 1 > q/(\ -m) holds if q > 1 — m, which is true in this case since q > 0
and 1 — m < 0. If m < 1 (1 — m > 0), then the inequality 1 > q/(l —m)>0 holds if
1 — m > q > 0.
If m > 1, then goOO = xq generates admissible reformulation functions of the first
kind for all q > 0. If m < 1, then g0(x) = xq generates admissible reformulation func-
tions of the second kind only if 0 < q < 1 — m. If q = 1, then g0(x) = xq generates
admissible reformulation functions of the first kind if m > 1. If q = 1 and m < 1,
then g0(x) = xq generates admissible reformulation functions of the second kind if
Section 8 Soft LVQ and Clustering Algorithms Based on Nonlinear Generator Functions 183

1 < 1 - m, or, equivalently, if m < 0. In addition to the reformulation functions gen-


erated by the increasing generator function g0(x) = xq, q > 0, there exist admissible
reformulation functions of the second kind that can be produced by the decreasing
generator function g0(x) = xq, q < 0, if m > 1 and q < 1 — m < 0. Figure 2 shows the
combinations of values of q and m that produce admissible reformulation functions of
the first and second kind.
For the generator function g0(x) = xq, q> 0, fo(x) = göl(x) = χ1/ς aQ d fo(x) =
(x - )/q. The condition /„&<>(*)) = x implies that go(||x,-- ν,·||2) /0'(?ο(ΙΙ*ι-
(l 9)/q

ν,-H2)) = 1. Since fö(xy) = qfd(x)fd(y), {θν} can be obtained from (86) as θ9 = ifi/Oy),
or, equivalently, as

0y = iYiji,(!-*)/« (101)

where

\-m

Yii (102)
~v^Mi-^\2)
The competition function {ay} can be obtained from (85) as ay — ey(Yy)mnm ', or,
equivalently, as

2x ql(m-l)\
α
ν ^JlH^L^L
= ΑζΣ θ
l=\ Pf-^l
(103)

where {6y} are defined in (101).


If q — 1, then 0,·, = 1, Vi',7, and

^
Figure 2 Reformulation functions of the first
kind (indicated by vertical shading) and the
second kind (indicated by horizontal shading)
generated by generator functions of the form \
g0(x) = xq for different values of q and m. The
reformulation functions of the first and sec-
X
ond kind associated with the FCM and
FLVQ algorithms are represented by the line
horizontal to the m axis located at q = 1. \
184 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

•-(;έ(κί)"Τ
This is the form of the competition functions of the FLVQ algorithm [8,20]. In this case,
ay = {cuy)m, where

are the membership functions of the FCM algorithm [1].


For q φ 1, the membership functions [uy] can be obtained from the competition
functions {av} as uv = (cty)l/m/c = (ev)Vm(yv)l,<,n~l)/c. Using (103), {uy} can also be
written as
1
Λ f/(m-l)> -
l
— (a..\,m
/">
"r(¥ LUrTi2 (106>
Mil**-**« /
For q φ I, the membership functions (106) resulting from the generator function
g0(x) = y? do not satisfy the constraint (89) that is necessary for fuzzy c-partitions.
The c-partitions obtained by relaxing the constraint (89) are called soft c-partitions and
include fuzzy c-partitions as a special case.
The properties of the LVQ and clustering algorithms corresponding to the gen-
erator function g0(x) = x9, q > 0, are revealed by the limiting behavior of the competi-
tion functions. The behavior of the competition functions for afixedvalue of m > 1 as q
spans the interval (0, oo) can be given as

hm ay(m,q) = -^— (107)


?->o+ ιΐχ,'-v/ir
where ({||x, - Vi\\ }ieMc)G denotes the geometric mean of {||x,· — v^H 2 }/^, defined as
2

({||Xi - Vi\\\essc)G= (f\ llx,· - V/IN (108)

For a finite m > 1,

lim ciy(m, q) = cuy (109)


q-+oo

where {uy} are the membership functions that implement the nearest prototype parti-
tion of the feature vectors associated with crisp c-means algorithm, defined as

uc = ( 1, if ||x( - v;||2 < ||z, - v,||2, V€ φ] {χ 1Q)


ij
\ 0, otherwise
Section 8 Soft LVQ and Clustering Algorithms Based on Nonlinear Generator Functions 185

The behavior of the competition functions for a fixed value of q > 0 as m spans the
interval (1, oo) can be given as

lim ay(m, q) = cu,y (111)

For a finite q > 0,

hm a9(m,g) = p-*— (112)



m-*oo ||X; V/H

In summary, the algorithms generated by g0(x) = xq, q > 0, produce asymptotically


crisp c-partitions for fixed values of m > 1 as q -*■ oo and for fixed values of q > 0 as
m -> 1 + . The partitions produced by the algorithms become increasingly soft for fixed
values of m > 1 as q -> 0 and for fixed values of q > 0 as m -*■ oo.
8.1. Implementation of the Algorithms
The soft clustering algorithms resulting from the proposed formulation with
g0(x) = xq can be summarized as follows:
1. Select c, m, and e; set v = 0.
2. Generate an initial set of prototypes V<> = {vi,o> V2,o> · · ·. vc,o)·
3. Set v = v + 1 .

• %,v = (y^v) 0 "*^, l</<Af; 1<;<C.


17 1 0
• %„ = (^,ν) ™^.») ^" ^, 1 < I' < M; 1 <j < c.
• V = (Σ^ιί^.νΓχΜΣ^ιΚνΓ), 1 < / < c·
• ^ = Σ>ιΙΙν-Τ/.»-ιΙΙ 2 ·
4. If Ev > e, then go to step 3.
The soft learning vector quantization algorithms resulting from the proposed for-
mulation with go(x) = xq can be summarized as follows:
1. Select c, mh mf, fix N; set v = 0.
2. Generate an initial set of prototypes V0 = {vj 0, v2,o. · · ·, vc o).
3. Set v = v + 1 .
4. Calculate m = m, + v[(nif — m^/N].

l m
2χ q/(l-m)\ ~

1 <i<M; 1 <j < c.


186 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

• %,„ = (Yy,vr-q)/(1, 1 < i < M; 1 <j < c.


• «,» = fy>0Wm/(m_1), 1 < I' < M- \<j<C.
• ^> = (Σίι%νΓ1, i</<c.
• y,· „ = v,· v_! + ??y,v Σ,·=ι αΰ>(χι - y/,v-i). 1 < i < c.
• i» = E ? = l B » ; , v - V l l ·
5. If v < JV and Ev > e, then go to step 3.

9. INITIALIZATION OF SOFT LVQ AND


CLUSTERING ALGORITHMS
One of the central issues involved in the application of LVQ and clustering algorithms
is their initialization, that is, the formation of the set of prototypes used as a reference in
their first iteration. There is experimental evidence that soft LVQ and clustering algo-
rithms are not as sensitive to their initialization as crisp LVQ and clustering algorithms
[8,26]. The reason is that, unlike crisp LVQ and clustering algorithms, soft LVQ and
clustering algorithms partition the feature space on the basis of soft instead of crisp
decisions. This strategy allows soft LVQ and clustering algorithms to search for near-
optimum partitions in a subspace of all admissible partitions that is not necessarily
restricted to a small neighborhood centered at the initial partition. Although the initi-
alization of soft LVQ and clustering algorithms has little or no effect on their perfor-
mance on simple data sets [10,11], the application of such algorithms in more
challenging problems indicated that their initialization has a rather significant effect
on their performance. Soft LVQ and clustering algorithms can be initialized using the
prototype splitting procedures described next.

9.1. A Prototype Splitting Procedure


This procedure begins with a single prototype and designs a codebook containing c
prototypes by splitting one prototype at each iteration. The first prototype vj is calcu-
lated as the centroid of the entire set of feature vectors x„ 1 < i < M. In the first
iteration, the prototype Vi is split to create a codebook of size 2 containing a new
prototype v2 and an updated version of the original prototype V]. One of these two
prototypes is subsequently split to produce a codebook of size 3. This procedure is
repeated until a codebook of the desired size is available.
Suppose that after v — 1 iterations the codebook contains the prototypes y,·,
1 < j < v. Let Xj be the set of integers between 1 and M formed by the indices of the
feature vectors represented by the prototype y,·, that is,

Ij = {il < i < M : ||x,· - ν,·||2 < ||x, - v,||2, VI φ]} (113)

Since U/=1I,· = {1,2,..., M], the total quantization error associated with the represen-
tation of the entire set of feature vectors by the prototypes v7, 1 <j<v, can be calcu-
lated as E = Σ]=ί Ej, where Ej is the quantization error associated with the
representation of the feature vectors {x,}iei. by the prototype vy, that is

^=Σΐ' χ '- τ >ιι 2 (114)


ielj
Section 9 Initialization of Soft LVQ and Clustering Algorithms 187

The largest possible reduction of the total quantization error E can be produced by
splitting the prototype that corresponds to the largest among the quantization errors
denned in (114). More specifically, the prototype yk is split if Ek > Ej, V/' ψ k.
Suppose vk is the prototype selected for splitting according to the splitting criterion
just presented. Let xe be the feature vector that has the maximum distance from yk
among all {x,·},·^, that is, ||x* - \kf > ||x,· - ν*||2, Vz φ I. According to (114), x* has
the largest contribution to the quantization error Ek associated with the representation
of the feature vectors {χ,}ί€χ4 by the prototype vk. Thus, the error Ek can be reduced by
splitting the prototype vk along the direction yk — xt. Splitting of the prototype yk
produces two new prototypes yk and vv+1. The new prototype vk is obtained by updat-
ing the original prototype v* as

yk+-(l-8)yk + 8xt = yk + 8(xt-yk) (115)

where 8 e (0,1). According to (115), the new prototype v^ is attracted by the feature
vector xt by an amount determined by 8 € (0,1). As 8 increases from 0 to 1, the new
prototype vk moves closer to the feature vector xt. In order to balance the effect of
moving yk toward xt, the new prototype vv+1 is obtained by moving the original pro-
totype yk by the same amount in the opposite direction. More specifically, the new
prototype is obtained by updating the original prototype yk as

vv+1 <- (1 + 8)yk - 8xt = vk - 8(x( -yk) (116)

where 8 e (0,1). According to (116), the prototype vv+1 is repelled by the feature vector
Xi by an amount determined by 8 e (0,1). As 8 increases from 0 to 1, the new prototype
v„+1 moves away from the feature vector xt. The codebook resulting after sphtting one
prototype can be improved by calculating each prototype as the centroid of all feature
vectors belonging to its corresponding cluster. More specifically, each prototype v,· is
computed as the centroid of the feature vectors {χ,·},€χ., with 2} defined in (113).

9.2. Initialization Schemes


Soft LVQ and clustering algorithms were initialized in the experiments presented in
this chapter by the following three initialization schemes:

• Initialization Scheme 1 (II): According to this scheme, the initial prototypes are
randomly selected from the feature space, that is, the subspace of W containing
all feature vectors that form the inputs to the clustering or LVQ algorithm.
• Initialization Scheme 2 (12): This initialization scheme is based on the prototype
splitting procedure described in Section 9.1.
• Initialization Scheme 3 (13): This initialization scheme is based on the same
prototype splitting approach. The only difference is that this scheme employs
the clustering or LVQ algorithm used to produce the final set of prototypes
every time a new prototype is created by splitting one of the existing
prototypes.
188 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

10. MAGNETIC RESONANCE IMAGE


SEGMENTATION

The clinical utility of magnetic resonance (MR) imaging rests on the contrasting image
intensities obtained for different tissue types, both normal and abnormal. For a given
MR image pulse sequence, image intensities depend on local values of the following
relaxation parameters: the spin-lattice relaxation time (Tl), the spin-spin relaxation
time (T2), and the Flair. In the context of MR imaging, segmentation usually implies
the creation of a single image with fewer intensity levels that correspond to different
segments. The development of MR image segmentation techniques has been attempted
using classical pattern recognition techniques [30], rule-based systems [31], image ana-
lysis methods and mathematical morphology [32], supervised neural networks [33], and
unsupervised clustering procedures [33].
Segmentation of MR images is formulated to exploit the differences among local
value of the Tl, T2, and Flair relaxation parameters. The values of these parameters
represent the intensity levels (pixels) of a set of three images, namely the Tl-weighted,
T2-weighted, and Flair-weighted images. Let xT1 xT2, and xFlair be the pixel values of
the Tl-weighted, T2-weighted, and Flair-weighted images, respectively, at a certain
location. The relaxation parameter values χτχ, χτ2, and χρ^ can be combined to
form the vector

X = — [xri XT2 *Flair] (117)

An MR image composed by the Tl-weighted, T2-weighted, and Flair-weighted


images of size N x N can be represented in the segmentation process by iV2 feature
vectors formed using their pixel values according to (117). Given the representation of
an MR image by a set of feature vectors, MR image segmentation can be formulated as
an unsupervised clustering or vector quantization process. In such a case, the feature
vectors are partitioned into a set of clusters, each represented by a prototype. Following
the completion of the unsupervised clustering or vector quantization process, the seg-
mented MR image is produced by mapping each feature vector into its closest proto-
type. According to this formulation, the number of segments in the segmented MR
image is equal to the number of prototypes produced by the clustering or vector
quantization algorithm. The number of segments must be selected sufficiently large
to accommodate the different tissues in the brain, fat, skin, and background. The
diagnostic value of segmented MR images is often improved by assigning pseudocolor
to different prototypes to produce segmented images containing colored segments.
The soft LVQ and clustering algorithms proposed in this chapter were evaluated
and tested on the MR image of the brain of an individual who had undergone surgery
to remove a brain tumor. Figure 3 shows the Tl, T2, and Flair components of the brain
MR image used in the experiments, which contains a surgical cavity, remaining tumor,
and edema in addition to normal brain tissues, such as white matter, gray matter, and
cerebrospinal fluid (CSF). The segmented MR images were evaluated in terms of their
diagnostic value by a neurologist and an expert in diagnostic imaging. More specifi-
cally, the segmented images produced by various clustering and LVQ algorithms were
presented to the evaluators in random order together with the Tl, T2, and Flair com-
ponents of the MR image. The evaluators assigned to each of the segmented images a
Section 10 Magnetic Resonance Image Segmentation 189

(c)

Figure 3 MR image of an individual who had undergone surgery to remove a brain


tumor: (a) Tl -weighted, (b) T2-weighted, and (c) Flair-weighted images.

grade ranging from 0 to 10, with 0 representing the lowest diagnostic value and 10
representing the highest diagnostic value. The average of the two grades assigned to
each segmented image was used in the evaluation to represent its diagnostic value. The
average is indeed a reliable measure of diagnostic value because the two evaluators
assigned very similar grades to the majority of the segmented images and disagreed only
in their assessment of segmented MR images of low diagnostic value.
190 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

In the first set of experiments, the MR image was segmented by the soft clustering
algorithms generated by g0(x) = xq, which were tested with m = 2 and q = 0.7, q=\,
q=\.2,q = \.5, and q = 2. The algorithms were used to produce eight segments, which
implies that each algorithm generated c = 8 prototypes. Table 2 summarizes the results
of the evaluation of the segmented MR images produced by soft clustering algorithms
initialized by all three initialization schemes. Regardless of the initialization scheme
used, the segmented images produced by the algorithms tested with q = 1, q = 1.2, and
q = 1.5 were hardly distinguishable from each other. On the other hand, the initializa-
tion scheme had a rather significant effect on the diagnostic value of the segmented
images produced by the algorithms tested with q = 0.7 and q = 2.

TABLE 2 Evaluation of Soft Clustering Algorithms Generated by 0b(x) =


xq with q = 0.7, q = 1 (FCM), q = 1.2, q = 1.5, and q = 2"

Initialization 11 12 13
9 = 0.7 4.25 6.75 6.25
9=1.0 6.75 6.75 6.75
q=l.2 6.75 6.75 6.75
9=1.5 6.75 6.75 6.75
? = 2.0 6.75 5.50 6.75
" The algorithms were tested with m = 2 and JV = 50 and the clustering process
was initialized using the schemes II, 12, and 13.

Figure 4 shows the segmented images produced by the soft clustering algorithms
initialized by the 12 initialization scheme and tested with q = 0.7, q = 1, q = 1.2, and
q = 2. In this set of segmented images, red segments represent tumor, blue segments
represent edema surrounding the tumor, white segments represents CSF or CSF filling
in the surgical cavity, yellow segments represent white matter, and green segments
represent gray matter. There are also some segments of pink color within the area
segmented as tumor (red color). Since labeling these segments was difficult, the exis-
tence of pink segments and their size relative to that of the red segments was con-
sidered to be a liability and had a negative impact on the evaluation of the segmented
images. In addition to the segments already mentioned, there are segments occupying
areas that do not correspond to brain matter, such as air, fat, skin, and background.
The algorithms were capable of identifying the surgical cavity, the tumor, and the
edema. Nevertheless, there was concern that the volume of CSF might be under-
estimated in the segmented images. This is probably the cause of a slight overestima-
tion of gray matter. Finally, there are some green pixels (representing gray matter)
appearing within the blue segment that represents edema. According to Figure 4, the
segmented images produced for q = 0.7, q = 1, and # = 1.2 cannot be distinguished
from each other. However, the segmented image produced for q = 2 differs from all
the rest. Gray matter (represented by green color) is almost completely absent from
this segmented image. Another significant difference between the segmented MR
image produced for q = 2 and the rest is the appearance of two yellow patches
(representing white matter) within the red area (representing remaining tumor). The
same patches were segmented as gray matter (green color) by the algorithms tested
with q = 0.7, q=\, and q = 1.2.
Section 10 Magnetic Resonance Image Segmentation 191

(c) (d)

Figure 4 Segmented MR images produced by soft clustering algorithms generated by


g0(x) = xq with (a) q = 0.7, (b) q = 1, (c) q = 1.2, and (d) q = 2. The algo-
rithms were tested with m = 2 and the clustering process was initialized
using the scheme 12.

In the second set of experiments, MR image segmentation was performed by soft


LVQ algorithms generated by g0(*) = xq and tested with different values of q. The
value of m decreased linearly during the learning process from w, = 5 to rrtf = 2 in
N = 50 iterations. Table 3 summarizes the results of the evaluation of the segmented
MR images produced by soft LVQ algorithms initialized by all three initialization
schemes and tested with q = 0.7, q = 1, q = 1.2, q = 1.5, and q = 2. The segmented
192 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

TABLE 3 Evaluation of Soft LVQ Algorithms Generated by g0(x) = x "


with q = 0.7, q = 1 (FLVQ), q = 1.2, q = 1.5, and q = 2 a

Initialization 11 12 13

? = 0.7 3.50 3.50 2.25


? = 1.0 3.75 3.75 2.50
? = 1.2 6.25 6.25 2.75
9=1.5 7.00 7.00 7.00
9 = 2.0 7.00 7.00 7.00

" The algorithms were tested with values of m decreasing linearly from m, = 5 to
ntf = 2 in N = 50 iterations and the learning process was initialized using the
schemes II, 12, and 13.

images produced for q = 0.7 and q — 1 (FLVQ) were of low diagnostic value. This is
clearly indicated by Figure 5, which shows the segmented MR images obtained for
q = 0.7, q = 1, q = 1.2, and q = 2. In Figures 5a and b, red color (representing tumor)
surrounds the edema and the region of the MR image occupied by skin. Red color,
representing tumor, also appeared in the cortex and the edema. The diagnostic value
of the segmented images improved for values of q above 1.2, as indicated by Figures 5c
and d.
Table 3 also indicates that the performance of the soft LVQ algorithms tested in
these experiments was rather strongly affected by the scheme used to produce the initial
set of prototypes. Regardless of the value of q, there were no visible differences among
the segmented images produced by the soft LVQ algorithms initialized by the initializa-
tion schemes II and 12. For q = 0.7, q—\, and q — 1.2, the soft LVQ algorithms
initialized by the initialization scheme 13 produced segmented MR images inferior to
those produced by the same algorithm employing the schemes II and 12 for initializa-
tion. Compared with the soft clustering algorithms tested with a fixed value of m, soft
LVQ algorithms were more strongly affected by the value of q. This can be attributed to
the fact that soft LVQ algorithms were implemented with values of m that decreased
linearly from mt = 5 to nif = 2 during the learning process. In the initial stages of the
learning process, where the values of m are considerably higher than 1, the soft LVQ
algorithms tested with low values of q tend to produce increasingly soft partitions. This
leads to tissue mixing, which is also observed in the last stages of the learning process,
where the values of m are close to its final value of ntf = 2. Tissue mixing can be
remedied by increasing the value of q above 1. Values of q in this range lead to parti-
tions that are closer to a crisp partition and, thus, balance the effect of the high values
of m used in the initial stages of the learning process.
The third set of experiments evaluated the effect of the parameter q on soft cluster-
ing and LVQ algorithms by comparing the segmented images produced by such algo-
rithms tested with values of q from the interval [0.7, 4]. Soft clustering algorithms were
tested with afixedvalue of m = 2, whereas soft LVQ algorithms were tested with values
of m decreasing linearly from m,f = 5 to nif — 2 in N = 50 iterations. All algorithms
tested in these experiments were initialized using the initialization scheme 12. Table 4
summarizes the results of the evaluation of the segmented images produced in this set of
experiments. Soft clustering algorithms produced segmented MR images of high diag-
nostic value when q was taking values between 0.7 and 1.5. For this range of values of q,
there were no visible differences among the segmented images produced by soft cluster-
Section 10 Magnetic Resonance Image Segmentation 193

(c) (d)

Figure 5 Segmented MR images produced by soft LVQ algorithms generated by


g0(x) = x" with (a) q = 0.7, (b) q = 1, (c) q = 1.2, and (d) q = 2. The algo-
rithms were tested with values of m decreasing linearly from m, = 5 to nif
= 2 in N = 50 iterations and the learning process was initialized using the
scheme 12.

ing algorithms. However, there was a noticeable degradation in the segmented MR


images as the value of q increased above 1.5. Soft LVQ algorithms produced segmented
MR images of high diagnostic value for values of q above 1.5. In fact, the best seg-
mented image was produced in this set of experiments by the soft LVQ algorithm tested
with q = 4. The outcome of these experiments indicated that soft clustering algorithms
194 Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

TABLE 4 Evaluation of Soft Clustering and LVQ Algorithms Generated by


go(x) = xq with Different Values of qa
Algorithm Soft Clustering Soft LVQ
q = 0.7 6.75 3.50
ϊ=1.0 6.75 3.75
9=1.2 6.75 5.75
9=1.5 6.75 7.00
q = 2.0 5.50 7.00
q = 3.0 5.50 7.00
q = 4.0 5.50 7.50
" Clustering algorithms were tested with m = 2 and LVQ algorithms were tested
with values of m decreasing linearly from m, = 5 to mj = 2 in N = 50 iterations.
All algorithms were initialized by the scheme 12.

achieved their best performance for values of q in a neighborhood of q = 1 (that is, the
value of q that corresponds to the FCM algorithm). On the other hand, soft LVQ
algorithms failed for values of q in a neighborhood of q = 1 (that is, the value of q
that corresponds to the FLVQ algorithm). In fact, soft LVQ algorithms produced
segmented MR images comparable or even superior to those resulting from soft clus-
tering algorithms for values of q between 1.5 and 4.

11. CONCLUSIONS

The reformulation of FCM and ECFC algorithms indicated that clustering algo-
rithms can alternatively be derived by minimizing their corresponding reformulation
function using gradient descent. It was also shown in this chapter that a broad
variety of soft LVQ and clustering algorithms can be developed by minimizing an
admissible reformulation function. According to this formulation, the development
of new algorithms reduces to the search for a generator function that satisfies certain
conditions. FCM and ECFC algorithms can be interpreted as special cases of the
proposed formulation corresponding to linear and exponential generator functions,
respectively.
The axiomatic approach presented in this chapter for the development of soft
LVQ and clustering algorithms can eventually replace alternating optimization as a
tool for vector quantization and clustering. The simplicity of the proposed approach
allows the development of algorithms that would not be the product of alternating
optimization techniques [10, 11]. In fact, the fuzzy LVQ and clustering algorithms
produced by the proposed formulation using the linear and exponential generator
functions constitute only a tiny subset of all possible algorithms that can be gener-
ated by this approach. Any admissible generator function leads to soft but not
necessarily fuzzy LVQ and clustering algorithms, in the sense that only the member-
ship functions corresponding to the linear and exponential generator functions
satisfy the condition (89) required for fuzzy opartitions. The search for potential
generator functions would essentially involve admissible functions increasing faster
than the linear generator function and slower than the exponential generator func-
tion. Soft clustering and LVQ algorithms were developed by selecting nonlinear
generator functions.
Acknowledgments 195

The soft LVQ and clustering algorithms produced by nonlinear generator func-
tions were evaluated and compared with existing algorithms by formulating MR
image segmentation as an unsupervised clustering or vector quantization process.
This is a nontrivial problem and provides a reliable basis for comparing the perfor-
mance of LVQ and clustering algorithms because the diagnostic value of the seg-
mented MR images depends exclusively on the unsupervised algorithm used to
perform segmentation. The soft clustering algorithms generated by g0(x) = xq and
tested with m = 2 achieved their best performance for values of q in a neighborhood
of 1, that is, the value of q that leads to the FCM algorithm. On the other hand, the
soft LVQ algorithms generated by go(x) = xq and tested with values of m decreasing
linearly from m, = 5 and my = 2 in N = 50 iterations achieved their best perfor-
mance for values of q in the interval [1.5,4]. Note that this interval does not include
the value q=\, which leads to the FLVQ algorithm. This experimental outcome
reveals that the performance of clustering algorithms was not significantly affected
by replacing the linear generator function g0(x) = x by the nonlinear generator
function go(x) = xq, with q Φ I. On the other hand, the use of the nonlinear gen-
erator function go(x) = xq with q > 1 instead of the linear generator function go(x) —
x improved significantly the performance of LVQ algorithms. The application of
soft clustering and LVQ algorithms in image segmentation indicated that the for-
mation of the initial set of prototypes has a rather significant effect on their per-
formance. The most robust and consistent behavior was exhibited in this
experimental study by the soft clustering and LVQ algorithms initialized by the
scheme 12, which employs the proposed prototype splitting procedure.
Initialization of the algorithms by the scheme II (based on random generation of
the initial set of prototypes) worked well in some cases, but the performance of the
algorithms initialized according to this scheme was not consistent. Similar inferences
can be made about the initialization scheme 13, which involved repeated prototype
splitting followed by the application of the algorithm used to perform MR image
segmentation.
There is experimental evidence that soft vector quantization algorithms are strong
competitors to vector quantizers based on crisp algorithms in applications involving a
large number of feature vectors of high dimensionality [2,7,16]. Thus, soft LVQ and
clustering algorithms can also be used to perform codebook design for lossy image and
video compression, a task that is frequently performed using a variation of the crisp c-
means algorithm known in the engineering literature as the Linde-Buzo-Gray (LBG)
algorithm [28,34]. There is also evidence that soft LVQ and clustering algorithms
produced by the formulation presented in this chapter are inherently capable of identi-
fying the feature vectors that are equidistant from the prototypes. This property can be
exploited to detect outliers in the feature set. This problem is currently under
investigation.

ACKNOWLEDGMENTS

The author would like to thank Lawrence C.-P. Leung for processing the MR
image data. Furthermore, the author thanks Professors W. K. Alfred Yung, M.D.,
and Edward F. Jackson, Ph.D., who provided the MR image data and evaluated the
segmented MR images.
Chapter 7 Soft Learning Vector Quantization and Clustering Algorithms

REFERENCES

[1] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York:
Plenum, 1981.
[2] N. B. Karayiannis, Generalized fuzzy λ-means algorithms and their application in image
compression. SPIE Proceedings vol. 2493: Applications of Fuzzy Logic Technology II, pp.
206-217, Orlando, FL, April 17-21, 1995.
[3] N. B. Karayiannis, Generalized fuzzy c-means algorithms. Proceedings of Fifth International
Conference on Fuzzy Systems, pp. 1036-1042, New Orleans, LA, September 8-11, 1996.
[4] K. Rose, E. Gurewitz, and G. C. Fox, Vector quantization by deterministic annealing. IEEE
Trans. Inform. Theory 38: 1249-1257, 1992.
[5] N. B. Karayiannis, Maximum entropy clustering algorithms and their application in image
compression. Proceedings of 1994 IEEE International Conference on Systems, Man, and
Cybernetics, pp. 337-342, San Antonio, October 2-5, 1994.
[6] N. B. Karayiannis, Fuzzy partition entropies and entropy constrained fuzzy clustering
algorithms. J. Intell. Fuzzy Syst. 5(2): 103-111, 1997.
[7] N. B. Karayiannis, Entropy constrained learning vector quantization algorithms and their
application in image compression. SPIE Proceedings, Vol. 3030: Applications of Artificial
Neural Networks in Image Processing II, pp. 2-13, San Jose, CA, February 12-13, 1997.
[8] N. B. Karayiannis, A methodology for constructing fuzzy algorithms for learning vector
quantization. IEEE Trans. Neural Networks 8(3): 505-518, 1997.
[9] N. B. Karayiannis, Learning vector quantization: A review. Int. J. Smart Eng. Syst. Design
1: 33-58, 1997.
[10] N. B. Karayiannis, Ordered weighted learning vector quantization and clustering algo-
rithms. Proceedings of 1998 International Conference on Fuzzy Systems, pp. 1388-1393,
Anchorage, AK, May 4-9, 1998.
[11] N. B. Karayiannis, Soft learning vector quantization and clustering algorithms based in
reformulation. Proceedings of 1998 International Conference on Fuzzy Systems, pp. 1441—
1446, Anchorage, AK, May 4-9, 1998.
[12] N. B. Karayiannis, Reformulating learning vector quantization and radial basis neural net-
works. Fundam. Inform. 37: 137-175, 1999.
[13] N. B. Karayianms, An axiomatic approach to soft learning vector quantization and cluster-
ing. IEEE Trans. Neural Networks, in press.
[14] N. B. Karayiannis, J. C. Bezdek, N. R. Pal, R. J. Hathaway, and P.-I. Pai, Repairs to
GLVQ: A new family of competitive learning schemes. IEEE Trans. Neural Networks
7(5): 1062-1071, 1996.
[15] N. B. Karayiannis and J. C. Bezdek, An integrated approach to fuzzy learning vector
quantization and fuzzy c-means clustering. IEEE Trans. Fuzzy Syst. 5(4): 622-628, 1997.
[16] N. B. Karayiannis and P.-I. Pai, Fuzzy algorithms for learning vector quantization. IEEE
Trans. Neural Networks 7(5): 1196-1211, 1996.
[17] E. Kosmatopoulos and M. Christodoulou, Convergence properties of a class of learning
vector quantization algorithms. IEEE Trans. Image Process. 5(2): 361-368, 1996.
[18] I. Pitas, C. Kotropoulos, N. Nikolaidis, R. Yang, and M. Gabbouj, Order statistics learning
vector quantizer. IEEE Trans. Image Process. 5(6): 1048-1053, 1996.
[19] I. Pitas, C. Kotropoulos, N. Nikolaidis, and A. G. Bors, Robust and adaptive techniques in
self-organizing neural networks. Nonlinear Anal. Theory Methods Appl. 307: 4517-4528,
1997.
[20] E. C.-K. Tsao, J. C. Bezdek, and N. R. Pal, Fuzzy Kohonen clustering networks, Pattern
Recogn. 27(5): 757-764, 1994.
[21] S.-J. Yu and C.-H. Choi, LVQ with a weighted objective function. Proceedings of IEEE
International Conference on Neural Networks, pp. 2763-2768, Perth, Australia, November
27-December 1, 1995.
References 197

[22] E. Yair, K. Zeger, and A. Gersho, Competitive learning and soft competition for vector
quantizer design. IEEE Trans. Signal Process. 40(2): 294-309, 1992.
[23] J. C. Bezdek and N. R. Pal, Two soft relatives of learning vector quantization. Neural
Networks 8(5): 729-743, 1995.
[24] R. J. Hathaway and J. C. Bezdek, Optimization of clustering criteria by reformulation.
IEEE Trans. Fuzzy Syst. 3: 241-246, 1995.
[25] N. B. Karayiannis, Generalized fuzzy c-means algorithms. /. Intell. Fuzzy Syst. 8(1): 68-71,
2000.
[26] N. B. Karayiannis and P.-I. Pai, Fuzzy vector quantization algorithms and their application
in image compression. IEEE Trans. Image Process. 4(9): 1193-1201, 1995.
[27] R. Krishnapuram and J. M. Keller, A possibilistic approach to clustering. IEEE Trans.
Fuzzy Syst. 1:98-110, 1993.
[28] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression. Boston: Kluwer
Academic, 1992.
[29] D. Dyckhoff and W. Pedrycz, Generalized means as a model of compensative connectives.
Fuzzy Sets Syst. 14(2): 143-154, 1984.
[30] T. J. Hyman, R. J. Kurland, G. C. Levy, and J. D. Shoop, Characterization of normal brain
tissue using seven calculated MRI parameters and a statistical analysis system. Magn. Reson.
Med. 11:22-34, 1989.
[31] S. P. Raya, Low-level segmentation of 3-D magnetic resonance brain images—A rule-based
system. IEEE Trans. Med. Imaging 9(3): 327-337, 1990.
[32] M. Bomans, K. H. Hohne, U. Tiede, and M. Riemer, 3-D segmentation of MR images of
the head for 3-D display. IEEE Trans. Med. Imaging 9(2): 177-183, 1990.
[33] L. O. Hall, A. M. Bensaid, L. P. Clarke, R. P. Velthuizen, M. S. Silbiger, and J. C. Bezdek, A
comparison of neural network and fuzzy clustering techniques in segmenting magnetic
resonance images of the brain. IEEE Trans. Neural Networks. 3: 672-682, 1992.
[34] N. M. Nasrabadi and R. A. King, Image coding using vector quantization: A review. IEEE
Trans. Commun. 36: 957-971, 1988.
[35] G. J. Klir and B. Yuan, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Upper Saddle
River, NJ: Prentice Hall, 1995.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.

Chapter METASTABLE ASSOCIATIVE


8 NETWORK MODELS OF NEURONAL
DYNAMICS TRANSITION DURING
SLEEP

Mitsuyuki Nakao and Mitsuaki Yamamoto

Sleep state is one of the substantial aspects of consciousness. Concerning the functions
of sleep in memory and learning, physiological and psychological evidence has been
accumulated (e.g., [1,2]). Many ideas have been proposed to elucidate the mechanisms
underlying sleep functions (e.g., [3]), but none of these ideas has been established. It is
essential to construct a model that can give insight into mechanisms of sleep and enable
its computational interpretation. Efforts have been made to do so from physiological
and model-based points of view.
In the central nervous system of the cat, the following phenomena concerning the
dynamics of single neuronal activities during the sleep cycle have been found: (1)
During rapid eye movement sleep (REM or dream sleep), neuronal activities showed
slow fluctuations, and their power spectral densities (PSDs) were approximately inver-
sely proportional to frequency in the frequency range 0.01-1.0 Hz (simply abbreviated
as l/f). (2) During steady-state slow wave sleep (SWS), neurons showed almost flat
spectral profiles in the same frequency range. These phenomena have been found in
various regions of the cat's brain such as the mesencephalic reticular formation [4],
the hippocampus, the thalamus, and the cortex [5-7].
Based on neurophysiological knowledge, the dynamics transition was successfully
simulated by using an interconnected neural network model including a globally
applied inhibitory input and random noise [8-11]. That is, the neuronal dynamics
during SWS and REM sleep were reproduced by the network model under strong
and weak inhibitory inputs, respectively. A monotonous structure of the network
attractor is suggested to exist where the "0" state is highly attractive under strong
inhibition and the metastability of the attractor is dominant under weak inhibition.
Thus, the structural change in the network attractor associated with an increase in the
global inhibitory input could underlie the neuronal dynamics transition. It was also
shown that statistical properties of the noise could differentiate the network state
behavior in the metastable attractor.
In this chapter, the phenomenology of the dynamics transition of single neuronal
activities during sleep is reviewed. Then simulation results for the neuronal dynamics
transition are summarized. Finally, the generation mechanism underlying the
dynamics transition is studied based on the structural analysis of network attractors

198
Section 1 Dynamics Transition of Neuronal Activities During Sleep 199

under various conditions. On the basis of these studies, we discuss what is happening
in actual neural networks and its possible contribution to brain functions.

1. DYNAMICS TRANSITION OF NEURONAL


ACTIVITIES DURING SLEEP

The phenomenon of the dynamics transition was first demonstrated in the long-term
spontaneous single neuronal activity of the mesencephalic reticular formation (MRF)
of the cat during sleep (Figure 1) [4]. During slow wave sleep, a random spike density
pattern was often observed and an almost flat PSD was obtained for the low-frequency
range from approximately 0.01 to 1 Hz. On the other hand, for almost the same fre-
quency range during REM sleep, a slowly fluctuating spike density pattern was
observed and the PSD was inversely proportional to frequency. This kind of PSD is
simply called the \/f PSD. Thus, an important characteristic of the cat's MRF neuro-
nal activity is the "dynamics transition" between the flat PSD during SWS and the 1//
PSD during REM sleep. The robustness of the characteristic dynamics transition
between the two sleep states was verified for the 18 MRF neurons [4]. Furthermore,

SWS REMS

Figure 1 Example of long-termfiringprofiles of the MRF neurons during slow wave


sleep (SWS) and rapid eye movement sleep (REM). The second row indi-
cates the corresponding time series of spike counts (TSC, window
time = 250 ms) and the third the power spectral density of the correspond-
ing TSC. (Adapted with permission from Ref. 4)
200 Chapter 8 Metastable Associative Network Models of Neuronal Dynamics Transition

the consistent characteristic of the transition was verified on a time axis during 24 hours
for a representative MRF neuron. That is, the PSD was calculated for a sample series of
the neuronal activity extracted from all sustained episodes of both SWS and REM. In
all the episodes of the respective states, the MRF neurons displayed the flat and \/f
PSDs during SWS and REM, respectively.
The \/f PSD is more generally evaluated by a form off~b profile, where exponent
b is the slope of the PSD in a double logarithmic plot. When b = 1 the PSD corresponds
to the exact 1// profile, whereas b = 0 corresponds to the flat PSD. As for the varia-
bility of the value of b during REM sleep among neurons, the mean values calculated
from 19 MRF neuronal activities were distributed from 0.56 to 1.37 with a pooled mean
of 0.96. Within a neuron, variabilities in b for the respective states were small.
Therefore, each neuron was suggested to have its own value, which was hypothesized
to indicate the structural specificity of the reticular network including the neuron under
observation [13].
Figure 2 shows a summary of the dynamics transition of single neuronal activities
between the two sleep states in the five neuronal groups: the mesencephalic reticular
formation, the primary somatosensory cortex, the ventrobasal complex of the thala-
mus, and the hippocampus (the theta and the pyramidal neurons) [14,15]. In addition,
the neurons recorded from the neocortex, areas 6 and 18, from the dorsal lateral
nucleus, and from the thalamic reticular nucleus have shown similar characteristics
[7,16,17]. It should be emphasized that all of these neuronal groups show a similar

Figure 2 Summary of the dynamics transition between SWS and REMS in five
neuronal groups: the mesencephalic reticular formation (Ml 13), the pri-
mary somatosensory cortex (S007), the ventrobasal complex of the thala-
mus (VI10), and the hippocampal theta (HO 13) and pyramidal (H071)
neurons. On the ordinate is the logarithm of PSD in (spikes/250 ms)2/Hz.
(Adapted with permission from Ref. 14.)
Section 3 Neural Network Model 201

dynamics transition, despite the structural and functional differences of the neuronal
groups. Some common modulation mechanisms must operate under the dynamics
transition.

2. PHYSIOLOGICAL SUBSTRATE OF THE GLOBAL


NEUROMODULATION

A contribution of the aminergic or chohnergic system might be responsible for the


dynamics transition prevailing in extended areas of the brain. Aminergic-cholinergic
systems are considered to modulate neuronal activities in widespread areas of the brain,
with either excitatory or inhibitory influences depending on their receptor mechanisms
[18]. Both serotoninergic and noradrenergic neurons show a depression or cessation of
firing during REM [19,20]. The extracellular concentration of serotonin during REM
becomes lowest both in the dorsal raphe nucleus [21] and in the region of the medial
pontine reticular formation [22]. Concerning the chohnergic system, the extracellular
concentration of acetylcholine becomes minimal during slow wave sleep and increases
during REM, not only in the pontine reticular formation but also in forebrain struc-
tures such as the caudate nucleus [23].
In relation to these experimental findings, the following pharmacological experi-
ments were performed on the hippocampal theta and pyramidal neurons and the
thalamic ventrobasal complex neurons [6,24]. After administration of p-chloropheny-
lalanine (PCPA), an inhibitor of serotonin synthesis, the activity of the hippocampal
theta and pyramidal neurons as well as thalamic neurons clearly displayed the 1//
dynamics [6,24], which was accompanied by similar phenomena characterizing REM,
such as rapid eye movements, cortical desynchronization, hippocampal theta waves,
and PGO activities [6]. Under this condition, administration of 5-methoxy-Af,iV-
dimethyltryptamine (5MeODMT), a nonselective serotoninergic agonist, abolished
both the polygraphically REM-like state and the l/f neuronal dynamics, and the
white dynamics was obtained [6,24]. In addition, when atropine sulfate was adminis-
tered under the condition of the PCPA pretreatment, the l/f dynamics was also abol-
ished, indicating that a muscarinic receptor action is necessary for generation of the l/f
dynamics. Concomitantly, it has preliminarily been shown that the nicotinic receptor
action is not required for the l/f dynamics of the MRF neuronal activity, because the
nicotinic antagonist mecamylamine tended to fail to abolish the l/f neuronal
dynamics [25].

3. NEURAL NETWORK MODEL

It is interesting to investigate whether the dynamics transition could originate from a


neuronal network. Simulation studies were performed using an interconnected neural
network model [8-12]. In this model, each neuron was hypothesized to have an inhi-
bitory postsynaptic input common to all neurons of the network to simulate the overall
biasing from the global neuromodulation system, such as the serotoninergic, noradre-
nergic, and chohnergic system. The effect of overall biasing from the global neuromo-
dulation system was assumed to be condensed into this common value of the
postsynaptic inhibition. Furthermore, mutually independent random perturbations
202 Chapter 8 Metastable Associative Network Models of Neuronal Dynamics Transition

were applied to activate each neuron, mimicking various noise sources rapidly
fluctuating.
Here, the model structure is reviewed briefly. The neural network model consists of
fully interconnected neuron-like elements (referred to as "neuron"). For the ith neuron,
the state evolution rule is defined as follows:

";(' + 1) = Σ V W ~h + M + !>
W X
(1)

*/(*+1) = *(«,(*+1)) (2)


f1 x>0
g(x) = (3)
10 *<0
Ut +1) = «6(0 + Φ\ ί=ΐ,2 JV (4)

where N denotes the number of neurons contained in the network; t is a discrete time;
€t(t) denotes random noises, which are assumed to be mutually independent, zero-mean
white Gaussian noises with an identical variance σ2(1 — a 2 ), in which a variance of
autoregressive process is kept constant regardless of a; h (> 0) is the inhibitory input,
which is fixed here independent of neurons for simplicity; and a is an autoregressive
parameter. In this case, the autocorrelation function ^(fc) is given by

/-!(£) = σ2α1*1, * = . · · , - 2 , - 1 , 0 , 1,2,··· (5)

As described in the introduction, an acetylcholine-containing neuron, which is called a


cholinergic neuron, could be one of the possible sources for this noise. Although their
state-dependent activities are heterogeneous, most choUnergic neurons have increased
activities during REM and reduced activities during SWS [23,26]. Furthermore, a mus-
carinic receptor in the cat brain is known to respond as rapidly as the order of several
seconds [27], and facilitatory effects are exerted on most of the pyramidal neurons in the
hippocampus [28] and cortex [29,30] and a thalamic relay neuron [27]. Considering the
responsiveness of a muscarinic receptor and temporal variabilities in the discharge
pattern of cholinergic neurons during REM [26,31], the cholinergic input is supposed
to be a band-limited noise with a DC component in the frequency range from 0.01 to
1.0 Hz that is concerned here. Taking into account supposition, the correlated noise is
assumed to come from various sources such as the cholinergic-noncholinergic input
from the external environment and intrinsic membrane noise.
Synaptic weights Wy are defined following the case of an associative memory:

^. = UEl 1 (2^-l)(2.r-0 **J (6)


lo i=j
where X;m) indicates the ith neuron's state of the mth memorized pattern, and M is the
number of memorized patterns. This possibly enables parametric control of the funda-
mental structure of the network attractor. In this case, the symmetry condition
Section 4 Spectral Analysis of Neuronal Activities in Neural Network Model 203
Wy = Wß (7)

is satisfied.
State evolution of the network model is performed in an asynchronous (cyclic)
manner [32]. The memorized patterns and the initial states are given as equiprobable
binary random sequences. Unless otherwise stated, simulations are done for 11,000
Monte Carlo steps (MCS) and the initial 1000 MCS are not analyzed to exclude the
state sequence dependent on the initial pattern. Because, in this study, the PSD of a
state sequence is almost invariant to the temporal translation of the sequence, the
starting time of the analysis scarcely affects the resulting PSD.
The data length, 10,000 MCS, is selected to estimate the PSD in a frequency
bandwidth of three decades with sufficient statistical reliability. PSDs of actual neuro-
nal activities referred to here were given in a similar frequency bandwidth [4].
Furthermore, the data length of the neuronal spike train analyzed was at most several
hundred seconds. Comparing this actual data length with 10,000 MCS, 1 MCS could
correspond to several tens of milliseconds. This could be regarded as a time unit during
which a neuron's state is active (1) or inactive (0). The neuronal state may be deter-
mined to be responsible for the number of spikes during this time unit. This time
resolution is presumably sufficient, considering that the firing rates of actual neurons
under study were at most 30-40 spikes/sec and the concerned frequency range is lower
than 1 Hz [4].

4. SPECTRAL ANALYSIS OF NEURONAL


ACTIVITIES IN NEURAL NETWORK MODEL

Typical PSD profiles of single neuronal activities in the network model are shown in
Figure 3 for various inhibitions and a values. Unless otherwise stated, the number of
neurons N = 100 and the number of memorized patterns M = 30. In this figure, activ-
ity of a single neuron is picked up from 100 neurons included in the network. The raster
plot of Xi(t) is shown together with the corresponding PSD, where a dot indicates
x,(i) = 1. As one can see for the case of a = 0 (i.e., white noise) the PSD profile changes
from 1// to flat as the inhibitory input increases. Here, the parameter values, h and σ,
are selected regardless of the connection type so that most of neurons in the network
show 1// PSD profiles under the weak inhibition and flat PSD profiles under strong
inhibition. The time series xt(t) responsible for the 1// PSD shows larger and slower
variations than that for the flat PSD. Naturally, the activity is reduced as the inhibitory
input increases. As described previously [8], the strong and weak inhibitory inputs are
responsible for SWS and REM, respectively, in the framework used here. Qualitatively,
the PSD profiles and the temporal characteristics of activities are well reproduced in the
simulations. Regardless of the inhibition level, finely fragmented activities tend to be
suppressed as a increases, which is more obvious in the strong inhibition case. Slopes of
PSDs commonly become steeper with an increase in a. Because an increase in a makes
autocorrelation longer lasting, these results can be attributed to a change in the correla-
tion structure of the noise. The neuronal activities seem to more closely follow the
dynamics of the noise as a increases.
204 Chapter 8 Metastable Associative Network Models of Neuronal Dynamics Transition

Figure 3 Simulation results for the dynamics transition of neuronal activities in the
symmetry neural network for various a values under weak and strong
inhibition, A, where M = 30 and σ = 0.26. Raster plots of single activities
and the corresponding PSDs for the picked-up neuron are shown together.
Results for (a) h = 0.40 and (b) h = 0.50. In PSD, the frequency axis is
normalized by/0> which denotes an arbitrary standard frequency. (Adapted
with permission from Ref. 12.)

5. DYNAMICS OF NEURAL NETWORK IN STATE


SPACE

It has been suggested that the structural change of the network attractor could underlie
the dynamics transition of neuronal activities during the sleep cycle. That is, the meta-
stable properties of the attractor could be key to understanding the physiological
mechanism that controls the dynamics of neuronal activities during the sleep cycle.
Here we show how the correlation of the random noise modifies the network dynamics
in the state space.
Section 5 Dynamics of Neural Network in State Space 205

For the symmetry network in Figure 3, activities of all neurons (network activity)
are briefly presented in Figure 4a under weak inhibition and in Figure 4b under strong
inhibition. In each panel, the autoregressive parameter a differentiates the pattern of
network activities.
As shown in Figure 4a, under weak inhibition, the network activity explicitly
indicates that the regular and irregular patterns appear alternately with varied dura-
tions. In the regular states, several different stripe patterns can be clearly seen. In
contrast, only the irregular state becomes dominant under strong inhibition. It can
be shown that these stripe patterns correspond to the vicinities of equilibrium states
under this condition, whereas the irregular pattern corresponds to the vicinity of the
"0" state, where all neurons are silent. The closest reference equilibrium state to the
current network state is determined every MCS in terms of a direction cosine (DC),
which represents the "agreement" between a current network state x(t) and a certain
reference state x*. For the networks in Figure 3, reference equilibrium states including
the "0" state are reached from 4000 statistically independent initial states with no noise,
that is, σ = 0. Here, the "0" state is denoted by x0 = [0,0,..., 0]. For the networks in
Figure 3a and b, 63 and 2 equilibrium states are found, respectively. For all references,
the closest reference to the current state is determined step by step by comparing the
magnitude of the corresponding DCs.
Under weak inhibition, the network state wanders in the vicinities of the equili-
brium states. Here, the equilibrium states are not absolutely stable because intermittent
transitions among them driven by noise are observed. In this sense, they are denoted
here as "metastable equilibrium states" or simply metastable states, following the ter-
minology of statistical physics [33]. In spite of a constant drive by the noise, the network
state is trapped in metastable states for a certain period. In the irregular states, the

Figure 4 Dynamics of network state evolution for the network in Figure 3. This
shows the brief sequences of all neuronal activities (network activity).
The numbers on the left end are neuron numbers, and the number on the
top is the number of steps from the beginning of evolution. (Adapted with
permission from Ref. 12.)
206 Chapter 8 Metastable Associative Network Models of Neuronal Dynamics Transition

network may wander around the vicinity of the "0" state. While in the vicinity of the
"0" state, each neuronal state is expected to be determined by an instantaneous value of
the noise rather than inputs from the other neurons. This is presumably the reason why
the spatiotemporal activity pattern looks random. Here, the "metastability" represents
the structural properties of the network attractor in which such metastable equilibrium
states dominantly exist. Therefore, the following description is possible concerning the
dynamics transition of single neuronal activities. The globally applied inhibitory input
modifies the structure of the network attractor. In the weakly inhibited case, the meta-
stability of the network attractor becomes dominant so as to realize the 1// fluctuations
of single neuronal activities, and in the strongly inhibited case, the "0" state becomes
the global attractor, which underlies low and random activities. In other words, it is
suggested that these behaviors reflect the geometrical structure of the network attractor:
the attractor has a "bumpy" structure under weak inhibition and a monotonic structure
under strong inhibition.
For a = 0.5, one may not be able to recognize the difference in the behavior of the
network state and the preceding results in the case of white Gaussian noise, that is,
a = 0. However,finelyfragmented patterns such as snow noises become suppressed for
a = 0.9 in both the strongly and weakly inhibited cases. For the weakly inhibited case,
more kinds of regular patterns could be recognized than in the case with the smaller a.
Similarly, in the strongly inhibited case, the distinct regular patterns are clearly raised as
a increases.

6. METASTABILITY OF THE NETWORK


ATTRACTOR

6.1. Escape Time Distributions in Metastable


Equilibrium States
As a increases, it has been observed that the irregular patterns are suppressed in
the weakly inhibited case and the regular patterns are raised even under strong inhibi-
tion. These observations suggest that the correlation properties of the noise could
control an elapsed time during which the network is trapped in a metastable state.
In order to confirm these inferences more quantitatively, the distribution of the
escape time in the respective metastable state was obtained for varied a values by a
Monte Carlo simulation with 10,000 trials. Here, the escape time is defined as the time
required for the network state, which is initially located in a metastable equilibrium
state, to cross a boundary distant from the corresponding equilibrium state by 0.2 in
terms of the DC under the correlated noise with the parameter or. Figure 5 shows the
escape time distributions for the network model. The escape time is distributed in a
monophasic manner peaking in the short range. As shown in the distributions of states
29 and 1 in Figure 5, the escape time for a metastable state except for the "0" state tends
to be prolonged with an increase in a. For the "0" state a consistent relationship
between the distribution and the value of a could not be found; in some cases, it is
rather shortened. This might be attributed to the peculiar landscape of potential energy
around the "0" state described in the next section. Nevertheless, since the prolongation
of the escape time is observed commonly in all other metastable states, these results
qualitatively support the preceding conclusions.
Section 6 Metastability of the Network Attractor 207

Figure 5 Escape time distributions in metastable states, (a) Semilogarithmic plots of


the escape time distribution in metastable states, state 0 and state 29, for the
network in Figure 3a, where h = 0.40 and σ = 0.26. (b) Semilogarithmic
plots of the escape time distribution in metastable states, state 0 and state 1,
for the network in Figure 3b, where h = 0.50 and σ = 0.26. Here, the
abscissa denotes the escape time in Monte Carlo steps, the ordinate the
value of or, and the vertical the frequency. (Adapted with permission from
Ref. 12.)

6.2. Potential Walls Surrounding Metastable


States
Stochastic properties of network activity are roughly characterized by a staying
probability in a metastable state and a transition probability from one metastable state
to another, where a higher order Markovian nature is assumed to be negligible. Both
probabilities may reflect the height of the potential wall between metastable states.
However, since the network potential function is multidimensional, it is difficult to
estimate the height of the potential wall. Here, the following computational procedure
is employed to estimate the height of the potential wall between metastable states. Now,
it is shown how to estimate the height of the wall between equilibrium states s,· to sy·.
First, the maximum network potential is obtained during the process s, to Sj by flipping
the different bits (neuronal states) one by one. Then the potential maxima are collected
208 Chapter 8 Metastable Associative Network Models of Neuronal Dynamics Transition

(a) h = 0.40 (b) h = 0.40


Wall height around 0 state

•a

20 30 40 50 60 10 20 30 40 50 60
State number State number
Transition probability from 0 state Transition probability from state 29
a = 0.0
a = 0.5
a = 0.9
0.6

0.4

0.2

0 10 20 30 40 50 60 20 30 40 50 60
State number State number

A = 0.50 (d) A = 0.50


Wall height around 0 state Wall height around state 1

!
i

State number State number

Transition probability from 0 state Transition probability from state 1


1
a = 0.0
a = 0.5 —.·■'·
0.8 a = 0.9 ■"'■-■

0.6 ••'-'i"
·"""""
&
0.4
...".--'<·'
— " ■ ' ' '

0.2

State number State number

Figure 6
Section 6 Metastability of the Network Attractor 209

by repeating the same procedure, changing the flipping order 100 times. The minimum
in the set of the maxima is selected as a potential wall height between s,· and sj. In
addition, the transition probability from s,· to sy· is estimated by 10 runs of 10,000 MCS
network state evolutions, where the transition probability for i =j indicates the staying
probability in state i.
The wall height and the transition probability are presented in Figure 6. For the
weakly inhibited case, the wall heights and the transition probabilities from the "0"
state and state 29 to all other metastable states are presented. Note that the wall
heights in the objective states are "0." Characteristically, there are high potential
walls between the "0" state and any other metastable states. In contrast, there are
several low walls around state 29 (e.g., to state 4, state 26, and state 55). This
structural property of the network potential around state 29 is shared by the other
metastable states except for the "0" state. In this sense, the potential landscape
around the "0" state is special. Concerning the transition probability, transition
from the "0" state to the other metastable states is rare in comparison with staying,
which can be understood from the above special potential landscape. On the other
hand, the transition probability from state 29 is high to itself and the other metastable
states with low potential walls. This agreement between the height of the potential
wall and the transition probability demonstrates the validity of the procedure for
deriving the wall height. A larger a is shown to make transition to the other states
less frequent and to increase the staying probability. Although, in the strongly inhib-
ited condition, only a few metastable states could be analyzed, the results in Figure 6c
and d show features similar to those in the weakly inhibited case in Figure 6a and b.
Under this condition, the potential wall from the "0" state to state 1 is much higher
than from state 1 to the "0" state, which is thought to make the staying probability
close to 1. For state 1, the staying probability increases and the transition decreases as
a gets closer to 1.
The escape time distribution is expected to depend on the local landscape of the
network potential around an equilibrium state. A symmetry neural network is known to
be a multidimensional discrete gradient system (e.g., [32]). However, there is no general
theory describing the metastability of such a multidimensional system. On the other
hand, for a one-dimensional continuous gradient system with a two-well potential, the
escape time under a small Gaussian noise obeys an exponential distribution whose
parameter depends on the height of the potential wall between two wells (e.g., [34]).
That is, a staying probability in a shallow potential well has a faster decaying profile
than one in a deep well. Although for the correlated noise, the theoretical results are
derived only under limited conditions even for the one-dimensional potential case, from
some numerical experiments the escape time distribution is expected to depend on the

Figure 6 Estimated height of network potential walls between metastable states and the transition probabil-
ity. (a) Height of network potential walls around the "0" state and the transition probability from the "0"
state, (b) Height of network potential walls around state 29 and the transition probability from state 29. These
cases are for the network under weak inhibition in Figure 3a. (c) Height of network potential walls around the
"0" state and the transition probability from the "0" state, (d) Height of network potential walls around state
1 and the transition probability from state 1. These cases are for the network under strong inhibition in Figure
3b. (Adapted with permission from Ref. 12.)
210 Chapter 8 Metastable Associative Network Models of Neuronal Dynamics Transition

local geometry of the attractor as well as the correlation structure of the noise [35]. The
result obtained for the neural network qualitatively coincides with those for the one-
dimensional system.
In short, the behavior of network activity in the state space consists of stochastic
transitions among metastable equilibrium states. The stochastic features of transitions
are determined by the height of the potential walls around metastable states and the
correlation structure of the noise. As far as the inhibition-induced dynamics transition
is concerned, global inhibition reduces the height of the potential walls and the number
of metastable states so that a PSD of neuronal activity changes its profile from \/f to
white.

7. POSSIBLE MECHANISMS OF THE NEURONAL


DYNAMICS TRANSITION

In the preceding sections, global inhibition has been shown to change the structure of
the network attractor so as to induce the neuronal dynamics transition. From another
point of view, the global inhibition and the random noise can be regarded as external
inputs in contrast to the inputs from the other neurons interconnected (network input).
Here, "external" means that neurons in the network are not involved in its dynamics. In
this context, the previously proposed mechanism for the neuronal dynamics transition
could be reinterpreted as follows. For the weakly inhibited condition, interaction
between the external inputs and the network inputs prevails, whereas for the strongly
inhibited condition, the external inputs dominate the network inputs. That is, the
balance between the external and network inputs is supposed to play an essential
role in inducing the neuronal dynamics transition. In order to realize the same situation
in a different way from the global inhibition, it was investigated how randomly diluting
connections between neurons affected the structure of the network attractor and the
metastable behavior of the network [12].
The noise was white Gaussian, and the values of σ and h were set so that most of
the neurons exhibited 1// fluctuations. The dilution was done at random and in a
symmetrical manner. With no dilution, the metastable behavior was clearly similar to
that in Figure 4a. As the dilution ratio increased, irregular patterns such as snow
noise became distinct in the network activities, where most of the neurons exhibited
flat PSDs in the frequency range less than / ~ 10 -2 . In order to understand the
structural change of the network attractor associated with the dilution, the number
of equilibrium states was derived by the procedure described earlier. The number of
metastable equilibrium states monotonically declined with increasing dilution ratio.
Furthermore, similar results were obtained for the other trials of random dilution.
Naturally, full dilution isolates each neuron, which results in purely random neuronal
activities elicited only by noise. Within the numerical range of the inhibition and the
dilution ratio used, this result suggests that similar structural changes are caused by
the dilution and the inhibitory input. Although there are many possible ways to
reduce effective network inputs, the balance between the external inputs and the
network inputs is suggested to be essential for inducing the neuronal dynamics
transition.
Section 8 Discussion 211

8. DISCUSSION
Single neuronal activities in various regions of the brain were found to exhibit the
distinct dynamics transition from the flat to the 1// power spectral profile during the
sleep cycle in cats. The pharmacological studies suggested that the neuromodulatory
systems such as aminergic and cholinergic neuron groups were involved in the dynamics
transition. Based on these results, the dynamics transition was successfully simulated by
using the interconnected neural network model including globally applied inhibitory
input and random noise. That is, the neuronal dynamics during SWS and REM was
reproduced by the network model under strong and weak inhibitory inputs, respec-
tively. In addition, the correlation properties of the noise were shown to affect the
behavior of the network activity, which could cooperate with the network attractor
to induce the dynamics transition. Although not shown here, these findings were qua-
litatively shared by an asymmetry neural network model whose connections were made
following Dale's rule [12] and a continuous-state and time neural network model
(unpublished results).
It has been suggested that the structural change of the network attractor associated
with the global inhibitory input could be the underlying mechanism for the neuronal
dynamics transition that was observed physiologically [10,11]. In particular, the meta-
stable properties of the network attractor were found to play a main role in generating
the \/f fluctuations of neuronal activities. It was also found that the correlation proper-
ties of the noise could change the dynamics of the network activities, and the staying
time in the metastable state was prolonged with an increase in the correlation time of
the noise [11]. Those results were confirmed in terms of the escape time distribution in
each metastable state, and the metastable structure of the network attractor was
roughly visualized by estimating the wall height of the network potential energy
between metastable states. In addition, diluting connections in the network was
shown to modify the structure of the network attractor so that the dynamics transition
of neuronal activities occurred, and it was similar to that induced by the global inhibi-
tion. This result generalizes the conditions for generating the \/f fluctuations and the
dynamics transition of neuronal activities. Accordingly, one of the essential factors for
inducing the dynamics transition is suggested to be the balance between the external
inputs, such as the global inhibition and the noise, and the inputs from the other
interconnected neurons (i.e., network inputs). In other words, when the network inputs
and the external biasing inputs are comparable, the metastability of the network attrac-
tor distinctly appears: the \/f fluctuation of neuronal activities is shown. In contrast,
when the external biasing input exceeds the network input, a specific state such as the
"0" state is highly attractive: the neuronal activities exhibit a flat PSD. Nevertheless, a
large external noise would control the network dynamics with its correlation structure.
According to the simulation results, mechanisms underlying the actual neuronal
dynamics transition might be anticipated as follows. During SWS, neurons receive
stronger inhibitory inputs and/or less input magnitude from interconnected neurons
than in REM sleep. In contrast, during REM sleep, neurons are released from inhibi-
tory inputs and/or receive comparable input magnitude from interconnected neurons
with the inhibition and the noise.
The serotoninergic system is suggested as a possible candidate for the globally
working inhibitory system. This is based on neuropharmacological results [5,6].
212 Chapter 8 Metastable Associative Network Models of Neuronal Dynamics Transition

However, the noradrenergic system, which is known to have a state-dependent


activity similar to that of the serotoninergic system [36], is qualified as another
substrate exerting a biasing effect. Furthermore, the cholinergic input, mediated
mainly by the muscarinic pathway, possibly has a DC component in addition to
the temporally variable component as a source of the noise. Furthermore, there
seems to exist a negative feedback loop between serotoninergic (noradrenergic)
and cholinergic neurons [37]. Therefore, to be physiologically precise, the common
inhibitory postsynaptic input, h, represents overall influences of "neuromodulatory
systems" [38] such as the serotoninergic, noradrenergic, and cholinergic systems. So,
from the physiological point of view, the results obtained here are meaningful in
terms of the possible function of the neuromodulatory systems deeply involved in
controlling and maintaining consciousness. In order to confirm the hypothesis and
to clarify the contribution of the respective systems, it is planned to set up an
experiment in which the relationship between extracellular concentrations of amine
or acetylcholine and dynamics of neuronal activities will be investigated. For this
purpose, a special technique combining microdialysis and unit recording has been
developed [39].
Rapid eye movement sleep, in which the single neuronal activity shows the 1//
fluctuation, is well known as dream sleep. Crick and Mitchison [3] postulated that
dream sleep functions to remove certain undesirable modes of interaction in neural
networks of the cerebral cortex by the nonspecific random excitation (PGO waves),
which are known to be generated in the brain stem and to be delivered to the cerebral
cortex during dream sleep. Their idea was concurrent with the "unlearning" algorithm
[40]. However, its physiological reality is not yet known. Concerning the relationship
between the 1// fluctuation of a single neuronal activity and PGO waves, there is
preliminary physiological evidence suggesting that there is no correlation between
them [20]. Nevertheless, the metastable behavior of the network activity appears to
be suitable for unlearning, because the depth of a potential well where a metastable
state is located is reflected in the corresponding staying time. Appropriate unlearning
could be performed by reducing synaptic weights every time the network stays in the
vicinity of a metastable state.
Dreaming might be regarded as a random process of recalling memorized pat-
terns without logical context. In this respect, the metastable behavior of the artificial
neural network could be analogous to dreaming. Therefore, the model-based
approach used here could provide novel information for investigating the functions
of dreaming and REM through the 1// fluctuations of neuronal activities. From the
cognitive point of view, computational network models on dreaming have been pro-
posed [2,41]. In addition, many physiological and psychological studies have sug-
gested an important role of dream sleep in memory and learning processes
[1,42,43]. The relationship between these ideas and the 1// fluctuations deserves
further study.
Some researchers have reported dreaming during non-REM sleep (e.g., [44]).
Currently, there is no appropriate explanation for this phenomenon. According to
the simulation results, metastable states with shallow potential wells and peculiar per-
iodic events during the non-REM sleep such as spindling and slow wave activity might
be a possible cue. This will be a future subject.
References 213

ACKNOWLEDGMENTS

This study was partly supported by Grant-in-Aid for Scientific Research


(07459003) from the Ministry of Education, Science, Sports, and Culture, Japan
(1995-1997); by Special Coordination Funds for Promoting Science and Technology
from the Science and Technology Agency, Japan (1996-1998); and by the Cooperative
Research Project Program of the Research Institute of Electrical Communication,
Tohoku University (1996-1998).

REFERENCES

[1] C. Smith, Sleep states and memory processes. Behav. Brain Res. 69: 137-145, 1995.
[2] J. Antrobus, Dream theory 1997: Toward a computational neurocognitive model. Sleep Res.
Soc. Bull. 3: 5-10, 1997.
[3] F. Crick and G. Mitchison, The function of dream sleep. Nature, 304: 111-114, 1983.
[4] M. Yamamoto, H. Nakahama, K. Shima, T. Kodama, and H. Mushiake, Markov-depen-
dency and spectral analyses on spike counts in mesencephalic reticular neurons during sleep
and attentive states. Brain Res. 366: 279-289, 1986.
[5] T. Kodama, H. Mushiake, K. Shima, H. Nakahama, and M. Yamamoto, Slow fluctuations
of single unit activities of hippocampal and thalamic neurons in cats. I. Relation to natural
sleep and alert states. Brain Res. 487: 26-34, 1989.
[6] H. Mushiake, T. Kodama, K. Shima, M. Yamamoto, and H. Nakahama, Fluctuations in
spontaneous discharge of hippocampal theta cells during sleep-waking states and PCPA-
induced insomnia, /. Neurophysiol. 60: 925-939, 1988.
[7] M. Yamamoto, M. Nakao, T. Kodama, A. Hanzawa, K. Nakamura, and N. Katayama,
Dynamics transition of cortical neuronal activity between REM sleep and slow wave sleep.
Abstract 4th IBRO Congress Neuroscience, D11.2, p. 404, 1995.
[8] M. Nakao, T. Takahashi, Y. Mizutani, and M. Yamamoto, Simulation on dynamics transi-
tion in neuronal activity during sleep cycle by using asynchronous and symmetry neural
network model. Biol. Cybern. 63: 243-250, 1990.
[9] M. Nakao, K. Watanabe, T. Takahashi, Y. Mizutani, and M. Yamamoto, Structural prop-
erties of network attractor associated with neuronal dynamics transition. Proceedings
IJCNN, pp. 529-534, Baltimore, 1992.
[10] M. Nakao, K. Watanabe, Y. Mizutani, and M. Yamamoto, Metastability of network
attractor and dream sleep. Proceedings ICANN, pp. 27-30, Amsterdam, 1993.
[11] M. Nakao, I. Honda, M. Musila, and M. Yamamoto, Metastable behavior of neural net-
work under correlated random perturbations. Proceedings ICONIP, pp. 1692-1697, Seoul,
1994.
[12] M. Nakao, I. Honda, M. Musila, and M. Yamamoto, Metastable associative network
models of dream sleep. Neural Networks, 10: 1289-1302, 1997.
[13] M. Yamamoto, 1// fluctuations observed in single central neurons during REM sleep. In
Physics of the Living State, T. Musha and Y. Sawada, eds. pp. 211-222. Tokyo: Ohmsha,
1994.
[14] M. Yamamoto, H. Nakahama, K. Shima, K. Aya, T. Kodama, H. Mushiake, and M. Inase,
Neuronal activities during paradoxical sleep. Adv. Neurol. Sei. 30: 1010-1022, 1986.
[15] M. Yamamoto, M. Nakao, Y. Mizutani, T. Takahashi, and K. Watanabe, Pharmacological
and model-based interpretation of neuronal dynamics transitions during sleep-waking cycle.
Method Inform. Med. 33: 125-128, 1994.
214 Chapter 8 Metastable Associative Network Models of Neuronal Dynamics Transition

[16] M. Yamamoto, M. Nakao, Y. Mizutani, and T. Kodama, Dynamic properties in time series
of single neuronal activity during sleep. Adv. Neurol Sei. 39: 29-40, 1995.
[17] K. Takahashi, Measurement of single neuronal activities in the cat's brain and their
dynamics analysis. Master's thesis, Graduate School of Information Science, Tohoku
University, 1997.
[18] M. Yamamoto, M. Nakao, and T. Kodama, A possible mechanism of dynamics-transition
of central single neuronal activity during sleep. In Sleep and Sleep Disorders: From Molecule
to Behavior. O. Hayaishi and S. Inoue, eds. pp. 81-95, Tokyo: Academic Press, 1997.
[19] D. McGinty and R. M. Harper, Dorsal raphe neurons: Depression of firing during sleep in
cats. Brain Res. 101: 569-575, 1976.
[20] K. Shima, H. Nakahama, and M. Yamamoto, Firing properties of two types of nucleus
raphe dorsalis during the sleep-waking cycle and their responses to sensory stimuli. Brain
Res. 399: 317-326, 1986.
[21] C. M. Portas and R. W. McCarley, Behavioral state-related changes of extracellular ser-
otonin concentration in the dorsal raphe nucleus: A microdialysis study in the freely moving
cat. Brain Res. 648: 306-312, 1994.
[22] H. Iwakiri, K. Matsuyama, and S. Mori, Extracellular levels of serotonin in the medial
pontine reticular formation in relation to sleep-wake cycle in cats: a microdialysis study.
Neurosci. Res. 18: 157-170, 1993.
[23] T. Kodama, Y. Takahashi, and Y. Honda, Enhancement of acetylcholine release during
paradoxical sleep in the dorsal tegmentalfieldof the cat brain stem. Neurosci. Lett. 114:227-
282, 1990.
[24] T. Kodama, H. Mushiake, K. Shima, T. Hayashi, and M. Yamamoto, Slow fluctuations of
single unit activities of hippocampal and thalamic neurons in cat. II. Role of serotonin on
the stability of neuronal activities. Brain. Res. 487: 35-44, 1989.
[25] M. Yamamoto, H. Arai, T. Takahashi, N. Sasaki, M. Nakao, Y. Mizutani, and T. Kodama,
Pharmacological basis of 1// fluctuations of neuronal activities during REM sleep. Sleep
Res. 22: 458, 1993.
[26] Y. Koyama and Y. Kayama, Properties of neurons participating in regulation of sleep and
wakefulness. Adv. Neurol. Sei. 39: 29-40, 1995.
[27] D. A. McCormick and D. A. Price, Actions of acetylcholine in the guinea-pig and cat medial
and lateral geniculate nuclei, in vitro. J. Physiol. 392: 147-165, 1987.
[28] M. Stewart and S. E. Fox, Do septal neurons pace the hippocampal theta rhythm? Trends
Neurosci. 13: 163-168, 1990.
[29] T. M. McKenna, J. H. Ashe, G. K. Hui, and N. W. Weinberger, Muscarinic agonists
modulate spontaneous and evoked unit discharge in auditory cortex. Synapse 2: 54-68,1988.
[30] H. Sato, Y. Hata, H. Masui, and T. Tsumoto, A functional role of cholinergic innervation to
neurons in the cat visual cortex. /. Neurophysiol. 58: 765-780, 1987.
[31] H. Sei, K. Sakai, M. Yamamoto, and M. Jouvet, Spectral analyses of PGO-on neurons
during paradoxical sleep in freely moving cats. Brain Res. 612: 351-353, 1993.
[32] J. J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities. Proc. Natl. Acad. Sei. USA 79: 2254-2258, 1982.
[33] D. J. Amit, Modeling Brain Function. Cambridge: Cambridge University Press, 1989.
[34] A. R. Bulsara and F. E. Ross, Cooperative stochastic processes in reduced neuron models.
Proceedings International Conference on Noise in Physical Systems and 1 If Fluctuations, pp.
621-627, 1991.
[35] F. Moss and P. V. E. McClintock, eds., Noise in Nonlinear Dynamical Systems, Vols. 1, 2,
and 3. Cambridge: Cambridge University Press, 1989.
[36] B. L. Jacobs, Overview of the activity of brain monoaminergic neurons across the sleep-
wake cycle. In Sleep: Neurotransmitters and Neuromodulators, A. Wauquier, J. M. Monti, J.
M. Gaillard, and M. Radulovacki, eds., pp. 1-14. New York: Raven Press, 1985.
References 215

[37] R. W. McCarley, R. W. Greene, D. Rainnie, and C. M. Portas, Brainstem neuromodulation


and REM sleep. Neuroscience 7: 341-354, 1995.
[38] T. Maeda, Neural mechanisms of sleep. Adv. Neurol. Sei. 39: 11-19, 1995.
[39] K. Nakamura, Development of neuropharmacological technique and its application to
neuronal unit recording. Master's thesis, Graduate School of Information Sciences,
Tohoku University, 1996.
[40] J. J. Hopfield, D. I. Feinstein, and R. G. Palmer, 'Unlearning' has a stabilizing effect in
collective memories. Nature 304: 158-159, 1983.
[41] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, The 'wake-sleep' algorithm for
unsupervised neural networks. Sciences 268: 1158-1161, 1995.
[42] M. A. Wilson and B. L. McNaughton, Reactivation of hippocampal ensemble memories
during sleep. Science 265: 676-679, 1994.
[43] A. Kami, D. Tanne, B. S. Rubenstein, J. J. M. Askenasy, and D. Sagi, Dependence on REM
sleep of overnight improvement of a perceptual skill. Science 265: 679-682, 1994.
[44] G. W. Vogel, B. Barrowclough, and D. D. Giesler, Limited discriminability of REM and
sleep onset reports and its psychiatric implications. Arch. Gen. Psychiatry 26: 449-455, 1972.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.

Chapter ARTIFICIAL NEURAL NETWORKS


9 FOR SPECTROSCOPIC SIGNAL
MEASUREMENT

Chii-Wann Lin, Tzu-Chien Hsiao, Mang-Ting Zeng, and


Hui-Hua Kenny Chiang

1. INTRODUCTION

For most analytical instrumentation for chemometric applications, spectral signals can
be generated by electromagnetic energy, including X-ray, ultraviolet, visible, infrared,
microwave, electron spin resonance, and nuclear magnetic resonance [1]. Spectral sig-
nals thus represent the perturbation response of the probed system. This provides a
means of identifying the system characteristics and modeling the possible responses. In
theory, it canfingerprintthe composition molecules in both qualitative and quantitative
ways. In order to apply these principles successfully to biomedical and clinical research,
one often needs to employ some spectral signal processing techniques for smoothing,
fitting, and/or extracting the signal for quantitative interpretation [2-4]. This has been
an intensive research field on the past few decades. The name chemometrics usually
refers to using linear algebra calculation methods to make either quantitative or qua-
litative measurements of chemical data, primarily spectra. Classical methods for spec-
tral signal measurement include least-squares regression (LSR), principal component
regression (PCR), and partial least squares (PLS), which can be found in daily usage in
spectral analysis. However, the possible nonlinear effect of multiple substances and
interference still poses technical challenges for performance. Both PLS and PCR are
decomposition techniques for multivariate spectral measurement. PLS uses the concen-
tration information (expected pattern) for decomposition processes. It takes advantage
of the correlation between the spectral data and the constituent concentrations. This
causes spectra containing higher constituent concentrations to be weighted more heav-
ily than those with low concentrations. The resulting spectral vectors are directly related
to the constituents of interest.
Artificial neural networks have been successfully applied to many engineering
fields [5-7]. There are more than 50 different types of network architecture and a
number of different input-output transfer functions in the literature. Neural networks
are often used as system modeling or identification methods to characterize system
properties [8]. The capabilities of nonlinear handling and adaptation for better perfor-
mance have been the advantages of this method. With a given set of input-output data
(often limited in biomedical applications), the model structure needs to approximate the
system characteristics with acceptable accuracy. Among these different structures, the
back-propagation (BP) model is probably the most popular one and claims many

216
Section 2 Methods 217

successful applications [9-12]. In general, the procedures can be divided into a training
phase and an evaluating phase. During the training phase, the sum of the square error
was minimized by using a generalized delta learning rule, which theoretically can
achieve any level of accuracy. The converged weight matrix can thus contain informa-
tion on maximum variations within the input patterns. It can then be used as a feed-
forward network for the evaluating phase. Presumably for the same category, one can
quantify unknown or untrained patterns by using the convergent weight matrix. The
advantage of this method is easy implementation. However, it is also known for its slow
convergence and limited capacity for training patterns. The radial basis function (RBF)
has its origins in techniques used for interpolation in multidimensional space [13]. The
implementation with a special two-layered architecture enables the nonlinear transfor-
mation of input space to a compacted output space. With a linear combiner on this new
space, one can modify the connection weight matrix from hidden to output layers by
using a traditional linear least-squares regression. The optimally chosen center and
width of the hidden units and the symmetrical transfer function can provide a smooth
interpolation of scattered data in arbitrary dimension to the desired accuracy [14]. This
model has been used in various areas such as speech recognition, data classification, and
time series prediction [15-19]. It has also been proposed to play an important role in the
pattern recognition capability of a neuronal signal processing scheme [20].
In this chapter, multivariate analysis methods (BP, RBF and PLS) are adopted to
measure glucose concentrations from near-infrared spectra and this performance is
compared. To facilitate the instrumentation development, all three methods were devel-
oped by using MATLAB and then implemented by using LabVIEW 4.0.1 (National
Instrument Inc.). The comparison of performance will be given in the discussion section
according to the simulation results from glucose spectra. Part of this chapter has been
presented in a conference [21].

2. METHODS

2.1. Partial Least Squares


PLS uses the concentration information in the calculations during spectral decom-
position. This results in two sets of eigenvectors: a set of spectral "loadings" (P), which
represent the common variations in the spectral data, and a set of spectral "weights"
( 0 , which represent the changes in the spectra that correspond to the regression con-
stituents. Correspondingly, there are two sets of scores: one for the spectral data (X)
and another for the concentration data (Γ).
The model can be written as follows:

X=TP + E=TlPl + T2P2 + --- + TaPa + E (1)


Y=TQ + F=TlQl + T2Q2+-+TaQa +F (2)

Both X and Y are mean centered to enhance the differences between concentration
and sample responses.
During the calibration phase, one needs to decompose and calculate the T, P, and
Q matrices from the known spectra value (X) and concentration (Γ) matrices by the
least-squares method. Each component of the T and P matrices is calculated by sub-
218 Chapter 9 Artificial Neural Networks for Spectroscopic Signal Measurement

sequently removing the contribution of the previous spectral vector. The optimal num-
ber of factor (a) is determined by calculating the prediction residual error sum of
squares (PRESS) with a calibration and validation set and cross-validation method.
In general, the smaller the PRESS value, the better the model can predict the concen-
trations of calibrated constituents. One can then use the convergent matrices P and Q to
calculate the calibration model for the prediction phase. With P orthogonal, the cali-
bration equation can be shown as follows:

Ϋ = XPTQ (3)

2.2. Back-Propagation Networks


BP also uses the concentration information in the calculation during calibration,
which shares the same basic idea with PLS. The architecture of the BP model is shown
in Figure 1 for a typical three-layer network. In brief, all the spectra are subjected to
Euclidean norm normalization for subsequent training and predicting processes. The
concentration values are binary coded as the output pattern for training processes. The
output of neuron Oj is calculated from the weighted (Wy) sum of the previous layer
output (Oj) and bias (Bj) according to the following equation.

The transfer function (F) can be a linear, sigmoid, or hyperbolic tangent function.
The back-propagation error is calculated according to the generalized delta rule
(GDR). The connection weight matrices are set to random values near zero before
the beginning of the training process. During the training epochs, these weights are
adjusted with respect to the difference between the desired and actual output patterns of

gaussian
units

^""F/T/
Linear units

Figure 1 The general architecture of neural


Input normalization networks; the activation function of the hid-
module den layer is the major difference. We nor-
mally use a sigmoid function for BP and
Gaussian function for RBF.
Section 2 Methods 219

the network. With a predefined threshold, learning rate, hidden node numbers, and
maximum epoch, the network will eventually reach a stable condition, which we assume
convergence. The convergent state corresponds to a local minimum instead of a global
one in terms of the error energy. The convergent weight matrices can reflect the max-
imum variations within the input patterns. It has the form of eigenvector decomposition
in the eigenvalue order for linear neurons [22,23]. An adaptive learning rate and
momentum term can be applied to increase the convergent speed during the training
phase. After convergence, the network can be used as a feed-forward network for the
prediction of untrained patterns. The output value of the output layer is then calculated
by the binary weighting for quantitative results.

2.3. Radial Basis Function Networks


RBF shares the same architecture as general multilayer neural networks. The
hidden layer performs a nonlinear transformation, which maps the input space to a
new space. The output layer acts as a linear combiner on this new space. The weight
matrix between hidden and output layers is consequently modified according to the
least-squares method. The mapping function of RBF is as follows:

m
Y = YjK{\X-ci\X)*Wi (5)
i

where Y is the output vector, X is the input matrix, c, is the RBF center of the ith node,
m is the total number of centers, W is the connection weight between output node and
hidden nodes, K(o) is the common radial symmetric kernel function with nonlinearity,
and || o || denotes the Euclidean distance between X and c,. It is known that the choice
of kernel function is not critical to the performance of RBF networks. Functions often
used include the thin-plate spline function, Gaussian function, multiquadric function,
and the inverse multiquadric function. Gaussian functions (e-^*2) were used through-
out this work. The spreading width (σ) is preset by default value, and the optimal value
obtained by exhaustive search in the LabVIEW program. The number of centers is
calculated according to the orthogonal least square (OLS) method [24]. In brief, the
RBF can be viewed as a special case of the linear regression model and expressed as
follows:

m
r « = I>(O0,-MO (6)
i

The pat) are known as the regressors, which can correspond to a radial basis
function withfixedcenter c, as in Eq. 5, and the 0, are the parameters. The OLS method
involves the transformation of the set;?, into a set of orthogonal basis vectors, P = ΏΑ,
where A is an M x M upper triangular matrix with l's on the diagonal and Ω is an
N x M matrix with orthogonal column ω,. The often ill-conditioned information
matrix (P) can be decomposed by the modified Gram-Schmidt (MGS) method. By
monitoring the error reduction ratio due to ω, and a chosen tolerance p, one can find
a subset of significant regressors in a forward-regression manner [25,26]
220 Chapter 9 Artificial Neural Networks for Spectroscopic Signal Measurement

[^=gjwfWi/(dTd) (7)
i - Σ>], < p (8>

2.4. Spectral Data Collection and Preprocessing


All the spectral data were collected with a near-infrared spectrophotometer
(Control Development Inc) with 100 times average to increase the signal-to-noiso
ratio. The spectral wavelength spans from 900 to 1500nm to cover the "biological
optical window," which is known to provide better penetration depth for possible in
vivo medical applications [27]. Various concentrations of glucose solution were pre-
pared by series dilution of stock solution and equilibrated overnight at room tempera-
ture. The solution samples were preincubated in a 37°C constant-temperature water
bath before measurement of the absorption spectrum. A set of typical spectra data is
shown in the upper panel of Figure 2 for the concentration range 80 to 300 mg/dL. The
actual reading of the glucose concentration was reconfirmed by a glucose analyzer

Figure 2 The graphical user interface (GUI) of the simulation system for PLS, BP,
and RBF in the LabVIEW environment. The upper panel shows the near-
infrared (NIR) spectra of glucose solutions with different concentrations.
Section 3 Results 221

(YSI-1500, Yellow Springs) before the measurement and subsequently used in the
neural network training and evaluating procedures as the standard. The absorption
spectra were calculated by reference to air, deionized water, or an absorption film. The
presented data are taken from the air reference. The acquired spectral data were saved
in ASCII format for file retrieval in off-line analysis. Due to the strong absorption of
water signals within the spectral range, different preprocessing methods were used to
change the scale of spectra for input to both PLS and neural network models. These
include the optimal density on logarithmic scale, linear scale, and Euclidean norm.
Simulation results indicated the significant performance improvement of the
Euclidean norm. The total number of spectra was divided into odd and even number
groups for the training and evaluating procedures. In cross-validation, the left-one-out
spectrum was used to evaluate the performance of trained networks.

3. RESULTS
The glucose spectra have typical water absorption peaks in the range of 970, 1250, and
1450 nm. The strong absorption band due to the presence of water is about three orders
of magnitude greater than the glucose absorption with the current measurement con-
figuration. To account for the dramatic difference between glucose and water absorp-
tion and increase the performance, we manipulate the original spectral data for
different scale and normalization procedures. All of the methods mentioned have
been implemented in the LabVIEW 4.0.1 window environment. The graphical user
interface (GUI) of the working system is shown in Figure 2. The system has been tested
with a simulated spectrum by linear and nonlinear combination of two normal distri-
bution curves, with different center and width. The resultant root mean square of the
residual error is about 0.11, which indicates that the accuracy of the prediction is
applicable to real data.

3.1. PLS
After activating the PLS, the suitable wavelength range can be interactively
selected as input spectra. The optimal factor number is determined by the PRESS
value when it first comes to baseline as shown at the bottom in Figure 3. For the
glucose data set, the average factor number is around 6. The convergent speed is faster
than with BP. As the graph shown for the cross-validation, the performance can be
evaluated by the total summation square error of predication versus the true value from
the glucose analyzer. This standard method gives very stable results for the presented
data scheme.

3.2. BP
Generally, the system can have better performance with a Euclidean norm applied
to the input spectra. In the training phase, we can modify the learning rate to view the
resultant error graphically. The sum square error is monitored and compared to the
threshold value for convergence. The spectra and concentration vectors are presented to
the networks in batch mode. The resultant convergent weight matrices are used to
evaluate the untrained spectra. The prediction value is then plotted against the calibra-
tion value to visualize the difference as shown in Figure 4.
222 Chapter 9 Artificial Neural Networks for Spectroscopic Signal Measurement

Figure 3 The layout of PLS simulation. The region of interest can be selected with
cursors to forward to the PLS processing routine. The bottom panel shows
the PRESS value versus the number of factors and the predictions versus
expected value. The RMSE value of the corresponding chosen factors is
shown on the graph.

3.3. RBF
In the RBF section, the error reduction ratio is calculated and compared to find the
significant regressors in an orderly manner. The convergent status is monitored by the
residual of one minus the total summation of the error reduction ratio with threshold
setting by p value. The system performance is quite sensitive to the value of spreading
factor, which we can control through the front panel. The resultant prediction error is
plotted against the calibration value and visualized in Figure 5.
The simulation results for PLS, BP, and RBF with the glucose near-infrared spec-
tra are listed in Table 1 for comparison. Due to the large number of samples in cross-
validation, the listed value for BP is the result obtained with a different maximal epoch
number as indicated in a footnote.

4. DISCUSSION

The chemometric measurement of optical spectra has extensive usage in both research
and clinical applications. The effectiveness of this analytical method is still an active
research area. There has been a recent surge in application for biomedical spectral
Section 4 Discussion 223

Figure 4 The layout of BP simulation. The region of interest can be selected with
cursors to forward the BP processing routine. The bottom panel shows the
PRESS value versus the number of hidden neurons and the prediction
versus expected value. The RMSE value of the corresponding chosen fac-
tors is shown on the graph.

TABLE 1 Comparison of Different Methods in Terms of MSE

Odd/even" Even/odd6 Cross-validation

PLS
Origin(OD) 32.53 33.53 25.82
l-exp(-OD) 19.15 21.2 42.31
Euclidean norm 23.01 25.49 65.36
t
Origin(OD) 28.28 25.67 35.54
l-exp(-OD) 57.31 179.38 66.02
Euclidean norm 24.41 25.49 65.36

Origin(OD) 32.37 28.20 26.28


l-exp(-OD) 28.27 21.38 25.72
Euclidean norm 27.51 24.49 20.08
" Odd/even: the odd group for calibration and the even group for prediction.
b
Even/odd: the even group for calibration and the odd group for prediction.
c
For BP the maximum epoch number is set for *odd/even and even/odd, 100,000; tcross-validation, 30,000.
IS»

Figure 5 The layout of RBF simulation. The region of interest can be selected with cursors
to forward to the RBF processing routine. The bottom panel shows the PRESS
value versus the number of hidden nodes and the prediction versus expected value.
The RMSE value of the corresponding chosen factors is shown on the graph.
Section 4 Discussion 225

measurement, which includes noninvasive diagnostic of many diseases and level of


biochemical markers. The multivariate approach provides much better numerical sta-
bility and needed accuracy for a wider range of prediction.
PLS has the advantage of resultant spectral vectors (eigenvectors) that are directly
related to the constituents of interest rather than the largest common spectral variations
(principal components). It can thus be viewed as a feature extractor in a series manner.
It can be used for very complex mixtures and sometimes for the prediction of samples
with contaminants not present in the original calibration mixtures. However, it does
require a large number of samples for accurate calibration, which increases the com-
putation loading for real-time applications. Unavoidably, there are collinear constitu-
ent concentrations within the biological samples that degrade the performance of the
PLS model. This situation can be clearly observed from the experimental results.
BP uses the generalized delta rule (GDR) for weight modification. It has the same
fundamental approach as PLS, which uses the expected patterns to correlate with the
input patterns for system identification. Can BP be treated as a parallel implementation
of PLS? Will the series decomposition of PLS lead to the finding of the exact order and
location of eigenvectors within the weight matrix? The major difference between these
two methods may be due to the GDR, which can handle the collinear case in BP better
than PLS does. Theoretically, BP can approximate any function to any level of accu-
racy. However, for practical applications, it will take extremely long times for the
convergence of training processes. Even though the adaptive learning rate and momen-
tum modification of the original scheme can significantly shorten the training time, it
remains an obstacle to the real-time application of BP. The other well-known problem
with BP is the number of hidden nodes for the necessary learning capacity of training
samples. This can degrade the performance of BP in terms of both convergent speed
and accuracy. As in the case of cross-validation, an increased number of calibration
samples often leads to an impractical convergent time for the same convergent thresh-
old as in the other two groups. The way the data are presented to the network can
significantly affect its performance. As the results indicate, normalization of input
spectra by the Euclidean norm can significantly improve the performance of BP. The
two presentations of absorption on linear and logarithm scales for the network can
make a big difference in their prediction error.
The OLS learning rule has improved the performance of RBF by selecting the
optimal subset centers from a large number of input data points. In fact, the OLS
algorithm can be applied to any model, which has a linear-in-the-parameter structure.
The whole calibration process can be done in a few iterations with very efficient com-
putation. It has the advantages that the global minimum is always found and "learn-
ing" is relatively quick. The optimal width of the RBF function is determined by
exhaustive search in the current simulation. New approaches have been proposed to
find the optimal spreading width more efficiently (e.g., back-propagation and genetic
algorithm) [28,29]. The distance between samples allows fast identification of the con-
tribution due to each individual constituent. This receptive field concept can be very
important for biological neural signal processing.
In conclusion, the comparison of three multivariate analysis methods, PLS, BP,
and RBF, has provided a background for the continuous development of biospectral
chemometric applications. The LabVIEW platform allows quick development in the
graphical user interface environment. BP and PLS have comparable results for small
calibration data sets and both degrade when the size increase as in the cross-validation
226 Chapter 9 Artificial Neural Networks for Spectroscopic Signal Measurement

case. These two methods use Cartesian mapping for the input and both have the
disadvantage of slow convergence. BP uses a sigmoid function for the nonlinearity
and GDR for the weight modification. It is bounded for the local minimum. PLS is
ideal for the linear case and results in the global minimum. RBF provides better results
in our current simulation in both convergent speed and resultant prediction accuracy.
Most of all, its performance does not degrade with increasing sample size. This can be
important for practical applications where the calibration data set will accumulate to
account for the wider range of interest and group average effect.
The MATLAB files for the PLS, BP, and RBF algorithms are as follows:
Bp.m
% INITFF - Initializes a feed-forward network.
% TRAINBP - Trains a feed-forward network with back propagation.
% SIMUFF - Simulates a feed-forward network.
% FUNCTION APPROXIMATION WITH TANSIG/PURELIN NETWORK:
% Using the above functions two-layer network is trained
% to respond to specific inputs with target outputs.
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
input data
spectrum:spectra data
concentration:concentration data
xpre and ypre are the prediction data set.
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

clear
close('all')

load c:\spectrum.txt;
load c:\concentration.txt;
xcal=spectrum;
ycal=concentration;

load c:\spectruml.text;
load c:\concentrationl.txt;
xpre=spectruml;
ypre=concentrationl;

% DESIGN THE NETWORK


% __^=_===_^==
% A two-layer TANSIG/PURELIN network will be trained.
% The number of hidden TANSIG neurons should reflect the
% complexity of the problem.
SI = 5;
[wl,bl,w2,b2]=initff(xcal,Sl,'tansig',ycal, 'purelin');

% TRAINING THE NETWORK

% TRAINBP uses backpropagation to train feed-forward networks.


Section 4 Discussion 227

df = 10; % Frequency of progress displays (in epochs)


me = 8000 % Maximum number of epochs to train.
eg = 0.02 % Sum-squared error goal.
lr = 0.01 % Learning rate.

tp = [df me eg lr];
[wl,bl,w2,b2,ep,tr] =trainbp(wl,bl,'tansig',w2,b2, 'purelin', xcal,ycal,tp);

% PLOTTING THE ERROR CURVE


%

ploterr(tr,eg);

% ___________
% Prediction
% _________
a = simuff(xpre,wl,bl,'tansig',w2,b2,'purelin');
plot (ypre, a ) ;
% The result is fairly close. Training to a lower error
% goal would result in a closer approximation.
echo off

Rb.m
% SOLVERB - Designs a radial basis network.
% SIMURB - Simulates a radial basis network.
% SUPERFAST FUNCTION APPROXIMATION WITH RADIAL BASIS NETWORKS:
% Using the above functions a radial basis network is trained
% to respond to specific inputs with target outputs.

input data
spectrum:spectra data
concentration:concentration data
xpre and ypre are the prediction data set.
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

clear
close('all')
load c:\spectrum.txt;
load c:\concentration.txt;
xcal=spectrum;
ycal=concentration;

load c:\spectruml.txt;
load c:\concentrationl.txt;
xpre=spectruml;
ypre=concentrationl;
228 Chapter 9 Artificial Neural Networks for Spectroscopic Signal Measurement

% SOLVERB finds a two-layer radial basis network with enough


% neurons to fit a function to within the error goal.

df = 10; % Frequency of progress displays (in neurons).


me = 100; % Maximum number of neurons.
eg = 0.02; % Sum-squared error goal.
sc = 1; % Spread constant radial basis functions.
tp = [df me eg sc];

% ______________
% TRAINING THE NETWORK
% _____________

[wl,bl,w2,b2,nr,tr] = solverb(xcal,ycal,tp);
% TRAINRB has returned weight and bias values, the number
% of neurons required NR, and a record of training errors TR.

% ________________
% PLOTTING THE ERROR CURVE
% ________=_______=

ploterr(tr,eg);

% _________
% Prediction
% _______

a = simurb(xpre,wl,bl,w2,b2)

plot (ypre, a ) ;

% The result is fairly close. Training to a lower error


% goal would result in a closer approximation.

echo off

PLS.m
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
input data
spectrum:spectra data
concentration:concentration data
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

clear
close('all')
load c:\spectrum.txt;
load c:\concentration.txt;
xcal=spectrum;
ycal=concentration;
ycplall=[];
Section 4 Discussion

yplall=[]
Ecplall=[];
ycpall=[];
ypall=[];
Ecpall=[]
tamax,fn]=size(xcal) ;
% ___ _ _ _ _ _ _ _ _
data arrangement for cross-validation
xc:spectra of calibration
yx:concentration of calibration
xp:spectra of prediction
yp:concentration of prediction
% _ _ _ _ _ _ _ _ _ _ _ _ _

for turn-1:amax
xp=[];
xc=[];
yp=[];
yc=[];
W= [ ] ;
T= [ ] ;
if turn =1
xp=xcal(turn, : ) ;
xc=xcal(turn+1):amax,:);
yp=ycal(turn,:);
yc=ycal(turn+1:amax,:) ;

elseif turn=amax
xp=xcal(turn,:);
xc=xcal(1:turn-1,:);
yp=ycal(turn,:);
yc=ycal(1:turn-1, : ) ;

else

xp=xcal(turn,:);
xc=[xcal(1:turn-1,:);xcal(turn+1:amax, :) ] ;
yp=ycal(turn,:);
yc=[ycal(1:turn-1,:);ycal(turn+1:amax,:)];
end
[η,ηη]=size(xc);
nl=min(n,nn);
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
center spectra and concentration
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

xmean-ones(n,1)*mean(xc);
ymean=ones(n,1)*mean(yc);
[np2,nnp2]=size(xp);
230 Chapter 9 Artificial Neural Networks for Spectroscopic Signal Measurement

xmean2=ones(np2,1)*mean(xc);
ymean2=ones(np2,1)*mean(yc);
x=xc-xmean;
yi=yc-ymean;
y=yi;

linear PLS algorithm

for a=l:nl
c=l/sqrt (y'*x*x'*y) ;
w=c*x'*y;
W=[W,w];
t=x*w;
T=[T,t];
Q=inv(T'*T)*T'*yi;
E=x-t*w';
F=yi-T*Q;
x=E;
y=F;
b=W*Q;
b0=ymean2-xmean2*b;

% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
prediction of y for new spectra
% _ __ __ __

ycp=b0+xp*b;
Ecpl=yp-ycp;
yep;
ycplall=[ycplall,ycp];
yplall=[yplall,yp];
Ecplall=[Ecplall,Ecpl];
end
ycpall=[ycpall;ycplall];
ycplall=[];
ypall=[ypall;yplall];
yplall=[];
Ecpall=[Ecpall;Ecplall];
Ecplall=[];
end
for i=l:nl
msecp(i)=sqrt (EcpalK : , i ) '*Ecpall( : ,i)/amax) ;
end
[aa,a]=min(msecp)
ypresent=[ypall(:,a),ycpall(:,a)];
[ypsl,yps2]=size(ypresent);
References 231

% ____ ____
plot predicted values and RMSECV
% _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

figured)
plot(ypresent(:,1)»ypresent(:,2),Ό')
title('Glucose prediction ')
xlabeK'Actual concentration(mg/dL)');
ylabeK'Predicted concentration (mg/dL)');

figure(2)
plot(msecp(1:20),Ό') ;
t i t l e ( ' Linear RMSECV ')
xlabeK'DIMENSION ') ;
ylabeK'RMSECV (mg/dL)');

ACKNOWLEDGMENTS

This work is supported by the Industrial Technology Research Institute, Ministry


of Economic Affairs (ITRI, MOEA, R.O.C.) to develop the technology for noninvasive
glucose monitoring with artificial neural networks and partial least-squares methods.

REFERENCES

[1] D. A. Skoog, Principles of Instrumental Analysis, Holt, Rinehart and Winston: CBS College
Publishing, 1995.
[2] E. V. Thomas and D. M. Haaland, Comparison of multivariate calibration methods for
quantitative spectral analysis. Anal. Chem. 62: 1091-1099, 1990.
[3] W. E. Blass and G. W. Halsey, Deconvolution of Absorption Spectra, New York: Academic
Press, 1981.
[4] Y.-Z. Liang, Y.-L. Xie, and R.-Q. Yu, Accuracy criteria and optimal wavelength selection
for multicomponent spectrophotometric determinations. Anal. Chim. Ada 222: 347-357,
1989.
[5] P. J. Gemperline, J. R. Long, and V. G. Gregoriou, Nonlinear multivariate calibration using
principal components regression and artificial neural networks. Anal. Chem. 63: 2313-2323,
1991.
[6] T. B. Blank and S. D. Brown, Data processing using neural networks, Anal. Chim. Acta 277:
273-287, 1993.
[7] J. J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities. Proc. Natl. Acad. Sei. USA 79: 2554-2558, 1982.
[8] S. Grossberg, Nonlinear neural networks: Principle, mechanisms, and architectures. Neural
Networks 1: 17-61, 1988.
[9] P. A. Jansson, Neural network: An overview. Anal. Chem. 63: 367A-362A, 1991.
[10] B. J. Wythoff, S. P. Levine, and S. A. Tomellini, Spectral peak verification and recognition
using a multilayered neural network. Anal. Chem. 62: 2702-2709, 1990.
[11] C.-W. Lin, J. C. LaManna, and Y. Takefuji, Quantitative measurement of two-component
pH-sensitive colorimetric spectra using multilayer neural networks. Biol. Cyber. 67: 303-308,
1992.
232 Chapter 9 Artificial Neural Networks for Spectroscopic Signal Measurement

[12] J.-J. Weim, C.-W. Lin, T.-S. Kuo, T. Kao and C.-Y. Wang, A quantitative neural networks
system for glucose concentration measurement. Chin. J. Med. Bio. Eng. 15: 59-72, 1995.
[13] M. J. D. Powell, Radial basis functions for multivariable interpolation: A review. In
Algorithms for the Approximation of Function and Data. J. C. Mason and M. G. Cox,
eds., pp. 143-167. New York: Chapman & Hall, 1990.
[14] J. Park and I. W. Sandberg, Approximation and radial-basis-function networks. Neural
Comput. 5: 305-316, 1993.
[15] M. Casdagli, Nonlinear prediction of chaotic time series. Physica D 35: 335-356, 1989.
[16] J. C. Carr, W. R. Fright, and R. K. Beatson, Surface interpolation with radial basis func-
tions for medical imaging. IEEE Trans. Med. Imaging 16: 96-107, 1997.
[17] N. Donaldson, H. De, K. Gollee, J. Hunt, J. Jarvis, and M. K. Kwende, A radial function
model of muscle stimulated with irregular inter-pulse intervals. Med. Eng. Phys. 17:431-441,
1995.
[18] J. Holzfuss and J. Kadtke, Global nonlinear noise reduction using radial basis function. Int.
J. Bifurcation Chaos 3: 589-596, 1993.
[19] S. Lowes and J. M. Shippen, A diagnostic system for industrial fans. Measurement Control
30: 9-13, 1997.
[20] J. J. Hopfield, Pattern recognition computation using action potential timing for stimulus
representation. Nature 276: 33-36, 1995.
[21] C.-W. Lin, T.-C. Hsiao, M.-T. Zeng, and H.-H. Chiang, Quantitative multivariate analysis
with artificial neural networks. Second International Conference on Bioelectromagnetism, pp.
59-60, Melbourne, Australia, 1998.
[22] E. Oja, A simplified neuron model as a principal component analyzer. /. Math. Biol. 15:
267-273, 1982.
[23] T. D. Sänger, Optimal unsupervised learning in a single-layer linear feedforward neural
network. Neural Network 2: 495-473, 1989.
[24] S. Chen, S. A. Billings, and W. Luo, Orthogonal least square methods and their application
to non-linear system identification. Int. J. Control 50:1873-1896, 1989.
[25] S. Chen, S. A. Billing, C. F. N. Cowan, and P. M. Grant, Non-linear systems identification
using redial basis function. Int. J. Syst. Sei. 21: 2513-2539, 1990.
[26] S. S. A. Chen, C. F. N. Cowan, and P. M. Grant, Orthogonal least squares learning algo-
rithm for radial basis function networks. IEEE Trans. Neural Networks 2: 302-309, 1991.
[27] R. R. Anderson and J. A. Parrish, The optics of human skin. /. Invest. Dermatol. 77: 13-19,
1981.
[28] C.-C. Chiu, D. F. Cook, J. J. Pignatiello, and A. D. Whittaker, Design of a radial basis
function neural network with a radius-modification algorithm using response surface meth-
odology. /. Intell. Manuf. 8: 117-124, 1997.
[29] B. A. Whitehead and T. D. Choate, Cooperative-competitive genetic evolution of radial
basis function centers and widths for time series prediction. IEEE Trans. Neural Networks
7:869-880, 1996.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.

Chapter APPLICATIONS OF FEED-FORWARD


10 NEURAL NETWORKS IN THE
ELECTROGASTROGRAM

Zhiyue Lin and J. D. Z. Chen

1. INTRODUCTION

The surface electrogastrogram (EGG) is a noninvasive measurement of the electrical


activity of the stomach obtained by placing surface electrodes on the abdomen over the
stomach [1]. The electrical activity of the stomach controls gastric contractions and
plays an important role in the digestive process of the stomach. The EGG is attractive
because it is noninvasive and does not disturb ongoing activity of the stomach.
Numerous studies have shown that the EGG is an accurate measure of the gastric
slow wave [2-4]. Therefore, the EGG has great potential in clinical applications.
Compared with the development of other electrophysiological measures, such as the
electroencephalogram (EEG) and the electrocardiogram (ECG), that of EGG in clinical
applications has been very slow. One of the main problems is that the EGG is imprecise.
No established parameters of the EGG and mathematical algorithms have been avail-
able for clinical diagnosis using the EGG. Therefore, an efficient solution to exploring
the clinical applications of the EGG is required.
An artificial neural network (ANN) is a computational tool simulating the human
nervous system. ANNs have been widely employed in classification and recognition due
to their great potential for high performance, flexibility, robust fault tolerance, cost-
effective functionality, and capability for real-time applications [5,6]. A number of
successful applications of ANNs to biomedical signal detection and classification
have been reported, including estimation of the ejection fraction of a human heart,
diagnosis of heart arrhythmia, and identification of a corrupted arterial pressure signal
[7-11]. The main attraction of ANNs is their ability to learn the functionality in cases in
which it is possible to specify the inputs and outputs but difficult to define the relation-
ship between them. Physiological systems, in particular, such as the digestive system in
the human body, are not easily described by mathematical relationships.
Although no mathematical algorithms can be developed for the problems involved
in the assessment of the EGG, a large amount of EGG data can easily be made avail-
able because of the noninvasive nature of the electrogastrographic technique. Using the
neural network technique, there is no problem associated with the number or complex-
ity of rules because the relationship is learned by the ANN itself. The ANN is also
tolerant of noise in the input data. These attributes of the ANN are suitable for the
assessment of the EGG.

233
234 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram

The ANNs we consider here are three-layer feed-forward networks. In comparison


with other ANNs, the feed-forward neural network has the advantages of the avail-
ability of effective training algorithms, relatively better system behavior, and a success-
ful track record in solving personal computer (PC)-based systems [6,11]. The feed-
forward network consists of neurons (processing elements) located in different layers
and the adjustable connecting weights between the layers of neurons. The configuration
is such that from the input layer and hidden layers to the output layer, there are only
feed-forward connections, and there are no connections among neurons within the
same layer. The processing function exists within the hidden neurons and output neu-
rons, and this function is typically a nonlinear sigmoid function that can be described in
several ways. One of the most important properties of feed-forward networks is their
so-called universal approximation ability; that is, with a sufficient but finite number of
hidden neurons (in one or more layers), there always exists a set of weights for the
network that can approximate a given continuous nonlinear function to the desired
accuracy [12,13]. It is this property that forms the foundation for pattern classification
and signal processing with the feed-forward network. Previous studies [14,15] have
shown that for a classification problem where the output neuron with the greatest
activation determines the category of the input pattern, one hidden layer is most likely
sufficient. Therefore, three-layer feed-forward networks with one hidden layer were
used in EGG applications. The optimal number of neurons in the hidden layer is
determined experimentally. The optimal set of weights for a particular problem is
obtained through the learning process. Several adaptive learning algorithms for feed-
forward networks have been proposed in the literature. The most commonly used
learning method is the error back-propagation (BP) algorithm, originally invented by
Werbos (1974) [16] and popularized by Rumelhart et al. (1986) [17]. The problem with
the BP algorithm is that the convergence of the weight matrix could be very slow, or it
could even get stuck in a local minimum. Many faster versions of the BP algorithm have
since been developed [18,19]. Other new adaptive learning algorithms based on con-
jugate gradient methods have been proposed to overcome this problem, including the
quasi-Newton (QN) algorithm [20] and the scaled conjugate gradient (SCG) algorithm
[21]. Details of these three learning algorithms for training the feed-forward networks
can be found elsewhere [21,22].
In this chapter, we will review some applications of feed-forward NNs in electro-
gastrography, including detection of motion artifacts in EGG recordings [23], identifi-
cation of gastric contractions from the EGG [24], classification of normal and abnormal
EGGs [22], and prediction of delayed gastric emptying in patients through the EGG.

2. MEASUREMENTS AND PREPROCESSING OF


THE EGG

2.1. Measurements of the EGG


The EGG data were obtained from both healthy subjects and patients using an
EGG Digitrapper (Synectics Med. Inc., Irving, TX). Prior to the attachment of the
electrodes, the abdominal surface where electrodes were to be positioned was shaved, if
hairy, and cleaned with sandy skin prepping paste (OMNI Prep, Weaver & Co.,
Aurora, CO) to achieve better conduction and to reduce skin-electrode motion arti-
facts. Three silver/silver chloride electrodes were placed on the abdominal skin over the
Section 2 Measurements and Preprocessing of the EGG 235

stomach. Two epigastric electrodes were connected to yield a bipolar EGG signal. The
other electrode was used as a reference. The EGG signal was amplified with a frequency
range of 0.03 to 0.25 Hz and simultaneously digitized and stored on the EGG
Digitrapper. The analog-to-digital converter has 8-bit resolution and the sampling
frequency was 1 Hz. All recordings were made in a quiet room. The subjects were in
a supine position and asked not to talk and to remain as still as possible during
recording to avoid motion artifacts. The EGG recording for each subject was made
for 30 minutes in the fasting state and for 1 to 2 hours after a test meal.

2.2. Preprocessing of the EGG Data


It is known that the performance of the neural network is dependent on the
representation of the input data. Raw EGG data are normally not good candidates
for the input to ANNs due to the time shift effect [23]. In addition, EGG features are
better reflected in other representations, such as a running power spectrum [25].
Therefore, signal processing techniques are often utilized to extract useful parameters,
which can improve the performance of the ANN when these parameters are fed as input
of the network. The related techniques applied in this chapter are introduced in this
section, including autoregressive moving average (ARMA) modeling parameters, run-
ning power spectra, and amplitude (power) spectra.

2.2.1. ARMA Modeling Parameters

Theoretically, a time series Sj (j = time instant), such as a digitized EGG signal,


can be modeled as an ARMA process as follows [26]:

p i
Sj = ~J2 a*SJ-k + Σ C n
k J-k + nJ 0)
k=l k=\

where % (1 <k <p) and ck (1 < k < q) are called the ARMA parameters, and My is a
white noise process.
To model an EGG signal, an adaptive ARMA filter was proposed [27] as shown in
Figure 1, where Xj is the input signal at time j . The sets ay and c^ are, respectively, the
feed-forward and feedback weights of the adaptive filter, and j , is the estimate of the
input signal, expressed as

P 9
a
yj = Σ kjXJ-k + ] C Ckjej-k (2)
fc=l *=1

where e7 is the estimate error

ej = xj-yj (3)

To make the filter output jy an ARMA estimate of the input signal Xj, the filter
weights, which are initially set to zero, are iteratively adjusted in such a way that the
error signal ej becomes a white noise process. After some mathematical manipulations
and simplifications [27], this leads to an adaptation expressed as follows:
236 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram

*&*->
Vz-1

Figure 1 Structure of the adaptive ARMA filter.

a/cj+i = akj + 2ß*ejXj-k, k=l,2,...,p (4)


ckj+i = ckj + 2ßcejej_k, k-l,2,...,q (5)

where μ.α and μ0 are step sizes controlling the convergence and stability of the algo-
rithm.
The ARMA modeling parameters consist of ak (k = 1,2,.. .,p) and ck
(k= 1,2, ...,q). The p and q were set to 20 and 2, respectively, based on previous
studies [27]. Updated ARMA parameters for each EGG segment were used as the input
of the neural network.

2.2.2. Running Power Spectra

Previous studies [2-4] have shown that the EGG accurately reflects the frequency
of the gastric slow wave. Therefore, spectral data instead of raw EGG data were used as
the input to the ANN in most applications. Running spectral analysis is widely used for
both qualitative and quantitative analyses of the EGG [28]. Two running spectral
analysis methods, adaptive spectral analysis [27] and the exponential distribution
(ED) [29], used in this chapter are briefly introduced as follows.
Adaptive Spectral Analysis. The adaptive spectral analysis method is based on
ARMA parametric modeling of the EGG signal. Once the adaptive filter converges,
the power spectrum of the EGG signal can be computed from the ARMA modeling
parameters, ak (k - 1, 2 , . . . ,p) and ck(k = 1,2,... ,q), according to the following [27]:

c1^ + Σ1=ι ckj exp(-Mofc)|


Pj{co) = '-1 (6)
I1 + E L i (-«*/) exp(-/wfc)|2

where σ 2 is calculated as follows:


Section 2 Measurements and Preprocessing of the EGG 237

a2 =
j-m {L*
+ k=m
(7)

where j is the current time index and m is the time index at which the algorithm
converges.
At any point in an EGG recording, a power spectrum can be calculated instanta-
neously from the updated parameters of the model. Similarly, the power spectrum of
the signal for any particular time interval can be calculated by averaging the filter
parameters over that time interval. A typical EGG signal and its running spectra,
computed by the adaptive spectral analysis method, are presented in Figure 2. The
top trace shows a 30-minute EGG recording made on one patient, and the lower
panel presents the running spectra of the recording. The power spectra (from the
bottom to the top) were computed every 2 minutes starting at the beginning of the
signal with each curve representing the power spectrum of 2-minute EGG data. These
2-minute analyses were ordered serially without overlap. Comparing the spectra with
the EGG trace, one can observe that the temporal ordering of frequency events in the
EGG signal is accurately reflected in the spectral analysis.
The main advantage of this method is increased spectral and temporal solution.
Numerous experiments have shown that the adaptive spectral analysis method provides
narrow frequency peaks permitting more precise frequency identification and enhanced
ability in the determination of frequency changes at any time point [27,28]. This method
is especially powerful in detecting dysrhythmic events of brief duration and rhythmic
variations of the gastric slow wave.

Running power spectra

Figure 2 A 30-minute EGG signal (top) and ~i r


its running power spectra calculated by the 4 5 6 7
adaptive analysis method. Frequency (cpm)
238 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram

Exponential Distribution (ED). The ED method was introduced by Choi and


Williams [29] to overcome the drawback of the cross-terms in the Wigner distribution
[30]. It provides high resolution in time and frequency while suppressing cross-terms.
The ability to suppress the cross-terms comes by way of controlling the single para-
meter σ. Cross-term suppression is achieved because cross-terms oscillate more rapidly
than signal autocomponents.
For computational purpose, the running windowed exponential (RWE) distribu-
tion was applied for the time-frequency representation of the EGG signal [29,31].
00

RWEfin, k) = 2j2 WN(T)e-J2nkT/N


r=—oo
(8)
eX x( n T n r
X] W
M(J*) ■ -7^2 p( ~^2j - + /* + )**( + ^ - )

where x(n) is the digitized EGG signal, n is the time index, and k is the frequency index.
WN(r) is a symmetrical window with a length of TV, and WM(JJL) is a rectangular window
with a length of M. After obtaining the summation in the preceding equation, an Ap-
point fast Fourier transform (FFT) can be used to evaluate RWEf(n, k) at each time
instant n.
The performance of the ED method for the analysis of EGG has been thoroughly
investigated by Lin and Chen [31]. The optimal parameters derived in that reference
were used in this chapter.

2.2.3. Amplitude (Power) Spectrum

In practice, the EGG data were divided into segments before being processed for
the network, and thus the cutting place (time) of each segment leads to different wave-
forms in the time domain. This time shift effect is quite sensitive to the ANNs and
would make detection or identification difficult. In order to remove the time shift effect,
each segment of EGG data was transformed to the frequency domain, and only the
amplitude or power spectrum was used.
The difference between the amplitude spectra of different EGG data (e.g., data
with and without motion artifacts) is mainly at high frequencies, and sometimes it is not
obvious in the linear amplitude spectrum. A logarithmic scale (dB levels) was used to
enlarge the differences. This is denoted by Y and defined as

r=ioiog|jr(*)| (9)
where | * | denotes the absolute value and X{k) is the discrete Fourier transform of
EGG data [32].
The amplitude spectrum (or unsmoothed spectrum) provides better performance in
terms of signal detectability. In terms of characterizing the entire spectrum (power), the
smoothed spectral estimate (periodogram) is better [32]. The periodogram method is an
exact implementation based on the definition of the power spectral density. In this
method, EGG data samples are divided into consequent segments with certain overlap.
A Fourier transform is performed on each data segment, and the resultant functions of
Section 3 Applications in the EGG 239

all segments are averaged. Power can also be presented in linear and decibel (dB) units.
The decibel is the most commonly used unit and is defined as follows:
^ = 10 1og105 (10)

where A is power in dB and B is power in its linear unit, that is, taking the square
magnitude of the Fourier transform of the data.
Windows are often applied in both the amplitude spectrum and the periodogram
method to control the effect of side lobes in spectral estimations [33]. The Hamming
window, which has a low side lobe effect while still maintaining a good main lobe
bandwidth [34], was applied to each segment of the EGG data before computing the
Fourier transform in this chapter.

3. APPLICATIONS IN THE EGG

3.1. Detection and Deletion of Motion Artifacts in


EGG Recordings
Unlike other electrophysiological recordings, the EGG is very vulnerable to
motion artifacts. The EGG recording usually lasts several hours, and motion artifacts
due to moving, coughing, or speaking often distort it. Motion artifacts are very annoy-
ing in the EGG because (1) they are strong and may completely obscure the electrical
signal of the stomach; (2) they have a broadband spectrum and their frequencies over-
lap that of the gastric electrical activity, they cannot be eliminated by conventional
filtering techniques and jeopardize any kind of quantitative analysis of the EGG data;
and (3) they cannot be canceled without affecting the gastric signal. At present, the
deletion of EGG data with severe motion artifacts is performed by visual inspection.
However, scanning hours of waveforms is tedious and time consuming. Therefore, a
method using feature analysis and back-propagation NN has been developed to detect
and eliminate automatically motion artifacts in the EGG recordings [23].

3.1.1. Input Data to the NN

The EGG data used in this study were collected from 20 volunteers, each data
collection lasting for total of 1 hour. Special exercises were designed to mimic possible
motions in an EGG study, including reading loudly, raising the legs up, tapping the
electrodes, coughing, sitting up, turning the body, and walking. Four hundred data
segments were selected from all the EGG data by visual examination of the tracings
based on the study protocol. Of the segments, 160 contained motion artifacts and the
remaining 240 were pure data without motion artifacts. These 400 data segments were
divided into two groups, with data for one group of 10 volunteers used as the training
set and data for the other group of 10 used as the testing set. For each group there were
200 segments, 80 of them containing motion artifacts and 120 of them no motion
artifacts. Each data segment consisted of 16 samples.
Because the segmented raw EGG data had a time shift effect that is sensitive to
NNs, it is difficult to detect motion artifacts from raw EGG data using NNs. To
improve the performance of detection, the following features were derived from the
raw data based on the characteristics of the motion artifacts:
240 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram

1. Amplitude spectrum (AS): Because the EGG data are real, the amplitude spectra
are symmetric to the frequency axis. Therefore, half of each spectrum contains all
required information. As the length of each data segment is 16 samples in this
application, only the first 9 of 16 spectral data were chosen as the input to the NNs.
2. Maximum derivative (MD): This feature is very effective for detection of pulse-
type motion artifacts, for they vary sharply in a short time period and their max-
imum derivatives are much larger. The maximum derivative for each data segment
can be represented as follows:

Amax = max|x,-x,_ 1 |, i' = 1,2, ...,N- 1 (11)

where x, is the /th sample of the segment, and N is the number of samples in one
segment.
3. Standard deviation (SD): Data with motion artifacts show large variations in
amplitude in the time domain. This can be characterized by the standard deviation
(square root of the variance), which can be expressed as

a={E[x(i)-E(x(i))]2}1/2 (12)

where E is the mean value:

!=0

4. Relative amplitude (RA): In this application, RA is defined as

^ = max(x, - min(x,))

where σ is the standard deviation.


These four features were computed from each data segment and then combined
into a one-dimensional vector as the input of the NN.

3.1.2. Experimental Results

The momentum and adaptive learning rate mechanism were used in the training
process of the three-layer feed-forward networks. After some trials, the parameters were
set as the following: learning rate, 0.01; learning increase ratio, 1.05; learning decrease
ratio, 0.7; momentum constant, 0.9; and error ratio, 1.04. The error goal was set to 0.01,
which was found to be accurate enough in this application.
Feature selection plays an important role in the detection of motion artifacts. The
accuracy would not be high enough if selection of the features was improper or too few
features were selected. On the other hand, too many features would lead to redundancy
and make the detection time consuming. Therefore, different combinations of the four
different features were used and compared to find an optimal set. Table 1 shows the
Section 3 Applications in the EGG 241

TABLE 1 Comparison of Different Combinations of the Features8


AS AS MD AS, MD, AS, MD, AS, MD,
and and and and and and
Features SD AS MD SD MD SD RA SD RA

Accuracy (%) 94.9 96.2 97.4 96.2 98.7 98.7 98.7 100 100

"The four features compared are the maximum derivative (MD), amplitude spectrum (AS), relative amplitude
(RA), and standard deviation (SD).

testing results of detection by networks with three hidden neurons using one, two, three,
or four features. Two-feature detection was better than or equal to one-feature detec-
tion, and three-feature detection was better than or equal to two-feature detection.
Three-feature detection using amplitude spectrum, maximum derivative, and standard
deviation was as accurate as four-feature detection and better than any two-feature
detection. Therefore, amplitude spectrum, maximum derivative, and standard deviation
were considered to be the optimal choice for the training and testing data sets.
A software system running on MATLAB has been developed for the detection and
elimination of motion artifacts in EGG recordings. Figure 3 shows a flowchart of this
system. Figure 4 shows an example of this system's ability to identify motion artifacts in
a real EGG recording. Figure 4a shows an original EGG recording that contains
relatively severe motion artifacts. The data segments with zero values in Figure 4b
show the motion artifacts recognized by the network. The EGG waveform after the
deletion of the data segments with motion artifacts is shown in Figure 4c. The effect of
motion artifacts on the EGG spectrum is shown in Figure 5. The frequency peak at 3
cycles/min (cpm) indicates the electrical activity of the stomach. It can be seen from this
figure that motion artifacts result in not only waveform distortion but also spectral
distortion.

3.2. Identification of Gastric Contractions from the


EGG
Gastric contractions play an important role in the digestive process of the stomach.
The established method for the measurement of gastric contractions, which involves
intubating a manometric probe in the stomach is invasive. Because of the correlation
between gastric myoelectrical activity and gastric contractions, the EGG measured
during the period of motor quiescence has characteristics different from those of the
EGG measured during the period of gastric contractions [35]. It has been observed that
the EGG during gastric contractions has a higher amplitude and more low-frequency
components than the EGG during motor quiescence [36]. Therefore, a neural network
approach has been developed for noninvasive identification of gastric contractions from
the EGG [24].

3.2.1. Experimental Data

The EGG data were obtained from 10 healthy subjects in the fasting state, each for
2 hours. Gastric contractions were simultaneously monitored with EGG recordings
using an intraluminal antral manometric probe. The manometric signals were used as
242 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram

Amplitude
spectrum
U2
Elimination Output
Data Maximum Neural
collection derivative network ofMA

Ul U3 U5 U6
Standard
deviation
U4

Figure 3 Flowchart of the software system for detection and elimination of motion
artifacts.

Figure 4 Example of EGG data before and after elimination of motion artifacts.
(a) Original EGG data with motion artifacts; (b) the motion artifacts (repre-
sented by zero values) detected by the neural network; (c) EGG data with
motion artifacts eliminated.
Section 3 Applications in the EGG 243
Power spectra: (a) original EGG (star);
(b) after deleting motion artifacts (solid)

Figure 5 The power spectra of the EGG in


Figure 4 before (curve a) and after (curve b) 2 4 6 8 10 12 14 16 IS
the elimination of motion artifacts. Frequency (cpm)

the "gold standard" for the existence of gastric contractions. The EGG recording was
divided into segments, each with 512 samples. Each segment of the EGG data was
labeled as 0 or 1. A segment was labeled 0 if no contractions were seen in the stomach in
the simultaneous manometric recordings. A segment was labeled 1 if one or more
contractions were present in the manometric recording. The power spectrum of each
segment was computed by the exponential distribution method to use as the input to the
network. There was an overlap of 75% between two adjacent EGG segments. Only 64
spectral data were used for each input, which covered 0 to 15 cpm. Since the electrical
activity of the stomach contains no information above 15 cpm, spectral data above
15 cpm were discarded. This substantially simplified the structure of the network. The
EGG in five subjects was used as the training set. The testing set was composed of the
EGG in the other five subjects.

3.2.2. Experimental Results

The back-propagation learning algorithm with momentum factor was used to


train three-layer networks. Several parameters were optimized in this study to
achieve better performance. These included the number of hidden nodes in the
hidden layer, learning rate, and momentum factor. Figure 6 and Table 2 show
the effect of the learning rate on the performance (average squared error) of the
network with a structure of 64:10:2 (input:hidden:output nodes). The best perfor-
mance was observed when the rate was chosen between 0.05 and 0.1, with which an
accuracy of 100% was obtained for the training set and 92% for the testing set.
The effect of the momentum factor on the performance of the same network is
illustrated in Figure 7 and Table 3. It was observed that a momentum factor of 0.9
produced better performance than the other values for a fixed number of iterations
of 1000. Finally, the effect of the numbers of hidden nodes is presented in Table 4.
It is seen that with a fixed number of iterations, 10 hidden nodes resulted in better
performance than 5 hidden nodes, whereas no further improvement was observed
with more than 10 hidden nodes. The performance of the network with 20 hidden
244 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram

0.10· Learning profiles for a 64:10:2 network


with different learning rates (momentum
0.09- factor = 0.9, the number of iterations = 1000)

0.08

0.07

0.06

| · 0.05

0.04-

0.03

0.02

0.01

0.00 i I I i | i i 11 11 i i i | I 11 i 11 i i 111 11 i | i i ΓΓΓΗΤΤΓΓΤΙΤΓΤΜΊΤΙ


0 100 200 300 400 500 600 700 800 900 1000
Iterations

Figure 6 Effects of the learning rate on the network performance.

TABLE 2 Effects of the Learning Rate on the Performance of a 64:10:2 Network8

Learning rate Average squared error Accuracy (train) (%) Accuracy (test) (%)

0.01 0.008 100 89


0.05 0.002 100 92
0.1 0.0002 100 92
0.5 0.005 100 50

Momentum factor = 0.9, number of iterations = 1000.

nodes was the same as that of the network with 10 hidden nodes for a fixed
number of iterations of 1000. With a structure 64:10:2 and the optimized values
of the learning rate and momentum factor presented previously, the network recog-
nized the EGG during motor quiescence with an accuracy of 90% and the EGG
during gastric contractions with an accuracy of 94%.

3.3. Classification of Normal and Abnormal EGGs


The normal frequency of the human gastric slow wave is about 3 cpm. Like the
ECG, and EGG recording may consist of not only normal rhythms but also dysrhyth-
mias, which have frequently been observed in patients with gastric motor disorders.
Gastric dysrhythmias include tachygastria (3.8 cpm < slow wave frequency < 9 cpm),
bradygastria (0.5 cpm < slow wave frequency < 2.4cpm), and arrhythmia (no domi-
nant frequencies). Gastric dysrhythmias are believed to be associated with gastric
Section 3 Applications in the EGG 245

o.iOq Learning profiles for a 64:10:2 network


with different momentum factor (learning
0.09: rate = 0.05, the number of iterations = 1000)

0.08 :

0.07:

0.06 :

0.05

0.04 d

0.03

0.02:

0.01 -

0 . 0 0 - i i 1 1 1 1 1 1 1 1 1 i 11 11 11 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 ' ' i i 11 11 i 11 111 i


0 100 200 300 400 500 600 700 800 900 1000
Iterations

Figure 7 Effects of the momentum factor on the network performance.

TABLE 3 Effects of the Momentum Factor on the Performance of a 64:10:2 Network 3

Momentum factor Average squared error Accuracy (train) (%) Accuracy (test) (%)

0.05 0.01 100 89


0.5 0.008 100 92
0.7 0.007 100 92
0.9 0.002 100 92

" Learning rate = 0.05, number of iterations = 1000.

motor disorders and several clinical symptoms, such as nausea, vomiting, motion sick-
ness, and early pregnancy [37-40]. The assessment of the normality or abnormality of
the EGG is, therefore, of great clinical significance. Currently, researchers assess the
normality of the EGG by visual examination of the EGG tracing or its running power
spectra or both. Figure 8 shows typical normal and abnormal EGG signals. The left
panel shows the time signals and the right panel their power spectra. The four rows
correspond to bradygastria, normal, tachygastna, and arrhythmia, respectively. The
normal signal clearly contains a frequency of 3 cpm, and abnormal signals differ from
this pattern in that they contain lower, higher, or irregular frequencies. Compared with
other surface recordings, such as ECG, the quality of the EGG is usually poor. The
gastric signal in the EGG is disturbed by noise, which is composed of respiratory and
motion artifacts, the ECG, and electrical interference of the small intestine. In addition,
an EGG recording usually lasts more than 1 hour. As a result, visual examination of the
EGG not only is time consuming but also requires extensive experience in spectral
246 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram

TABLE 4 Effects of the Number of Hidden Nodes on the Performance of


the Network (Testing Results)*

Hidden nodes Accuracy (quiescence) Accuracy (contractions)


(%) (%)
5 84 91
10 90 94
20 90 94
30 90 94

" Learning rate = 0.05, momentum factor = 0.9, number of iterations = 1000.

analysis. Therefore, a neural network approach has been proposed for the automated
classification of normal and abnormal EGGs.

5.3.7. Experimental Data

To achieve optimal results and reduce the complexity of the network, three types of
input data preprocessing were investigated and compared with each other. These
included raw EGG data, power spectral data, and ARMA modeling parameters of
the EGG. The raw EGG data were divided into segments, each with 60 time samples
(1 minute). The power spectral data and ARMA modeling parameters were obtained
from the raw EGG data using the adaptive running spectral analysis method [27], which
has been shown to provide high frequency resolution and accurate temporal informa-
tion [28]. The ARMA parameters of each EGG segment were composed of 20 feed-
forward and 2 feedback parameters.
The EGG segment was defined as normal if the visual examination of the EGG
traces and its spectrum indicated a dominant frequency in the range 2.4 to 3.8 cpm.
Otherwise, it was defined as abnormal. The training data set was composed of 100
segments of the normal EGG and 100 segments of the abnormal EGG, which were
randomly selected from the EGG recordings of 10 subjects. The test set was composed
of 100 segments of normal EGG and 100 segments of abnormal EGG randomly
selected from another 10 subjects' EGG recordings.

3.3.2. Structure of the NN Classifier and


Performance Indexes

The feed-forward network with one hidden layer was used as a classifier in this
study. The optimal number of neurons in the hidden layer was determined experimen-
tally. The numbers of neurons in the input layer were determined by the dimension or
size of the input vector. For the input types of raw EGG data and spectral data, 60
input neurons were used. The input layer contained 22 neurons when the ARMA
modeling parameters were used as the input. The output layer consisted of one neuron
for classifying the EGG into two classes: normal and abnormal.
Three indexes were used to assess the performance of the neural networks: percent
correct (Pc), sum-squared error (SSE), and complexity per iteration.
The complexity per iteration was defined as the computational time required for
each iteration of the algorithm. The Pc was defined as the percentage of all of the
answers obtained that were judged to be correct according to the gold standard. The
Section 3 Applications in the E G G 247

0.8 0.75

0.7
! IVj 1 1 0.7

u 0.6
I |l/i 1 k — 0.65

\ 1 |V 1 i J
CO
-a 3 o.i.
3 is
=
E
0.5
f\r i yv ? 0.55
O

< 0.4 W i 11 0.5 -^ i


I

0.3
Mill 20 30 40
0.45

0.4
2 3 4 5 6 7
0.2 Time (seconds) Frequency (cpm)
0.7
0.65
0.65
0.6 0.6

1" 0.45
ffl
3-
Ö
0.55

0.5
0.4
I
0.35 O, 0.45
0.4
0.3
0.25 0.35
20 30 40 1 2 3 4 5 6 7
Time (seconds) Frequency (cpm)

20 30 40 2 3 4 5 6 7
Time (seconds) Frequency (cpm)
0.5
{
> t
0.45
i

0.4
! |

\ 1 >-\ !/->
Vs
a- 0.3

0.25 V /
0.2
20 30 40 50 2 3 4 5 6
Time (seconds) Frequency (cpm)

Figure 8 Examples of the EGG signals. The left panels show the time domain signals
and the right panels correspond to their power spectra. The four rows from
top to bottom represent bradygastria, normal, tachygastria, and arrhyth-
mia, respectively.
248 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram

SSE was obtained by computing the difference between the output value that an output
neuron was supposed to have, called Tip, and the value the neuron actually had as a
result of the feed-forward calculations, called zip. This difference was squared, and then
the sum of the squares was taken over all output neurons. Finally, the calculation was
repeated for each example in the training or testing set, as applicable. Let P be the
number of examples in the training set and testing set and N2 the number of neurons in
the output layer; then the SSE can be expressed as

p N2
Si£ = ^ £ ( z , , r g 2 (15)
p=\ (=1

3.3.3. Experimental Results

A series of experiments were conducted to investigate the performance of different


learning algorithms and the effect of different input types on the neural network per-
formance to optimize the neural network configuration.
Performance Comparison of the Learning Algorithms. The performances of the BP,
QN, and SCG algorithms for the classification of the normal and abnormal EGG were
investigated and compared with each other. The feed-forward networks were composed
of 60 input neurons, 4 hidden neurons, and 1 output neuron. Two types of input were
tested: raw EGG data and spectral data. The results are presented in Figures 9 and 10
and Table 5. Figure 9 shows the SSE as a function of iteration for each of the three
algorithms. With the spectral data as the input (right), the QN algorithm converged 4
times faster than the SCG algorithm and 10 times faster than the BP algorithm. With
the raw EGG data as the input (left), the QN algorithm was 10 times faster than the
SCG algorithm and 100 times faster than the BP algorithm.
The complexity per iteration for the three learning algorithms is presented in
Figure 10. For the BP and SCG algorithms, the complexity per iteration is linear as
a function of the network size. In contrast, the QN algorithm shows a quadratic rela-
tion. Table 5 shows a general overview of the applied indexes for the different algo-
rithms. We can see that the SCG algorithm is a better compromise for this specific
application.
Effects of Different Input Types. The effect of the type of the data input to the
network was studied and is presented in Figure 11. The spectral data resulted in a much
better generalization than the raw EGG data. Table 6 shows the performance results for
the ANN using the BP algorithm for the classification of normal and abnormal EGGs

TABLE 5 General Overview of the Three Criteria for Three Algorithms"

BP SCG QN

Convergence rate — + +
Complexity per iteration + + -
Robustness — + +
" The SCG algorithm is the best compromise.
Section 3 Applications in the EGG 249

20 40 60 80 100
Number of iterations

Figure 9 Cost function (or SSE) displayed as a function of the iteration for time
domain data (left) and spectral data (right) for three algorithms: BP (—),
SCG ( ), and QN (—). The network structure is 60:4:1.

6 8 10 12 14 16 6 8 10 12 14 16
Number of hidden neurons Number of hidden neurons

Figure 10 For BP and SCG, the complexity per iteration is a linear function of the
network size; in contrast, the QN shows a quadratic relationship.

with three different types of input. It can be seen that both the spectral data and the
ARMA modeling parameters yielded an accurate classification of 95%, whereas the
percent correct was only 65% when the raw EGG data were used as the input.
Although ARMA modeling parameters generated the same performance as the spectral
data, the ANN with the ARMA modeling parameters as the input contained substan-
tially less input neurons (22 vs. 60) and was much simpler in computations.

3.4. Feature-Based Detection of Delayed Gastric


Emptying from the EGG
Currently, the scintigraphic gastric emptying test is the method commonly used to
assess the digestive process of the stomach. In this method, a patient is instructed to
digest a meal containing radioactive materials and then to stay under a gamma camera
so that abdominal images can be acquired for several hours. The application of this
technique is invasive and expensive. Because gastric myoelectrical activity is the most
fundamental activity of the stomach and it modulates gastric contractions that are
directly associated with gastric emptying, abnormal myoelectrical activity of the sto-
mach may lead to impaired gastric motility and delayed gastric emptying. In a previous
study, significant differences were found in some EGG parameters between patients
with delayed gastric emptying and those with normal gastric emptying [41]. To test the
250 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram

Figure 11 Comparison of the time domain


input with spectral input. (—) Learning
curve (cost function) of the spectral input;
(—) generalization curve of the spectral
input; (...) learning curve of the time domain
50 100 150 200 250 input; ( ): generalization curve of the
Number of iterations time domain input.

TABLE 6 The Performance of the ANN with Optimal Network Structure


Based on Three Types of Inputs
Types of input Structure of the ANN Percent correct (test set)
Raw EGG data 60:25:2 65
Spectral data 60:25:2 95
ARMA parameters 22:25:2 95

hypothesis that patients with normal and delayed emptying of the stomach can be
differentiated from certain EGG parameters using the ANN, a feature-based neural
network method has been developed to provide a noninvasive alternative for detection
of delayed gastric emptying.

3.4.1. Experimental Data

The EGG data were obtained from 152 patients with suspected gastric motility
disorders who underwent clinical tests for gastric emptying. A 30-minute baseline EGG
recording was made before ingestion of a standard test meal for each patient. Then the
patient consumed a standard test meal within 10 minutes. After the meal, simultaneous
recordings of the EGG and scintigraphic gastric emptying were made continuously for
2 hours. The techniques for recording the EGG and gastric emptying were previously
described [41]. The gastric emptying results were interpreted by the nuclear medicine
physicians.
Previous studies have shown that spectral parameters of the EGG provide useful
information regarding gastrointestinal motility and symptoms [33]. Therefore, all EGG
data were subjected to computerized spectral analysis using the programs previously
developed in our laboratory [42]. The following EGG parameters were extracted from
the spectral domain of the EGG data for each patient and were used as the input to the
neural network: dominant frequencies and their corresponding powers of the prepran-
dial and postprandial EGGs, the EGG peak power ratio between the preprandial and
postprandial EGGs, percentages of 2-4 cpm (normal frequency range) activity, and
tachygastria in the fasting and fed state. The EGG power ratio between the preprandial
and postprandial EGGs is related to the regularity and amplitude of the gastric slow
Section 3 Applications in the EGG 251

wave and has been reported to be associated with gastric contractility. The percentage
of the normal 2-4 cpm activity is a quantitative assessment of the regularity of the
gastric slow wave measured from the EGG. It was defined as the percentage of time
during which normal 2-4 cpm slow waves were observed in the EGG. It was calculated
using the running power spectral analysis method [42]. Tachygastria has been shown to
be associated with gastric hypomotility [33]. Therefore, the percentage of tachygastria
was calculated and used as a feature to be input into the neural network. It was defined
as the percentage of time during which 4-9 cpm slow waves were dominant in the EGG
recording and was computed in the same way as the percentage of the normal gastric
slow wave.
In order to preclude some features dominating the classification process, the value
of each parameter was normalized to the range of zero to one. Experiments were
performed using all or part of the preceding parameters as the input to the artificial
neural network to derive optimal performance.

3.4.2. Experimental Results

The EGG data obtained from the 152 patients were divided into two groups based
on the results of the scintigraphic gastric emptying test: 76 patients with delayed gastric
emptying and 76 patients with normal gastric emptying. The training set was composed
of EGG data of 50% of the patients from each of the groups, and the remaining data
were used as the testing set. The statistical analysis of the EGG parameters between the
two groups of patients revealed that the patients with delayed gastric emptying had a
lower percentage of regular 2-4 cpm slow waves in both fasting (77.1 ±2.6% vs
88.7 ± 1.3%, p < 0.001) and fed (77.8 ± 2.2% vs. 90.0 ± 1.0%, p < 0.001) states. A
significantly higher level of tachygastria was also observed in the fed state in patients
with delayed gastric emptying (13.9 ± 1.8% vs. 4.1 ± 0.6%, p < 0.001). Both groups of
patients showed a postprandial increase in EGG dominant power. This increase was,
however, significantly lower in patients with delayed gastric emptying than in patients
with normal gastric emptying (1.2 ± 0.6 dB vs. 4.6 ± 0.5 dB, p < 0.001). No significant
differences were observed in the dominant frequency of the EGG between the two
groups in the fasting (3.08 ±0.10 cpm vs. 2.94 ±0.03 cpm, p > 0.05) and fed
(3.20 ± 0.10 cpm vs. 3.03 ± 0.03 cpm, p > 0.05) states.
A number of experiments were conducted to optimize the performance of the
network using different numbers of EGG parameters ranging from two to all of the
parameters. Table 7 presents the test results of the network with different neurons in the
input layer and hidden layer. ThefiveEGG parameters were the dominant frequency in
the fasting state, the dominant frequency in the fed state, the postprandial increase of
the EGG dominant power, the percentage of normal 2-4 cpm slow waves in the fed
state, and the percentage of tachygastria in the fed state. These five parameters were
determined based on a series of experiments using different combination of all EGG
parameters. It can be seen that the best performance was achieved when these five
parameters were used as the input and the hidden layer had five neurons. In this
case, the accuracy in determining the correct diagnosis was 85% with a sensitivity of
82% and a specificity of 89%. It can also be seen that three neurons are the optimal
number for the hidden layer and that exclusion of any one or two of the five input
parameters would cause the performance of the classification to deteriorate.
252 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram

TABLE 7 Testing results of the ANN

Accuracy for testing data8

No. of neurons CC(0/ SE(%) SP(%)


input-hidden-output

5-5-2 80 74 87
5-4-2 80 74 87
5-3-2 85 82 89
5-2-2 80 74 87
4-3-2 80 74 87
3-3-2 72 74 71
" CC(%), the percentage of correct classification; SE(%), sensitivity; SP(%), specificity.

4. DISCUSSION AND CONCLUSIONS


We have reviewed four applications of feed-forward neural network techniques in the
EGG:

1. Detection and elimination of motion artifacts in EGG recordings


2. Identification of gastric contractions from the EGG in the fasting state
3. Classification of the normal and abnormal EGGs
4. Prediction of delayed gastric emptying in patients from the EGG

The reasons for choosing the neural network approaches in clinical applications of
the EGG were as follows: (1) The best candidate problems for ANN analysis are those
that are characterized by fuzzy, imprecise, and imperfect knowledge (data) and/or by
lack of a clearly stated mathematical algorithm for the analysis of data [6]. The pro-
blems of the EGG in clinical applications are perfect candidates for ANNs. The EGG
signal is imprecise. However, the measurement of the EGG is noninvasive and well
accepted by patients and physicians. Therefore, ample data can be made available
without any difficulty for the training and testing of the neural network.
(2) Successful applications of the ANN for the classification of other medical data
have been reported in numerous previous studies [7-13,43,44].
The structure of the neural network and the parameters were determined from the
literature and experiment. In comparison with other ANNs, the feed-forward neural
network has the advantages of availability of effective training algorithms, relatively
better system behavior, and a successful track record in solving PC-based systems
[6,11]. One hidden layer was used on the basis of several previous studies [14,15] that
showed that one hidden layer resulted in the same performance as two or more hidden
layers. Conflicting results were reported in the literature on the number of hidden
neurons [11]. The selection of the number of hidden neurons in this chapter was
based purely on experiments that showed that different hidden nodes were needed in
different applications. The BP learning algorithm was successfully applied in most
cases, and experimental results also showed that the adaptive learning rate and momen-
tum mechanism greatly improved the network performance. Compared with the BP
and QN algorithms, the SCG algorithm is more appropriate for classification of normal
and abnormal signals. It has moderate computational complexity and shows a super-
References 253

linear convergence rate so that different network configurations can be analyzed


through a large amount of experimentation in an acceptable period of time.
Since the performance of neural networks depends on the representation of the
input data, feature selection plays an important role in practical applications. Raw
EGG data are not a good representation as input to the network in contrast to other
parameters such as power spectrum and ARMA modeling parameters. The optimal
selection of those features depends on the applications. One of the reasons is that there
is a time shift effect in raw EGG data, and ANNs are sensitive to the time shift effect.
The amplitude or power spectrum removes the time shift effect while keeping useful
information. In addition, some intrinsic attributes of the EGG data are better reflected
in the frequency domain than in the time domain.
One may have noted that the accuracy of the method proposed in this chapter for
the prediction of gastric emptying is moderate and not very high. This is associated with
the characteristic of gastric emptying and its association with gastric myoelectrical
activity. Although gastric motor function is usually the major player in gastric empty-
ing, any abnormalities in the pylorus, such as pyloric stenosis, or in the small bowel,
such as intestinal pseudoobstruction, could lead to delayed emptying of the stomach. It
is well known that gastric myoelectrical activity is associated only with gastric motor
function and has nothing to do with the pylorus and the small intestine. It is therefore
expected that the accuracy of the proposed method would not be very high. In parti-
cular, the sensitivity is lower and the specificity is higher. This is because normal gastric
myoelectrical activity cannot always guarantee normal gastric emptying and abnormal
gastric myoelectrical activity usually results in delayed gastric emptying.
In summary, this chapter has shown that higher accuracy has been achieved in
applications of feed-forward neural networks and signal processing techniques to the
EGG.

REFERENCES

[1] W. C. Alvarez, The electrogastrogram and what it shows. J. Am. Med. Assoc. 78: 1116-1118,
1922.
[2] A. J. P. M. Smout, E. J. van der Schee, and J. L. Grashuis, What is measured in electro-
gastrography? Dig. Dis. Sei. 25: 179-187, 1980.
[3] B. O. Familoni, K. L. Bowes, Y. J. Kingma, and K. R. Cote, Can transcutaneous recordings
detect gastric electrical abnormalities? Gut 32: 141-146, 1991.
[4] J. Chen, B. D. Schirmer, and R. W. McCallum, Serosal and cutaneous recordings of gastric
myoelectrical activity in patients with gastroparesis. Am. J. Physiol. 266: G90-G98, 1994.
[5] R. P. Lippmann, An introduction to computing with neural nets. IEEE ASSP Mag. April:
4-22, 1987.
[6] R. C. Eberhart and R. W. Dobbins, Neural Network PC Tools: A Practical Guide. San
Diego: Academic Press, 1990.
[7] D. R. Hush and B. G. Home, Progress in supervised neural networks. IEEE Signal Process.
January: 8-39, 1993.
[8] P. A. Karkhanis, J. Y. Cheung, and S. M. Teague, Using a PC based neural network to
estimate the ejection fraction of a human heart. Int. J. Microcomput. Appl. 9: 99, 1990.
[9] T. Pike and R. A. Mustart, Automated recognition of corrupted arterial waveforms using
neural network techniques. Comput. Biol. Med. 22: 173-179, 1992.
254 Chapter 10 Applications of Feed-Forward Neural Networks in the Electrogastrogram

[10] I. N. Bankman, V. G. Sigillito, R. A. Wise, and P. L. Smith, Feature-based detection of the


K-complex wave in the human electroencephalogram using neural networks. IEEE Trans.
Biomed. Eng. 39: 1305-1309, 1992.
[11] Z. M. Zurada, Introduction to Artificial Network Systems. St. Paul, MN: West Publishing,
1992.
[12] K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal
approximators. Neural Networks. 2: 359-366, 1989.
[13] K. I. Funahashi, On the approximate realization of continuous mappings by neural net-
works. Neural Networks 2: 183-192, 1989.
[14] A. J. Maren, D. Jones, and S. Frankin, Configuring and optimizing the back-propagation
network. In Handbook of Neural Computing Applications. A. Maren, A. C. Harston, and R.
Pap, eds. Vol 5, pp. 233-250. San Diego: Academic Press, 1990.
[15] J. de Villiers and E. Barnard, Backpropagation neural nets with one and two hidden layers.
IEEE Trans. Neural Networks 4: 136-141, 1992.
[16] P. J. Werbos, Beyond regression: New tools for prediction and analysis in the behavioral
sciences. Doctoral dissertation, Applied Mathematics, Harvard University, 1974.
[17] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representation by
error propagation. In Parallel Distributed Processing, Vol. 1, Foundations, pp. 318-362.
Cambridge, MA: MIT Press, 1986.
[18] R. Battiti, Accelerated backpropagation learning: Two optimization methods. Complex
Syst. 3: 331-342, 1989.
[19] G. E. Hinton, Connectionist learning procedures. Artif. Intell. 40: 185-234, 1989.
[20] R. Battiti and F. Masulli, BFGS optimization for faster and automated supervised learning.
INCC 90 Paris, International Neural Network Conference, Vol. 2, pp. 757-760, 1990.
[21] M. F. MoUer, A scaled conjugate gradient algorithm for fast supervised learning. Neural
Networks 6: 525-533, 1993.
[22] Z. Y. Lin, J. Maris, L. Hermans, J. Vandewalle, and J. D. Z. Chen, Classification of normal
and abnormal electrogastrograms using multi-layer feedforward neural networks. Med. Biol.
Eng. Comput. 35: 199-206, 1997.
[23] J. Liang, J. Y. Cheung, and J. D. Z. Chen, Detection and elimination of motion artifacts in
electrogastrogram using feature analysis and neural networks. Ann. Biomed. Eng. 25: 850-
857, 1997.
[24] J. D. Z. Chen, Z. Y. Lin, and R. W. McCallum, Noninvasive identification of gastric
contractions from surface electrogastrogram using back-propagation neural networks,
Med. Eng. Phys. 17(3): 219-225, 1995.
[25] Z. Y. Lin, and J. D. Z. Chen, Time-frequency analysis of the electrogastrogram. In Time-
Frequency and Wavelets in Biomedical Engineering, M. Akay, ed., pp. 147-181, Piscataway,
NJ: IEEE Press, 1996.
[26] S. M. Kay and S. L. Marple, Jr., Spectral analysis: A modern perspective, Proc. IEEE, 69:
1380-1419, 1981.
[27] J. Chen, W. R. Stewart, and R. W. McCallum, Spectral analysis of episodic rhythmic
variations in the cutaneous electrogastrogram. IEEE Trans. Biomed. Eng. 40:128-134,1993.
[28] Z. Y. Lin and J. D. Z. Chen, Comparison of three running spectral analysis methods. In
Electrogastrophy: Principles and Applications, J. D. Z. Chen and R. W. McCallum, eds., pp.
75-99. New York: Raven Press, 1994.
[29] H.-I. Choi and W. J. Williams, Improved time-frequency representation of multicomponent
signals using exponential kernels, IEEE Trans. Acoust. Speech Signal Process. 37: 862-871,
1989.
[30] E. P. Wigner, On the quantum correction for thermodynamic equilibrium, Phys. Rev. 40:
749-759, 1932.
[31] Z. Y. Lin and J. D. Z. Chen, Time-frequency representation of the electrogastrogram—
Application of the exponential distribution. IEEE Trans. Biomed. Eng. 41: 267-275, 1994.
References 255

[32] A. V. Oppenheim and R. W. Schäfer, Digital Signal Processing, Englewood Cliffs, NJ:
Prentice-Hall, 1975.
[33] J. D. Z. Chen and R. W. McCallum, EGG parameters and their clinical significance. In
Electrogastrography: Principles and Applications, J. D. Z. Chen and R. W. McCallum, eds.,
pp. 45-73, New York: Raven Press, 1994.
[34] A. H. Nuttal, Some windows with very good sidelobe behavior, IEEE Trans. Acoust. Speech
Signal Process, 29: 84-91, 1981.
[35] J. Chen, R. W. McCallum, and R. Richards, Frequency components of the electrogastro-
gram and their correlations with gastrointestinal contractions in humans. Med. Biol. Eng.
Comput. 31: 60-67, 1993.
[36] J. D. Z. Chen, R. Richards, and R. W. McCallum, Identification of gastric contractions
from the cutaneous electrogastrogram. Am. J. Gastroenterol. 89: 79-85, 1994.
[37] J. D. Z. Chen and R. W. McCallum, Clinical applications of electrogastrography. Am. J.
Gastroenterol. 88: 1324-1336, 1993.
[38] C. H. You, K. Y. Lee, W. Y. Chey, and R. Menguy, Electrogastrographic study of patients
with unexplained nausea, bloating and vomiting. Gastroenterology 79: 311-314, 1980.
[39] J. Chen and R. W. McCallum, Gastric slow wave abnormalities in patients with gastropar-
esis. Am. J. Gastroenterol. 87: 477-482, 1992.
[40] R. M. Stern, K. L. Koch, H. W. Leibowitz, I. Lindblad, C. Shupert, and W. R. Stewart,
Tachygastria and motion sickness. Aviat. Space Environ. Med. 56: 1074-1077, 1985.
[41] J. D. Z. Chen, Z. Y. Lin, and R. W. McCallum, Abnormal gastric myoelectrical activity and
delayed gastric emptying in patients with symptoms suggestive of gastroparesis. Dig. Dis.
Sei. 41: 1538-1545, 1996.
[42] J. Chen, A computerized data analysis system for electrogastrogram. Comput. Biol. Med. 22:
45-58, 1992.
[43] M. F. Kelly et al., The application of neural networks to myoelectric signal analysis: A
preliminary study. IEEE Trans. Biomed. Eng. 37: 221-230, 1990.
[44] S. Srinivasan, R. E. Gander, and H. C. Wood, A movement pattern generated model using
artificial neural networks. IEEE Trans. Biomed. Eng. 39: 716-722, 1992.
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.

INDEX

A F
A posteriori probabilities, 102 Feature extraction, 32, 249
ADALINE, 72 Function approximation, 125
Admissible generator function, 129, 175 Fuzzy arithmetic, 10
Approximate reasoning, 17 Fuzzy clustering, 29, 180
ARMA, 235 Fuzzy omeans algorithm, 163-164
Artificial neural networks, 69 Fuzzy interference, 22
Associative network, 198 Fuzzy realtions, 12
Fuzzy set theory, 5

B G
Back propagation, 218 Generalized reformulation function, 171
Back propagation learning, 54 Generator function, 144, 174
Blind spot, 134 Genomic sequences, 98
Bootstrap stratification, 104 Gradient descent learning, 144

I
C Information-theoretic models, 60
Clustering algorithm, 159
Crisp fuzzy partitions, 160
K
Knowledge, 2
D
Dynamic stability, 204
Dynamic transition, 210 L
Dynamically driven recurrent networks, Learning algorithm, 248
Dynamics, 204 Linear generator function, 137
LVQ and clustering algorithm, 178

E M
EEG, 83 Metastability, 206
EGG, 85, 233 Modified back-propagation, 100
Entropy-constrained fuzzy clustering, 165 MRI segmentation, 188
Exponentially generator function, 131, 13! Multilayer perceptrons, 53, 71
N Sequential learning algorithm, 143
Necessity, 16 Sigmoid function, 71
Neural network models, 201 Signal compression, 89
Neurodynamic programming, 61 Sleep, 88,198,199
Neuromodulation, 201 Soft clustering, 182
Nonstationary signal processing, 29 Spectroscopic signals, 216
Supervised learning, 53, 74
Support vector machines, 58
P SWS, 199
Partial least squares, 212 System identification, 28
Pharyngeal wall, 88
Possibility distribution, 14
Possibility theory, 12
Principal component analysis, 59 T
Processing elements, 70 Thresholding, 71
Time-frequency analysis, 33
Time series prediction, 28, 44
R
Radial basis function, 57, 123, 125, 219
Reformulating fuzzy clustering, 168 U
REM, 199
Uncertainty, 1
Respiratory signal, 86
Unsupervised fuzzy clustering, 34, 36
Running power spectrum, 236
Unsupervised learning, 59, 75
Upper airway obstruction, 87
S
Sample stratification, 98
Selecting generator function, 133 W
Self-organizing map, 59 Weighted fuzzy K-mean algorithm, 37
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.

ABOUT THE EDITOR

Metin Akay is currently assistant professor of engineering at Dartmouth College,


Hanover, New Hampshire. He received the B.S. and M.S. in electrical engineering
from the Bogazici University, Istanbul, Turkey, in 1981 and 1984, respectively, and
the Ph.D. from Rutgers University, NJ, in 1990.
He is author or coauthor of 10 other books, including Theory and Design of Bio-
medical Instruments (Academic Press, 1991), Biomedical Signal Processing, (Academic
Press, 1994), Detection Estimation of Biomedical Signals (Academic Press, 1996), and
Time Frequency and Wavelets in Biomedical Signal Processing (IEEE Press, 1997). In
addition to 11 books, Dr. Akay has published more than 57 journal papers, 55 con-
ference papers, 10 abstracts, and 12 book chapters. He holds two U.S. patents. He is the
editor of the IEEE Press Series on Biomedical Engineering.
Dr. Akay is a recipient of the IEEE Third Millennium Medal in Bioengineering.
He received the EMBS Early Career Achievement Award 1997 "for outstanding con-
tributions in the detection of coronary artery disease, in understanding of early human
development, and leadership and contributions in biomedical engineering education."
He also received the 1998 and 2000 Young Investigator Award of the Sigma Xi Society,
Northeast Region, for his outstanding research activity and the ability to communicate
the importance of his research to the general public.
His research areas of interest are fuzzy neural networks, neural signal processing,
wavelet transform, detection and estimation theory, and application to biomedical
signals. His biomedical research areas include maturation, breathing control and
respiratory-related evoked response, noninvasive detection of coronary artery disease,
noninvasive estimation of cardiac output, and understanding of the autonomic nervous
system.
Dr. Akay is a senior member of the IEEE and a member of Eta Kappa, Sigma Xi,
Tau Beta Pi, The American Heart Association, and The New York Academy of
Science. He serves on the advisory board of several international journals and organi-
zations and the National Institute of Health study session and several NSF review
panels.

259
Nonlinear Biomedical Signal Processing: Fuzzy Logic,
Neural Networks, and New Algorithms, Volume I
Edited by Metin Akay
© 2000 The Institute of Electrical and Electronics Engineers, Inc.
epst 1 HUFC 2 subclasses

300
Feature 2
Feature 1

epsl 1 Patterns classification


3
1 1 1

> (
ja? "IT Γ "f" ++4f ++ + +

1 1
1

1 1
I 1

1 1
1

} 1(X) 200 300 400 500 600 700


Time (s)

Figure 2.4 The first partition of rat number 11 's EEG data into two clusters by the HUFC
0
algorithm. The upper panel shows the partition in the clustering space of three
(out[ of eight) energies of the scales of the discrete wavelet transform of the EEG
stretch terminating with a seizure. Each data point is marked by the number
of the cluster in which it has maximal degree of membership. The number
of clusters was determined by the average density criterion for cluster validity
(Figure 3). The lower panel shows the "hard" affiliation of each successive
point in the time series (1 second) to each of the clusters. The seizure beginning
(as located by a human expert) is marked by a solid vertical line (after 700
seconds).

1
epsl 1 HUFC final 10 classes

300
Feature 2
Feature 1

eps11 Patterns classification

+ + ' -H-+ ! + + +■ +

-H-+ +
i
+ ■#+■ + -l· +■

5
+
!
0 4
HHHttt IltBI t
3

2
IIIIMIIIIIIlUllBllllllMI^IIII II III
1

0 100 200 300 400 500 600 700


Time(s)

Figure 2.5 Thefinalpartition of the EEG data with the HUFC algorithm. Clusters 4 and 5
can be used to predict the seizure, which can be identified by clusters 8 and 10.

2
hrv HUFC 4 subclasses

a
3 3
80s
70 | 3 a3s 3
35 3 3 J
c 60
| 50^

30

80

60
70 80
40 60
50
Feature 2
Feature 1

hrv Temporal patterns classification

100 200 300 400 500


Temporal pattern number

ure 2.7 The first partition of the recovery heart rate signals by the HUFC algorithm
into four clusters as suggested by the average partition density (Figure 6). The
upperpanel shows the partition of the 3Dtemporal patterns [SJ, s,-+i, s;+2). i =
1 , . . . , L — 2, of the heart rate signal into the four clusters, and in the lower
panel we can see the affiliation of each temporal pattern with its corresponding
cluster marked on the original heart rate signal (the continuous line).
hrv HUFCIinal 10 classes

Feature 2
Feature 1

hrv Temporal patterns classification

200 300 400


Temporal pattern number
Figure 2.8 Thefinalpartition of the heart rate variability signal into 10 clusters. The upper
panel shows the partition of the 3D temporal patterns of the heart rate signal
into the final 10 clusters, and in the lower panel we can see the affiliation of each
temporal pattern with its corresponding cluster marked on the original heart rate
signal (the continuous line).
x_1 .ufc 1 / Hypervolume x_1 .ufc 1 / Partition J(U,V; X)*K

<n m
rD // c
/ 3
P* / £■
m 03

n -Q

< <
1 2 3 4 5 6 1; 2 3 4 5 6
<_1 .ufc Partition density x_1
x_1 .ufc Average partition density

—r \
m
r3
/
/
(0

c
D \ I
I
?- / & >
m <0

n J3

< <
1 2 3 4 5 ( 1 2 3 4 5 6
NlumlDerc»fellister s Number of clusters

x_1 .ufc HUFC 5 subclasses

Feature 1

Figure 2.9 The first partition of the s„ = 4 ■ i„_i ■ (1 - s n _i) time series by the HUFC
algorithm. The lower panel shows the partition of the 2D temporal patterns,
{i;, i,-+l}, i = 1 , . . . , 899, into five clusters as suggested by the average parti-
tion density criterion in the upper panel.
Predicting 1 sample ahead of s (origin — blue o, predict — red x)

900 950 1000 1050 1100


Sample number

Figure 2.10 The one sample ahead (d=l) prediction results of 200 samples of the sn =
4 ■ s„_i ■ (1 - sn-\) time series. The circle (O) marks the original samples of
the time series and the " X " marks the predictions.
hrv.hfc HUFC 19 subclasses

Feature 2 -2 -2
Feature 1

hrv.hfc Temporal patterns classification

100 200 300 400 500 600 700 800 900


Temporal pattern number

Figure 2.11 Thefinalpartition of the resting heart rate signals by the HUFC algorithm into
19 clusters as suggested by the average partition density using the "maximal
members." The upper panel shows the partition of the 3D temporal patterns
{ί,·,ί,·+1, ί;+2). i = 1 , . . . , £-—2, of the heart rate signal into 19 clusters, and
in the lower panel we can see the affiliation of each temporal pattern with its
corresponding cluster marked on the original heart rate signal (the continuous
line).
Predicting 1 sample ahead of hrv (origin — blue o, predict — red x)
2.5 r

900 920 940 960 9Θ0 1000 1020


Sample number

Figure 2.12 The one sample ahead (d = 1) prediction results of 100 samples of the resting
heart rate signal. The circle (O) marks the original samples of the time series
and the " X " marks the predictions.

You might also like