Germ An G Omez Herrero

TAMPERE UNIVERSITY OF TECHNOLOGY
Department of Information Technology

Institute of Signal Processing
German Gomez Herrero
Mismatch Negativity Detection in EEG recordings using Wavelets
Master of Science Thesis
Subject approved in the department committee meeting on
9th April 2003
Reviewers: Prof. Karen Egiazarian
Docent Alpo Varri
Preface
First of all, I would like to express my deepest gratitude to my thesis advisor and reviewer,
Prof. Karen Egiazarian for his help and support during my staying in the Institute of Signal
Processing. Thanks also to my reviewer Docent Alpo Varri and my advisor in the University
of Zaragoza, Prof. Salvador Olmos, whose corrections helped me in producing a more coherent
and clear text.
During the last year working for this Institute i have enjoyed a great international atmosphere
and i have learnt a lot about my eld of studying and about life. I have made new friends
from so many dierent countries that i will never have time enough to visit all of them.
Furthermore, Finland has given me the opportunity to live unforgettable experiences, like
dipping in a frozen lake or having days without sunset. I truly recommend to everybody to
visit this amazing country.
Finally, i want to thank my family in Spain, specially my parents, for their love and support
during my long staying so far away from home.
Tampere, June 2003
German Gomez Herrero
ii
TAMPERE UNIVERSITY OF TECHNOLOGY
Institute of Signal Processing
Department of Information Technology
Gomez Herrero, German: Mismatch Negativity Detection in EEG recordings using
Wavelets
Master of Science Thesis, 113 pages, 11 enclosure pages
Funding: Institute of Signal Processing
June 2003
Reviewers:
Prof. Karen Egiazarian
Docent Alpo Varri
Abstract
The mismatch negativity (MMN) is an event-related potential elicited by changes in repetitive
auditory stimuli, irrespective of the direction of the subjects attention. It is mainly generated
in the auditory cortex suggesting the ability of the brain to automatically perform complex
comparisons between new sounds and the immediate auditory past. This pre-perceptual
processing in the auditory cortex tends to trigger frontal cortex activity probably representing
the initiation of attention switch to sound change. The MMN is, nowadays, the only objective
measure of the accuracy of central auditory processing. Furthermore, it has shown its utility
in measuring higher level brain processing tasks related to the attention switch to changes
in the acoustic environment. One of the most promising elds of application of the MMN
involves newborns and young infants since it has been shown that the MMN is elicited even
in prematurely born newborns.
A fundamental problem found in MMN research is the very low signal to noise ratios that are
obtained using classical EEG recording techniques. The MMN component, whose amplitude
is typically in the range of 1-3 V , is embedded in the ongoing EEG activity that can reach
peak variations of several hundreds of V . During the recent years, several studies have
shown the utility of wavelets in biomedical signal processing. In this Thesis we present two
approaches based on wavelets to detect the presence of MMN deection in single ERP trials.
The rst approach is based on classication, using a neural network, of the wavelet features
representing the time-frequency window of the MMN component. Results obtained with
such approach have shown the utility of the procedure in selecting the single ERP trials with
stronger MMN deection in order to generate selective averages with improved signal to noise
ratios. However, the low classication ratios suggested that wavelet features are not enough to
eectively characterize the MMN due to its overlap with many other EEG components in the
time-frequency plane. The second approach tries to overcome this limitation by combining
wavelet features with Independent Component Analysis (ICA). In many cases, ICA was able
to separate the MMN from the background activity making the wavelet representation of the
MMN much more accurate and therefore, yielding better classication ratios of single trials
with MMN deection.
iii
Contents
1 Introduction 1
1.1 The MMN in the modern neuroscience research . . . . . . . . . . . . . . . . . 2
1.2 Tasks formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Encephalographic data and mismatch negativity 5
2.1 Electroencephalogram (EEG) . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Event Related Potentials (ERP) . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 The Mismatch Negativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 MMN elicitation paradigms . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 The process underlying the MMN component . . . . . . . . . . . . . . 13
2.3.3 Perspectives of application and advantages . . . . . . . . . . . . . . . 14
3 Wavelets and multiresolution analysis 16
3.1 Time representation and frequency representation . . . . . . . . . . . . . . . . 17
3.2 Time frequency analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 The Short Time Fourier Transform . . . . . . . . . . . . . . . . . . . . 18
3.2.2 The Continuous Wavelet Transform (CWT) . . . . . . . . . . . . . . . 20
3.3 Multiresolution Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 The Discrete Wavelet Transform (DWT) . . . . . . . . . . . . . . . . . 22
3.3.2 The lter bank approach for the DWT . . . . . . . . . . . . . . . . . . 24
3.4 Wavelets in Biomedical Applications . . . . . . . . . . . . . . . . . . . . . . . 28
iv
CONTENTS
3.4.1 Wavelet properties in the context of biomedical applications . . . . . . 28
3.4.2 Electroencephalography applications . . . . . . . . . . . . . . . . . . . 30
4 Independent Component Analysis 32
4.1 Blind Source Separation problem . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Independent Component Analysis theory . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Identiability and ambiguities of the ICA model . . . . . . . . . . . . 34
4.3 Objective functions for ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Likehood and infomax . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.3 Negentropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.4 High order approximations . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 ICA algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.1 Infomax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.2 FastICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.3 JADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Application of ICA in encephalography . . . . . . . . . . . . . . . . . . . . . . 42
5 Articial Neural Networks 44
5.1 Neuron model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Learning rules and algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Feed-Forward multilayer networks . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.6 Network Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.6.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.6.2 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.6.3 Neural Network ensembles . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.7 Applications in neurophysiology . . . . . . . . . . . . . . . . . . . . . . . . . . 57
v
CONTENTS
6 Results 59
6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1.1 Subjects and experimental paradigm . . . . . . . . . . . . . . . . . . . 59
6.1.2 Data acquisition and equipment . . . . . . . . . . . . . . . . . . . . . 60
6.1.3 Artifact correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 Time-Frequency localization of the MMN . . . . . . . . . . . . . . . . . . . . 66
6.4 Wavelet denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5 First method: DWT+ISODATA+NN . . . . . . . . . . . . . . . . . . . . . . 72
6.5.1 Wavelet based feature extraction . . . . . . . . . . . . . . . . . . . . . 73
6.5.2 Training data preprocessing: ISODATA clustering . . . . . . . . . . . 73
6.5.3 Neural Classier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.6 Second method: DWT+ICA+NN . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.6.1 Wavelet based feature extraction and ltering . . . . . . . . . . . . . . 89
6.6.2 ICA based feature reduction . . . . . . . . . . . . . . . . . . . . . . . . 91
6.6.3 Neural classier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7 Conclusions 101
8 References 103
A K-means and ISODATA 111
A.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
A.2 ISODATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
vi
List of Figures
1.1 The mismatch negativity [81]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 A single plane projection of the head on which the standard electrode sites of
the international 10-20 system. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Dierent views of the head indicating the electrodes positions for the 10-20
international system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 An illustration of the averaging process involved in AEP simulation. In this
example, the stimulus presented to the subject is a simple tone [25]. . . . . . 10
2.4 ERP typical components [72]. . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Three typical scalp distributions of the MMN component . . . . . . . . . . . 12
3.1 Three time-frequency STFT atoms corresponding to a gaussian window . . . 19
3.2 STFT of a signal with linear frequency modulation and gaussian amplitude
modulation. Perfect time resolution but no frequency resolution when the
window h is chosen as a Dirac impulse (a). Perfect frequency resolution but
no time resolution when h is chosen as a constant (b). . . . . . . . . . . . . . 20
3.3 Three CWT time-scale atoms. Note that their time duration is inversely pro-
portional to the central frequency. . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Sampling of the time-frequency plane. Horizontal axis represents time, vertical
axis frequency. Dierent forms of sampling: Time representation, Fourier,
Gabor, Wavelet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Scalogram (squared modulus of the CWT) of a signal with linear frequency
modulation and gaussian amplitude modulation. . . . . . . . . . . . . . . . . 23
3.6 Splitting of MRA subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
vii
LIST OF FIGURES
3.7 2-Channel analysis lter bank . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8 2-Channel synthesis lter bank . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.9 General multilevel wavelet decomposition scheme . . . . . . . . . . . . . . . . 27
4.1 Basic BSS model. Unobserved signals: s, observations: x, estimated source
signals: y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Neuron model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Dierent types of activation functions [78]. . . . . . . . . . . . . . . . . . . . . 45
5.3 Feed-forward network. Input patterns are r-dimensional, input layer has n
units, the rst hidden layer l units, the second hidden layer k and the output
layer has m neurons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4 Illustration of the generalization problem. On the left, the network size was
too small to t the function h(x). On the right, the network memorized the
samples but did not solve the general problem. . . . . . . . . . . . . . . . . . 54
6.1 Stimulus sequence for eliciting the MMN . . . . . . . . . . . . . . . . . . . . . 60
6.2 Channel locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3 Grand averages for the subjects ADHD1171, ADHD1160, ADHD1167 . . . . 64
6.4 Grand averages for the subjects ADHD1165, ADHD1168, ADHD1176 . . . . 65
6.5 ERP images using dierent moving window averages for subject ADHD1171dev2 67
6.6 Scalograms obtained for ADHD1171dev2 . . . . . . . . . . . . . . . . . . . . . 68
6.7 Wavelet decomposition of a single ERP trial into 5 octaves . . . . . . . . . . 70
6.8 Pulse characteristics and spectrum of the Daubechies 12 Quadrature Mirror
Filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.9 Daubechies 12 scaling and wavelet function. . . . . . . . . . . . . . . . . . . . 71
6.10 ERP image for subject ADHD1171dev2 before and after ltering. . . . . . . . 71
6.11 Average DWT coecients for the standard waves . . . . . . . . . . . . . . . . 74
6.12 Average DWT coecients for the deviant waves . . . . . . . . . . . . . . . . . 75
6.13 DWT features images for standard and deviant waves . . . . . . . . . . . . . 76
6.14 Average of the target waves in the best MMN cluster . . . . . . . . . . . . 81
viii
LIST OF FIGURES
6.15 Averages of target and non-target waves when using deviant responses as targets 85
6.16 Averages of target and non-target waves when using deviant responses as targets 86
6.17 Averages of target and non-target waves when using dierence waves as targets 87
6.18 Averages of target and non-target waves when using dierence waves as targets 88
6.19 Scheme of the DWT+ICA+NN system . . . . . . . . . . . . . . . . . . . . . . 90
6.20 Ten ERP trials and their decomposition into independent components . . . . 92
6.21 Template used in the matching procedure for the selection of a MMN-like
component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.22 Projections back to the electrodes of one independent component that was
automatically labeled as MMN correlated. Note the clear positive deection
in the frontal and central electrodes and the negative peak in the mastoids A1
and A2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.23 Averages of target and non-target waves for method DWT+ICA+NN . . . . 97
6.24 Averages of target and non-target waves for method DWT+ICA+NN . . . . 98
6.25 Eect of the size of the subdatasets used for the moving window ICA. . . . . 99
ix
List of Tables
2.1 Features of the main EEG components [81]. . . . . . . . . . . . . . . . . . . . 7
2.2 MMN: reasons for wide applicability [58]. . . . . . . . . . . . . . . . . . . . . 15
5.1 Bath Perceptron algorithm [22]. . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Well-known learning algorithms [37]. . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Standard Backpropagation algorithm [5]. . . . . . . . . . . . . . . . . . . . . . 52
5.4 The Bagging algorithm [88]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 The Adaboost algorithm [88]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1 Number of trials that remained for each subject after rejecting those trials
with artifacts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Sample missclassication matrix when using the deviant waves as targets . . 78
6.3 Sample missclassication matrix when using the dierence waves as targets . 79
6.4 Number of times that a MMN cluster was found when using the deviant re-
sponses as targets. 50 clustering results were obtained by setting dierent
initial cluster centers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5 Number of times that a MMN cluster was found when using the dierence
responses as targets. 50 clustering results were obtained by setting dierent
initial cluster centers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.6 Best MMN clusters when using deviant waves as targets . . . . . . . . . . . 82
6.7 Best MMN clusters when using dierence waves as targets . . . . . . . . . 83
6.8 Classication results when using the deviant waves as targets . . . . . . . . . 83
6.9 Classication results when using the dierence waves as targets . . . . . . . . 83
x
LIST OF TABLES
6.10 Results for the single Neural Network using deviant responses as targets. . . . 96
6.11 Results for the ensemble using deviant responses as targets. . . . . . . . . . . 96
xi
Chapter 1
Introduction
Processing of sensory stimulus features is essential for humans in determining their responses
and actions. If behaviorally relevant aspects of the environment are not correctly represented
in the brain, then the organisms behavior cannot be appropriate. Without these representa-
tions, our ability to understand spoken language, for example, would be seriously impaired.
Our everyday environment consists of a very complex mixture of simultaneously active sound
sources with overlapping temporal and spectral acoustic properties. Nevertheless, we perceive
it as an orderly auditory scene that is organized according to sources and auditory events,
allowing us to select messages easily, recognize familiar sound of patterns, and distinguish
deviant or novel ones. Extracting useful information from such an ever-changing environment
requires the creation and maintenance of neural models that absorb the major part of the
incoming sensory data while passing through the potentially important signals. Recent data
suggest that these perceptual achievements are mainly based on processes of a cognitive
nature (sensory intelligence) in the auditory cortex.
Cognitive neuroscience has consequently emphasized the importance of understanding brain
mechanisms of sensory information processing, that is, the sensory prerequisites of cognition.
Unfortunately, most of the data obtained do not allow the objective measurement of the
accuracy of these stimulus representations. In audition, recent cognitive neuroscience seems
to have succeeded in extracting such a measure, however. This is the mismatch negativity
(MMN), a component of the event-related potential (ERP), rst reported by Naatanen et al.
(1978).
1.1. THE MMN IN THE MODERN NEUROSCIENCE RESEARCH
Figure 1.1: The mismatch negativity [81].
1.1 The MMN in the modern neuroscience research
MMN is elicited by any discriminable change in a repetitive background of auditory stimu-
lation. It is typically recorded from sequences of auditory stimuli in which low probability
deviant sounds are interspersed amongst frequent standard sounds while the subjects at-
tention is directed elsewhere, for example by requiring them to read a book. Stimulation
deviance can be dened by physical features such as tone frequency, duration or intensity.
The elicited response in the ERP has the shape of a negative deection in the dierence ERP
response
1
at latencies from 100 ms to 200 ms (see Figure 1.1).
There are many reasons for the wide applicability of the MMN in the modern neuroscience
research. One of its more important advantages, which makes it very suitable for cognitive
studies in infants and other special clinical populations, is the fact that it can be measured in
the absence of attention and without any task requirements. For decades behavioral methods,
such as the head-turning or sucking paradigms, have been the primary methods to investigate
auditory discrimination, learning and the function of sensory memory in infancy and early
childhood. With MMN a new method for investigating these issues has emerged [14].
But the applications of MMN go further than just the measurement of the preattentive
cognitive processes. Some studies give importance to the MMN also in the context of sensory
inference and passive attentional processes. Thus the MMN can serve as a joint description
1
The dierence response is obtained by subtracting the responses to the deviant sounds from the responses
to the standard stimuli.
2
1.2. TASKS FORMULATION
of preattentive and attentional processes in the human brain [85]. In the clinical medicine
the MMN has been documented to disclose many neurological changes [81] and diseases,
including:
Early diagnosis of hearing problems and Central Nervous System maturation in new-
borns [14].
Diagnosis of aphasic patients [18].
Alzheimers disease [62].
Parkinsons disease [63].
Schizophrenia [56].
Dyslexia [49].
Alcoholism [31].
But the applications eld of the MMN is just growing and is of great importance in the
modern neurophatology. Nevertheless, one of the main drawbacks of the MMN is its usually
very small Signal to Noise Ratio (SNR), which very often makes dicult to detect it by simple
visual inspection of the ERP.
1.2 Tasks formulation
The goal of this Master Thesis is to present the modern signal processing techniques that are
being used to identify and classify the occurrence of the brain processes with MMN reect
and to apply them in a real ERP dataset with MMN.
In order to achieve these goals a number of tasks must be solved:
Artifact correction of EEG data. The ERP dataset that we have used to perform our
tests had already been artifact corrected by my colleagues D. Rusanovskyy and A.
Bazhyna as a preliminary step in their Masters Theses [72, 5]. Nevertheless, we will
perform more intensive checks for artifact detection and elimination.
Denoising and base line and low-frequency trend removing. This tasks will be done by
means of simple wavelet ltering.
Extraction of characteristic features of the MMN component. Wavelets will be used
in this Thesis not only for denoising but also for extracting characteristic features of
3
1.3. ORGANIZATION OF THE THESIS
the MMN. We will show that Independent Component Analysis can be combined with
wavelet analysis to extract the features of the MMN process more eciently.
Classication of ERP trials or small averages of some trials which are likely to possess
MMN deection. After extracting the MMN features from the ERP trials we will try
to automatically detect the presence of MMN in single/averaged trials using neural
networks.
The nal objective is to construct a system for the automatic detection of MMN in ERP
trials.
1.3 Organization of the Thesis
The Thesis is organized into seven chapters. Chapter 2 gives a basic overview on the Event
Related Potentials (ERP) and the mismatch negativity (MMN). Since the main task of this
Thesis is the signal processing techniques and not the medical aspects of the ERP and the
MMN, we will not explain these issues in detail. Some references are given for the reader
interested in the medical aspects of the ERP and the MMN. Chapter 3 deals with multireso-
lution analysis and wavelets and their applications in biomedical signal processing. Chapter 4
presents Independent Component Analysis (ICA) and its applications in the biomedical eld
and specically in the neurological signals research. In Chapter 5, the theory of Neural Net-
works (NNs) and their most important architectures and learning algorithms are introduced.
Chapter 6 shows the results obtained when applying Wavelets, Neural Networks and ICA to
an ERP dataset provided by the Department of Psychology of the University of Jyvaskyla,
Finland. The conclusions are drawn in Chapter 7.
4
Chapter 2
Encephalographic data and
mismatch negativity
Electroencephalography (EEG) is a technique for measuring the electrical activity of the
brain that is caused by the current generated within neurons. It was originally developed
in 1924 by Hans Berger who showed that it was possible to record feeble electric currents
generated in the brain on the scalp, and depict them graphically onto a strip of paper.
Due to its non-invasive nature and relatively low cost, the EEG is one of the most popular clin-
ical tools used for monitoring brain activity. There are two basic neuroelectric examinations:
(1) EEG studies that involve inspection of spontaneous brain activity and (2) Event-Related
Potential studies (ERPs) that use signal-averaging and other processing techniques to ex-
tract neural activity that is time-locked to specic sensory, motor or cognitive events. In this
chapter we are going to overview the EEG and ERP signals. We will also explain in more
details the Auditory Event Related Potentials (AERPs) and the mismatch negativity as one
of its more important components.
2.1 Electroencephalogram (EEG)
The human brain is an extremely complicated network that contains about 10
10
intercon-
nected neurons. Each neuron consists of a central portion containing the nucleus, known as
the cell body, and one or more structures referred to as axons and dendrites. The dendrites
are rather short extensions of the cell body and are involved in the reception of stimuli. The
axon, by contrast, is usually a single elongated extension. Rapid signaling within the nervous
system occurs by two primary mechanisms:
2.1. ELECTROENCEPHALOGRAM (EEG)
Within nerves and neurons by way of action potentials. These action potentials consist
on a rapid swing of the polarity of the neuron transmembrane voltage from negative
to positive and back. These voltage changes result from changes in the permeability of
the membrane to specic ions, the internal and external concentrations of which are in
imbalance.
Between neurons by way of neurotransmitter diusion across synapses. The release
of chemical neurotransmitter is triggered by the arrival of a nerve impulse (action po-
tential). Receptors on the opposite side of the synaptic gap bind neurotransmitter
molecules and respond by opening nearby ion channels in the post-synaptic cell mem-
brane, causing ions to rush in or out and changing the local transmembrane potential
of the cell and thus propagating the nervous impulse.
Both synaptic and action potentials result in potential dierences across the neuron mem-
brane. These potentials, and the resting potentials of the glia cells are the main contributors
of the EEG signal.
The human spontaneous EEG has the appearance of a noisy signal devoid of any dominant
frequency. Nevertheless, the frequency decomposition of this signal manifests a rhythmicity
occasionally interrupted by transient discharges. These rhythms and discharges are classied
according to their location, frequency, amplitude, morphology, periodicity, and behavioral and
functional correlates. Table 2.1 depicts the main EEG rhythms and some pathophysiologies
related to alterations in those rhythms.
EEG signals are recorded as potential dierences between surface electrodes on the scalp. The
recorded signal therefore depends on the positions of the individual electrodes and how they
are paired. Most laboratories performing routine clinical evaluations use the International
10-20 System of electrode placement [38], [25]. This system is based on placing the electrodes
at 10 % to 20 % of the distances between anatomical landmarks on the skull and head, with
the idea of providing some standardization across head size. Each electrode site has a letter
identifying its sub-cranial lobe (i.e. FP-Frontopolar or prefrontal lobe, F-Frontal lobe, T-
Temporal lobe, C-Central lobe , P-Parietal lobe, O-Occipital lobe) and a number, or another
letter identifying its hemispherical location. The subscript Z (denoting line zero) ensuing
any lobe abbreviation, refers to an electrode placed along the cerebrums midline. The use
of an even number (2, 4, 6 or 8) represents the right hemisphere, with odd numbers (1, 3, 5
or 7) referring to the left hemisphere. The numbers rise with increasing distance from the
midline of the head. Figure 2.1 depicts the positions of the electrodes in the international
10-20 system. Figure 2.2 shows a lateral, frontal and superior view of the head indicating the
6
Rhythm Characteristic features
Alpha
(8-13 Hz)
The most prominent rhythm in the normal adult brain. Most prominent
at occipital and parietal electrodes. About 25 % stronger over the right
hemisphere.
Fully present when a subject is mentally inactive, alert, with eyes closed.
Disrupted by visual attentiveness. Almost totally disappears when eyes
are opened.
Slowing is considered a nonspecic abnormality in metabolic, toxic, and
infectious conditions. Asymmetries - unilateral lesions. Loss of reactiv-
ity - a lesion in the temporal lobe. Loss of alpha - brainstem lesion.
Mu
(7-11 Hz)
Mostly active at central electrodes. It does not react to opening of eyes,
shows blocking before movement of the contralateral hand. It does not
have any clinical application.
Beta
(18-30 Hz)
There are three basic types: (1) Frontal beta (blocked by movement),
(2) Widespread beta (often unreactive) and (3) Posterior beta (shows
reactivity to eye opening).
Theta
(4-7 Hz)
Found in drowsy normal adult, in frontal and temporal regions. Rare in
EEG of awake aduts, focal or lateralized theta indicates focal pathology,
diuse temporal regions.
Delta
(< 4 Hz)
Dominant rhythmic activity in infants and deep stages of adult sleep.
Polymorphic delta indicates acute, or ongoing injury to cortical neurons.
Rhythmic discharge is characteristic of phychophysiologic dysfunction.
Table 2.1: Features of the main EEG components [81].
7
Figure 2.1: A single plane projection of the head on which the standard electrode sites of the inter-
national 10-20 system.
same electrodes positions.
Apart from the electrodes any EEG recording system requires at least these elements:
Ampliers. Since the amplitude of the signals recorded in the scalp is typically in the
range of 10-100 V high input impedance dierential ampliers are required.
Filters. Usually the EEG signal is ltered before recording it. High-pass, low-pass and
notch lters are used in this context.
Recording unit. Used to keep a permanent record of the EEG signal. Originally the
(a) Lateral view (b) Frontal view (c) Superior view
Figure 2.2: Dierent views of the head indicating the electrodes positions for the 10-20 international
system.
8
2.2. EVENT RELATED POTENTIALS (ERP)
EEG was recorded on paper but digital recording is the norm nowadays.
Almost every evaluation of the recorded EEG is preceded by an artifact correction step.
Usually this task is performed by a human expert using the help of some automatic processing
tools that select the intervals of the EEG signal more likely to have artifacts. Major types
of physiological artifacts include: EOG (i.e. ocular artifacts), muscular activity, respiration
and head and body movements.
2.2 Event Related Potentials (ERP)
The EEG recorded in response to a specic sensory stimulus is termed an Evoked Potential
or Event Related Potential (ERP). Alone, the low amplitude of the ERP makes it dicult to
distinguish it from the background brain activity (EEG). However, since it is assumed that
repetitive applications of the stimulus will activate similar pathways in the brain, several ERP
responses can be averaged and used to improve the SNR. If we assume that the background
EEG (the noise in this case) and the ERP are statistically independent, we have for a large
amount of repetitions M:
SNR
M repetitions
=
M SNR
1 repetition
The waveform resulting from averaging series of individual EEG samples time-locked to the
evoking stimulus is commonly referred to as an averaged-evoked potential (AEP) [7]. The
AEP reects only that activity which is consistently associated with the stimulus processing
in a time-locked way. The AEP thus reects, with high temporal resolution, the change
of neuronal activity resulting from synchronized response patterns of thousands of neurons
evoked by the stimulus. Figure 2.3 shows a typical AEP waveform for dierent number of
averaged epochs.
ERPs can be recorded from all of the primary sensory modalities (visual, auditory, somatosen-
sory and gustatory) and from motor events (e.g., a button press). Moreover, they can be
recorded from multiple locations on the scalp. While there are formidable challenges to de-
termining the location within the brain from which ERPs emanate, recording from multiple
sites does aord some information on the locus of the underlying relevant brain systems.
By convention ERP researchers break down ERP waveforms into several basic parts or com-
ponents (see Figure 2.4). Components are the positive and negative-going uctuations that
can be seen in any ERP waveform. Viewed on dierent time scales one can see that the
ERP is a rich source of temporal information. It is common to consider that the compo-
nents occurring prior to 100 ms reect information processing in the early sensory pathway.
9
2.2. EVENT RELATED POTENTIALS (ERP)
Figure 2.3: An illustration of the averaging process involved in AEP simulation. In this example,
the stimulus presented to the subject is a simple tone [25].
10
2.3. THE MISMATCH NEGATIVITY
Figure 2.4: ERP typical components [72].
For example, the auditory brain stem ERP arises from neural impulses traveling from the
cochlea through auditory brain stem centers. Nevertheless, the middle latency components
are thought to reect activity in the thalamus and possibly the earliest cortical processing.
Cognitive scientists have been most interested in the so-called long-latency ERP components:
which include the P1, P2, N1, N2, N400 and P3 components. These components are named
by their polarity (N for negative) and either their ordinal position after stimulus onset (P1
is the rst positive peak), or their latency after stimulus onset (the N400 is a negative-going
component peaking at 400 ms). Commonly, the long-latency components occurring prior to
200 ms are thought to reect late sensory and early perceptual processes. The MMN can be
included in this group. Those components after 250 ms or so are considered to reect higher
level cognitive processes (e.g., memory and language) [71].
2.3 The Mismatch Negativity
The mismatch negativity (MMN) is a component of the auditory event related potential
(ERP) which is elicited task-independently by an infrequent change in a repetitive sound.
The MMN can be recorded in response to any discriminable change in the stimulus stream
11
[81]. This change can be in any of the physical features of the stimuli, such as frequency, du-
ration, intensity or location. It appears as a negative peak of the dierence wave obtained by
subtracting the ERPs elicited to the standard tones from those elicited to the deviant tones,
at a latency between 100 and 200 ms [23]. MMN has a frontocentral scalp distribution,
with polarity reversal at electrode locations below the Sylvian ssure
1
, suggesting gener-
ator sources located bilaterally to the supratemporal auditory cortex [29]. This auditory
cortex location has been conrmed by a range of cognitive neuroscience methods, including
intracranial recordings in animals [17, 47] and humans [48], source modeling of magnetoen-
cephalographic (MEG) signals in humans [2, 19], analysis of scalp current density (SCD) of
deviant related negativities [29, 70], functional magnetic resonance imaging (fMRI) [13] and
positron emission tomography (PET) [76]. Figure 2.5 shows typical scalp distributions for
the MMN component.
IC 2 from Subdataset 23
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
Figure 2.5: Three typical scalp distributions of the MMN component
There are two basic facts that dierentiate the MMN from other ERP components:
It is elicited only by changes in the auditory stimulus and not by any stimulus per se
[60].
Its elicitation is independent of whether or not the subjects task is related to the
auditory stimuli (e.g., counting the deviants vs. reading a book) [74].
2.3.1 MMN elicitation paradigms
As we already mentioned, the basic paradigm for MMN recordings involves presenting a
repetitive auditory stimulus (termed standard) infrequently substituted by a dierent sound
(deviant). In this classical paradigm, the standard stimuli typically evoke N1-P2 complex
1
The Sylvian ssure is an important surface landmark located between the frontal and temporal lobes of
the brain.
12
(see Figures 1.1, 2.4). In the recent years a number of dierent paradigms for eliciting MMN
have been developed [64].
Most typically, MMN is measured from the dierence wave obtained subtracting the standard
response to the deviant one in order to cancel the N1-P2 complex and the background EEG
activity. However, if a paradigm with mostly at responses to standards is available, the
responses to deviant stimuli can be used without subtracting [77]. Increasing the interstimuli
time interval (ISI) makes the N2 amplitude to decrease. Otherwise, changing the duration
of the deviant tone allow us to manipulate the latency of the MMN component making it
appear out of the N1 occurrence time range [65]. For more details about stimuli streams
optimization we refer to [77].
Pihko and her colleagues [65] used a continuous sound stream, which consisted of two al-
ternative 100 ms tones of 600 and 800 Hz. The tone changed into another pitch without a
pause and with constant tone intensity. 10 % of the tones were randomly replaced by shorter
duration of 30ms and 70ms pitch elements. The results obtained with this stimuli stream
show negative deection 125 ms-225 ms after the beginning of the deviant stimulus than can
be interpreted as the MMN unconned by N1. This experimental paradigm is the one used
in this Thesis for eliciting MMN-data.
2.3.2 The process underlying the MMN component
We can nd in the literature several dierent interpretations of the process underlying the
MMN component. Amongst them, the most important ones are:
The trace presence hypothesis. Naatanen [57] suggested that the MMN reects the
outcome of a neural mismatch process between the deviant stimulus and a memory
trace of the standard. Most cognitive studies of the MMN agree that the existence of a
trace in auditory sensory memory is a necessary precondition for eliciting the MMN. The
trace presence hypothesis assumes that this condition is also sucient, i.e. the second
of two dierent auditory stimuli should always elicit an MMN. However, some results
like the obtained by Cowan et al. [16] and Winkler et al. [84] refute this hypothesis
and point to the conclusion that the MMN process has determinants in addition to the
presence of the standard-stimulus trace.
The relative trace strength hypothesis. An extension of the trace presence hypothesis.
It considers that MMN is only elicited by stimuli the representation of which is weaker
than the representation of a dierent stimulus available in transient auditory memory.
13
This can explain some situations where the traditional trace presence hypothesis fails
but it cannot explain some of the ndings of Winkler et al. in [85].
The Model adjustment hypothesis. This hypothesis suggests that MMN reects modi-
cations to existing parts of the preattentive model of the acoustic environment caused
by the incorporation of a new auditory event that mismatches the actual inferences of
the model [85]. According to this theory, a model of the acoustic background is main-
tained even if the auditory stimuli are not in the focus of attention. In such a situation
the model absorbs those sounds that t to it and reacts, updating the model, only to
those stimuli that violate the current model.
Depending on the preceding sequence of sounds, the model can attain a high or low
sensory inference value. A quasi-deterministic sequence of stimuli usually results in a
model of high inference value whereas, the model has low inference value during random
acoustic stimulation. Of course, the model has a limited inferencing capability, limited
mainly by the temporal constraints of the underlying transient memory and by the
ability of the model to detect the rules governing the auditory stimulus sequence.
The model adjustment hypothesis is able to explain the situations when the trace pres-
ence and relative strength trace hypothesis are valid. It is also coherent with the results
obtained in [85]. Thus we can consider this theory as the one that gives a better
description of the process underlying the elicitation of the auditory MMN component.
If we consider the model adjustment hypothesis valid, the MMN would widen its functional
relevance in the context of sensory inference and passive attentional processes making the
MMN analysis even more attractive to the researchers.
2.3.3 Perspectives of application and advantages
We are going to explain very briey the main applications of the MMN in the modern cognitive
neuroscience. This section is based on [58]. For a detailed description we refer to the original
text of Naatanen [58].
1. MMN for automatic change-detection response in audition. The MMN is
currently the only valid objective measure of the accuracy of central auditory processing
in the human brain. The fact that MMN elicitation is attention-independent shows that
the human brain is able to perform automatically very complex stimuli comparisons.
2. MMN as an index of sensory memory in audition. The MMN elicitation is
dependent on the presence of a memory trace formed by the standard stimuli in the
14
Inexpensive
Easy
Attention-independent elicitation (no task is needed)
may also be elicited in sleep and coma
The only objective measure for
the accuracy of central auditory processing (correlates with perceptual accuracy)
the duration of echoic memory
the permanent auditory memory traces (e.g. speech-sound memory traces)
An objective measure for the temporal window of integration in auditory perception
An objective index of general brain degeneration
An objective index of the gross functional state of the brain (e.g. coma-outcome prediction, drug eects)
A measure of left temporal-cortex loss of gray matter (schizofrenia)
Ontologically the rst cognitive ERP component (can be elicited even in newborns)
Generators and their functional signicance relatively well-known
Table 2.2: MMN: reasons for wide applicability [58].
auditory sensory memory. Several studies show that the MMN cannot be attributed to
new or fresh aerent elements activated by the deviant but not the standard.
3. MMN as an index of attention switch to sound change. The frontal MMN source
has been suggested to underlie such a neural signal triggering the attention switching
response. This is supported by studies showing deterioration of task performance at
the occurrence of unexpected task-irrelevant deviant sounds eliciting the MMN [23].
4. MMN shows the presence of language-specic speech-sound traces. Recent
studies have shown that the MMN can also be used to probe permanent auditory
memory traces such that those representing the phonemes of our mother tongue (see,
for example, Naatanen et al. [59]).
MMN has certain advantages versus other neuroelectric measures of the central auditory
function. The main advantages and reasons of the wide applicability of the MMN are listed
in table 2.2, which has been borrowed from [58].
15
Chapter 3
Wavelets and multiresolution
analysis
This chapter introduces the concepts of time-frequency and multiresolution analysis. Af-
ter a brief recall on time-domain and frequency-domain representations, we introduce some
concepts that constitute the basic background necessary to understand the wavelet theory.
Wavelets provide a unied framework for a number of techniques which had been developed
independently for various signal processing applications. For example, multiresolution signal
processing, mainly used in computer vision, subband coding, developed for speech and image
compression, and wavelet series expansions, developed in applied mathematics, have been
recently recognized as dierent views of a single theory.
We will also present an overview of the various uses of the wavelet transform in the pro-
cessing of biomedical signals and more specically in the analysis of 1-D physiological signals
such as the ones obtained in electrocardiography (ECG), and electroencephalography (EEG),
including evoked response potentials (ERP).
It is our aim to give a general perspective that yields insight into the wavelet method as a
tool for signal processing and analysis and not to explain the mathematical details of those
methods. Furthermore, we will focus on the standard dyadic wavelet decomposition, skipping
the theory of wavelet packet analysis (WPA) and wavelet packet synthesis (WPS) since they
are not used in this Thesis. However, a detailed explanation about wavelets can be found in
[53, 30, 28].
3.1. TIME REPRESENTATION AND FREQUENCY REPRESENTATION
3.1 Time representation and frequency representation
The time representation is usually the rst (and the most natural) description of a signal,
since almost all physical signals are obtained by recording variations with time.
The frequency representation, obtained by the well known Fourier transform
X(f) =
x(t)e
j2ft
dt
and its inverse
x(t) =
X(f)e
j2ft
df
is also a very powerful way to describe a signal, mainly because the relevance of the concept
of frequency is shared by many domains (physics, astronomy, economics, biology, etc.) in
which periodic events occur.
If we look more carefully at the spectrum X(f), it can be viewed as the coecient function
obtained by expanding the signal x(t) into the family of innite waves, e
j2ft
, which are
totally unlocalized in time. Thus, the frequency spectrum tells us which frequencies are
contained in a signal, as well as their corresponding amplitudes and phases, but not tell
anything about at which times these frequencies occur. This is why the Fourier transform
is not suitable if the signal has time varying frequency spectrum, i.e. the signal is non-
stationary. This type of signals are of special relevance in the biomedical eld since a large
amount of the information carried by physiological signals like the EEG and the ECG is
found in transient and short duration changes in the ongoing background activity.
3.2 Time frequency analysis
As we have seen in the previous section, the Fourier transform is not well adapted to the
analysis of non-stationary signals since it projects the signal on innite waves (sinusoids)
which are completely delocalized in time. Thus, one has to consider the use of bidimensional
functions (functions of the variables time and frequency).
A rst class of such time-frequency representations is given by the atomic decompositions (also
known as the linear time-frequency representations), being the short time Fourier transform
(STFT) and the wavelet transform (WT) the most known cases.
An alternative class of solutions is given by the energy distributions, among which the Cohens
[15] class is of special relevance, but we will not discuss them since it is a too wide subject
and they are not used in this thesis.
17
3.2. TIME FREQUENCY ANALYSIS
3.2.1 The Short Time Fourier Transform
In order to introduce time-dependency in the Fourier transform, a simple and intuitive so-
lution consists in pre-windowing the signal x(t) around a particular time t, calculating its
Fourier transform, and doing that for each particular time t. The resulting transform, called
the short-time Fourier transform (STFT), is
STFT
x
(t, f; w) =
x()w
( t)e
j2f
d
where w(t) is the short time analysis window localized around t = 0 and f = 0. Because
multiplication by the relatively short window w
( t) eectively suppresses the signal outside

a neighborhood around the analysis time point = t, the STFT is a local spectrum of the
signal x() around t. The duration of the analysis window should be such that the signal can
be assumed to be stationary during its time span.
Provided that the short-time window is of nite energy, the STFT is invertible according to
x(t) =
1
E
h
STFT
x
(, ; w)w(t )e
j2t
dd
with E
h
=
|w(t)|
2
dt. This relation expresses that the total signal can be decomposed
into a weighted sum of elementary waveforms
w
t,f
() = h( t)e
j2f
which can be interpreted as building blocks or atoms. Each atom is obtained from the
window w(t) by a translation both in time and frequency (modulation). Figure 3.1 shows
three atoms corresponding to a gaussian type window.
From the frequency point of view the STFT may also be expressed in terms of signal and
window spectra:
STFT
x
(t, f; w) =
X()W
( f)e
j2(f)t
d
where X and H are, respectively, the Fourier transforms of the signal x and the window w.
Thus the STFT can be considered as the result of passing the signal through a band-pass
lter whose frequency response is deducted from a mother lter H() by a translation of f.
So the STFT is similar to a bank of band-pass lters with the constant bandwidth.
The STFT gives us a time-frequency representation of a signal, but it has some resolution
problems whose roots go back to the Heisenberg Uncertainty Principle. This principle states
18
50 100 150 200 250
0
0.1
0.2
0.3
0.4
0.5
Time
N
o
r
m
a
l
i
z
e
d

f
r
e
q
u
e
n
c
y
50 100 150 200 250
1
0.5
0
0.5
1
3 Gaussian atom(s)
Figure 3.1: Three time-frequency STFT atoms corresponding to a gaussian window
that one cannot know the exact time-frequency representation of a signal and that what we
can only know is the time intervals in which a certain band of frequencies exist [66].
We can easily obtain the time resolution of the STFT by considering for the signal x(t) a
Dirac impulse:
x(t) = (t t
0
) STFT
x
(t, f; w) = e
j2t
0
f
w(t t
0
)
Similarly if we consider a complex sinusoid (a Dirac impulse in the frequency domain), we
can obtain that the frequency resolution of the STFT is:
x(t) = e
j2f
0
t
STFT
x
(t, f; w) = e
j2tf
0
H(f f
0
)
So we have a trade-o between time and frequency resolutions: on one hand, a good time
resolution requires a short window with short time support. On the other hand, a good
frequency resolution requires a narrow-band lter, i.e. a long window h(t). This xed time-
frequency resolution of the STFT poses a serious constraint in many applications (see Figure
3.2). The wavelet transform (WT) solves this problem to a certain extent as we will see in
the next section.
19
0.5
0
0.5
1
R
e
a
l
p
a
r
t
Signal in time
20 40 60
Linear scale
E
n
e
r
g
y

s
p
e
c
t
r
a
l
d
e
n
s
it
y
|STFT|
2
, Lh=0, Nf=64, lin. scale, imagesc, Thld=5%
Time [s]
F
r
e
q
u
e
n
c
y

[
H
z
]
20 40 60 80 100 120
0.15
0.2
0.25
0.3
(a)
0.5
0
0.5
1
R
e
a
l
p
a
r
t
Signal in time
20 40 60
Linear scale
E
n
e
r
g
y

s
p
e
c
t
r
a
l
d
e
n
s
it
y
|STFT|
2
, Lh=63, Nf=64, lin. scale, imagesc, Thld=5%
Time [s]
F
r
e
q
u
e
n
c
y

[
H
z
]
20 40 60 80 100 120
0.15
0.2
0.25
0.3
(b)
Figure 3.2: STFT of a signal with linear frequency modulation and gaussian amplitude modulation.
Perfect time resolution but no frequency resolution when the window h is chosen as a
Dirac impulse (a). Perfect frequency resolution but no time resolution when h is chosen
as a constant (b).
3.2.2 The Continuous Wavelet Transform (CWT)
As we just mentioned, the time and frequency width of the window function for the STFT
do not depend upon the location in the time-frequency plane. However, in practical applica-
tions, high frequency components are usually present for short durations while low frequency
components stay for long durations. The CWT is designed to take advantage of this fact
by using good time resolution and poor frequency resolution at high frequencies and good
frequency resolution and poor time resolution at low frequencies.
The continuous wavelet transform (CWT) of a signal x with respect to some analyzing wavelet
is dened by:
CWT
x
(, s; ) =
x(t)
,s
(t)dt
where
,s
(t) =
1
|s|
(
t
s
)
as we can see, the transformed signal is a function of two variables, and s. The variable s
is called scale parameter, since taking |s| > 1 dilates the mother wavelet and taking |s| < 1
compresses it. The variable is commonly referred as translation parameter in the sense
that it shifts the position of the mother wavelet in the time axis.
20
50 100 150 200 250
0
0.1
0.2
0.3
0.4
0.5
Time
N
o
r
m
a
l
i
z
e
d

f
r
e
q
u
e
n
c
y
50 100 150 200 250
1
0.5
0
0.5
1
3 Gaussian atom(s)
Figure 3.3: Three CWT time-scale atoms. Note that their time duration is inversely proportional
to the central frequency.
By denition, the wavelet transform cannot be considered as a time-frequency representation
of a signal but a time-scale representation. However, for wavelets that are well localized
around a non-zero frequency f
0
at scale a = 1, a time-frequency interpretation is possible
thanks to the formal identication f = f
0
/s.
As we can easily guess, the basic dierence between the wavelet transform and the short-time
Fourier transform is related to the new scale parameter s. When this scale factor is changed,
the duration and the bandwidth of the wavelet is changed while its shape remains the same.
Since low scales (|s| > 1) are related to low frequencies and high scales (|s| < 1) to high
frequencies we can see that the CWT uses short windows at high frequencies and long ones
at low frequencies. In contrast, the STFT uses a single analysis window for all frequencies.
Figure 3.5 shows the squared modulus of the CWT for the same signal whose STFT can be
seen in Figure 3.2. We can check how the CWT oers a good time-frequency representation
of the given signal without requiring the specication of a window length. So, the CWT
partially overcomes the resolution limitation of the STFT: the bandwidth B of the analysis
window is proportional to f, or
B
f
= Q constant.
The CWT can be considered as a constant-Q analysis. Thus, the CWT can also be seen as
21
3.3. MULTIRESOLUTION ANALYSIS
(a) Time (b) Fourier
(c) Gabor(STFT)
...
(d) Wavelets
Figure 3.4: Sampling of the time-frequency plane. Horizontal axis represents time, vertical axis
frequency. Dierent forms of sampling: Time representation, Fourier, Gabor, Wavelet.
a lter bank analysis composed of band-pass lters with a constant relative bandwidth as we
will see in section 3.3.2.
3.3 Multiresolution Analysis
As the name suggests, in multiresolution analysis (MRA) a function is viewed at various levels
of approximations or resolutions. The idea was developed by Meyer [55] and Mallat [52]. By
applying MRA we can divide a complicated function into several simpler ones and study
them separately.
We are not going to go through the mathematical details of the multiresolution analysis since
they are not our main concern. Instead, we are just going to show the relation between MRA
and the lter bank approach for the Discrete Wavelet Transform (DWT). For a complete
review of the MRA theory we refer to [52].
3.3.1 The Discrete Wavelet Transform (DWT)
We have seen that the CWT allows us to perform a multiresolution analysis of a certain
signal. However, the CWT has some important drawbacks that make it impractical for real
signal processing applications [83]:
22
Figure 3.5: Scalogram (squared modulus of the CWT) of a signal with linear frequency modulation
and gaussian amplitude modulation.
The CWT is highly redundant as far as the reconstruction of the signals is concerned.
This is due to the fact that a continuously scalable set of wavelet functions is nowhere
near an orthonormal basis
1
.
The CWT of a signal is a continuous signal and therefore it is composed of an innite
number of terms.
For most functions, the CWT have no analytical solutions and they can be calculated
only numerically.
In order to solve the former problems the Discrete Wavelet Transform (DWT) was introduced.
The DWT works as the CWT but choosing only certain scales and positions. The natural
way to sample the time-scale plane is to take samples on the non-uniform grid dened by
(t, s) = (nt
0
s
m
0
, s
m
0
) t
0
> 0, s
0
> 0 m, n Z
The discrete wavelet transform (DWT) is dened as
DWT
x
(m, n; ) =
x(u)
n,m
(u)du
1
The CWT is a quasi-orthogonal transform, i.e. it allows perfect reconstruction even though the wavelet
basis is not orthogonal.
23
where
n,m
(u) = s
m/2
0
(s
m
0
u nt
0
). The most natural choice (s
0
= 2, t
0
= 1) corresponds
to the dyadic sampling of the time-scale plane (see Figure 3.4(d)). Using such a sampling we
obtain that the family of atoms
n,m
(u); m, n Z form an orthonormal basis.
3.3.2 The lter bank approach for the DWT
In 1986, Mallat [52] shows that the dierence of information between the approximation of a
signal at the resolutions 2
j+1
and 2
j
can be extracted by decomposing this signal on a wavelet
orthonormal basis of L
2
(R
n
). In L
2
(R) an orthogonal dyadic multiresolution representation
is a chain of closed subspaces indexed by all integers
...V
2
V
1
V
0
V
1
V
2
...
subject to the following three conditions:
Completeness: lim
n
V
n
= L
2
(R) lim
n
V
n
= {0}
Scale Similarity: f(t) V
n
f(2x) V
n+1
Translation Seed: V
0
has an orthonormal basis consisting of all integral translates of a
single scaling function (t) : {(tk) : n Z}. Similarly, if W
0
denotes the orthogonal
complement of V
0
in V
1
, W
0
is also orthogonally spanned by the integer translates
of a single translation seed (t). This function (t) is the wavelet function of our
decomposition scheme.
If W
0
denote the orthogonal complement of V
0
in V
1
(see Figure 3.6),
(t) W
0
W
1
(3.1)
(t) V
0
W
1
(3.2)
we should be able to write (t) and (t) in terms of the bases that generate W
1
. In other
words, there exist two sequences {g
0
[k]} , {g
1
[k]} such that
(t) =
k
g
0
[k](2t k) (3.3)
(t) =
k
g
1
[k](2t k) (3.4)
In general, for any j Z the relationships between V
j
and W
j
with W
j+1
is governed by
24
WM
VM-1
.
.
.
WM-1
.
.
.
VM-2 WM-2
Figure 3.6: Splitting of MRA subspaces
(2
j
t) =
k
g
0
[k](2
j+1
t k) (3.5)
(2
j
t) =
k
g
1
[k](2
j+1
t k) (3.6)
Equations (3.5) and (3.6) are referred to as two-scale relations and dene the relation between
the scaling functions and the wavelets at a given scale with the scaling function at the next
higher scale. By taking their Fourier transform, we have
() = G
0
(z)

(
2
) (3.7)
() = G
1
(z)

(
2
) (3.8)
where
G
0
(z) :=
1
2
k
g
0
[k]z
k
(3.9)
G
1
(z) :=
1
2
k
g
1
[k]z
k
(3.10)
25
with z = e
jW/2
. Expansions of (3.5) and (3.6) lead to
() =
l=1
G
0
e
j

2
l
(3.11)
() = G
1
e
j
l=2
G
0
e
j

2
l
(3.12)
From expressions (3.11) and (3.12) it is straightforward the design of an ecient algorithm
for calculating the wavelet decomposition of a signal based on a lter bank of successive pairs
of highpass (from the wavelet function) and lowpass (from the scaling function) quadrature
mirror lters (QMFs). The scheme for a single decomposition step (for 1-D signals) is depicted
in Figures 3.7 and 3.8.
An orthogonal Mallat-Meyer MRA corresponds to an orthogonal lter bank with the analysis
lters:
G
0
= {g
0
[k] : k Z} G
1
= {g
1
[k] : k Z}
(3.13)
where {g
0
[k]}
kZ
and {g
1
[k]}
kZ
are the 2-scale connection coecients previously stated in
equations (3.5) and (3.6). The synthesis lters G
0
, G
1
are just the time reversals of the
analysis lters.
The general multiresolution decomposition of a signal x(t) corresponds to the iteration of the
2-channel analysis bank with the coecients of x as the input data. In Figure 3.9 a general
multilevel wavelet decomposition scheme is depicted.
s
n
G
0
2
Downsampled
Low-pass filtered
signal
s
n-1
Signal
G
1
2
Downsampled
High-pass filtered
signal
d
n-1
Figure 3.7: 2-Channel analysis lter bank
Orthonormal basis is an ecient and straightforward way to represent a signal. In some appli-
cations, however, the orthonormal basis function may lack certain desirable signal processing
properties, i.e. symmetry, causing inconvenience in processing. Biorthogonal representation
is an alternative to overcome the constraint in orthogonality and producing a good approx-
26
s
n-1
G
0
' 2
Downsampled
Low-pass
filtered signal
s
n
Reconstructed
Signal
d
n-1
G
1
' 2
Downsampled
High-pass
filtered signal
Figure 3.8: 2-Channel synthesis lter bank
Wj sj
Wj-1 sj-1
Wj-2 sj-2
Wj-3 sj-3
Vj-1 dj-1
Vj-2 dj-2
Vj-3 dj-3
G0 G1
(LP) (HP)
G0 G1
(LP) (HP)
G0 G1
(LP) (HP)
Sj(f)
f
Wj f(2 t)
j
f
Vj-1 (2 t)
y Wj-1 f(2 t)
j-1 j-1
f
Vj-1 (2 t)
y
j-1
Vj-2 (2 t)
y
j-2
Wj-2 f(2 t)
j-2
Figure 3.9: General multilevel wavelet decomposition scheme
27
3.4. WAVELETS IN BIOMEDICAL APPLICATIONS
imation to a given function. Let {
k
(t)}
kZ
L
2
be a biorthogonal basis function set. If
there exists another basis function set {
k
(t)}
kZ
L
2
such that
k
,
k
(t)dt =
l,k
the set {
k
(t)}
kZ
L
2
is called the dual basis of {
k
(t)}
kZ
L
2
.
In multiresolution analysis the dual wavelet

can be used to analyze a function x by comput-
ing its integral wavelet transform at a desired time-scale location, while the original wavelet
can be used to obtain its function representation at any scale. Therefore, we call

an analyz-
ing wavelet, while is called a synthesis wavelet [30]. A biorthogonal Mallat-Meyer MRA can
be constructed using these synthesis and analyzing wavelets and their corresponding scaling
functions.
3.4 Wavelets in Biomedical Applications
During the past few years the wavelet transform has been found to be of great relevance in
biomedical engineering. The main diculty in dealing with biomedical signals is their extreme
variability and that, very often, one does not know a priori what is a pertinent information
and/or at which scale it is located. Another important aspect of biomedical signals is that the
information of interest is often a combination of features that are well localized temporally
or spatially (e.g., spikes and transients in the EEG) and others that are more diuse (e.g.,
EEG rhythms). This requires the use of analysis methods versatile enough to handle events
that can be in at opposite extremes in terms of their time-frequency localization. Thus, the
spectrum of applications of the wavelet transform and its multiresolution analysis has been
extremely large. For a complete review of the main biomedical applications of wavelets we
refer to [82].
3.4.1 Wavelet properties in the context of biomedical applications
The main properties of the wavelet transform have already been described but wavelets can
be seen from dierent points of view depending on the application we are planning to give
to them. The main features of wavelets used in the biomedical eld are the following:
Wavelets as a Filter bank. As we have seen in section 3.3.2, the wavelet transform
can be viewed as a special kind of spectral analyzer. The simplest global features
that can be extracted from this type of system are energy estimates in the various
28
frequency bands. Spectral features of this type have been used recently to discriminate
between various physiological states. We should note, however, that this type of global
feature extraction is only justied when the underlying signal can be assumed to be
stationary and that similar results can also be obtained using more conventional Fourier
techniques. Another common application of the wavelet multiresolution lterbank is
the implementation of noise reduction by selective shrinkage of coecients from certain
frequency bands.
Wavelets as a Multiscale Matched Filter. In essence, the CWT performs a corre-
lation analysis, so that we can expect its output to be maximum when the input signal
most resembles the analysis template
s,t
. This principle is the basis for the matched
lter, which is the optimum detector of a deterministic signal in the presence of additive
noise. This property has been exploit, for example, in the detection of certain EEG
signal waveforms. However, wavelets bases are not well adapted to represent functions
whose Fourier transforms have a good high frequency support. Hence, wavelet bases
are not used alone very often but are usually included as members of a larger ensemble
of bases (dictionary) in the so called Matching Pursuit procedure [34].
Wavelets and Time-Frequency Localization. As we have already mentioned, most
biomedical signals of interest include a combination of impulse-like events (spikes and
transients) and more diuse oscillations (murmurs and EEG waveforms) which may all
convey important information for the clinician. The STFT and other conventional time-
frequency methods perform well in analyzing the latter type of events but are much less
suited for the analysis of short duration pulsation. Additionally, the xed resolution of
the STFT does not allow to search for both types of events simultaneously with a good
resolution in time and frequency. By contrast, the wavelet transform oers a resolution
compromise solution that has shown to be appropriate for the characterization of heart
beat sounds [46], the analysis of ECG signals including the detection of late ventricular
potentials [45], the analysis of EEGs [75] as well as a variety of other physiological
signals. However, we must say that the constant Q property of neuroelectric waveforms
is only approximate, suggesting that a DWT analysis may not adequately partition
some neuroelectric waveforms into functionally distinct scales. In those special cases a
greater exibility in dening the frequency bands of a decomposition can be obtained by
using a generalization of wavelets known as wavelet packets [53, 30]. A wavelet packet
decomposition is generated by a ltering scheme similar to that used in a conventional
DWT. The dierence between the two is that a wavelet packet decomposition permits
the detail function to be further split into two or more subbands.
29
Wavelets Bases. Possibly the most remarkable aspect of the wavelet theory is the
possibility to construct wavelet bases of L
2
. Hence, wavelets provide a one-to-one
representation of the signal in terms of its coecients (reversible linear transformation).
Data compression as well as denoising can be achieved by quantization and shrinkage
in the wavelet domain, or by simply discarding the coecients that are insignicant.
Furthermore, wavelets are the core of the modern lossless compression and embedded
coding techniques for biomedical images.
3.4.2 Electroencephalography applications
Electroencephalographic waveforms such as EEG and event related potentials (ERP) record-
ings from multiple electrodes vary their frequency content over their time courses and across
recording sites on the scalp. Accordingly, EEG and ERP data sets are nonstationary in both
time and space. Furthermore, three specic components and events that interest neuroscien-
tists and clinicians in these data sets tend to be transient (localized in time), prominent over
certain scalp regions (localized in space), and restricted to certain ranges of temporal and
spatial frequencies (localized in scale). Because of these characteristics, wavelets are suited
for the analysis of the EEG and ERP signals. Wavelet based techniques can nowadays be
found in many processing areas of neuroelectric waveforms, such as:
Noise ltering. After performing the wavelet transform to a EEG or ERP waveform,
precise noise ltering is possible simply by zeroing out or attenuating any wavelet
coecients associated primarily with noise and then reconstructing the neuroelectric
signal using the inverse wavelet transform. In chapter 6 we will apply this technique to
lter a ERP dataset with MMN deection in order to eliminate any component outside
the MMN time-frequency range.
Preprocessing neuroelectric data for input to neural networks. Several studies
( [80], [43]) suggest that wavelet decompositions of neuroelectric waveforms may have
important processing applications in intelligent detection systems for use in clinical and
human performance settings.
Neuroelectric waveform compression. Wavelet compression techniques have been
shown to improve neuroelectric data compression ratios with little loss of signal in-
formation when compared with classical compression techniques. Furthermore, there
are very ecient algorithms available for the calculation of the wavelet transform that
make it very attractive from the computation requirements point of view.
30
Spike and transient detection. As we already know, the wavelet representation has
the property that its time or space resolution improves as the scale of a neuroelectric
event decreases. This variable resolution property makes wavelets ideally suited to
detect the time of occurrence and the location of small-scale transient events such as
focal epileptogenic spikes [43].
Component and event detection. Wavelets methods, such as wavelets packets,
oer precise control over the frequency selectivity of the decomposition, resulting in
precise component identication, even when the components substantially overlap in
time and frequency. Furthermore, wavelets shapes can be designed to match the shapes
of components embedded in ERPs. Such wavelets are excellent templates to detect and
separate those components and events from the background EEG.
Time-scale analysis of EEG waveforms. Time-scale and space-scale representa-
tions permit the user to search for functionally signicant events at specic scales, or to
observe time and spatial relationships across scales. A sample time-scale representation
of an ERP trial with MMN deection can be observed in Figure 6.7.
31
Chapter 4
Independent Component Analysis
As we have seen in the previous chapters wavelets and time-frequency methods can eectively
analyze brain processes only when those processes do not overlap in the time-frequency plane.
However, the EEG components very often overlap in both time and frequency due to their
stochastic nature. To overcome those limitations totally dierent approaches, based on the
analysis of the statistical properties of the signals, are needed.
Blind source separation (BSS) and Independent Component Analysis (ICA) are emerging
techniques aiming at recovering unobserved signals or sources from observed mixtures, ex-
ploiting only the assumption of mutual independence between the signals. The adjective
blind stresses the fact that i) the source signals are not observed and ii) no information
is available about the mixture. This lack of a priori knowledge about the mixture is com-
pensated by a statistically strong but often physically plausible assumption of independence
between the source signals. The weakness of the assumptions makes it a powerful approach
for processing biomedical signals like ECG and EEG.
In this chapter we will briey introduce the blind separation problem and the theory un-
derlying Independent Component Analysis. Later we will shortly describe three of the most
important ICA algorithms: Infomax, FastICA and JADE. These have been the ICA algo-
rithms used in the methods proposed in chapter 6 for the automatic detection of MMN
deection in ERP trials. Finally we describe the most common applications of ICA, specially
in the eld of biomedicine and encephalography.
32
4.1. BLIND SOURCE SEPARATION PROBLEM
s =
) (
) (
1
t s
t s
n
M
x
s y
) (
) (
1
=
=
t y
t y
n
M
A B
s =
) (
) (
1
t s
t s
n
M
x
s y
) (
) (
1
=
=
t y
t y
n
M
A B
Figure 4.1: Basic BSS model. Unobserved signals: s, observations: x, estimated source signals: y
4.1 Blind Source Separation problem
The simplest BSS model assumes the existence of n independent signals s
1
(t), ..., s
n
(t) and
the observation of at most as many mixtures x
1
(t), ..., x
m
(t), these mixtures being linear
and instantaneous, i.e. x
i
(t) =
n
j=1
a
ij
s
j
(t) for each i = 1, ..., m. This can be compactly
represented by the mixing equation
x(t) = As(t) (4.1)
where s(t) = [s
1
(t), ..., s
n
(t)]
T
collects the source signals, x(t) the m observed signals and
the n m mixing matrix A contains the n m mixture coecients. The BSS problem
consists in recovering the source vector s(t) using only the observed data x(t), the assumptions
of independence between the entries of the input vector s(t) and possibly some a priori
information about the probability distribution of the inputs. It can be formulated as the
computation of an mn separating matrix B whose output:
y(t) = Bx(t) (4.2)
is an estimate of the source signals.
The basic BSS model can be extended considering, for example, more sensors than sources
and noisy mixtures. These extensions, although they are of practical importance, will not
be considered since the basic BSS model is enough for understanding the essence of the BSS
problem.
Several principles have been developed in statistics, neural computing, and signal processing
to solve the BSS problem. The classical methods can be divided into two main categories [36]:
Second-order methods. This methods try to nd the model separation matrix using
only the information contained in the covariance matrix of the data vector x. The
most known methods in this category are Principal Component Analysis (PCA) [39]
and Factor Analysis (FA) [33]. Roughly, the purpose of the second-order methods is to
nd a faithful representation of the data, in the sense of reconstruction (mean-square)
error under the classical assumption of Gaussianity of the sources.
33
4.2. INDEPENDENT COMPONENT ANALYSIS THEORY
Higher-order methods. In this case, information about the sources probability distribu-
tion that is not contained in the covariance matrix is also considered. The distribution
of x must not be assumed to be Gaussian since if that was the case all the information
about x would be contained in the covariance matrix
1
and using higher-order statistics
would be useless. Examples of this kind of methods are projection pursuit [27] and
redundancy reduction [4].
4.2 Independent Component Analysis theory
Independent component Analysis is an emerging method for solving the BSS problem enun-
ciated in equation (4.2). ICA of a random vector x consists of nding a linear transform
y = Bx so that the components y
i
are as independent as possible, in the sense of maximizing
some function F(y
1
, ..., y
m
) that measures independence. In this denition we have assumed
the basic BSS model in absence of noise. Considering the general noisy case complicates
signicantly the estimation problem and thus, majority of ICA research neglects the noise
term in the BSS problem.
To estimate the data model of ICA the common procedure is to formulate an objective
function F and then minimize or maximize it. Thus, an ICA algorithm can be decomposed
into two parts: The objective or contrast function F and the Optimization algorithm used to
maximize/minimize F. The properties of an ICA method depend on those two parts. The
statistical properties (e.g. consistency, asymptotic variance, robustness) depend on the choice
of the objective function whereas the algorithmic properties (e.g. convergence speed, memory
requirements, etc.) depend on the optimization algorithm. Typical ICA contrast functions
and optimization algorithms are described in sections 4.3 and 4.4.
4.2.1 Identiability and ambiguities of the ICA model
ICA exploits primarily spatial diversity, that is the fact that dierent sensors receive dierent
mixtures of the sources. Thus, the ICA approach for source separation looks for structure
across the sensors, not across time. A consequence of ignoring the time structure of the
observed signals is that the information contained in the data is throughly represented by
the sample distribution of the observed vector x. Then, the BSS becomes the problem
of identifying the probability distribution of the observations P(x = Bs) given a sample
1
This is true only for zero mean Gaussian variables. We assume that the variable x has been centered, i.e.
it has been transformed by x = x
0
E{x
0
}, where x
0
is the original non-centered variable.
34
4.2. INDEPENDENT COMPONENT ANALYSIS THEORY
distribution for the sources. Thus, the ICA statistical model has two components: the mixing
matrix A and the probability distribution of the source vector s. For the identiability of
the noise-free ICA model to be possible some restrictions must be considered:
1. Mutual independence of the sources. If one source i = 1, ..., n is assumed to have a
probability density function (pdf) denoted q
i
(), the independence assumption has a
simple mathematical expression: the joint pdf q(s) of the source vector s is
q(s) = q
1
(s
1
) q
n
(s
n
) =
n
i=1
q
i
(s
i
) (4.3)
i.e. it is the product of the (marginal) densities for all the sources.
2. All the independent components s
i
, with the possible exception of one component, must
be non-Gaussian. For Gaussian random variables mere uncorrelatedness implies inde-
pendence, and thus any decorrelating representation would give independent compo-
nents. Nevertheless, if more than one of the components s
i
are Gaussian, it is still
possible to identify the non-Gaussian independent components, as well as the corre-
sponding columns of the mixing matrix.
3. The number of observed linear mixtures m must be at least as large as the number of
independent components n, i.e. m n. This restriction is not completely necessary
and can be overcome by using ICA with overcomplete bases [8]. However, we will focus
on the classical m n case.
4. The matrix A must be of full column rank, i.e. its columns are linearly independent so
that it is invertible.
If x and s are interpreted as stochastic processes, additional restrictions arise. At least the
processes must be stationary in strict sense. Some other restrictions of ergodicity are also
required. If the process is i.i.d over time those requirements are full and we can consider
the stochastic process as a random variable.
We can easily see that there is some indeterminacy in the ICA model:
1. It is not possible to determine the variances (energies) of the independent components.
The reason is that the eect of any constant multiplying an independent component
could be canceled by dividing the corresponding column of the matrix A by the same
constant.
35
4.3. OBJECTIVE FUNCTIONS FOR ICA
2. Due to the previous indeterminacy it is not possible to order the independent compo-
nents. However, we can use the norms of the columns of the mixing matrix, which give
the contributions of the independent components to the variances of the observations,
to order s
i
according to descending norm of the corresponding columns of A.
4.3 Objective functions for ICA
We can dierentiate between two types of objective functions depending on how the inde-
pendent components are estimated. The Multi-unit contrast functions estimate all the ICs
at the same time while the One-unit contrast functions enable estimation of single indepen-
dent components (procedure that can be iterated to nd several components). Amongst the
latter we can nd the Negentropy and higher-order cumulants as the kurtosis. Amongst the
former, some examples are likehood and infomax, mutual information, high order cumulants
and others.
4.3.1 Likehood and infomax
The maximum likehood (ML) principle leads to several contrasts which are expressed via the
Kullback-Leibler convergence, dened for two probability distributions f(s) and g(s) as
K(f|g)
s
f(s)log(
f(s)
g(s)
)ds (4.4)
and that can be understood as a statistical way of quantifying the closeness of two distri-
butions.
The log likehood contrast function in the noise-free ICA model can be formulated as
ML
[y] =
1
T
T
t=1
log p (x(t)|A, q) (4.5)
where x(t) is the observation vector at the realization t, A is the unmixing matrix and q is
the distribution of the source vector s. Simple calculus show that [11]
ML
[y]
T
K[y|s] +constant (4.6)
and thus, the ML principle has associated a contrast function
ML
= K[y|s], i.e. ML tries
to nd a matrix A such that the distribution of A
1
x is as close as possible (in the Kullback
divergence sense) to the hypothesized distribution of the sources.
36
The infomax principle suggests a contrast function that maximizes the entropy of the inde-
pendent components [6]:
IM
[y] H[g(y)] (4.7)
where H[] denotes the Shannon entropy
2
. In fact, it can be shown that
IM
[y] =
ML
[y]
and infomax is equivalent to the maximum likehood criterion [10].
4.3.2 Mutual information
The simple likehood approach described above is based on a xed hypothesis about the
distribution of the sources. The ML results are expected to be good only if the hypothesized
source distributions do not dier too much from the true ones. To overcome that problem
we should minimize the divergence K[y|s] not only with respect to A (via the distribution
of y = A
1
x) but also with respect to the distribution of s. If we denote y a random vector
with independent entries and each entry distributed as the corresponding entry of y then:
K[y|s] = K[y| y] +K[ y|s] (4.8)
for any vector s with independent entries [11]. Then, the minimization task can be ac-
complished by minimizing the terms in the right side of equation (4.8). The rst term is
independent of s and thus the minimization in s is equivalent to minimize K[ y|s], that is
done by just setting s = y for which K[ y|s] = 0 so that min
s
K[y|s] = K[y| y] having nally
that min
(s,y)
K[y|s] = min
y
K[y| y], i.e we must minimize the contrast function:
MI
[y] K[y| y] (4.9)
The Kullback divergence K[y| y] between a distribution and the closest distribution with
independent entries is traditionally called the mutual information and can also be expressed
as [35]:
MI(y
1
, ..., y
n
) =
n
i=1
H[y
i
] H[y] (4.10)
where H[] is the Shannon entropy
2
. It is easy to see that the mutual information satises
MI
[y] 0 with equality if and only if y is distributed as y, i.e if the entries of y are
independent. Thus, the mutual information can be understood as a quantitative measure of
independence associated to the maximum likehood principle. The main problem with mutual
information is that it is dicult to estimate because it is based on entropy, which requires
estimating the density functions of the observations y
i
.
2
For a random vector u with density f(u), the Shannon entropy is dened as H[u] =
f(u)logf(u)du
with the convention 0log0 = 0
37
4.3.3 Negentropy
Negentropy is a one-unit contrast dened as [35]
J(y) = H(y
gauss
) H(y) (4.11)
where y
gauss
is a Gaussian random vector with the same covariance matrix as the observations
y. By denition we have that J(y) 0 with J(y) = 0 if and only if y has a Gaussian
distribution. It can be shown that if the mixtures y
i
are uncorrelated, the mutual information
can be expressed as [35]:
MI(y
1
, ..., y
n
) = J(y)
n
i=1
J(y
i
) (4.12)
And we can see that nding maximum negentropy directions, i.e., directions where the el-
ements of the sum J(y
i
) are maximized, is equivalent to nding a representation in which
mutual information is minimized. Unfortunately, the reservations made with respect to mu-
tual information are also valid here. The estimation of the Negentropy is dicult and thus,
it is not a very practical contrast function.
4.3.4 High order approximations
The main drawback of contrast functions derived from ML approach is that they require
the estimation of probability distributions. A possible solution to this problem is the use of
high order statistics to dene contrast functions which are simple approximations of those
derived from the ML criterion. The easiest way to express high order information is by using
cumulants. For zero-mean random observation y
i
, y
j
, y
k
, y
l
, the 2nd order cumulant can be
expressed as [11]
C
ij
[y] = E[y
i
y
j
] (4.13)
and the 4th order cumulant:
C
ijkl
[y] = E[y
i
y
j
y
k
y
l
] E[y
i
y
j
]E[y
k
y
l
] E[y
i
y
k
]E[y
j
y
l
] E[y
i
y
l
]E[y
j
y
k
] (4.14)
An approximate measure of mismatch between the output distribution and the source distri-
bution can be dened from the quadratic mismatch of the cumulants:
2
[y]
ij
(C
ij
[y] C
ij
[s])
2
4
[y]
ijkl
(C
ijkl
[y] C
ijkl
[s])
2
38
4.4. ICA ALGORITHMS
Using
2
and
4
, if s and y are symmetrically distributed with distributions close enough
to normal, then we can approximate the Kullback divergence by
K[y|s]
24
[y]
1
48
(12
2
[y] +
4
[y]) (4.15)
The kurtosis is also used in some ICA algorithms as a measure of non-Gaussianity of the IC
estimations y
i
. It can be dened using cumulants as:
k
i
C
iiii
= E[y
4
i
] 3(E[y
2
i
])
2
(4.16)
Also derived from cumulants is the contrast function of the algorithm JADE that is based
on a subset of cross cumulants:
JADE
[y]
ijkl=ijkk
C
2
ijkl
[y] (4.17)
A survey of the most important high-order contrasts for Independent Component Analysis
can be found in [9].
4.4 ICA algorithms
After choosing an appropriate contrast function one needs a practical method or algorithm
for its implementation. In this section, we are going to briey explain three well known
algorithms: Infomax, FastICA and JADE.
There are some preprocessing steps that are common to most of the ICA algorithms:
Centering. The mean of the data is subtracted from the actual data to make it zero-
mean, i.e. x
c
= x E[x]. After the estimation of the mixing matrix A, the mean is
added back to the data.
Whitening or sphering. A linear transform is applied to the data x so that the covariance
matrix of the transformed data x
w
equals unity: E[x
w
x
T
w
] = I. This transformation is
always possible, for example by using the eigenvalue decomposition of the covariance
matrix E[xx
T
] = EDE
T
to transform the observed data according to:
x
w
= ED
1/2
E
T
x (4.18)
where E is the orthogonal matrix of eigenvectors of the covariance matrix of the data,
D is the diagonal matrix of associated eigenvalues D = diag(d
1
, ..., d
m
) and D
1/2
=
diag(d
1/2
1
, ..., d
1/2
m
)
39
4.4. ICA ALGORITHMS
Dimensionality reduction. When sphering, we can at the same time reduce the dimen-
sionality of the data by discarding those eigenvalues of the covariance matrix that are
too small as it is done in Principal Component Analysis (PCA). Reducing the dimen-
sions of the data can help in suppressing noise and preventing over learning of the ICA
algorithm.
4.4.1 Infomax
One of the rst developed algorithms for ICA is the so called Infomax algorithm, based on
maximization of the network entropy, which is, under some conditions, equivalent to the
maximum likehood approach. Usually these algorithms are based on gradient ascent of the
objective function. The original infomax algorithm of Bell and Sejnowski uses an stochastic
gradient that yields the following update formula for the separating matrix:
B [B
T
]
1
2tanh(Bx)x
T
(4.19)
This function works for estimation of most super-Gaussian independent components but for
sub-Gaussian components other functions must be used. The main drawback of the stochastic
gradient methods is their very slow convergence.
To improve the convergence speed and simplify the algorithm the natural (or relative) gradient
method introduced by Amari [3] can be used. This yields to an algorithm of the form:
B (I 2tanh(Bx)(Bx)
T
)B (4.20)
After this modication, the algorithm does not need sphering.
4.4.2 FastICA
Adaptive algorithms like Infomax may be problematic when used in an environment where
no adaptation is needed. The convergence of those algorithms is usually slow and depends
crucially on the choice of the learning rate at each step of the ICA learning process. As a
remedy for this problem, one can use batch (block) algorithms based on xed-point iteration.
The FastICA algorithm is one of those xed-point algorithms. It was originally introduced
using kurtosis and was generalized later [35] for general contrast functions. For sphered data,
the one-unit FastICA algorithm has the following form:
w(k) = E[ag(a(k 1)
T
x)] E[g
(a(k 1)
T
x)]a(k 1) (4.21)
40
4.4. ICA ALGORITHMS
where the weight vector a is normalized to unit norm after every iteration, and the function
g is the derivative of the function G used in the general contrast function given by
J
G
(y) = |E[G(y)] E
[G()]|
p
(4.22)
where is a standardized Gaussian variable, y is assumed to be normalized to unit variance,
and the exponent p = 1, 2 typically.
The convergence speed of FastICA and related xed-point ICA algorithms is clearly superior
to the adaptive algorithms such as Infomax. Speed-up factors of 10 to 100 are usually
observed [35]. Another advantage is that FastICA can estimate without problems both sub-
gaussian and super-gaussian independent components. Moreover, it is a general algorithm
that can be used to optimize both one-unit and multi-unit contrast functions.
4.4.3 JADE
JADE [12] is based on the Jacobi optimization [9] of an orthonormal contrast function as
opposed to optimization by gradient-like algorithms. JADE is also a statistic-based algorithm
and can be summarized in the following steps [9]:
1. Initialization. Estimate a whitening matrix W and set Y = WX.
2. Form statistics. Estimate a maximal set {
Q
Z
i
} of cumulant matrices. Given a random
n 1 vector X and any n n matrix M, we dene the associated cumulant matrix
Q
X
(M) as the n n matrix dened component-wise by
[Q
X
(M)]
ij

n
k,l=1
C
ijkl
M
kl
(4.23)
It is at this step when the Jacobi technique is used.
3. Optimize an orthogonal contrast. Find the rotation matrix

V such that the cumulant
matrices are as diagonal as possible, that is, solve

V = arg min
i
O(V
T

Q
Z
i
V ) being
O(F) the sum of the squares of the nondiagonal elements of a matrix F, i.e. O(F)
i=j
(f
ij
)
2
.
4. Separate. Estimate A as

A =

V

W
1
and/or estimate the components as

S =

A
1
X =
V
T
Z.
JADE has shown to perform very eciently on small dimensions. However, in large dimen-
sions, the memory requirements may be prohibitive, because the cumulant matrices must
41
4.5. APPLICATION OF ICA IN ENCEPHALOGRAPHY
be stored in memory, which requires O(m
4
) units of memory. Another disadvantage is that
JADE, as all the Jacobi algorithms, tend to be quite complicated to program, requiring
sophisticated matrix manipulations.
4.5 Application of ICA in encephalography
ICA can be used for the analysis of encephalographic signals like EEG and ERP only if
certain conditions are satised, at least approximately:
Statistical independency of the brain sources involved in the generation of the EEG
signal. This independence criterion considers solely the statistical relations between
the amplitude distributions of the signals involved, and not the morphology or the
physiology of neural structures.
Instantaneous mixing at the electrodes. Because most of the energy in EEG signals lies
below 1kHz, the so-called quasistatic approximation of the Maxwell equations holds,
and each time instance can be considered separately. Therefore, the propagation of the
signals is immediate, there is no need for introducing time-delays, and the instantaneous
mixing is valid.
Linear mixing. Because the volume conduction through the cerebrospinal uid, skull,
and scalp is thought to be linear, the EEG and the ERP are assumed to be a linear
mixture of the potentials associated with synchronous activation of neuropil in each
stimulated area [51].
Stationarity of the mixing and the independent components. Stationarity is generally
assumed in the analysis of the EEG and related signals.
The rst application of ICA in the encephalography eld is the artifact correction of EEG or
ERP datasets. It turns on that the artifacts are quite independent from the potentials derived
from the brain neuroelectric activity and therefore they can be separated using independent
component analysis. The great advantage of ICA over classical artifact correction methods
is that ICA does not need an accurate model of the process that generated the artifacts.
Furthermore, ICA has shown to be superior to PCA for example in the correction of eye
artifacts [40], specially when the artifacts amplitudes are comparable to the amplitude of the
EEG signal. However, ICA has the drawback that visual inspection is needed for selecting
which independent components correspond to artifacts.
42
4.5. APPLICATION OF ICA IN ENCEPHALOGRAPHY
Another common application of ICA is the separation of ERP components. Several studies
(e.g. [51]) have shown that ICA is able to obtain a blind decomposition of the ERP without
imposing any a priori structure on the measurements. Those studies concluded that ICA can
successfully detect ERP components in a single trial paradigm what is very dicult to achieve
using traditional methods. Thus, ICA does not require to average a large number of ERP
trials to obtain results, allowing the study of the brain dynamics arising from intermittent
changes in subjects state and/or from complex interaction between task events.
Other applications of ICA, not related to ERP, can be found in the theory of feature extraction
and redundancy reduction, denoising in natural image processing, processing of communi-
cation signals, monitoring, or as an alternative to techniques such as principal component
analysis (PCA) and factor analysis (FA).
43
Chapter 5
Articial Neural Networks
We can nd in the literature many denitions of what a neural network is. According to the
DARPA Neural Network Study [50] a neural network is a system composed of many simple
processing elements operating in parallel whose function is determined by network structure,
connection strengths, and the processing performed at computing elements or nodes.
Inspired by biological neural networks, ANNs are massively parallel computing systems con-
sisting of a number of simple processors with many interconnections. ANN models attempt
to use some organizational principles believed to be used in the human brain. Neural
networks have been trained to perform complex functions in various elds of application in-
cluding pattern recognition, identication, classication, speech, vision and control systems.
Along this chapter we will rstly explain the basic theory of neural networks and their archi-
tectures and learning algorithms. After that, we will focus in the Feed-Forward Backpropa-
gation multilayer networks since they are the most commonly used in practical applications.
They are also the type of networks used in this Thesis for classifying ERP trials with MMN
deection.
5.1 Neuron model
McCulloch and Pitts [54] proposed a binary threshold unit as a computational model for
an articial neuron. This mathematical neuron computes a weighted sum of its R input
signals x
i
, j = 1, 2, ...R, and generates an output of 1 if this sum is above certain threshold
u. Otherwise, an output of 0 results. Mathematically,
a = sign
i=1
w
i
x
i
u
44
5.1. NEURON MODEL
f
1
w
2
w
R
w
1
x
2
x
R
x
.
.
.
y
b
) ( y f a =
ff
1
w
2
w
R
w
1
x
2
x
R
x
.
.
.
y
b
) ( y f a =
Figure 5.1: Neuron model.
(a) Threshold (b) Piece-wise linear (c) Sigmoid
Figure 5.2: Dierent types of activation functions [78].
where w
j
is the weight associated with the jth input. For simplicity of notation, we often
consider the threshold u as another weight (usually called bias) w
0
= b = u. This model
can be generalized for any kind of activation function f as
a = f
i=i
w
i
x
i
b
Such a neuron model is depicted in Figure 5.1.

The sigmoid activation function is by far the most frequently used in ANNs. It is a strictly
increasing function that exhibits smoothness and has the desired asymptotic properties. The
standard sigmoid function is dened by
g(y) =
1
1 +e
y
where is the slope parameter. Other common activation functions are the sign function,
the linear or piece-wise linear functions and the gaussian function (see Figure 5.2).
But the McCulloch-Pitts neuron did not have a mechanisms for learning and, for the neuron
model to be useful, it must be able to learn from its input like happens in real biological
45
5.2. NETWORK ARCHITECTURES
networks. In section 5.3 we give an overview of the most common learning algorithms that
can be used to train the neurons of a neural network.
5.2 Network Architectures
NNs can be viewed as weighted directed graphs in which articial neurons are nodes and
directed edges (with weights) are connections between neuron outputs and neuron inputs.
Based on the connection pattern (architecture), neural networks can be grouped into two
categories:
Feed-Forward networks, in which graphs have no loops.
Recurrent(or feedback) networks, in which loops occur because of the feedback connec-
tions.
In the most common family of feed-forward networks, called multilayer perceptron, neurons
are organized into layers that have unidirectional connections between them.
Dierent connectivities yield dierent network behaviors. Generally speaking, feed-forward
networks are static, i.e. they produce only one set of output values rather than a sequence
of values for a given input. Feed-forward networks are memory-less in the sense that their
response to an input is independent of the previous network state. Recurrent, or feedback,
networks, on the other hand, are dynamic systems. When a new input pattern is presented,
the neuron outputs are computed. Because of the feedback paths, the inputs to each neuron
are then modied, which leads the network to enter a new state. Dierent network archi-
tectures require appropriate learning algorithms. The next section provides an overview of
learning processes.
5.3 Learning rules and algorithms
Although a a precise denition of learning is dicult to formulate, a learning process in
the ANN context can be viewed as the problem of updating the network architecture and
connection weights so that a network can eciently perform a specic task. The network
usually must learn the connection weights from available training patterns. Performance
is improved over time by iteratively updating the weights in the networks, i.e: w
new
i
=
w
old
i
+ w
i
. Neural networks ability to automatically learn from examples makes them
attractive for complex processing tasks. Instead of following a set of rules specied by human
46
5.3. LEARNING RULES AND ALGORITHMS
experts, NNs learn underlying rules (like input-output relationships) from the given collection
of representative examples. This is one of the major advantages of neural networks over
traditional expert systems.
To understand or design a learning process, one must rst have a model of the environment
in which a neural network operates, that is, one must know what information is available to
the network. We refer to this model as a learning paradigm. Second, one must understand
how network weights are updated, that is, which learning rules govern the updating process.
A learning algorithm refers to a procedure in which learning rules are used for adjusting the
weights.
There are three main learning paradigms:
Supervised learning. The network is provided with a correct output for every input
pattern. Weights are determined to produce outputs as close as possible to the known
correct outputs.
Unsupervised learning. This paradigm does not require training data. It explores the
underlying structure in the data, or correlations between patterns in the data, and
organizes patterns into categories from these correlations.
Hybrid learning. Combines supervised and unsupervised learning. Part of the weights
are usually determined through supervised learning, while the others are obtained
through unsupervised learning.
We can dierentiate four basic types of learning rules:
Error-correction rules. During the learning process, the actual output a generated by
the network may not equal the desired output d. The basic principle of error-correction
learning rules is to use the error signal (d y) to modify the connection weights to
gradually reduce this error.
The perceptron learning rule is based on this error-correction principle. A percep-
tron consists of a single neuron as depicted in Figure 5.1 using a sign (or threshold)
activation and a learning algorithm based on the steps described in table 5.1, where
w = (w
1
, ..., w
R
)
T
is the weight vector, b = (b
0
, ..., b
R
)
T
is the bias vector, k is the itera-
tion index, (k) the learning rate at iteration k and A
k
is the set of samples misclassied
by w(k).
The perceptron algorithm converges to a proper solution if and only if the classes are
linearly separable. Otherwise some of the samples will always be misclassied.
47
5.3. LEARNING RULES AND ALGORITHMS
1. begin: initialize w, b, (), criterion ,k 0
2. do: k k + 1
3. w w +(k)
aA
k
a
4. until: |(k)
aA
k
| <
5. return: w
6. end
Table 5.1: Bath Perceptron algorithm [22].
Boltzmann learning. Boltzmann machines are symmetric recurrent networks consisting
of binary units. By symmetric, we mean that the weight on the connection from unit i
to unit j is equal to the weight on the connection from unit j to unit i, i.e. w
ij
= w
ji
.
The subset of neurons that interact with the environment are called visible, the rest are
called hidden neurons. The objective of Boltzmann learning is to adjust the connec-
tion weights so that the states of visible units satisfy a particular desired probability
distribution. According to the Boltzmann learning rule, the change in the connection
weight w
ij
is given by
w
ij
= (
ij

ij
)
where is the learning rate, and
ij
and
ij
are the correlations between the states
of units i and j when the network operates respectively in the clamped mode (visible
neurons are clamped onto specic states determined by the environment) and free-
running mode (both visible and hidden neurons are allowed to operate freely).
Hebbian learning. Mathematically the Hebbian rule can be described as
w
ij
(k + 1) = w
ij
(k) +(k)a
i
(k)a
j
(k)
where a
i
and a
j
are, respectively, the outputs of neurons i and j, which are connected
by the synapse w
ij
and is the learning rate.
Competitive learning rules. Without going into details, we will say that in this kind
of rules, output units compete among themselves for activation. As a result, only one
output unit is active at any given time. This phenomenon is known as winner-take-
all. Competitive learning often clusters or categorizes the input data. Similar patterns
are grouped by the network and represented by a single unit. This grouping is done
automatically based on data correlations.
.
48
5.4. FEED-FORWARD MULTILAYER NETWORKS
Paradigm Learing rule Architecture Learning Task
algorithm
Supervised Error correction Single or Perceptron Pattern classication
multilayer Back-propagation Function approximation
perceptron Adaline/Madaline Prediction, control
Boltzmann Recurrent Boltzman Pattern classication
Hebbian Multilayer Linear discriminant Data analysis
feed-forward analysis Pattern classication
Unsupervised Error correction Multilayer Sammons Data analysis
feed-forward projection
Hebbian Feed-forward or Principal Component Data analysis
competitive Analysis (PCA) Data compression
Competitive Competitive Vector Categorization
quantization Data compression
Kohonens SOM Kohonens SOM Categorization
Data analysis
Hybrid Error correction RBF network RBF learning Pattern classication
and competitive algorithm Function approximation
Table 5.2: Well-known learning algorithms [37].
5.4 Feed-Forward multilayer networks
The most popular class of multilayer feed-forward networks is multilayer perceptrons in which
each computational unit employs an arbitrary activation function (typically the thresholding
or the sigmoid function) with the only limitation of being a smooth, bounded and monotically
increasing function. Multilayer perceptrons, unlike single perceptrons, can form arbitrarily
complex decision boundaries and represent any Boolean function. The development of the
backpropagation learning algorithm for determining weights in a multilayer perceptron has
made these networks the most popular among researchers and users of neural networks. A
sample feed forward network for r-dimensional patterns is shown in Figure 5.3.
In any feed-forward network there are some parameters that have to be xed. Firstly we
have to decide the network size, i.e. the number of layers and the number of neurons of each
49
5.4. FEED-FORWARD MULTILAYER NETWORKS
.
.
.
.
.
.
.
.
.
.
.
.
1st hidden layer 2nd hidden layer
Input layer Output layer
) 1 (
11
w
) 1 (
nl
w
.
.
.
) 2 (
11
w
) 2 (
lk
w
) (
11
OUT
w
) (OUT
km
w
) (
11
IN
w
) (IN
rn
w
Inputs Outputs
1
a
m
a
Figure 5.3: Feed-forward network. Input patterns are r-dimensional, input layer has n units, the
rst hidden layer l units, the second hidden layer k and the output layer has m neurons.
layer. The most important techniques for choosing the network size can be divided into two
groups:
Network growing techniques. In general this class of algorithms starts with a small
network and adds units or connections until an adequate performance level is obtained.
Algorithms such as the cascade correlation algorithm [24] helps to nd the optimal
number of hidden units to use in a network. Traditional feature selection algorithms
such as the Sequential Floating Forward Selection [68] and the Sequential Forward
Selection algorithm [68] can be seen as network growing algorithms as they selectively
add feature inputs to the network based on a dened criterion function.
Network pruning techniques. This class of algorithms starts with a fully trained large
network and then attempts to remove some of the redundant weights and/or units.
Hopefully, this is done in such a way that the error of the network is not signicantly
increased and the generalisation will improve.
Apart from these algorithms there are several guidelines that can help us in choosing the
network size. For fully connected multilayer perceptron networks no more than three layers
are typically used, and in many cases only two. Numerous bounds exist on the number of
neurons in the hidden layers. It has been shown that the upper bound on the number of
hidden neurons for classifying the training data correctly is on the order of the number of
training samples. Actually, the number of hidden neurons should be much less than the
50
5.5. BACKPROPAGATION
number of training samples to avoid the memorization of the training samples resulting in
very poor generalization of the network.
5.5 Backpropagation
For training a neural network we must dene rst the objective or criterion function to be
minimized by the network. The most common case is to use the mean square error dened,
for a R-dimensional pattern vector and a two layer network, as:
J(w)
1
2
R
k=1
(t
k
a
k
)
2
=
1
2
t a
2
(5.1)
where t and a are the target and the network output vectors.
The backpropagation learning rule is based on gradient descent. The weights are initialized
with random values, and then they are changed in a direction that will reduce the error (the
negative of the gradient):
w =
J
w
(5.2)
where is the learning rate. The iterative algorithm requires taking a weight vector at
iteration k and updating it as
w(k + 1) = w(k) + w(k) (5.3)
When dealing with multiple layer networks we must take into account the relations between
dierent layers when calculating w(k). For a network of L layers, the output of layer i is
y
i
= f(W
T
i
y
i1
+b
i
) = f(net
i
) i = 1, ..., L (5.4)
Where W
i
is the weight matrix and b
i
is the bias vector for layer i. For the rst layer(i = 1),
y
i1
corresponds to the input data x and for the last layer(i = L) y
i
corresponds to the
network output a
i
.
The backpropagation algorithm for a network of L layers is presented in table 5.3.
The selection of an appropriate learning rate is very important as for any kind of gradient
search algorithm. If learning rate is large the learning process will be fast but it may lead to
instability. Choosing a very small learning rate results in a very slow convergence.
There are many extensions and variations of the basic backpropagation algorithm. Some of
the best known variants are the following:
51
5.5. BACKPROPAGATION
1. Initialize weights, biases and other network parameters.
2. Propagate activity forward every layer 1, ..., L according to equation (5.4).
3. Calculate the error in the output layer according to equation (5.1).
4. Backpropagate the error and update weights and biases:
for: l=L-1,...,1
e
l
= (W
T
l+1
e
l+1
)f
(net
l
)
W(k + 1) = W(k) + W = W(k) +e
l
y
T
l1
b(k + 1) = b(k) + b = b(k) +e
l
end
Table 5.3: Standard Backpropagation algorithm [5].
Backpropagation using gradient descent with momentum. This variant provides faster
convergence by making the network to respond not only to the local gradient, but also
to recent trends in the error surface. Acting like a low pass lter, momentum allows
the network to ignore small features in the error surface.
Variable Learning Rate. In standard steepest descent, the performance of the algorithm
is very sensitive to the proper choice of the learning rate. The performance of the
algorithm can be improved by using an adaptive learning rate that will attempt to
maintain the learning step size as large as possible while keeping learning stable.
Resilient backpropagation. The sigmoid transfer functions in the hidden layers of a net-
work compress an innite input range into a nite output range. This has the undesired
consequence of making the gradient move into a range of very small values, therefore
causing small changes in the weights and biases even when the weights and biases are
far from their optimal values. Resilient backpropagation eliminates this harmful eect
by considering only the sign of the derivative of the gradient when determining the
direction of the weight update and setting the magnitude of the update to a factor
independent of the magnitude of the derivative of the gradient.
Conjugate Gradient Algorithms. The negative of the gradient is the direction in which
the performance function is decreasing most rapidly but this does not produce necessar-
ily the fastest convergence. In the conjugate gradient algorithms a search is performed
along conjugate directions, which produces generally faster convergence than steepest
descent directions.
Quasi-Newton Algorithms. This algorithm is based on Newtons method but using
52
5.6. NETWORK GENERALIZATION
approximations of the Hessian matrix that do not require the calculation of second
order derivatives. It requires more computation and storage capacity than traditional
algorithms but it generally converges in fewer iterations.
Levenberg-Marquardt algorithm. Like the quasi-Newton methods, this algorithm was
designed to approach second-order derivatives training speed but without requiring the
calculation of the actual Hessian matrix. When the performance function has the form
of a sum of squares (as is typical in training feed-forward networks) the Hessian matrix
can be approximated by
H = J
T
J (5.5)
and the gradient can be computed as
= J
T
e (5.6)
where J is the Jacobian matrix, which contains the rst derivatives of the network
errors with respect to the weights and biases, and e is a vector of network errors. The
calculation of the Jacobian matrix is computationally much less expensive that the
calculation of the Hessian matrix and thus, the Levenberg-Marquardt algorithm uses
this approximation in the following update formula [78]:
w(k + 1) = w(k) [J
T
J +I]
1
J
T
e (5.7)
When = 0 the latter formula is just the Newtons method, using the approximate
Hessian matrix. When is large, the algorithm is equivalent to gradient descent with a
small step size. Due to the better performance of the Newtons method is decreased
after each successful step (reduction in performance function) and is increased only when
a tentative step would cause an increase of the performance function. This algorithm
appears to be the fastest method for training moderate-sized feed-forward networks.
5.6 Network Generalization
One of the problems that may arise during neural network training is called overtting.
The network achieves very small error on the training set but the performance is very poor
when new data is presented to the already trained network. The network has memorized the
training samples, but it has not learned to generalize to new situations. These generalization
problems are more likely to appear when the number of training samples is small (or they do
53
h(x)
g(x)
(a)
h(x)
g(x)
(b)
Figure 5.4: Illustration of the generalization problem. On the left, the network size was too small
to t the function h(x). On the right, the network memorized the samples but did not
solve the general problem.
not represent well the general problem to solve) or when the network size is not the optimal
one. If the size is too small, the network is not able to provide an adequate tting and if
too big the network tends to memorize the training samples. The generalization problem is
illustrated in Figure 5.4.
One method for improving network generalization is to use a network that is just large enough
to provide and adequate t to the training samples. However, it is dicult to know a priori
the required size for an specic application. Among other well know methods to improve the
generalization of the network we can nd regularization and early stopping. Another very
popular approach is the use of neural network ensembles.
5.6.1 Regularization
Regularization involves modifying the performance function, which is normally chosen as the
sum of squares of the network errors on the training set:
mse =
1
N
N
i=1
e
2
i
=
1
N
N
i=1
(t
i
a
i
)
2
(5.8)
where N is the number of available training samples, t
i
is the desired or target output for
the training sample i and a
i
is the actual output of the network for that sample.
A regularized version of the mean squared error performance function is the following [78]:
msereg = mse + (1 ) msw (5.9)
54
where is the performance ratio, and
msw =
1
n
n
j=1
w
2
j
(5.10)
Using this alternative performance function will cause the network to have smaller weights
and biases, forcing the network response to be smoother and less likely to overt. However,
there is an obvious problem with overtting: the selection of the performance ratio parameter.
Choosing a too large value for may produce overtting while keeping it too small will make
the network not to adequately t the training data. Thus, it is desirable to determine the
optimal regularization parameters in an automated fashion. A very popular approach is to
use the Bayesian framework, in which the weights and biases of the networks are assumed to
be random variables with specied distributions. The regularization parameters are related
to the unknown variances associated with these distributions. We can then estimate these
parameters using statistical techniques. A detailed discussion of Bayesian regularization
combined with Levenberg-Marquardt training can be found in [26].
5.6.2 Early stopping
Early stopping is one of the most widely used techniques to avoid overtting. In this technique
we divide the available data into three subsets:
Training set. Used for computing the gradient and updating the network weights.
Validation set. Used for monitoring the error during the training process. The error in
this set will normally decrease during the training phase. However, when the network
starts to overt the training data, the error in the validation set will typically begin to
rise. In that moment the training must be stopped.
Test set. This test is not used during the training, but it is used to compare dierent
models. It is also useful to plot the test set error during the training process. If the
error in the test set reaches a minimum at a signicantly dierent iteration number
than the validation set error, this may indicate a poor division of the available data set.
Early stopping has several advantages. It is very fast and can be applied successfully to
networks in which the number of weights far exceeds the sample size. Furthermore it requires
only one major decision by the user: what proportion of the available data to assign to each
of the training, validation and test set.
55
The main disadvantage of early stopping is that it is a split-sample technique, i.e. neither
training nor validation makes use of the entire sample. This reduces considerably the available
data for the network to learn the problem.
5.6.3 Neural Network ensembles
It has been shown [32] that the generalization ability of a neural network can be signicantly
improved through ensembling neural networks, i.e. training several neural networks to solve
the same problem and combining their results in some way. The simplest ensemble can
be formed by training all the ensemble members with the same training set and randomly
initializing the weights and biases of each member with dierent values. Then, the general
output of the ensemble can be obtained by simply majority voting, i.e. the decision taken
by the overall ensemble corresponds to the decision taken by the majority of the members.
There are more elaborated techniques for ensembling classiers, for example:
Bagging. Bagging employs bootstrap sampling to generate several training sets from
the original training set, and then trains an individual network from each generated
training set [32]. The individual predictions are often combined via majority voting.
The Bagging algorithm is shown in table 5.4, where T bootstrap samples S
1
, ..., S
T
are
generated from the original training set and an individual neural network N
t
is trained
from each S
t
, an ensemble N
is built from N
1
, ..., N
T
whose output is the class label
received the most number of votes.
1. for t = 1 to T{
2. S
t
=bootstrap sample from S
3. N
t
= L(S
t
)
4. }
5. N
= arg max
yY
t:N
t
(x)=y
1
Table 5.4: The Bagging algorithm [88].
Adaboost (Adaptive Boosting). Adaboost sequentially generates a series of individual
neural networks, where the training instances that are wrongly classied by the previous
individual networks will play more important role in the training of later networks.
The individual predictions are combined via weighted voting where the weights are
determined by the algorithm itself. The Adaboost algorithm is shown in table 5.5,
where T is the number of trials, S
1
, ..., S
T
are sequentially generated training sets and
56
5.7. APPLICATIONS IN NEUROPHYSIOLOGY
an individual neural network N
t
is trained for each S
t
.
t
denotes the weighted error
of N
t
on S
t
. An ensemble N
is built from N
1
, ..., N
T
whose output is the class label
received the most number of votes.
1. All the instance weights are set to 1
2. for t = 1 to T{
3. normalize the weights so that the total weight is m
4. S
t
=sample from S with the normalized instance weight
5. N
t
= L(S
t
)
6.
t
=
1
m
x
i
S
t
:N
t
(x
i
)=y
i
weight(x
i
)
7.
t
=
t
/(1
t
)
8. for each x
i
S
t
{
9. if N
t
(x
i
) = y
i
{
10. weight(x
i
)=weight(x
i
)
t
11. }
12. }
13. }
Table 5.5: The Adaboost algorithm [88].
Other techniques. There are many other methods to build neural network ensembles.
Some of them use complex procedures to select an optimum subset of members of the
ensemble in order to calculate the weighted output of the ensemble. One example
of such techniques that has shown very good performance is GASEN [87], a selective
ensemble method based on the genetic algorithm.
5.7 Applications in neurophysiology
Articial neural networks have been widely used in biomedical applications in the recent
years. ANNs are especially useful for classication and function approximation, when expert
rules cannot be applied or they are very dicult to obtain. This is very often the case when
analyzing and classifying neuroelectric signals such as the EEG.
For example, Kelly et al. [44] applied neural networks to myoelectric signal (MES) analysis
tasks. They used a two-layer perceptron for classifying a single site MES on the basis of
two features: the rst time series parameter for a moving average MES model and the signal
power.
57
5.7. APPLICATIONS IN NEUROPHYSIOLOGY
Ozdamar and Klayci [61] used neural networks to detect spikes in EEG recordings stating
the eciency of such approach for classication of raw EEG data.
Another example is the study by R. Polikar et al. [67] about classication of ERP recordings
of patients with Alzheimers disease. The classication was made using the raw time domain
signal and its wavelet features. Better results were obtained when using the wavelet features.
Most of the times neural networks are found in biomedical applications performing pattern
classication tasks. The diculty of extracting expert rules that can be applied to real
physiological signals have made the use of neural networks a very popular approach in the
analysis of neuroelectric signals.
58
Chapter 6
Results
This chapter describes the results obtained with two methods based on wavelets for detecting
the presence of MMN deection in single trials or small averages of trials. First, the charac-
teristics of the available test data are presented. After that, an analysis of the grand average
ERP of the subjects under study is carried out to study the latency and strength of the MMN
component in each of those subjects. Before proceeding to the actual tests, we localize the
MMN in the time-frequency plane, showing that a wavelet multiresolution analysis is suited
to the tasks of ERP denoising and MMN feature extraction.
The rst method presented in this chapter is based on a DWT feature extraction step and
classication using a neural classier. To obtain the training data for the neural network a
preprocessing step based on ISODATA clustering is used. The second method is also based
on wavelet features but now, an ICA based feature reduction step is used before the neural
classier. We will see that the results obtained using this last method are quite promising.
6.1 Data
6.1.1 Subjects and experimental paradigm
The data used in this thesis for testing the previously mentioned algorithms was recorded us-
ing DSAMP software complex. This system was developed for the Department of Psychology
of the University of Jyvaskyla (Finland) specially for ERP investigation. In our particular
experiment the standard signal consisted of a train of continuous tones of frequency 600 Hz
and 800 Hz alternating every 100 ms (see Figure 6.1). Approximately 20% of the 600 Hz
tones were deviant tones with duration 70 ms (deviant type I) or 30 ms (deviant type II).
The deviant tones should elicit MMN as a brain response. Subjects were reading during
59
6.1. DATA
600 Hz 600 Hz 600 Hz 600 Hz 800 Hz 800 Hz 800 Hz 800 Hz
Deviant (600 Hz)
-300 -200 -100 0 30 130 230 330 350
Time (ms)
Start recording Stop recording
Response to standard Response to deviant
Figure 6.1: Stimulus sequence for eliciting the MMN
the experiment. With this experimental paradigm other components of event related poten-
tials, like N1, should not be elicited, making the detection of the MMN deection easy. The
recording of EEG started 300 ms before the deviant sound and stopped 350 ms after it. The
sampling rate was 200 Hz and thus, each epoch was composed of 130 samples. The last two
samples were discarded to obtain epochs of dyadic length. Each epoch comprises a standard
response (before the rare stimulus occurs) and a deviant response (starting after the stimulus
takes place). Thus, we obtained for each epoch three dierent waves:
Standard wave. The rst 320 ms (64 samples) of each trial.
Deviant wave. The 320 ms (64 samples) of recording after the deviant tone takes place.
Dierence wave. Obtained by subtracting the event-related potentials in response to
frequent auditory stimulus (standard response) from those to rare deviant stimuli. The
length of this wave is also 320 ms or 64 samples.
In our study we used data obtained from 6 dierent subjects: ADHD1171, ADHD1160,
ADHD1167, ADHD1165, ADHD68 and ADHD1176. There were totally 350 epochs from
each subject for each type of deviant but some of those trials were discarded during the
artifact correction stage.
6.1.2 Data acquisition and equipment
The ERP response of the brain was recorded using a subset of the 10-20 International elec-
trode placement system [38]. This subset was composed of 9 electrodes. The positions of
the electrodes are presented in Figure 6.2. Ag-AgCl disc electrodes with impedances of less
than 5K were used. The signal captured by the electrodes was band pass ltered using a
60
6.1. DATA
a1 a2
Pz
F4 F3
C4 Cz C3
fz
Channel locations
Figure 6.2: Channel locations
pass band of 0.1-30Hz and sampled with a rate of 200 Hz. DSAMP [73] hardware-software
complex was used for data acquisition and recording.
Developed algorithms were implemented and tested using MATLAB 6.5 for Windows, running
on a PC with an Intel Pentium III 733MHz central processor, 128MB of RAM memory
and Windows 2000 as the operating system. MRA and wavelet processing algorithms were
implemented using Wavelab 8.02 [21]. For ICA, the packages JADE [12] and runica [20] for
MATLAB, were used. For plotting scalp distributions, maintaining the ERP data in order
and rejecting artifacts we used the EEGLAB toolbox [20]. Finally, the algorithms involving
neural classiers were developed using the Neural Network Toolbox for Matlab [78].
6.1.3 Artifact correction
We used semi-automated rejection coupled with visual inspection for rejecting artifacts. This
was done according to several criteria:
Rejecting extreme values. Any trial having values exceeding +/ 50V in any of the 9
channels at any time within the epoch was marked for rejection.
Rejecting abnormal trends. Artifactual currents may cause linear drift to occur at some
electrodes. To detect such drifts, the data is tted to a straight line. If the slope
61
6.2. AVERAGING
Subject Deviant tone Number of trials
ADHD1171 Type I 282
Type II 250
ADHD1160 Type I 248
Type II 269
ADHD1167 Type I 148
Type II 183
ADHD1165 Type I 245
Type II 274
ADHD1168 Type I 281
Type II 280
ADHD1176 Type I 258
Type II 241
Table 6.1: Number of trials that remained for each subject after rejecting those trials with artifacts.
exceeded 15V over the whole epoch the trial was marked for rejection.
Rejecting improbable data. EEGLAB [20] provides a routine for determining the prob-
ability distribution of values across the data epochs. The probability of each trial can
be calculated using such routine. Trials containing artifacts are (hopefully) improbable
events and thus may be detected using a function that measures the probability of
occurrence of trials. Any trial with probabilities falling outside a range of 3 standard
deviations around the mean was marked for rejection.
All the trials that were automatically marked for rejection were checked visually and those
suspicious of having artifacts were deleted from the dataset. Table 6.1 summarizes the number
of available trials for each subject after the artifact rejection step.
6.2 Averaging
As we already mentioned in chapter 3, the single EEG epochs are very noisy and apart
from the desired event related potential contain also the non-event related ongoing brain
activity. To visualize ERP we must use techniques to eliminate the spontaneous activity.
The most widely applied procedure used nowadays is averaging. Averaging may not be
suitable for investigating brain dynamics arising from intermittent changes in subject state
62
6.2. AVERAGING
or from complex interaction between task events. Nevertheless, it is a very easy way to study
the general behavior of the MMN component. In our particular case we will use averaging
for checking the latency and strength of the MMN component in the subjects under study.
Figures 6.3 and 6.4 show the grand average ERP for those subjects.
Observing gures 6.3 and 6.4 we can easily understand the complexity of our task. In many
subjects the MMN is hardly distinguished from the grand averages. Subject ADHD1176
shows an almost at MMN response for both types of deviant sounds. However, subject
ADHD1171 shows a quite clear MMN deection, specially for the deviant type 2. In fact, the
experiment with deviant 2 seemed to elicit a clearer MMN component in the majority of the
subjects. Thus, in the following we will only study the ERP data for the experiments using
that deviant. In general, from these grand averages we can conclude that:
The MMN deection typically appears in the time window ranging from 125 to 200 ms
after the deviant stimulus. This latency seems to be quite stable for all subjects.
The MMN is usually characterized by a positive spike on frontal Fz, F3, F4 and central
Cz, C3, C4 channels, a low activity on central parietal Pz channel and a negative
deection in mastoids A1 and A2.
As we have said, in some subjects we cannot observe any signicant deection in the MMN
time range of the grand averages. Nevertheless, this does not mean that the MMN has not
been elicited in that subject. The brain response activity varies widely between trials in
both time and course scalp distribution. This fact may cause the MMN deection to be
canceled when making the grand averages even though the MMN could be observed in some
individual trials of the experiment. One of our goals is to select automatically the trials that
show MMN deection to average them and get an improvement in the signal to noise ratio
when compared with the average over all trials.
In Figure 6.5 the ERP images at electrodes A1 and Fz are shown for subject ADHD1171
when using dierent sizes of averaging window. This kind of graphical representations give
us a coarse idea of the importance of the ERP dynamics involved in the MMN elicitation
process. We can see that when the window size is of 1 epoch, i.e. when we do not average at
all, it is impossible to distinguish the MMN even though subject ADHD1171 shows a quite
strong MMN deection in the grand average. For a window of size 30 epochs we can clearly
distinguish the positive activity in the frontal Fz and the negative deection in the mastoid
A1 in the MMN latencies. We can also see that the MMN-like deection is not so clear during
the whole experiment. There are intervals of trials when the MMN is stronger and periods
63
6.2. AVERAGING
4
+
V
1
25 50 75 100125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 0

fz
C3
Cz
C4
F3
F4
Pz
a2
a1
(a) ADHD1171dev1
4
+
V
1
25 50 75 100125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 0

fz
C3
Cz
C4
F3
F4
Pz
a2
a1
(b) ADHD1171dev2
3
+
V
1
25 50 75 100125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 0

fz
C3
Cz
C4
F3
F4
Pz
a2
a1
(c) ADHD1160dev1
3
+
V
1
25 50 75 100125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 0

fz
C3
Cz
C4
F3
F4
Pz
a2
a1
(d) ADHD1160dev2
6
+
V
1
25 50 75 100125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 0

fz
C3
Cz
C4
F3
F4
Pz
a2
a1
(e) ADHD1167dev1
6
+
V
1
25 50 75 100125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 0

fz
C3
Cz
C4
F3
F4
Pz
a2
a1
(f) ADHD1167dev2
Figure 6.3: Grand averages for the subjects ADHD1171, ADHD1160 and ADHD1167. The vertical
dashed line marks the appearance of the deviant sound. The two solid vertical lines
denote the approximate latency of the MMN deection.
64
6.2. AVERAGING
4
+
V
1
25 50 75 100125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 0

fz
C3
Cz
C4
F3
F4
Pz
a2
a1
(a) ADHD1165dev1
4
+
V
1
25 50 75 100125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 0

fz
C3
Cz
C4
F3
F4
Pz
a2
a1
(b) ADHD1165dev2
3
+
V
1
25 50 75 100125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 0

fz
C3
Cz
C4
F3
F4
Pz
a2
a1
(c) ADHD1168dev1
3
+
V
1
25 50 75 100125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 0

fz
C3
Cz
C4
F3
F4
Pz
a2
a1
(d) ADHD1168dev2
3
+
V
1
25 50 75 100125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 0

fz
C3
Cz
C4
F3
F4
Pz
a2
a1
(e) ADHD1176dev1
3
+
V
1
25 50 75 100125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 0

fz
C3
Cz
C4
F3
F4
Pz
a2
a1
(f) ADHD1176dev2
Figure 6.4: Grand averages for the subjects ADHD1165, ADHD1168 and ADHD1176. The vertical
dashed line marks the appearance of the deviant sound. The two solid vertical lines
denote the approximate latency of the MMN deection.
65
6.3. TIME-FREQUENCY LOCALIZATION OF THE MMN
when it is weaker. This variations of of the MMN amplitude could be of interest for the
physicians but classical averaging techniques does not allow us to study the MMN dynamics
during the experiment.
6.3 Time-Frequency localization of the MMN
After localizing the MMN component in time we tried to localize it in frequency. We have
performed a time-frequency analysis of the EEG trials. We have used dierent types of time-
frequency representations (TFRs) including short time fourier transform, continuous wavelet
transform (CWT), Wigner distributions, etc.
Analysis of single trials resulted almost impossible due to the great variability of their TFRs.
We used small averages of some epochs to perform the analysis. We obtained, as was expected,
that the MMN component is approximately located in a time-frequency window dened by
f (4 Hz, 15 Hz) and t (150 ms, 210 ms), setting t = 0 at the stimulus occurrence time.
This results correlate with [64], [77].
Figure 6.6 shows the TFR obtained from averages of 50 epochs for subject ADHD1171 and
deviant type 2. We can see in that gures the presence of a clear MMN component in the
deviant part of the EEG which is located in the time-frequency window already mentioned.
Even though the presence of the MMN is evident for this subject when inspecting the time
frequency representation of its partial averages, it is not so clear for all subjects. For example,
in the case of subject ADHD1176 (for both types of deviant) it was impossible to nd in
the time-frequency plane any component similar to the MMN. This suggests that a time
frequency analysis, while suitable for some subjects, may be not enough for characterizing
the MMN in some others. This is the reason of the low eectivity in the detection of the
MMN for methods using a feature extraction stage based on wavelets or any other kind of
time-frequency features. We will see later how Independent Component Analysis (ICA) when
combined with wavelets methods can improve these detection ratios.
Another point to note from the TFRs is the presence for all subjects of a low frequency
trend and a high frequency noise. Both should be eliminated before proceeding to the MMN
classication tasks.
66
58.3
29.1
0
29.1
58.3
S
o
r
t
e
d

T
r
ia
ls
a1
50
100
150
200
250
0 100 200 300 400 500 600
Time (ms)
4
4
V
(a) Electrode a1,Window of 1 epoch
65.6
32.8
0
32.8
65.6
S
o
r
t
e
d

T
r
ia
ls
fz
50
100
150
200
250
0 100 200 300 400 500 600
Time (ms)
4.8
4.8
V
(b) Electrode Fz,Window of 1 epoch
11.2
5.6
0
5.6
11.2
S
o
r
t
e
d

T
r
ia
ls
a1
50
100
150
200
250
0 100 200 300 400 500 600
Time (ms)
4
4
V
(c) Electrode a1,Window of 10 epochs
17.6
8.8
0
8.8
17.6
S
o
r
t
e
d

T
r
ia
ls
fz
50
100
150
200
250
0 100 200 300 400 500 600
Time (ms)
4.8
4.8
V
(d) Electrode Fz,Window of 10 epochs
6.4
3.2
0
3.2
6.4
S
o
r
t
e
d

T
r
ia
ls
a1
50
100
150
200
250
0 100 200 300 400 500 600
Time (ms)
4
4
V
(e) Electrode a1,Window of 30 epochs
11.4
5.7
0
5.7
11.4
S
o
r
t
e
d

T
r
ia
ls
fz
50
100
150
200
250
0 100 200 300 400 500 600
Time (ms)
4.8
4.8
V
(f) Electrode Fz,Window of 30 epochs
5.1
2.6
0
2.6
5.1
S
o
r
t
e
d

T
r
ia
ls
a1
50
100
150
200
250
0 100 200 300 400 500 600
Time (ms)
4
4
V
(g) Electrode a1,Window of 70 epochs
6.4
3.2
0
3.2
6.4
S
o
r
t
e
d

T
r
ia
ls
fz
50
100
150
200
250
0 100 200 300 400 500 600
Time (ms)
4.8
4.8
V
(h) Electrode Fz,Window of 70 epochs
Figure 6.5: ERP images using dierent moving window averages for subject ADHD1171dev2
67
(a) Average of trials 1 to 50 (b) Average of trials 51 to 100
(c) Average of trials 101 to 150 (d) Average of trials 151 to 200
Figure 6.6: Scalograms obtained for ADHD1171dev2
68
6.4. WAVELET DENOISING
6.4 Wavelet denoising
Traditionally, signal denoising is achieved by linear processing methods such as Wiener l-
tering. Recently, alternative nonlinear wavelet-based methods have been developed most of
them based on wavelet coecient thresholding. For our particular case we have found that
our signal of interest is located within the 3-15 Hz frequency band. Any component outside
that range is not of our interest and can be considered as noise. Thus we have used hard-
thresholding to completely suppress the DWT coecients corresponding to those undesired
components.
We have used a 5 levels MRA lterbank to decompose the raw ERP trials into six octaves
corresponding to the frequency bands 50-100 Hz, 25-50 Hz, 12.5-25 Hz, 3.1-6.2 Hz, and 0-
3.1 Hz. These bands approximately match the delta (0.5-3.5 Hz), theta (3.5-7 Hz), alpha
(8-12 Hz), beta (13-30 Hz), gamma (30-60 Hz), and high gamma (>60 Hz) rhythms of EEG.
Figure 6.7 illustrates this process for a sample ERP trial. If L is the length of the signal
being analyzed, we have L/2 DWT coecients at the rst octave, L/4 at the second, L/8
at the third and so on. In our case L = 128 for a complete ERP trial and L = 64 for the
standard, deviant and dierence waves. By zeroing out the DWT coecients that do not
belong to the scales 3.1-6.2 Hz and 6.2-12 Hz we can suppress the low frequency trend and
the high frequency noise while keeping the MMN deection shape.
The actual wavelet decomposition scheme was a lterbank based on the Daubechies 12 pair of
quadrature mirror lters (QMF). The time and frequency representation of those lters can
be seen in Figure 6.8. More specically, a periodized and orthogonal version of the FWT was
used. Since the wavelet decomposition can be understood as a correlation process between
the signal being analyzed and the analyzing wavelet at dierent scales, we tried to choose a
wavelet basis that resemble the typical MMN shape. The Daubechies 12 (see Figure 6.9) was
selected for the decomposition because its shape is similar to the typically observed for the
MMN and because it yielded good results when ltering the ERP trials.
We must say that the ltering process may be improved by using more sophisticated l-
ters. However, wavelet ltering is a technically simple, understandable procedure that yields
acceptable ltering results and that will allow us to integrate the ltering and the feature
extraction stages in the MMN detection methods explained in sections 6.5 and 6.6. An
extension of our basic Wavelet Filtering scheme is the theory of nonlinear wavelet domain
ltering [69, 86] based on identifying the signicant features of a noisy signal based on the
correlation between the scales of its nonorthogonal subband decomposition. This could be a
topic for further research.
69
20 40 60 80 100 120
15
10
5
0
5
10
ERP data

V
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
7
6
5
4
3
2
1
0
1
t
D
y
a
d
Multiresolution decomposition of ERP data
Figure 6.7: Wavelet decomposition of a single ERP trial into 5 octaves. Note the MMN deection
in the 6-12 Hz octave.
0 2 4 6 8 10 12
0.4
0.2
0
0.2
0.4
0.6
0.8
1
Decomposition low pass filter
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.5
1
1.5
Frequency spectrum of lowpass filter
(a) Low pass QMF
0 2 4 6 8 10 12
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
Decomposition high pass filter
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.5
1
1.5
Frequency spectrum of highpass filter
(b) High pass QMF
Figure 6.8: Pulse characteristics and spectrum of the Daubechies 12 Quadrature Mirror Filters.
70
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.06
0.04
0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Daubechies 12 scaling funtion
(a) Scaling function (t)
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.2
0.15
0.1
0.05
0
0.05
0.1
0.15
Daubechies 12 wavelet
(b) Wavelet function (t)
Figure 6.9: Daubechies 12 scaling and wavelet function.
6.4
3.2
0
3.2
6.4
S
o
r
t
e
d

T
r
ia
ls
a1
50
100
150
200
250
0 100 200 300 400 500 600
Time (ms)
4
4
V
(a) Before ltering
4.7
2.3
0
2.3
4.7
S
o
r
t
e
d

T
r
ia
ls
a1
50
100
150
200
250
0 100 200 300 400 500 600
Time (ms)
4
4
V
(b) After ltering
Figure 6.10: ERP image for subject ADHD1171dev2 before and after ltering.
For example, the results obtained with the wavelet ltering process for channel A1 of subject
ADHD1171 are shown in Figure ??. For that subject and channel, the SNR improvement in
the grand average due to the ltering process was of 3.12 dB. The SNR, in dBs, of an ERP
epoch with MMN deection is given by:
SNR = 10 log
10
P
2
(6.1)
where P is the voltage peak of the MMN deection and
2
is an estimation of the background
noise variance. The value of
2
is obtained from the variance of the standard response. The
improvement of SNR is given by:
SNR = SNR
f
SNR
o
(6.2)
where SNR
o
y SNR
f
denote, respectively, the SNR before and after the ltering process.
71
6.5. FIRST METHOD: DWT+ISODATA+NN
6.5 First method: DWT+ISODATA+NN
The rst method that we have used is based on a wavelets feature extraction stage and a
neural classier. The data used for training the classier is preprocessed using ISODATA
clustering. We will call this method DWT+ISODATA+NN method. In this method the
standard waves are considered non-targets (and the system should output a 0 for them) and
either the deviant or the dierence waves are considered targets (and the output for them
should be 1). Very briey, the procedure can be divided into two phases: the training phase
and the actual classication phase. In the training phase the following steps are followed:
1. Perform the DWT transform of the individual target and non-target responses.
2. Use ISODATA to cluster the dataset formed by the DWT coecients in the MMN
time-frequency window of both target and non-target waves.
3. Select the cluster most correlated with the MMN. We will denote this cluster as best
MMN cluster.
4. Train a neural network using the following training sequence: The sequence of trials
presented as targets to the network were 50% of the target waves in the best MMN
cluster; the sequence of trials presented as non-target were 50% of the non-target waves
in the cluster complementary of the best MMN cluster.
Once the network has been trained, the classication of individual ERP waves is made fol-
lowing these steps:
1. Obtain the DWT coecients of the individual response that we want to classify.
2. Use a subset of the DWT coecients corresponding to the time-frequency window of
the MMN as pattern vector for that response.
3. Input that pattern vector to the trained neural network. If the output of the network is
1 the trial is classied as target (or MMN correlated), otherwise the trial is classied
as non-target (or MMN non-correlated).
For testing the system we used all the individual waves that were not used for training the
network. The procedure steps are described in more detailed in the following sections.
72
6.5.1 Wavelet based feature extraction
In this thesis we have used the wavelet transform not only for ltering but also for extracting
characteristic features of the MMN component. We will use the DWT coecients from
bands 3-6 Hz and 6-12 Hz as features for MMN characterization. Figures 6.11 and 6.12
show the wavelet coecients for the grand averages of the standard and dierence waves
respectively. After analyzing the wavelet coecient obtained for the 6 subjects under study
we have concluded that:
Wavelet coecients of the deviant waves with signicant MMN component are more
regular than those obtained from the standard responses.
Coecients in the band 6-12 Hz seem to be more correlated with MMN component
than the ones in the 3-6 Hz frequency band.
The 2 DWT coecients from the 6-12 Hz band representing t (80 ms, 240 ms) seem
to be the ones most correlated with the MMN deection.
In general the polarity of the coecients in the channels A1, A2 is opposite to the
polarity of the coecients for the other channels. This correlates with the polarity of
deection we found in section 4.
Sometimes the wavelet coecients obtained using small averages of 30 or 40 epochs
show more regularity than those obtained from the grand averages.
We have observed that in some cases the grand averages seem to be less correlated to the MMN
component that some of the small averages, i.e in some groups of epochs where the DWT
coecients show much more regularity than in others (see for example Figure 6.13). These
are due to the fact that the experiment conditions are variable and some of the observations
reduce the signal to noise ratio of the grand average instead of improving it.
6.5.2 Training data preprocessing: ISODATA clustering
Once we have extracted the wavelet features one possibility is to train the neural classier
using those features without further processing. However, the variability of the wavelet
features is too high and the classes are not separated enough to obtain a well-trained classier.
We have tried to segment the original heterogeneous data set into smaller more homogeneous
subsets that can be more easily analysed. With that object, we have dened the following
73
Band 3-6Hz Band 6-12Hz
3 4
3
2
1
0
1
2
3
A
1
5 6 7 8
2
1
0
1
2
3
3 4
3
2
1
0
1
2
3
4
A
2
5 6 7 8
1
0
1
2
3
3 4
6
4
2
0
2
4
F
4
5 6 7 8
4
3
2
1
0
1
2
3
3 4
5
0
5
F
3
5 6 7 8
4
2
0
2
4
3 4
4
2
0
2
4
6
C
4
5 6 7 8
6
4
2
0
2
4
3 4
6
4
2
0
2
4
6
C
z
5 6 7 8
6
4
2
0
2
4
3 4
4
3
2
1
0
1
2
3
C
3
5 6 7 8
6
4
2
0
2
4
3 4
6
4
2
0
2
4
F
z
5 6 7 8
4
3
2
1
0
1
2
3
Figure 6.11: Average DWT coecients in the bands 3-6 Hz (left column) and 6-12 Hz (right col-
umn) for the standard waves from subjects ADHD1171 (blue), ADHD1160 (green),
ADHD1167 (red), ADHD1165 (cyan), ADHD1168 (black) and ADHD1176 (magenta).
74
Band 3-6Hz Band 6-12Hz
3 4
4
2
0
2
4
A
1
5 6 7 8
6
4
2
0
2
4
3 4
4
2
0
2
4
6
A
2
5 6 7 8
6
4
2
0
2
4
3 4
4
2
0
2
4
6
F
4
5 6 7 8
5
0
5
10
3 4
8
6
4
2
0
2
4
6
F
3
5 6 7 8
10
5
0
5
10
3 4
2
1
0
1
2
3
4
5
C
4
5 6 7 8
6
4
2
0
2
4
6
8
3 4
4
2
0
2
4
6
C
z
5 6 7 8
10
5
0
5
10
3 4
6
4
2
0
2
4
C
3
5 6 7 8
10
5
0
5
10
3 4
4
2
0
2
4
6
F
z
5 6 7 8
10
5
0
5
10
Figure 6.12: Average DWT coecients in the bands 3-6 Hz (left column) and 6-12 Hz (right column)
for the deviant waves from subjects ADHD1171 (blue), ADHD1160 (green), ADHD1167
(red), ADHD1165 (cyan), ADHD1168 (black) and ADHD1176 (magenta).
75
38.2
19.1
0
19.1
38.2
T
r
i
a
l
s
Fz
50
100
150
200
250
1 2 3 4 5 6
DWT coefficients
5
5
(a) ADHD1171dev2 - Standard
18.3
9.2
0
9.2
18.3
T
r
i
a
l
s
Fz
50
100
150
200
250
1 2 3 4 5 6
DWT coefficients
11.1
11.1
(b) ADHD1171dev2 - Deviant
48.6
24.3
0
24.3
48.6
T
r
i
a
l
s
Fz
20
40
60
80
100
120
1 2 3 4 5 6
DWT coefficients
3.5
3.5
(c) ADHD1167dev2 - Standard
22.6
11.3
0
11.3
22.6
T
r
ia
ls
Fz
20
30
40
50
60
70
80
90
100
110
120
1 2 3 4 5 6
DWT coefficients
11.8
11.8
V
(d) ADHD1167dev2 - Deviant
Figure 6.13: DWT coecients images for subjects ADHD1171dev2 and ADHD1167dev2. A moving
window of 20 epochs was used. All gures are from channel Fz. The vertical axis is ERP
trials and the horizontal axis shows the indexes of the DWT coecients representing
the bands 3-6 Hz (indexes 1 and 2) and 6-12 Hz (indexes 3 to 6). In the bottom part
of each gure the grand average vector of DWT coecients is shown.
76
dissimilarity measure between two ERP trials t
1
, t
2
:
d(t
1
, t
2
) =
L
j=1
a
j
i=1
b
i
(c
(1)
j,i
c
(2)
j,i
)
2
(6.3)
where L is the number of channels available (L = 9), M is the number of coecients for
each channel (M = 6), and c
(k)
j,i
is the i
th
coecient of the j
th
channel for the trial k. The
a
j
are the weights assigned to each channel and the b
i
the weights for each DWT coecient.
We have used a
j
= 1.5 for j [1, 2], a
j
= 0 for i = 3 a
j
= 1 for i [4...9], i.e. we
discard the information coming from channel Pz and we emphasize the contribution of the
mastoids. We have also set b
i
= 0 for i [1, 2, 3, 6] and b
i
= 1 for i [4, 5], i.e we do not
take into account the DWT coecients corresponding to the 3-6 Hz band and the DWT
coecients from the 6 12Hz representing the rst and last 80 ms of response (where the
MMN deection should not be present). We decided to discard the information coming
from the central parietal electrode since this channel coecients showed very little regularity
between dierent subjects. The coecients associated to the 3-6 Hz band were not considered
in the dissimilarity measure denition for the same reason. The mastoids were emphasized
to balance the importance of the negative spike in the mastoids with the importance of the
positive deection in the central and frontal electrodes.
Using the measure (6.3) we have performed a clustering analysis of the original responses
dataset. The datasets to be analyzed were formed either by the subject standard responses
and the deviant ones or by the standard waves and the dierence waves. The standard waves
were in both cases non-targets and the deviant (or dierence) responses were targets. Each
subject was treated separately. We have set the desired number of cluster centers to k = 2.
For the selection of initial cluster centers z
i
we have used two dierent methods:
RANDOM selection method: z
1
is a randomly selected target response (deviant or
dierence) in the test set and z
2
is randomly selected from the non-target responses,
i.e from the standard waves.
PSEUDO-RANDOM type 1: z
1
is assigned to the average of the target responses and
z
2
is a randomly selected non-target wave.
PSEUDO-RAMDOM type 2: We manually select for z
1
a target response well correlated
with the MMN component. For z
2
we randomly select one of the non-target waves.
For each of these three selection methods we obtained 50 dierent clustering results by using
dierent initial centers. The algorithm for performing the clustering was the ISODATA
77
Averaged trials Cluster 1 Cluster 2
Standard Responses 1 103 147
(NON-TARGETS) 10 11 14
25 6 4
50 1 4
Deviant Responses 1 178 72
(TARGETS) 10 16 9
25 7 3
50 4 1
Table 6.2: Sample missclasscation matrix for subject ADHD1171dev2 when using the deviant waves
as targets. The total number of trials was 250 and ISODATA found 2 clusters.The
selection method used was PSEUDO-RANDOM type 1.
algorithm that can be understood as a generalization of the k-means clustering algorithm.
Both k-means and ISODATA are explained in appendix A. For each clustering result we used
a misclassication matrix to analyse the correspondence between clusters and responses from
one type. Two sample misclassication matrices are shown in tables 6.2 and 6.3.
In the case of single ERP trials we considered a cluster to be MMN if it satised these
conditions:
1. Less than 40 % of the samples in the cluster are from non-target (standard) responses.
2. The cluster contains at least 40% of the total number of samples in the dataset.
This criterion is not very exigent and thus, in many of the clustering results, a cluster labeled
as MMN was found. Table 6.5 shows in how many clustering results a MMN cluster was
found for the three dierent initial selection methods mentioned before. We can see that the
RANDOM selection method performs worse than the other two and that the performance
of PSEUDO RANDOM I and PSEUDO RANDOM II is quite similar. However, when using
PSEUDO RANDOM II we need to manually select one of the cluster centers. It is desirable
for our classication system to be automatic, therefore, in the following we will assume that
PSEUDO RANDOM I was used.
78
Averaged trials Cluster 1 Cluster 2
Standard Responses 1 93 157
(NON-TARGETS) 10 10 15
25 4 6
50 2 3
Dierence Responses 1 142 108
(TARGETS) 10 13 12
25 7 3
50 3 2
Table 6.3: Sample missclasscation matrix for subject ADHD1171dev2 when using the dierence
waves as targets. The total number of trials was 250 and ISODATA found 2 clusters.
The selection method used was PSEUDO-RANDOM type 1.
RANDOM PSEUDO PSEUDO
RANDOM I RANDOM II
ADHD1171dev2 29 43 40
ADHD1160dev2 27 28 24
ADHD1167dev2 33 39 42
ADHD1165dev2 25 35 33
ADHD1168dev2 26 30 28
ADHD1176dev2 21 25 27
Table 6.4: Number of times that a MMN cluster was found when using the deviant responses as
targets. 50 clustering results were obtained by setting dierent initial cluster centers.
79
6.5.3 Neural Classier
After performing the ISODATA preprocessing for the DWT features we have tested the per-
formance of a neural network classier in detecting the presence of MMN component in single
ERP trials. The network used was a multilayer perceptron with an input layer formed by 4
neurons with linear activation, a hidden layer with 6 tangent sigmoid neurons and an output
layer of 2 linear neurons. The training algorithm was the Levenberg-Marquard backpropaga-
tion and Bayesian regularization and early stopping were used for avoiding overtting.
To reduce the dimensionality of the pattern vectors we have dened two virtual channels as
described in [72]:
Virtual channel V1. It is obtained averaging the channels from frontal and central parts
of the scalp, i.e the channels Fz, F3, F4, Cz, C3, and C4.
Virtual channel V2. Obtained by averaging channels A1 and A2 (mastoids).
Using only these two channels to describe an ERP trial we consider 4-dimensional pattern
vectors: 2 DWT coecients belonging to the channel V1 and representing the time-frequency
window t (80 ms, 240 ms), f (6 Hz, 12 Hz) and the 2 DWT coecients from channel
V2 representing the same window in the time frequency plane.
A basic approach for training the network would be to present the DWT pattern vectors from
the standard responses as non-targets and the patterns from the deviant (or dierence) waves
as targets. However, analysis of single trials shows that majority of them do not present MMN
deection, moreover, they are often contradictory to each other. Hence, the results obtained
with such a training sequence are not satisfactory. As an example, when using dierence
waves as targets, correct classication ratios are below 60 % and false positives (incorrect
detection of MMN deection in standard waves) are over 30 % for all subjects [72].
In the DWT+ISODATA+NN method proposed in this Thesis the training dataset is formed
from the clusters obtained in the ISODATA preprocessing step. From all the clusters labeled
as MMN we choose the cluster with minimum intraset distance, we will refer to this cluster as
the best MMN cluster. The intraset distance in a set of pattern points a
(i)
, {i = 1, 2, ..., K}
is dened as:
D
2
({a
(j)
}, {a
(i)
}) =
1
K
K
j=1
D
2
(a
(j)
, {a
(i)
}) (6.4)
where
80
0 50 100 150 200 250 300 350
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
ms

V
(a) ADHD1165dev2 - A1
0 50 100 150 200 250 300 350
4
3
2
1
0
1
2
3
4
ms

V
(b) ADHD1165dev2 - Fz
Figure 6.14: Sample average deviant waves obtained by discarding in the averaging process any wave
not belonging to the best MMN cluster. Red: average of the target waves in the
best MMN cluster; Blue: Average of the target waves not belonging to the cluster;
Black: Grand average using all the target trials.
D
2
(a
(j)
, {a
(i)
}) =
1
K 1
K
i=1
D
2
(a
(j)
, a
(i)
) (6.5)
and D is the dissimilarity measure dened in equation (6.3).
The characteristics of the best MMN subjects for the 6 subjects under study are presented
in tables 6.6 and 6.7. Visual inspection of the samples in the best MMN clusters conrmed
that they correlate well with the MMN deection. We can obtain averaged ERPs with a
clean MMN component by including in the average only the target waves in the best MMN
cluster (see for example Figure 6.14). In average, when the deviant waves are used as targets,
54.34 % of the deviant waves and 30.31 % of the standard waves were assigned to the best
MMN cluster. When using the dierence waves as targets, 50.13 % of the dierence waves
and 30.90 % of the standard were placed in the best MMN cluster. Visual inspection of the
best MMN clusters when using deviant and dierence responses as targets revealed that
the MMN deection was more clearly detected in the deviant waves.
To form now the training sequence of the neural classier we labeled as 1 the target responses
contained in the best MMN cluster and as 0 all the non-target waves belonging to the cluster
complementary to the best MMN cluster. Using such training sequence we obtained the
classication results shown in tables 6.8 and 6.9.
Averages of the trials classied by the network as MMN show signicant deection in the
MMN time range (150 ms - 220 ms). Figures 6.15, 6.16, 6.17, and 6.18 show the shapes of
81
RANDOM PSEUDO PSEUDO
RANDOM I RANDOM II
ADHD1171dev2 25 39 35
ADHD1160dev2 22 26 25
ADHD1167dev2 28 34 36
ADHD1165dev2 24 30 32
ADHD1168dev2 20 26 24
ADHD1176dev2 22 22 21
Table 6.5: Number of times that a MMN cluster was found when using the dierence responses as
targets. 50 clustering results were obtained by setting dierent initial cluster centers.
Subject Cluster Size (%) Standard (%) Deviant (%)
ADHD1171 43.12% 33.85% 66.15%
ADHD1160 40.44% 39.90% 60.10%
ADHD1167 45.36% 37.80% 62.20%
ADHD1165 43.11% 32.42% 67.58%
ADHD1168 40.15% 34.72% 65.28%
ADHD1176 41.70% 36.32% 63.68%
Table 6.6: Characteristics of the best MMN clusters when deviant responses are used as targets.
Leftmost column shows the size of the cluster in percentage over the total number of
samples, the middle and right columns give the percentage of standard and dierence
waves over the total number of samples in the cluster.
82
Subject Cluster Size (%) Standard (%) Deviant (%)
ADHD1171 41.11% 37.96% 62.04%
ADHD1160 40.24% 39.60% 60.40%
ADHD1167 40.36% 38.94% 61.06%
ADHD1165 40.35% 35.12% 64.88%
ADHD1168 40.71% 37.90% 62.10%
ADHD1176 40.46% 39.49% 60.51%
Table 6.7: Characteristics of the best MMN clusters when dierence responses are used as targets.
Leftmost column shows the size of the cluster in percentage over the total number of
samples, the middle and right columns give the percentage of standard and dierence
waves over the total number of samples in the cluster.
Subject Targets in Targets in
standard waves deviant waves
ADHD1171 17.78% 50.33%
ADHD1160 14.74% 36.65%
ADHD1167 11.42% 42.14%
ADHD1165 9.05% 31.10%
ADHD1168 13.01% 29.00%
ADHD1176 16.60% 22.40%
Table 6.8: Classication results when using the deviant waves as targets
Subject Targets in Targets in
standard waves dierence waves
ADHD1171 14.76% 35.23%
ADHD1160 9.16% 21.11%
ADHD1167 7.14% 27.14%
ADHD1165 12.20% 27.16%
ADHD1168 11.95% 23.9%
ADHD1176 13.69% 18.25%
Table 6.9: Classication results when using the dierence waves as targets
83
the signals obtained averaging the target responses classied as MMN and the ones classied
as non-MMN.
6.5.4 Discussion
Obtained classication results clearly improved the classication ratios obtained when ISODATA
preprocessing was not used. In general, the classication was more accurate when using de-
viant waves as targets than when using dierence waves. When using the former waves MMN
deection was detected in 35.27 % of the deviant responses and in 13.77 % of the standard
waves. When using the latter waves, the deection was detected in 25.46 % of the dierence
responses and in 10.29 % of the standard waves. These results are similar to the obtained in
[72].
Even though the ratios of correct classication are still too low to make the system useful for
detecting the presence of MMN in single ERP trials, it can be used for selectively averaging
ERP trials with strong MMN deection in order to obtain improved signal to noise ratios.
Figures 6.15 and 6.16 show the average deviant waves obtained in such fashion versus the
grand average of the whole stream of trials and the average of the trials where MMN de-
ection was not detected. Averages of target trials exhibit a clear negative deection in the
mastoids channels with latency around 140 ms after the appearance of the deviant stimu-
lus. Correspondingly, a frontocentral positive deection is also clearly seen in the recordings
from channel Fz with a latency of about 150 ms. These results support the hypothesis that
the frontal MMN generator is activated later than the auditory cortex generator [70]. The
time dierence between the temporal and frontal components of the MMN is very dicult to
distinguish from the grand averages over the whole set of ERP trials and it is more clearly
seen in the selective averages of target trials. This time oset and the amplitude of the two
separated sources of the MMN can be used as a clinical parameter of study (see for example
[1]).
Average waves obtained for subject ADHD1176 show that the classication of the input trials
was not correct. Positive deection appears in the mastoid channels during the MMN time
window. The classication ratios for this subject were specially bad. Up to 16.6 % of the
standard responses were labeled as target waves and the MMN deection was detected in only
22.40 % of the deviant waves. Examination of the grand averages for this concrete subject
(see Figure 6.4(f)) revealed a practically at response in the MMN latencies for all channels
what can explain the incorrect classication.
84
0 50 100 150 200 250 300 350
2.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
ms

V
(a) ADHD1171 - A1
0 50 100 150 200 250 300 350
8
6
4
2
0
2
4
6
8
ms

V
(b) ADHD1171 - Fz
0 50 100 150 200 250 300 350
1.5
1
0.5
0
0.5
1
1.5
ms

V
(c) ADHD1160 - A1
0 50 100 150 200 250 300 350
6
4
2
0
2
4
6
ms

V
(d) ADHD1160 - Fz
0 50 100 150 200 250 300 350
2.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
ms

V
(e) ADHD1167 - A1
0 50 100 150 200 250 300 350
8
6
4
2
0
2
4
6
8
ms

V
(f) ADHD1167 - Fz
Figure 6.15: Averages of the deviant waves classied as targets (red) and non-targets (blue) versus
grand averages (black).
85
0 50 100 150 200 250 300 350
2.5
2
1.5
1
0.5
0
0.5
1
1.5
ms

V
(a) ADHD1165 - A1
0 50 100 150 200 250 300 350
8
6
4
2
0
2
4
6
8
ms

V
(b) ADHD1165 - Fz
0 50 100 150 200 250 300 350
2
1.5
1
0.5
0
0.5
1
ms

V
(c) ADHD1168 - A1
0 50 100 150 200 250 300 350
5
4
3
2
1
0
1
2
3
4
5
ms

V
(d) ADHD1168 - Fz
0 50 100 150 200 250 300 350
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
ms

V
(e) ADHD1176 - A1
0 50 100 150 200 250 300 350
4
3
2
1
0
1
2
3
4
ms

V
(f) ADHD1176 - Fz
86
0 50 100 150 200 250 300 350
2.5
2
1.5
1
0.5
0
0.5
1
1.5
2
ms

V
(a) ADHD1171 - A1
0 50 100 150 200 250 300 350
10
8
6
4
2
0
2
4
6
8
10
ms

V
(b) ADHD1171 - Fz
0 50 100 150 200 250 300 350
1.5
1
0.5
0
0.5
1
1.5
2
ms

V
(c) ADHD1160 - A1
0 50 100 150 200 250 300 350
10
5
0
5
10
15
ms

V
(d) ADHD1160 - Fz
0 50 100 150 200 250 300 350
5
4
3
2
1
0
1
2
3
4
ms

V
(e) ADHD1167 - A1
0 50 100 150 200 250 300 350
8
6
4
2
0
2
4
6
8
10
ms

V
(f) ADHD1167 - Fz
Figure 6.17: Averages of the dierence waves classied as targets (red) and non-targets (blue) versus
87
0 50 100 150 200 250 300 350
2
1.5
1
0.5
0
0.5
1
1.5
2
ms

V
(a) ADHD1165 - A1
0 50 100 150 200 250 300 350
10
8
6
4
2
0
2
4
6
8
10
ms

V
(b) ADHD1165 - Fz
0 50 100 150 200 250 300 350
2.5
2
1.5
1
0.5
0
0.5
1
1.5
2
ms

V
(c) ADHD1168 - A1
0 50 100 150 200 250 300 350
8
6
4
2
0
2
4
6
8
ms

V
(d) ADHD1168 - Fz
0 50 100 150 200 250 300 350
2
1.5
1
0.5
0
0.5
1
1.5
2
ms

V
(e) ADHD1176 - A1
0 50 100 150 200 250 300 350
8
6
4
2
0
2
4
6
ms

V
(f) ADHD1176 - Fz
Figure 6.18: Averages of the dierence waves classied as targets (red) and non-targets (blue) versus
88
6.6. SECOND METHOD: DWT+ICA+NN
6.6 Second method: DWT+ICA+NN
In the last section we have discussed a classication method for ERP trials with MMN
deection based on wavelet features. We observed that, using that method, the detection
rates of a MMN component in deviant responses were very low (below 40 % in average) and
the false detection of MMN in standard waves were quite high (over 10 % for all subjects).
These results suggest that the multiresolution features obtained using the Discrete Wavelet
Transform are not able to properly describe the MMN. Noise and undesired EEG components
overlap with the MMN in time and frequency, making the wavelet features inaccurate.
In this section we propose an alternative method for detecting the MMN. This method uses
Independent Component Analysis to separate the MMN from any other undesired component
(noise, artifacts, EEG rhythms, etc.). ICA allows the separation of components with over-
lapping time latencies and frequency characteristics. When ICA and wavelets are combined,
a better description of the MMN can be obtained yielding better classication rates.
The general scheme of the DWT+ICA+NN is depicted in Figure 6.19. Very briey, the
processing steps involved are these:
1. Perform a DWT of the raw individual responses.
2. Extract characteristic features of the MMN from the DWT coecients of the raw ERP
trials. The main features used are the coecients belonging to the band 6-12 Hz.
3. Mix the wavelet features coming from dierent channels into a unique MMN channel.
The mixture coecients are obtained through independent component analysis of the
ltered EEG data.
4. Classify the patterns representing single ERP trials as targets or non-targets using a
Neural Network. This neural network must be trained before proceeding to the actual
classication process. 50 % of the available data was used for training the classier and
the rest for testing the trained system. For the training process we used as non-target
waves the standard responses and as target waves either the deviant or dierence waves.
6.6.1 Wavelet based feature extraction and ltering
As in the previous DWT+ISODATA+NN method, the DWT+ICA+NN uses wavelet features
to describe the MMN component. First, we calculate the DWT of the individual response
that has been introduced to the system. We use as descriptors of each trial the 4 coecients
89
...
...
...
...
A1
A2
Fz
...
...
...
...
A1
A2
Fz
DWT
A1
A2
Fz
...
...
...
1 k N N-k+1 2k-L
Time
ICA
...
IDWT
...
c1 c2 c9
...
+
...
Feature Reduction
Raw data
DWT coefficients
n-dimensional Pattern Vector
A1
...
n DWT features
per channel
Fz A2
Neural
Classifier
Figure 6.19: Scheme of the DWT+ICA+NN system
90
from the 6-12 Hz band. Thus, we obtain for each trial: 9 channels x 4 features/channel=36
features. Later on, these 36-dimensional pattern vectors will be reduced to 4-dimensional
patterns by mixing the features accordingly to the mixing coecients obtained using ICA.
We have only 9 available channels, what means that ICA will be able to extract at most
9 dierent independent components. The number of components present in the ERP trials
is most likely larger than 9, therefore we must lter the ERP trials before applying ICA
in order to eliminate any irrelevant component falling outside the MMN frequency range.
Another reason of ltering the EEG data is that not all ICA algorithms (like for example
the Infomax algorithm) are capable of unmixing independent components with sub-Gaussian
(negative Kurtosis) distribution. Nonetheless, sub-Gaussian sources can be found in ERPs
and spontaneous EEG recordings in the form of line noise, sensor noise and low frequency
activity [5]. Wavelet ltering can be used to eliminate such components from the raw ERP
data. The ltering takes place as was explained in section 6.4, i.e. we set to 0, for any single
trial, all the DWT coecients except those representing the bands 3-6 Hz and 6-12 Hz.
6.6.2 ICA based feature reduction
We apply ICA to the ltered ERP data in order to separate the brain sources responsible for
the dierent components existent in the ERP trials. It has been shown [42, 41, 5] that ICA
can eectively decompose multiple overlapping components from selected sets of big averages
of ERP. In this work we apply ICA to single trials and small averages of trials where many
irrelevant components are present. The general scheme used is a moving window independent
component analysis (see Figure 6.19) based on the following steps:
1. First the individual waves (320 ms/64 samples per trial) are band-pass ltered by
keeping only the wavelet coecients belonging to the bands 6.2-12.5 Hz (4 coecients)
and 3.1-6.2 Hz (2 coecients) and performing the inverse DWT.
2. The next step is to perform a moving window average of the ERP trials that we will
analyze using ICA. By default we have set the window size parameter to 10 trials. This
step can be skipped but we have observed that ICA works better when the ERP data
is slightly smoothed using averaging.
3. Then, we divide the ltered ERP dataset into smaller sub-datasets of a certain number
of trials (by default 20), each one with some trials overlapping between correlative sub-
datasets (by default 0). Each of those sub-datasets is then decomposed into independent
components using ICA. We have tested in this stage two dierent algorithms: Infomax
91
16
+
Scale
1 2 3 4 5 6 7 8 9 10
250 0 250 0 250 0 250 0 250 0 250 0 250 0 250 0 250 0 250 0

fz
C3
Cz
C4
F3
F4
Pz
a2
a1
(a) ERP data
13
+
Scale
1 2 3 4 5 6 7 8 9 10
250 0 250 0 250 0 250 0 250 0 250 0 250 0 250 0 250 0 250 0

ICA9
ICA8
ICA7
ICA6
ICA5
ICA4
ICA3
ICA2
ICA1
(b) Independent Components
Figure 6.20: Ten ERP trials from subject ADHD1171dev2 and their decomposition into nine inde-
pendent components. Component 5 was labeled as MMN correlated.
and JADE. The eect of varying the number of trials in each sub-dataset is studied
in section 6.6.3. In Figure 6.20 we can see a sample decomposition of the nine ERP
channels into nine independent components.
4. After that, the component or components that are more correlated to the MMN are
sought in each sub-dataset. This is done by means of a template matching procedure.
The template was obtained by averaging a group of components with strong MMN
deection that were manually selected from the subjects under study. The MMN com-
ponent template is shown in Figure 6.21. For each sub-dataset, the component whose
average activation during the epochs in the sub-dataset had higher correlation with the
MMN template was selected. This procedure does not take into account the possibility
that the scalp distribution of a component highly correlated with the MMN template
may not be correlated at all with the MMN characteristic scalp distribution. However,
it was observed that this happens very rarely. Figure 6.22 shows the projection of the
component that was selected using such a procedure from the set of 9 independent
component shown in Figure 6.20.
In some sub-datasets ICA did not nd any component similar to the MMN template.
Even though in that situation the correlation similarity measure becomes meaningless,
the component with higher correlation was (incorrectly) labeled as MMN. To reduce
the eects of this fact in the training procedure of the network we have used for the
training only those deviant waves belonging to sub-datasets in which a component with
a correlation with the template higher than a threshold was found. The threshold was
dierent for each subject and was calculated in such a fashion that only 50 % of the
92
0 50 100 150 200 250 300 350
3
2
1
0
1
2
3
4
5
6

V
ms
Figure 6.21: Template used in the matching procedure for the selection of a MMN-like component
4
+
Scale
1 2 3 4 5 6 7 8 9 10
250 0 250 0 250 0 250 0 250 0 250 0 250 0 250 0 250 0 250 0

fz
C3
Cz
C4
F3
F4
Pz
a2
a1
Figure 6.22: Projections back to the electrodes of one independent component that was automati-
cally labeled as MMN correlated. Note the clear positive deection in the frontal and
central electrodes and the negative peak in the mastoids A1 and A2.
93
deviant waves in the training set (those with higher correlations with the template)
were presented as targets to the network in the training procedure. The correlation
matching procedure has the great advantage of being totally automatic and computa-
tionally inexpensive. An alternative to this method are more complex techniques for
the automatic detection of MMN components, like the proposed by A. Bazhyna in his
Masters Thesis [5].
5. In each sub-dataset we extract the row of the ICA unmixing matrix corresponding to
the component labeled as MMN. So, in each sub-dataset we obtain a single weights
vector that we will call MMN unmixing vector. We use this vector to update the
coecients (c
0
, ..., c
9
in Figure 6.19) used to combine the DWT features coming from
the nine electrodes to obtain a single MMN channel. Thus, if we denote V
MMN
i
the
MMN unmixing vector associated to the sub-dataset i, S
i
the indexes of the epochs
contained in sub-dataset i, and C
j
the DWT coecients matrix for a certain trial j,
the pattern vector associated to that trial will be:
P
j
= V
MMN
i
C
j
j S
i
(6.6)
6.6.3 Neural classier
The nal step of our system is the classication of the pattern vectors obtained as explained
in the last two sections. We have used for that task two kinds of neural classiers: a three
layer perceptron with bayesian regularization and early stopping and a ensemble of ten neural
networks with dierent random initial weights. The members of the ensemble were also three
layer perceptrons with bayesian regularization and early stopping. The input layer had 4
linear neurons, the hidden layer 6 tangent sigmoid neurons and the output layer 2 linear
neurons. The training algorithm in all cases was the Levenberg-Marquard backpropagation.
When using the ensemble, a trial was classied as target if the majority of the members
agreed in classifying it as target. Anyway we can modify the number of members (threshold)
that must agree to classify one epoch as target depending on what is more important for us:
high correct classication rates or low false positives rates.
For training the neural classier 50 % of the available EEG data was used. The remaining 50 %
was used for testing the classication performance of the system. In general, we observed
that, as in the case of the DWT+ISODATA+NN method, the classication performance
obtained using the deviant waves as targets was better than the results obtained when the
dierence waves were used. Hence, we focused our study on the use of deviant waves. The
94
results obtained for the single neural network when using the deviant responses to train the
network are shown in table 6.10. The results obtained for the ensemble when using the default
threshold of 5 members are shown in tables 6.11.
Examining the results obtained we can conclude that they are clearly better that the ob-
tained using the previous DWT+ISODATA+NN method. To be sure that the classication
was correct we checked the shapes of the average waves obtained when averaging only the
responses classied as targets by the network. Those waves can be seen in Figures 6.23 and
6.24. A clear positive deection in the frontal electrode and a negative peak in the mastoids
can be observed in the latencies of the MMN for all subjects.
If we assume that the signal recorded at the electrodes is a stable mixture of several brain
sources, there would not have been any sense to apply ICA over small subsets of trials
instead of using the whole stream of trials to calculate a constant mixing matrix for the
MMN experiment. However, we have observed that increasing the size of the subdatasets
over 30 or 40 trials degrades signicantly the performance of the system. In Figure 6.25 we
show the performance of the system for a subject with strong MMN deection (ADHD1171)
and another with weak MMN (ADHD1160) when varying to the size of the subdatasets to
which ICA is applied. We see that the optimum performance seems to be reached for a size
of around 20 or 30 trials. Similar results were obtained for the other subjects under study.
6.6.4 Discussion
Obtained results showed that the proposed system can be used for automatic detection of
the presence of MMN in single ERP trials. The best performance was obtained using JADE
as ICA algorithm and an ensemble of 10 neural networks as classier. Using such system,
average detection ratios in deviant waves were of 54.52 % while only in 18 % of the standard
responses a component similar to the MMN was detected.
The detection ratios obtained by using JADE were better than the ones obtained with Info-
max for all subjects except for ADHD1160. Furthermore, the convergence of JADE was much
faster than the convergence of Infomax (in the order of 10 times faster). The small amount
of available channels makes the memory requirements of JADE acceptable and hence, JADE
is clearly preferable over Infomax for our system.
Analysis of the selective averages obtained using the waves classied as targets show that
the trials rejected in the average clearly degrade the signal to noise ratio of the average. For
example, in subject ADHD1171, despite the fact that more than 70 % of the trials were used
in the selective average, the positive peak value in the channel Fz increased from a peak of
95
Subject ICA algorithm Targets in Targets in
ADHD1171dev2 Infomax 20.86 % 58.29 %
JADE 20.43 % 69.89 %
ADHD1160dev2 Infomax 19.57 % 44.74 %
JADE 17.60 % 41.21 %
ADHD1167dev2 Infomax 17.40 % 57.54 %
JADE 19.48 % 65.43 %
ADHD1165dev2 Infomax 11.35 % 55.32 %
JADE 15.65 % 63.54 %
ADHD1168dev2 Infomax 15.34 % 47.49 %
JADE 17.48 % 53.72 %
ADHD1176dev2 Infomax 27.80 % 37.90 %
JADE 26.84 % 39.65 %
Table 6.10: Results for the single Neural Network using deviant responses as targets.
Subject ICA algorithm Targets in Targets in
ADHD1171dev2 Infomax 23.54 % 61.57 %
JADE 19.12 % 72.43 %
ADHD1160dev2 Infomax 16.32 % 43.46 %
JADE 17.64 % 40.10 %
ADHD1167dev2 Infomax 13.40 % 55.64 %
JADE 19.48 % 64.43 %
ADHD1165dev2 Infomax 10.27 % 51.32 %
JADE 13.65 % 62.21 %
ADHD1168dev2 Infomax 16.45 % 46.21 %
JADE 14.48 53.72 %
ADHD1176dev2 Infomax 29.00 % 36.36 %
JADE 24.32 % 34.24 %
Table 6.11: Results for the ensemble using deviant responses as targets.
96
0 50 100 150 200 250 300 350
3
2
1
0
1
2
3
ms

V
(a) ADHD1171 - A1
0 50 100 150 200 250 300 350
4
3
2
1
0
1
2
3
4
ms

V
(b) ADHD1171 - Fz
0 50 100 150 200 250 300 350
2
1.5
1
0.5
0
0.5
1
1.5
2
ms

V
(c) ADHD1160 - A1
0 50 100 150 200 250 300 350
3
2
1
0
1
2
3
ms

V
(d) ADHD1160 - Fz
0 50 100 150 200 250 300 350
4
3
2
1
0
1
2
3
ms

V
(e) ADHD1167 - A1
0 50 100 150 200 250 300 350
5
4
3
2
1
0
1
2
3
4
5
ms

V
(f) ADHD1167 - Fz
grand averages (black). The ICA algorithm used was JADE and the classier was the
ensemble of neural networks.
97
0 50 100 150 200 250 300 350
2.5
2
1.5
1
0.5
0
0.5
1
1.5
2
ms

V
(a) ADHD1165 - A1
0 50 100 150 200 250 300 350
5
4
3
2
1
0
1
2
3
4
ms

V
(b) ADHD1165 - Fz
0 50 100 150 200 250 300 350
2
1.5
1
0.5
0
0.5
1
1.5
ms

V
(c) ADHD1168 - A1
0 50 100 150 200 250 300 350
3
2
1
0
1
2
3
4
ms

V
(d) ADHD1168 - Fz
0 50 100 150 200 250 300 350
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
ms

V
(e) ADHD1176 - A1
0 50 100 150 200 250 300 350
3
2
1
0
1
2
3
ms

V
(f) ADHD1176 - Fz
grand averages (black). The ICA algorithm used was JADE and the classier was the
ensemble of neural networks.
98
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
10 30 50 75 100
Trials
T
a
r
g
e
t
s
Targets in standard waves
ADHD1171
Targets in deviant waves
ADHD1171
Targets in standard waves
ADHD1160
Targets in deviant waves
ADHD1160
Figure 6.25: Eect of the size of the subdatasets used for the moving window ICA.
2.3 V in the grand average to 4 V in the selective average of target trials. Similarly, the
negative deection in the mastoid A1 passed from a peak value of 2.2 V to 2.8 V .
The performance of DWT+ICA+NN is clearly better than DWT+ISODATA+NN. The main
drawback of the former over the latter is that the patterns generation process is quite slow
due to the ICA step. If the size of windows of trials used as input to the ICA block are
quite small (of the order of 20 trials) the results are expected to be better than if using
larger windows but the computation time increases dramatically. Thus, DWT+ICA+NN
would be exclusively for oine processing of ERP data. DWT+ISODATA+NN would be
more suitable for applications where the computation time is critical. The training phase of
DWT+ISODATA+NN is quite slow, due to the high dimensionality of the data that must
be clustered using ISODATA but, once the system has been trained the only computation
required is a DWT and to input the 4 dimensional pattern vector of wavelet features in the
already trained neural classier.
Both DWT+ICA+NN and DWT+ISODATA+NN were trained for each subject separately.
If we train the systems using data from one subject and then we test the performance using
MMN data from a dierent subject, the correct classication ratios are considerably worse.
This constitutes a drawback since the systems must be trained every time we want to analyze
a dierent subject. However, data from only 6 subjects were used in this Thesis, which is not
enough for testing the utility of the proposed methods in analyzing the MMN characteristics
from groups of subjects. A topic for further research could be to train the systems with
99
EEG recordings of a group of subjects with a certain disease with reect in the MMN for,
after that, testing the detection ratios obtained for new subjects with that disease and the
detection ratios for healthy subjects.
100
Chapter 7
Conclusions
Event Related Potentials (ERP) constitute an important source of data for the investigation
of the human brain. Unfortunately, low signal to noise ratios and the presence of many
overlapping components make the analysis of single ERP realizations almost impossible. In
this Thesis we performed a time-frequency analysis of a real dataset with MMN deection
obtaining that most of the energy of that component is located in the frequency band from
5-15 Hz and in the time latencies of 125 ms to 225 ms after the presentation of a deviant
stimuli to the subject under study. This results suggested the use of time-frequency features
such as wavelets in order to detect the presence of MMN deection in single ERP trials.
Two dierent approaches based on wavelet features have been proposed for the detection of
the MMN component in single trials or averages of small sets of trials. The rst method
used ISODATA for obtaining a suitable training data set for a neural classier. The results
obtained with that approach were acceptable considering the great variability of the single
ERP realizations. However, we concluded that wavelet features by their own were not able to
eectively characterize the MMN component. This is due to the overlap in the time-frequency
plane of many ERP and EEG components.
In the second proposed approach, ICA was used in combination with wavelet features. ICA,
due to its statistical nature, is able to separate overlapping components that are not possible
to be dierenciated by the only means of their wavelet features. The performance of this
method showed to be quite promising, reaching correct detection ratios of MMN deection
in deviant waves of up to 72 % and incorrect detection ratios in standard waves of 19 %
for subject ADHD1171. We can also conclude that the performance of JADE was clearly
superior to the performance of the classical Infomax algorithm.
This Thesis has also shown the usefulness of neural networks in analysis and classication
101
of ERP data. The diculty of extracting expert rules from the single ERP trials makes
neural classiers a very suitable solution when very little information about the characteristic
features of the target data is available.
Possible applications of the procedures developed in this Thesis include:
Automatic classication of single ERP trials to obtain the probability of appearance of
MMN deection in single trials of a certain subject.
Generation of averages of selected trials that have clear MMN deection in order to
obtain better signal to noise ratios.
Analysis of dynamic features of the MMN component like the distinct latencies and
amplitudes of the temporal and frontocentral components of the MMN.
Characterization of MMN data from dierent subjects or clinical groups. We could
train the proposed systems with data of subjects with a common disease for, after that,
test the system with new subjects suspicious of having the same disease. High correct
classication rates may signify that the subject is aected by that disease.
102
Bibliography
[1] C. Alain, D. L. Woods, and R. T. Knight. A distributed cortical network for auditory
sensory memory in humans. Brain Research, 812:2337, 1998.
[2] K. Alho, M. Tervaniemi, M. Huotilainen, J. Lavikainen, H. Tiitinen, R. J. Ilmoniemi,
J. Knuutila, and R. Naatanen. Processing of complex sounds in the human auditory
cortex revealed by magnetic brain responses. Phychophysiology, 33:369375, 1996.
[3] S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind source
separation. Advances in Neural Information Processing 8, pages 757763, 1996.
[4] H. B. Barlow. Unsupervised learning. Neural Computation, 1:295311, 1961.
[5] A. Bazhyna. Signal processing of biomedical data in application to human performance
monitoring in wireless telemedicine. Masters thesis, University of Jyvaskyla, Depart-
ment of Mathematical Information Technology, 2001.
[6] A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separa-
tion and blind deconvolution. Neural computation, 7(6):10041034, 1995.
[7] E. Callaway. Brain Electrical Potentials And Individual Psychological Dierences. Grune
& Stratton Inc., 1975.
[8] J. F. Cardoso. Eigen-structure of the fourth-order cumulant tensor with application to
the blind source separation problem. In Proceedings ICASSP90, pages 26552658, 1990.
[9] J. F. Cardoso. High-order contrasts for independent component analysis. Neural Com-
putation, 11:157192, 1996.
[10] J. F. Cardoso. Infomax and maximum likehood for blind source separation. IEEE letters
on Signal Processing, 4(4):112114, 1997.
[11] J. F. Cardoso. Blind signal separation: statistical principles. In Proceedings of the IEEE,
volume 9, pages 20092025, 1998.
103
BIBLIOGRAPHY
[12] J. F. Cardoso and A. Soulomiac. Blind beamforming for non gaussian signals. In Pro-
ceedings of the IEEE, volume 140, pages 362370, 1993.
[13] P. Celsis, K. Boulanouar, B. Doyon, J. P. Ranjeva, I. Berry, J. L. Nespoulous, and
F. Chollet. Dierential fMRI responses in the left posterior superior temporal gyrus
and left supramarginal gyrus to habituation and change detection in syllables and tones.
Neuroimage, 9:135144, 1999.
[14] M. Cheour, P. H. T. Leppanen, and Nica Kraus. Mismatch negativity (MMN) as a tool
for investigating auditory discrimination and sensory memory in infants and children.
Clinical Neurophysiology, 111:416, 2000.
[15] L. Cohen. Time-frequency distributions - a review. In IEEE Proceedings, volume 77,
pages 941980, 1989.
[16] N. Cowan, I. Winkler, W. Teder, and R. Naatanen. Memory prerequisites of the mis-
match negativity in the auditory event-related potential (ERP). Journal of Experimental
Psychology: Learning, Memory, and Cognition, 19:909 921, 1993.
[17] V. Csepe, G. Karmos, and M. Molnar. Evoked potential correlates of stimulus deviance
during wakefulness and sleep in cat - animal model of mismatch negativity. Electroen-
cephalography and Clinical Neurophysiology, 66:571578, 1987.
[18] V. Csepe, J. Osman-Sgi, M. Molnar, and M. Gosy. Impaired speech perception in aphasic
patients: event-related potential and neuropsychological assessment. Neuropsychologia,
39:11941208, 2001.
[19] V. Csepe, G. Pantev, M. Hoke, S. Hampson, and B. Ross. Evoked magnetic responses of
the human auditory cortex to minor pitch changes: Localization of the mismatch eld.
Electroencephalography and Clinical Neurophysiology, 84:538548, 1992.
[20] A. Delorme and S. Makeig. EEGLAB v4.0 for Matlab. Swartz Center for Computational
Neuroscience (SCCN). Institute for Neural Computation, University of California San
Diego(UCSD), 2002. http://www.sccn.ucsd.edu/eeglab/.
[21] D. Donoho, M. R. Duncan, X. Huo, and O. Levi. Wavelab 802 for Matlab 5.x.
http://www-stat.stanford.edu/ wavelab/.
[22] R. O. Duda, P. E. Hart, and D. G. Stork. Patten Classication. John Wiley & Sons, 2
edition, 2001.
104
BIBLIOGRAPHY
[23] C. Escera, M. J. Corral, and E. Yago. An electrophysiological and behavioral investiga-
tion of involuntary attention towards auditory frequency, duration and intensity changes.
Cognitive Brain Research, 14:325332, 2002.
[24] S. Fahlman and C. Lebiere. The cascade-correlation learning architecture. In Advances
in Neural Information Processing Systems, volume 2, pages 524532, 1990.
[25] S. Fisk. Neural Control Of Music Synthesis. University of York. Available online at
http://www-users.york.ac.uk/ scf104/neuralmusic/.
[26] F. D. Foresee and M. T. Hagan. Gauss-newton approximation to bayesian regularization.
In Proceedings of the 1997 International Joint Conference on Neural Networks, pages
19301935, 1997.
[27] J. H. Friedman and J. H. Tukey. A projection pursuit algorithm for exploratory data
analysis. IEEE Transactions of Computers, 23(9):881890, 1974.
[28] P. Gendron. Introduction to wavelet methods, November 1998.
[29] M. H. Giard, F. Perrin, J. Pernier, and P. Bouchet. Brain generators implicated in
processing of auditory stimulus deviance: a topographic event-related potential study.
Psychophysiology, 27:627640, 1990.
[30] J. C. Goswami and A. K. Chan. Fundamentals of Wavelets: Theory, Algorithms and
Applications. John Wiley & sons, INC., 1999.
[31] C. Grau, M. D. Polo, E. Yago, A. Gual, and C. Escera. Auditory sensory memory
as indicated by mismatch negativity in chronic alcoholism. Clinical Neurophysiology,
112:728731, 2001.
[32] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Transactions in Pattern
Analysis and Machine Intelligence, 12(10):9931001, 1990.
[33] H. H. Harman. Modern Factor Analysis. University of Chicago Press, 1967.
[34] E. Huupponen. Advances in the detection of sleep EEG signal waveforms. PhD thesis,
Tampere University of Technology, 2002.
[35] A. Hyvarinen. Fast and robust xed-point algorithms for independent component anal-
ysis. IEEE Transactions Neural Networks, 10(3):626634, 1999.
[36] A. Hyvarinen. Survey on independent component analysis. Neural Computing Surveys,
2:94128, 1999.
105
BIBLIOGRAPHY
[37] A. K. Jain and J. Mao. Articial neural networks: A tutorial. Computer, 1996.
[38] H. H. Jasper. The ten-twenty electrode system of the international federation. Elec-
troencephalography and Clinical Neurophysiology, 10:371375, 1958.
[39] I. T. Jollie. Principal Component Analysis. Springer-Verlag, 1986.
[40] T. Jung, S. Makeig, T. Lee, M. McKeown, G. Brown, A. Bell, and T. Sejnowski. Re-
moving electroencephalographic artifacts by blind source separation. Psychophysiology,
37:163178, 2000.
[41] T. P. Jung, S. Makeig, M. J. McKeown, A. J. Bell, T. W. Lee, and T. J. Sejnowski.
Imaging of brain dynamics using independent component analysis. In IEEE Proceedings,
volume 89, pages 11071122, 2001.
[42] T.P. Jung, S. Makeig, T.W. Lee, M.J. McKeown, G.Brown, A.J. Bell, and T.J. Se-
jnowski. Independent component analysis of biomedical signals. In Second International
Workshop on Independent Component Analysis and Signal Separation, pages 633644,
2000.
[43] T. Kalayci, O. Ozdamar, and N. Erdol. The use of wavelet transform as a preprocessor for
the nerual network detection of EEG spikes. In Proceedings of the IEEE Southeastcon94,
pages 13, 1994.
[44] M. F. Kelly, P. A. Parker, and R. N. Scott. The application of neural networks to
myoelectric signal analysis: a preliminary study. IEEE Transactions on Biomedical
Engineering, 37(3):221230, 1990.
[45] L. Khadra, H. Dickhaus, and A. Lipp. Representations of ECG late potentials in the
time frequency plane. Journal on Medical Engineering and Technology, 17(6):228231,
1993.
[46] L. Khadra, M. Matalgah, B. El-Asir, and S. Mawagdeh. The wavelet transform and its
applications to phonocardiogram signal analysis. Medical Informatics, 16(3):271277,
1991.
[47] N. Kraus, T. McGee, T. Littman, T. Nicol, and C. King. Encoding of acoustic change
involves non-primary auditory thalamus. J. Neurophysiology, 72:12701277, 1994.
[48] J. D. Kropotov, R. Naatanen, A. V. Sevostianov, K. Alho, K. Reinikainen, and O. V.
Kropotova. Mismatch negativity to auditory stimulus change recorded directly from the
human temporal cortex. Psychophysiology, 32:418422, 1995.
106
BIBLIOGRAPHY
[49] T. Kujala and R. Naatanen. The mismatch negativity in evaluating central auditory
dysfunction in dyslexia. Neuroscience and Biobehavioral Reviews, 25:535543, 2001.
[50] M. I. T. Lincoln Laboratory, editor. DARPA Neural Network Study. AFCEA Interna-
tional Press, 1988.
[51] S. Makeig, T. P. Jung, A. J. Bell, D. Ghahremani, and T. J. Sejnowski. Blind separation
of auditory event-related brain responses into independent components. In Proceedings
National Academy of Sciences, volume 94, pages 1097910984, 1997.
[52] S. Mallat. A theory for multiresolution and signal decomposition: the wavelet represen-
tation. IEEE Transactions on Pattern Analysis and Machine Intellgence, 11(7):674693,
1989.
[53] S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, 1999.
[54] W. S. McCulloch and W. Pitts. Bull. Mathematical Biophysics, volume 5, pages 115133.
1943.
[55] Y. Meyer. Wavelets: Algorithms and Applications. SIAM, 1993.
[56] P. T. Michie. What has MMN revealed about the auditory system in schizophrenia?
International Journal of Psychophysiology, 42:177194, 2001.
[57] R. Naatanen. Attention and brain functions. Lawrence Erlbaum Associates, 1992.
[58] R. Naatanen. Mismatch negativity: perspectives for application. International Journal
of Psychophysiology, 37:310, 2000.
[59] R. Naatanen, A. Lehtokoski, and M. Lennes et al. Language-specic phoneme represen-
tations revealed by electric and magnetic brain responses. Nature, 385:432434, 1999.
[60] R. Naatanen, P. Paavilainen, K. Alho, K. Reinikainen, and M. Sams. Do event-related
potentials reveal the mechanism of the auditory sensory memory in the human brain?
Neuroscience Letters, 98:217221, 1989.
[61]

O.

Ozdamar and T. Kalayci. Detection of spikes with articial neural networks using
raw EEG. Computing and biomedical research, 31:122124, 1998.
[62] E. Pekkonen, J. Hirvonen, I. P. Jskelinen, S. Kaakkola, and J. Huttunen. Auditory
sensory memory and the cholinergic system: Implications for Alzheimers disease. Neu-
roImage, 14:376382, 2001.
107
BIBLIOGRAPHY
[63] E. Pekkonen, V. Jousmki, K. Reinikainen, and J. Partanen. Automatic auditory dis-
crimination is impaired in Parkinsons disease. Electroencephalography and Clinical Neu-
rophysiology, 95:4752, 1995.
[64] T. W. Picton, C. Alain, L. Otten, W. Ritter, and A. Achim. Mismatch negativity:
dierent water in the same river. Audiol. Neurootol, 5:11139, 2000.
[65] E. Pihko, P. Leppasaari, and H. Lyytinen. Brain react to occasional changes in the
duration of elements in a continues sound. Neuroreport, 6:12151218, 1995.
[66] R. Polikar. The engineers ultimate guide to wavelet analysis. Signal Processing and
Pattern Recognition Laboratory (SPPRL) at Rowan University. Available online at
http://engineering.rowan.edu/ polikar/WAVELETS/WTtutorial.html.
[67] R. Polikar, M. H. Greer, L. Udpa, and F. Keinert. Multiresolution wavelet analysis of
ERPs for the detection of Alzheimers disease. In Proceedings of the 19th IEEE/EMBS
International Conference, volume 30, 1997.
[68] P. Pudil, J. Novovicova, and J. Kittler. Floating search methods in feature selection.
Pattern Recognition Letters, 15:11191125, 1994.
[69] S. Ramakrishnan. Wavelet domain nonlinear ltering for evoked potential signal entice-
ment. Computers and Biomedical Research, 55, 2000.
[70] T. Rinne, K. Alho, R. J. Ilmoniemi, J. Virtanen, and R. N aatanen. Separate time
behaviors of the temporal and frontal mismatch negativity sources. Neuroimage, 12:14
19, 2000.
[71] M. D. Rugg and M. G. H. Coles. Electrophysiology of mind: Event-related brain potentials
and cognition. Oxford University Press, 1995.
[72] D. Rusanovskyy. Applying the wavelet transformation and neural network techniques to
the detection of mismatch negativity from EEG. Masters thesis, University of Jyvaskyla,
Department of Mathematical Information Technology, 2001.
[73] M. Salmi. DSAMP v4.7 Users and Programmers Manual. University of Jyvaskyla,
Department of Psychology, 1990.
[74] M. Sams, P. Paavilainen, K. Alho, and R. Naatanen. Auditory frequency discrimina-
tion and event-related potentials. Electroencephalography and Clinical Neurophysiology,
62:437448, 1985.
108
BIBLIOGRAPHY
[75] S. J. Schi, J. Milton, J. Heller, and S. L. Weinstein. Wavelet transforms and sur-
rogate data for electroencephalographic and seizure localization. Optical Engineering,
33(7):21622169, 1994.
[76] M. Tervaniemi, S. V. Medvedev, K. Alho, S. V. Pakhomov, M. S. Roudas, T. L. Van
Zuijen, and R. Naatanen. Lateralized automatic auditory processing of phonetic versus
musical information: a pet study. Human Brain Mapping, 10:7479, 2000.
[77] S. Tervaniemi. Toward to optimal recording and analysis of the mismatch negativity.
Audiol. Neurootol, 5:235246, 2000.
[78] Inc. The Mathworks. Neural Network toolbox Users Guide.
[79] J. T. Tou and R. C. Gonzalez. Pattern Recognition Principles. Addison-Wesley, 1974.
[80] L. J. Trejo and M. J. Shensa. Feature extraction of event-related potentials using
wavelets: An application to human performance monitoring. Brain and Language, 1999.
[81] Cognitive Brain Research Unit(CBRU). Mismatch negativity end EEG overview. Uni-
versity of Helsinki.
[82] M. Unser. A review of wavelets in biomedical applications. In Proceedings of the IEEE,
volume 84, April 1996.
[83] C. Valens. A really friendly guide to wavelets. Available online at
http://perso.wanadoo.fr/polyvalens/clemens/wavelets/wavelets.html.
[84] I. Winkler, N. Cowan, V. Csepe, I. Czigler, and R. Naatanen. Interactions between tran-
sient and long-term auditory memory as reected by the mismatch negativity. Journal
of Cognitive Neuroscience, 6:403415, 1996.
[85] I. Winkler, G. Karmos, and R. Naatanen. Adaptive modeling of the unattended acous-
tic environment reected in the mismatch negativity event-related potential. Brain Re-
search, 742:239252, 1996.
[86] Z. Xiao-Ping and M. D. Desai. Nonlinear adaptive noise suppression based on wavelet
transform. In Proceedings of the 1998 International conference on Acoustics, Speech,
and Signal Processing, volume 3, pages 15891592, 1998.
[87] Z. H. Zhou, J. X. Wu, , Y. Jiang, and S. F. Chen. Genetic algorithm based selective neural
network ensemble. In Proceedings of the International Joint Conference on Articial
Intelligence, volume 2, pages 797802, 2001.
109
BIBLIOGRAPHY
[88] Z. H. Zhou, J. X. Wu, W. Tang, and Z. Q. Chen. Selectively ensembling neural classiers.
In Proceedings of the International Joint Conference on Neural Networks, volume 2,
pages 14111415, 2002.
110
Appendix A
K-means and ISODATA
In this Thesis we have used clustering for non-supervised preprocessing of the MMN dataset
in order to obtain a cluster representative of the MMN component and free of outliers. We
have seen that such a cluster can be used for training a neural classier in the task of detecting
MMN deection in single ERP trials. ISODATA is based on the same principles of the very
well known k-means clustering algorithm. In this appendix we are going to explain very
briey how these two algorithms work. For a detailed description of the algorithms we refer
to [22] and [79].
A.1 K-means
The k-means method aims to minimize the sum of squared distances between all points and
the cluster centre. This procedure consists of the following steps, as described by Tou and
Gonzalez [79]:
1. Choose K initial cluster centres z
1
(1), ..., z
K
(1).
2. At the k-th iterative step, distribute the samples among the K clusters using the rela-
tion,
x C
j
(k) if x z
j
(k) < x z
i
(k) (A.1)
for all i = 1, ..., K; i = j; where C
j
(k) denotes the set of samples whose cluster centre
is z
j
(k).
3. Compute the new cluster centres z
j
(k+1), j = 1, ..., K such that the sum of the squared
111
A.2. ISODATA
distances from all points in C
j
(k) to the new cluster centre
J
j
=
xC
j
x z
j
(k + 1)
2
(A.2)
is minimized. Assuming euclidean distance, z
j
(k +1) is simply the sample mean of C
j
.
4. If all cluster centres are unchanged, i.e. z
j
(k +1) = z
j
(k), j, terminate. Otherwise, go
to step 2.
A.2 ISODATA
ISODATA is similar to the K-means algorithm but it incorporates a set of Heuristic proce-
dures. The algorithm consists of these steps [79]:
1. Initialize N
c
cluster centres and specify the following parameters:
K = desired number of cluster centres.
K
min
= minimum number of clusters.
K
max
= maximum number of clusters.
N
= the minimum number of samples in a cluster.
S
= standard deviation parameter.
c
= lumping parameter.
L = maximum number of pairs of cluster centres which can be lumped.
I = number of allowed iterations.
2. Distribute the N samples among the present N
c
cluster centres, by
x C
j
if x z
j
< x z
i
, i = 1, ..., N
c
, i = j (A.3)
3. Discard sample subsets with fewer than
N
members, i.e. discard any cluster C
j
such
that card(C
j
) <
N
. Update N
c
.
4. Update each cluster centre z
j
, j = 1, ..., N
c
by setting it equal to the mean of C
j
z
j
=
1
card(C
j
)
xC
j
x j = 1, ..., N
c
(A.4)
5. Compute the average distance

D
j
of samples in C
j
from their centre
D
j
=
1
card(C
j
)
xC
j
x z
j
j = 1, ..., N
c
(A.5)
112
A.2. ISODATA
6. Compute the overall average distance of samples from their respective cluster centres
D =
1
N
c
j=1
card(C
j
)
N
c
j=1
card(C
j
)

D
j
(A.6)
7. If this is the last iteration: set
c
= 0 and go to step 11.
Else if this is an even-numbered iteration: go to step 11.
Else if N
c
K
max
: go to step 11.
Else: go to the next step.
8. Find the standard deviation vector
j
= (
1j
,
2j
, ...,
nj
)
T
for each cluster, using
ij
=
1
card(C
j
)
xC
j
(x
ik
z
ij
)
2
i = 1, ..., n, j = 1, ..., N
c
(A.7)
where n is the dimensionality of the data, x
ik
is the i
th
component of the k
th
sample in
C
j
and z
ij
is the i
th
component of z
j
.
9. Find the maximum component of each
j
and denote it by
jmax
.
10. If for any
jmax
we have
jmax
>
S
and:

D
j
>

D and card(C
j
) > 2(
N
+ 1) or
N
c
K
min
then split C
j
. Update N
c
= N
c
+ 1.
It there has been splitting in this step then go to Step 2. Otherwise go to next step.
11. Compute the pairwise distances D
ij
between all cluster centres.
D
ij
= z
i
z
j
i = 1, ..., N
c
1 j = i + 1, i + 2..., N
c
(A.8)
12. Compare the pairwise distances D
ij
against the parameter
c
. Arrange the L smallest
distances which are less than
c
in ascending order D
i
1
j
1
D
i
2
j
2
D
i
L
j
L
.
13. Pairwise lumping. Start with D
i
1
j
1
. The corresponding cluster centres are z
i
1
and z
j
1
.
Merge these two cluster centres into a new centre z
1
:
z
1
=
1
card(C
i
1
) +card(C
j
1
)
[card(C
i
1
) z
1
1
+card(C
j
1
) z
j
1
] (A.9)
Update N
c
= N
c
1. Repeat this step for all 2 k L for the centres which are not
already lumped in this iteration.
14. Terminate if this is the last iteration. Otherwise, go to step 1 if the parameters require
change or go to Step 2. Update iteration number.
113

Germ An G Omez Herrero

Uploaded by

Copyright:

Available Formats

You might also like

Germ An G Omez Herrero

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Germ An G Omez Herrero

Uploaded by

Copyright:

Available Formats

TAMPERE UNIVERSITY OF TECHNOLOGY

Department of Information Technology

( t) eectively suppresses the signal outside

Such a neuron model is depicted in Figure 5.1.

You might also like