Professional Documents
Culture Documents
Driver Identification Based On Voice Signal Using Continuous Wavelet Transform and Artificial Neural Network Techniques
Driver Identification Based On Voice Signal Using Continuous Wavelet Transform and Artificial Neural Network Techniques
www.elsevier.com/locate/medengphy
Technical note
Pathological voice quality assessment using artificial neural
networks
R.T. Ritchings a,∗, M. McGillion a, C.J. Moore b
a
Department of Computer Science, University of Salford, Salford, UK
b
North Western Medical Physics, Christie Hospital NHS Trust, Manchester, UK
Abstract
This paper describes a prototype system for the objective assessment of voice quality in patients recovering from various stages
of laryngeal cancer. A large database of male subjects steadily phonating the vowel /i/ was used in the study, and the quality of
their voices was independently assessed by a speech and language therapist (SALT) according to their seven-point ranking of
subjective voice quality. The system extracts salient short-term and long-term time-domain and frequency-domain parameters from
impedance (EGG) signals and these are used to train and test an artificial neural network (ANN). Multi-layer perceptron (MLP)
ANNs were investigated using various combinations of these parameters, and the best results were obtained using a combination
of short-term and long-term parameters, for which an accuracy of 92% was achieved. It is envisaged that this system could be
used as an assessment tool, providing a valuable aid to the SALT during clinical evaluation of voice quality. 2002 IPEM.
Published by Elsevier Science Ltd. All rights reserved.
Keywords: Intelligent systems; Neural networks; Electroarynograph signals; Voice quality; Larynx cancer
1350-4533/02/$22.00 2002 IPEM. Published by Elsevier Science Ltd. All rights reserved.
PII: S 1 3 5 0 - 4 5 3 3 ( 0 2 ) 0 0 0 6 4 - 4
562 R.T. Ritchings et al. / Medical Engineering & Physics 24 (2002) 561–564
2. Data capture deduced from the voicing analysis, was used to derive
the FHN normalised spectral representation. This pro-
The data used to develop the system were captured cess removed the large observed inter-patient variability
off-line under clinical conditions at the Christie and in f0 and its harmonics, thus allowing a more effective
Withington Hospitals in Manchester, using an Electrola- modelling of the spectral envelope among groups of
ryngograph PCLX system [4]. This system is used to patients. Once the FHN spectrum had been determined,
capture electrical impedance (EGG) signals using pads Gaussians were fitted to the data around f0 and its first
placed either side of the neck synchronously with acous- few harmonics [8]. Each Gaussian, Gh, (h=0 up to typi-
tic signals captured using a microphone. Both EGG and cally 8) was parameterised as:
acoustic data channels were captured synchronously at
Gh ⫽ (positionh, widthh and amplitudeh)
20 kHz for up to 3 s while the subject phonated the
vowel /i/ as steadily as possible. Other vowel sounds An observation was made that the mixture of Gaussi-
were also recorded, however, the /i/ vowel is most fre- ans gave a better ‘fit’ to the FHN spectrum for the less
quently used by SALTs as the onset of phonation occurs abnormal patients, and so a parameter related to the
more quickly than in other vowels and is a good indi- goodness of fit, called the harmonic linearity measure
cation of vocal fold health. (HLM), was calculated for each frame. Finally, as glottal
The EGG signal provides information about vocal fold noise is considered to be an important measure of voice
contact behaviour during voice production, as the electri- quality, a parameter based on the normalised noise
cal impedance varies with the opening and closing of energy (NNE) [9,10], but derived from the FHN spec-
the glottis. This signal is modulated by the resonant cavi- trum, FHNNE, was calculated for the data.
ties of the vocal, oral and nasal tracts to provide the The parameters extracted from the speech data and
acoustic signal. As the EGG signal is much less complex used for the ANN classification tests comprised three
than the acoustic signal, the visual appearance of this long-term parameters (Mf0, SDf0, V+) and 18 short-term
signal is used by the SALT in conjunction with their parameters (f0, G1, G2, G3, G4, G5, HLM, FHNNE). Full
perception of the acoustic sound when making their details of the data processing and extraction of these
assessment of voice quality. parameters can be found in McGillion [11].
Although speech data were recorded for both male and
female patients, the largest pathological group was male,
so it is these speech signals that were used in the study. 4. Data classification
Implicit in this is that the SALT had made a voice qual-
ity assessment of the patient using their own seven- In total, 77 abnormal speech signals were available
point ranking. for training and testing data. For each of the seven
classes, 450 patterns were used for training/validation
and 200 for testing. Unfortunately, as a result of the rela-
3. Data processing tively small dataset, there were different numbers of
patients in each class. As it is desirable to have equal
An automated voicing analysis was performed upon numbers in each class to train an ANN adequately,
each 3 s EGG and acoustic speech signals to determine additional frames were taken from some patients in
if the subject had voiced during phonation. If voicing classes with the fewest patients and a small percentage
was considered to have occurred, the EGG signal was of extra patterns was produced by adding normally dis-
processed to extract the long-term features initially, and tributed noise to the short-term features that were
then the short-term features for classification of voice derived from these frames.
quality. The long-term features include Mf0, the mean of A two-layer, seven-output MLP, as shown sche-
fundamental frequency, f0, the standard deviation of f0 matically in Fig. 1, was trained using the back-propa-
(SDf0), and the percentage of the 3 s signal that is voiced gation training algorithm, softmax activation function,
(V+), while the short-term features include parameters and cross-entropy error function [12]. The advantage of
related to the spectral envelope of the first few glottal using the cross-entropy activation function is that the
harmonics, and the glottal noise. output across all seven classes sums to 1.0 and can there-
The voicing test involved taking 50 ms frames from fore be interpreted as a probability of membership of
the signals and applying Cepstral analysis techniques each of the seven classes. A further constraint placed
[5,6] to identify the voiced frames. Each frame was then upon the MLP is that for any single class to be declared
pre-emphasised by forward differencing to reduce the the ‘winner’ the output for that class must be greater
effects of drifting signal amplitude, and its autocovari- than 50% (0.5). MLP structures with different numbers
ance was multiplied by a Hanning window, prior to of hidden nodes and subsets of the 21 input parameters
transformation to the frequency domain using the fast were investigated in order to determine the combination
Fourier transform. [7] An estimate of f0 for each frame, that provided the minimum classification error.
R.T. Ritchings et al. / Medical Engineering & Physics 24 (2002) 561–564 563
Fig. 2. The MLP estimate of class probability for the SALT’s pre-classified class 3 abnormals.
564 R.T. Ritchings et al. / Medical Engineering & Physics 24 (2002) 561–564