Professional Documents
Culture Documents
Voice Pathology Detection Based On The Modified Voice Contour and SVM
Voice Pathology Detection Based On The Modified Voice Contour and SVM
Voice Pathology Detection Based On The Modified Voice Contour and SVM
Available at www.sciencedirect.com
ScienceDirect
LETTER
Received 16 September 2015; received in revised form 1 October 2015; accepted 25 October 2015
KEYWORDS Abstract
Artificial intelligence; In this study, a novel method based on the voice intensity of a speech signal is used for automatic
Voice disorder detection;
pathology detection with continuous speech. The proposed method determines the peaks from
Modified voice contour;
the speech signal to form a voice contour. The area under the voice contour allows us to discrim-
Continuous speech;
Simpson’s rule inate between normal and disordered subjects. In the case of disordered subjects, the calculated
area under the voice contour is lower than that for a normal subject due to the malfunctioning of
vocal folds, which makes the voice weaker and breathier. Some long-term features such as shim-
mer and jitter are based on the accurate estimation of fundamental frequency, which is itself a
difficult task, especially for disordered speech signals. The proposed feature does not need to
estimate the pitch period or fundamental frequency during the calculation of the voice contour
and they provide a single value for the whole utterance similar to other long-term features. The
voice disorder database used in this study includes 71 normal subjects and the same number of
disordered subjects. Each disordered subject has one of the following voice disorders: vocal folds
cysts, laryngopharyngeal reflux disease, vocal folds polyps, unilateral vocal folds paralysis and
sulcus vocalis. The accuracy of the proposed method is 100%.
Ó 2015 Elsevier B.V. All rights reserved.
* Corresponding author at: Centre for Intelligent Signal and Imaging Research (CISIR), Department of Electrical and Electronic Engineering,
Universiti Teknologi PETRONAS, Perak, Malaysia.
E-mail addresses: zuali@ksu.edu.sa, zulfiqar_g02579@utp.edu.my (Z. Ali), msuliman@ksu.edu.sa (M. Alsulaiman), irraivan_elamvazuthi@
petronas.com.my (I. Elamvazuthi), Ghulam@ksu.edu.sa (G. Muhammad), tmesallam@ksu.edu.sa (T.A. Mesallam), mfarahat@ksu.edu.sa
(M. Farahat), kalmalki@ksu.edu.sa (K.H. Malki).
http://dx.doi.org/10.1016/j.bica.2015.10.004
2212-683X/Ó 2015 Elsevier B.V. All rights reserved.
Please cite this article in press as: Ali, Z et al. Voice pathology detection based on the modified voice contour and SVM.
Biologically Inspired Cognitive Architectures (2015), http://dx.doi.org/10.1016/j.bica.2015.10.004
2 Z. Ali et al.
Normal Subject, First 500 Samples Disordered Subject, First 500 Samples
1 1
0.8
0.6
0.5
0.4
0.2
Amplitude
Amplitude
0 0
-0.2
-0.4
-0.5
-0.6
-0.8
-1 -1
0 100 200 300 400 500 0 100 200 300 400 500
Number of Samples Number of Samples
(a) (b)
Fig. 1 Amplitudes in voice samples of (a) normal subjects and (b) disordered subjects.
Please cite this article in press as: Ali, Z et al. Voice pathology detection based on the modified voice contour and SVM.
Biologically Inspired Cognitive Architectures (2015), http://dx.doi.org/10.1016/j.bica.2015.10.004
Voice pathology detection based on the modified voice contour and SVM 3
studies (Anusuya & Katti, 2010; Childers & Sung-Bae, 1992; section’ provides information about the experimental
Gelzinis, Verikas, & Bacauskiene, 2008; Marinaki, setup, and explains the baseline results as well as the
Kotropoulos, Pitas, & Maglaveras, 2004; Neto, Costa, results of the proposed method. ‘Discussion section’ pre-
Fechine, & Muppah, 2007) to develop voice pathology sents a discussion and provides a comparison with existing
assessment systems. The correct acceptance rate of 73% methods. Finally, ‘Conclusion section’ draws some
with LPC and 73% with LPCC was obtained in Neto et al. conclusion.
(2007), when edema was detected from normal samples
and other pathologies, cysts, nodules, paralysis and Method
polyps. The efficiencies of LPC and LPCC were 85% and
80%, respectively. To conduct the study, 120 subjects,
The proposed method is based on the idea that normal and
including 67 patients and 53 normal persons from the Mas-
dysphonic patients can be classified by measuring the voice
sachusetts Eye & Ear Infirmary (MEEI) database
intensity of a speech signal. The method is never used for
(Massachusetts Eye & Ear Infirmary Voice & Speech Lab,
voice pathology detection. The method measures the voice
1994), were considered and experiments were performed
intensity of an utterance by calculating the area under the
by using the sustained vowel /a/. MFCC coefficients were
MVC. The contour is obtained by normalizing the cubic poly-
also calculated to make a comparison with LPC and LPCC
nomial fitted through the peaks. These peaks are found from
and they achieved an efficiency of 52%, very low com-
each frame of the utterance, and Simpson’s rule is used to
pared to LPC and LPCC. The high false acceptance rate
calculate the area under the contour.
of 74% showed that MFCC was unable to detect edema
The method is divided into five major components, and
from other pathologies as well as other features. For the
these are grouped into three steps: (1) frame blocking and
second case, when all normal persons were grouped into
peak calculation, (2) fitting and adjustment of a polynomial
one class and all pathologies were combined in the second
and (3) calculation of the area under the MVC by using Simp-
class, the results for MFCC were much better than the
son’s rule. To make the decision between normal and disor-
other features, which shows that MFCC performed well
dered speech, a binary classifier SVM is used. A block
for the detection of disorders but less well for the discrim-
diagram for the proposed AVPD system is shown in Fig. 2.
ination between the types of disorders. As for Fourier
To determine the MVC, peaks are found after blocking
transformation-based MFCC (Bou-Ghazale & Hansen,
the whole speech signal into frames. The length of each
2000; Murphy & Akande, 2007), non-parametric features
frame is 32 ms, and it contains 512 samples. The peaks
simulate the behavior of the human ear, and the study
higher than a certain threshold value thresh are determined
also concluded that MFCC behaves like a clinician, because
in each frame. A frame showing the calculated peaks, with
for a clinician, it is also easier to detect a disorder by
thresh = 0.05, is depicted in Fig. 3(a).
auditory perception than to discriminate between disor-
After the calculation of the peaks in each frame, they are
ders. By contrast, LPC and LPCC, as parametric features,
joined together. Then, a polynomial of degree three, g(x), is
represent the human vocal tract system, and hence may
fitted through these peaks to form a curve as shown in Fig. 3
perform better in the classification of VFDs. In Gelzinis
(b). The curve is composed by circles. As observed in Fig. 3
et al. (2008), MFCC and LPC fed into SVM and k-nearest
(b), the fitted polynomial passes through the peak points,
neighbors for the classification of three classes: healthy,
and hence does not make an envelope over the joined peak
diffuse and nodular. The database used in the study con-
points. Therefore, to get an MVC, a factor is added in the
tained sustained vowels only as recorded at the Depart-
polynomial g(x). The factor is given by Eq. (1) as:
ment of Medicine, Lithuania, The classification rate
obtained for MFCC was 73.08% and that for LPC was factor ¼ ðmaxðpeaksÞ maxðgÞÞ 0:70 ð1Þ
67.31%.
where peaks is a vector containing the peaks of all frames
In Ali, Alsulaiman, Muhammad, Elamvazuthi, and
and g is a vector containing all the points on the fitted poly-
Mesallam (2013), an MFCC-based system for disorder detec-
nomial. After adding the factor into the fitted polynomial g
tion by using text-dependent running speech was devel-
(x), the MVC composed by ‘diamonds’, shown in Fig. 3(b), is
oped. The running speech of a limited number of normal
obtained as:
and disordered subjects (12 and 26, respectively) was used
to evaluate the developed system. An accuracy of 91.66% MVC ¼ gðxÞ þ factor ð2Þ
was reported. The Gaussian mixture model was imple- The area under the MVC is calculated with Simpson’s rule
mented as the classification technique with a varying num- of numerical integration. SVM takes the area under the MVC
ber of Gaussian mixtures. A limited number of samples to make the decision about the presence of voice pathology.
were used for the experiments; therefore, no reliable con- Dysphonic patients are represented by +1 and normal per-
clusions could be made. sons are represented by 1. SVM is implemented by using
In this paper, a feature based on the voice intensity of a LIBSVM (Chang & Lin, 2011) with a radial basis function as
speech signal is proposed. An AVPD system by means of con- a kernel, given by Eq. (3).
nected speech is developed, and SVM is used for the classi-
fication of normal and pathological speech samples. Kðx; x 0 Þ ¼ exp ckx x 0 k2 ð3Þ
The rest of the paper is organized as follows. ‘Method
section’ explains the proposed method. ‘Data’ describes where x is the training sample, x0 is the testing sample and c
the speech database. ‘Experimental setup and results is a free parameter. SVM is a linear classifier; however, in
Please cite this article in press as: Ali, Z et al. Voice pathology detection based on the modified voice contour and SVM.
Biologically Inspired Cognitive Architectures (2015), http://dx.doi.org/10.1016/j.bica.2015.10.004
4 Z. Ali et al.
Determination Fitting of a
Divided into of Peaks in Polynomial
Frames each Frame through Peaks
Modified Voice
Area under Calculation of Contour Adjustment of
SVM Area under the Polynomial
MVC MVC (MVC)
Area Calculation
Normal Person
Or
Dysphonic Patient
Fig. 2 Block diagram of the proposed method for the automatic voice disorder detection system.
0.4 1
Joined Peaks
0.35
Fitted Polynomial
0.3 MVC
0.25 0.8
0.2
0.15
Amplitude
Amplitude
0.6
0.1
0.05
0 0.4
-0.05
-0.1
-0.15 0.2
-0.2
-0.25
-0.3 0
1025 1100 1200 1300 1400 1536 0 50 100 150 200 250 300 350
Samples Peaks
(a) (b)
Fig. 3 (a) Peaks determination in a frame and (b) MVC.
most cases, the data are not linearly separable. Therefore, speaker recorded one utterance of each Arabic digit from
the kernel function is implemented to map the original one to nine, as mentioned in Table 2. All speakers were Arab
input space to the higher dimensional space, whose features natives. The total number of available samples for the
are lineally separable. investigation is 1278 (=142 9).
Pathological speakers have five different types of VFDs,
Data namely cysts, laryngopharyngeal reflux disease, unilateral
vocal fold paralysis, polyps and sulcus vocalis. The classifi-
cation of these voice disorders was based on the available
The speech database used for the experiments in this paper
clinical data, including video-laryngostroboscopic examina-
has been considered in many studies of automatic pathology
tion. In some cases such as polyps and cysts, the clinical
detection and classification systems and automatic speech
diagnosis was confirmed by histopathology after the exci-
recognition systems (Alsulaiman, 2014; Muhammad et al.,
sion of the lesion.
2011, 2012). The speech database was recorded at the
Research Chair of Voice, Swallowing, and Communication
Disorders, King Abdul-Aziz Hospital, Riyadh in a soundproof
room. Standardized recording equipment, KayPentax Com- Table 1 Statistics of the speech database.
puterized Speech Lab (CSL Model 4300), was used for the Speakers Male Female Total
recording. The database contains speech samples of both
normal and pathological speakers. The total number of sub- Normal 53 18 71
jects is 142; 50% of them are patients and 50% are normal Patients 50 21 71
speakers, as listed in Table 1. The normal speakers did not Total 103 39 142
have any previous or current history of voice disorders. Each
Please cite this article in press as: Ali, Z et al. Voice pathology detection based on the modified voice contour and SVM.
Biologically Inspired Cognitive Architectures (2015), http://dx.doi.org/10.1016/j.bica.2015.10.004
Voice pathology detection based on the modified voice contour and SVM 5
Table 2 List of Arabic digits. Results of the proposed method for pathology
detection
Digits
Symbol In Roman English In Arabic The area under the MVC of normal speakers and dysphonic
1 Wahed ﻭﺍﺣﺪ patients was calculated by following the steps mentioned
2 Athnayn ﺃﺛﻨﻴﻦ in Fig. 2. The calculated area was fed into SVM to differen-
3 Thalathah ﺛﻼﺛﻪ tiate between normal and dysphonic people. Two different
4 Arbaah ﺃﺭﺑﻌﻪ values of the variable thresh, i.e. 0.03 and 0.04, were con-
5 Khamsah ﺧﻤﺴﻪ sidered in the peak determination process. The experimen-
6 Setah ﺳﺘﻪ tal results provided in Table 3 show that the values of thresh
7 Sabaah ﺳﺒﻌﻪ affect the detection rate. In Table 3, 95% C.I. represents the
8 Thamanyah ﺛﻤﺎﻧﻴﺔ 95% confidence interval and AUC stands for the area under
9 Tesaah ﺗﺴﻌﻪ the ROC (receiver operating characteristic) curves. The
accuracy for each fold is determined, and their averaged
accuracy with standard deviation (STD) is presented in
Table 3. The developed system was verified for male and
Experimental setup and results female speakers separately. Seventy percent of male speak-
ers were used for training and the remaining 30% were used
for testing the system. Similarly, female speakers were also
To ensure the performance of the proposed system, all
divided by the same ratio.
results were obtained by using fivefold cross validation.
With thresh = 0.03, an accuracy of 81% with STD of 2.07
The performance of automatic systems were carried out
is achieved for male speakers. The developed system recog-
using the following parameters:
nizes normal persons well as the acceptance of TNs is 92%
but does not perform well for dysphonic patients as the
(a) Sensitivity (SE): The likelihood that the system
acceptance of TPs is 70%. FPs are only 8%, but FNs are up
detects a pathological voice when the input is also
to 30%, which is very high. The long range of the C.I., from
pathological.
75.6 to 91.6, reflects the unreliability of the system. The
TP performance parameters for thresh lower than 0.03 also
SE ¼ 100
TP þ FN show the same kind of behavior. For thresh = 0.04, the
accuracy of the pathology detection system is 99% with
(b) Specificity (SP): The likelihood that the system STD = 0.83. TPs and TNs are 100% and 98%, respectively.
detects a normal voice when the input is also normal. The system responded well for both normal and dysphonic
patients. The 95% C.I. is [96 99] for males and [100 100]
TN for females, which shows the reliability of the system.
SP ¼ 100
TN þ FP The AUC for male and female subjects are 83.6% and
(c) Accuracy (ACC): The ratio between correctly detected 100%, respectively, as shown in Fig. 4(a). In Fig. 4(b), the
files and the total number of files. ROC curves for both genders with thresh = 0.04 are shown,
and the AUC is 100% for each. The performance of the
TP þ TN
ACC ¼ 100 developed system with thresh higher than 0.04 did not show
TP þ TN þ FP þ FN any improvement in the results. Therefore, they are not
where true negative (TN) means that the system included in the Table 3.
detects a normal subject as a normal subject, true
positive (TP) means that the system detects a patho- Baseline results with LPCC and MFCC for pathology
logical subject as a pathological subject, false nega- detection
tive (FN) means that the system detects a
pathological subject as a normal subject and false The results of the pathology detection system with the dif-
positive (FP) means that the system detects a normal ferent numbers of LPCC and MFCC are presented in Table 4.
subject as a pathological subject. Twenty-four coefficients contain 12 static coefficients and
Please cite this article in press as: Ali, Z et al. Voice pathology detection based on the modified voice contour and SVM.
Biologically Inspired Cognitive Architectures (2015), http://dx.doi.org/10.1016/j.bica.2015.10.004
6 Z. Ali et al.
12 delta coefficients. Similarly, 36 coefficients contain 12 more transient and noisy. In this paper, a novel method
static, 12 delta and 12 delta–delta coefficients. based on the voice intensity of the speech signal is proposed
Accuracy decreases for both LPCC and MFCC as the num- to develop an AVPD system.
ber of coefficients increases, as depicted in Table 4. The The method calculates one feature for a whole utterance
best obtained accuracy for LPCC is 79.13%, where TPs and produced by a subject. The peaks from the speech signal
TNs are 73.23% and 85.02%, respectively. The maximum are determined to make a voice contour. Breathiness
obtained accuracy for the MFCC-based system is 89.69%, in the voice of a patient makes the voice weaker; therefore,
where TP is 89.92% and TN is 89%. The detection rate of the amplitude in the dysphonic speech signal is lower than
MFCC is better than that of LPCC. The FNs and FPs of MFCC the amplitude in a normal speech signal. The area under the
are also lower than those of LPCC. The performance of the voice contour for a disordered subject is also less than that
proposed system is better than both LPCC- and MFCC-based of a normal subject due to the lower amplitude in dysphonic
AVPD systems. subjects. Moreover, the proposed method did not depend on
The ROC curves for the different numbers of LPCC and the language of the database. In case of any language,
MFCC are shown in Fig. 5. It can be observed that the AUC speech signal of a disordered subject will contain lower
for both LPCC- and MFCC-based detection systems amplitude than a speech signal of a normal subject.
decreases as the number of coefficients increases. The max- To explain the working of the voice intensity-based AVPD
imum AUC is for MFCC (93%) with 12 coefficients, and the system, the area under the MVC for 53 normal and the same
95% C.I. is [91.5–94.5]. number of disordered subjects are plotted in Fig. 6. As can
be observed from Fig. 1 that the amplitude of normal speak-
ers is higher than that of pathological patients. Fig. 6 also
Discussion shows that normal speakers have higher voice intensity than
pathological speech samples. To observe the significance of
The speech production system of a subject suffering from the area under the MVC for normal and disordered subjects,
voice pathology differs from a normal subject. Phonation, a two-tailed t-test is performed. For male subjects, the p-
resonance and articulation are three important phases to value is 0.00004 (<0.05), which shows that the area under
produce speech. Vocal folds directly affect phonation and the MVC for normal and disordered subjects is statistically
resonance in the speech production system. The abnormal significant. Similarly, the p-value = 0.000003 for female sub-
behavior of the vocal folds makes the voice weaker, whis- jects shows the significance of the two classes.
pering and breathier due to too far apart vocal folds. This Similar to other long-term features such as shimmer, jit-
malfunctioning of the vocal folds makes a speech signal ter (Arjmandi et al., 2011) and CPP (Heman-Ackah et al.,
1 1
True positive rate (Sensitivity)
True positive rate (Sensitivity)
Male
0.8 0.8 Female
Male
Female
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False positive rate (1−Specificity) False positive rate (1−Specificity)
(a) (b)
Fig. 4 ROC curve for the proposed method with (a) thresh = 0.03 and thresh = 0.04.
Table 4 Performance of the LPCC and MFCC for voice pathology detection.
Performance measures Number of LPCC coefficients Number of MFCC coefficients
12 24 36 12 24 36
SE 73% 93% 97% 90% 96% 97%
SP 85% 33% 22% 89% 39% 18%
ACC (%) ± STD 79.1 ± 4.8 63.1 ± 6.6 59.6 ± 6.0 89.6 ± 2.7 67.7 ± 3.9 57.4 ± 6.7
AUC 85.8% 77.8% 67.4% 93% 85.6% 67.5%
95% C.I. 83.8–87.8 75.3–80.4 64.4–70.3 91.5–94.5 83.5–87.7 64.6–70.0
Please cite this article in press as: Ali, Z et al. Voice pathology detection based on the modified voice contour and SVM.
Biologically Inspired Cognitive Architectures (2015), http://dx.doi.org/10.1016/j.bica.2015.10.004
Voice pathology detection based on the modified voice contour and SVM 7
0.6 0.6
LPCC MFCC
LPCC−delta MFCC−delta
0.4 LPCC−delta−delta 0.4 MFCC−delta−delta
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False positive rate (1−Specificity) False positive rate (1−Specificity)
(a) (b)
Fig. 5 ROC curve for different numbers of (a) LPCC and (b) MFCC.
2003), the proposed feature also provides a single value for sustained vowel-based systems. Most automatic pathology
a whole utterance. The measurement of long-term acoustic assessment systems, pathology detection and pathology
features normally involves the accurate estimation of the classification presented in the literature are developed by
pitch period, which is a very difficult task, especially in using a sustained vowel /ah/ (Lee et al., 2013, 2014;
pathological samples. The proposed feature does not need Markaki & Stylianou, 2011; Muhammad & Melhem, 2014).
to estimate the pitch period or fundamental frequency dur- A speech signal remains stationary in the case of a sustained
ing the calculation of the voice contour, and this is a posi- vowel, but it varies over time in the case of continuous
tive aspect of the proposed feature. speech. This is the reason why pathology assessment sys-
In this paper, the developed AVPD system used running tems that use continuous speech are challenging and require
speech, which is a comparatively difficult task than more investigation. Moreover, these systems are more real-
istic because people use continuous speech in their conver-
sations in daily life. Running speech contains fluctuations in
450 Normal vocal characteristics in relation to voice onset, voice termi-
Dysphonic nation and voice breaks, which are considered as crucial in
400
quality voice evaluation. These characteristics are not fully
Area under MVC
Table 5 Comparison of the proposed system with existing connected speech-based systems.
Reference Database No. of Samples (N + P) Features Accuracy (%)
Godino-Llorente, Fraile, Sáenz-Lechón, MEEI 23 + 117 MFCC 96
Osma-Ruiz, and Gómez-Vilda (2009)
Klára, Viktor, and Krisztina (2012) Private 26 + 33 MFCC, shimmer, 86
jitter, and HNR
Ali et al. (2013) Private 12 + 26 MFCC 91.66
Proposed method Private 71 + 71 MVC 100
N stands for normal samples and P stands for disordered samples.
Please cite this article in press as: Ali, Z et al. Voice pathology detection based on the modified voice contour and SVM.
Biologically Inspired Cognitive Architectures (2015), http://dx.doi.org/10.1016/j.bica.2015.10.004
8 Z. Ali et al.
Please cite this article in press as: Ali, Z et al. Voice pathology detection based on the modified voice contour and SVM.
Biologically Inspired Cognitive Architectures (2015), http://dx.doi.org/10.1016/j.bica.2015.10.004
Voice pathology detection based on the modified voice contour and SVM 9
Parsa, V., & Jamieson, D. G. (2001). Acoustic discrimination of Roy, N., Merrill, R. M., Thibeault, S., Parsa, R. A., Gray, S. D., &
pathological voice: Sustained vowels versus continuous speech. Smith, E. M. (2004). Prevalence of voice disorders in teachers
Journal of Speech, Language, and Hearing Research, 44, and the general population. Journal of Speech, Language, and
327–339. Hearing Research, 47(2), 281–293.
Paulraj, M. P., Yaacob, S., & Hariharan, M. (2009). Diagnosis of Watts, C. R., & Awan, S. N. (2011). Use of spectral/cepstral
vocal fold pathology using time-domain features and systole analyses for differentiating normal from hypofunctional voices
activated neural network. In Fifth international colloquium on in sustained vowel and continuous speech contexts. Journal of
signal processing & its applications (pp. 29–32). Speech, Language, and Hearing Research, 54, 1525–1537.
Please cite this article in press as: Ali, Z et al. Voice pathology detection based on the modified voice contour and SVM.
Biologically Inspired Cognitive Architectures (2015), http://dx.doi.org/10.1016/j.bica.2015.10.004