Speech Processing Article

Overview of Voice Disorder Identification by using
Machine Learning Techniques
Benkhay abdellah a , Daali noussaiba a

a
Faculty of Sciences of Rabat, Mohammed v University
Research Master in Computer Science and Telecommunications
Speech Processing
ARTICLE INFO ABSTRACT :
Keywords : Identification of voice disorders plays a basic part in our life these days. Therefore, many
of these diseases should be analyzed at beginning phases of case before they lead to a
Speech recognition
basic condition. Acoustic analysis can be used to identify voice disorders as a
Machine learning
corresponding strategy with other customary obtrusive techniques, such as
Voice disorders
laryngoscopy. Mobile technologies offer not only forms of communication for
Support Vector Machine
multimedia content clinical audio-visual notes and medical records yet additionally
(SVM)
encouraging answers for individuals who want the identification their health conditions
anywhere and whenever. In this paper, we present some of methods for voice disorders
classification based on machine learning methods, the identification of an algorithm that
discriminates between disorderly voices and healthy voices with more accuracy is
necessary to realize a valid and precise mobile health system. All analyses are performed
on a datasets of voices selected from SVD [7], MEEI [8] and AVPD [8].The results
obtained are evaluated in terms of accuracy, sensitivity and specificity and the main
classifier algorithm used is SVM.
INTRODUCTION: exploration and business networks. They offer, as a

matter of fact, various chances to acknowledge effective
Voice disorders by Wikipedia affect the ability to portable wellbeing frameworks. These arrangements
speak normally. These disorders can incorporate can permit patients and specialists to get to clinical
laryngitis, deadened vocal lines, and a nerve issue that records, clinical general media notes and medication
makes the vocal ropes fit. Your voice might tremble, be data anyplace and whenever from their cell phones, like
rough, or sound stressed or uneven. Speech or, as a rule, a tablet or cell phone, to screen a few circumstances. In
the voice signal is utilized in a few sorts of utilization recent years, probably also due to the diffusion of the
going from feeling acknowledgment [9] to patient Internet of Things (IoT) and cloud technologies, there
medical services state acknowledgment [10]. Amazon's has been a development of monitoring systems in an
Alexa, Apple's Siri's, Google assistant, Microsoft's unobtrusive, portable and easy way using wearable
Cortana and so forth are a portion of the present well- sensors and wireless communications, such as the
known Automatic Speech Recognition ASR solution describe in [1], [2], [3], [4], [5], [6]. Several
frameworks. These are voice actuated computerized acoustic parameters are estimated to evaluate the state
aides and assist us in working environment, banking, of health of the voice. Unfortunately, the accuracy of
wellbeing with caring and so forth. these parameters in the detection of voice disorders is,
The voice definition is the sound delivered in an often, related to the algorithms used to estimate them.
individual's larynx and expressed through the mouth, as For this reason, the main effort of researchers is oriented
discourse or melody. The introduction of mobile to the study of acoustic parameters and the application
devices for data transmission or disease control and of classification techniques able to obtain a high
monitoring has been a principal fascination of
1
discrimination accuracy. Recently, speech pathology used voice recordings from SVD [7], MEEI [8] and
has focused interest on machine learning techniques. AVPD [8] database to detect voice disorder with a big
This paper gives a meta-examination of the percentage. We will also ensure that selected research
important exploration articles that are straightforwardly papers use machine learning-based approaches.
focusing on voice disorder and the data sets use for the All articles that do not include machine learning or
discovery and the AI methods utilized for the an algorithm in which the disease is defined will be
recognition as made sense of. Shortly, this outline eliminated. This also excludes papers solely based on a
covers every one of the three most famous data sets qualitative examination and not analyzed on basis of
SVD [7], MEEI [8] and AVPD [8]. accuracy and quantitative analyzes. This standers
The plan of this paper is organized as follows. showed the accuracy of machine learning algorithms
Section 1 gives a short introduction of voice disorders and its strategies applied in totally chosen article that are
and data sets we have targeted as related work. Section quantitatively audited distributed.
2 provides the methodologies used to conduct this Afterwards, we are going to present the
overview. The finding of this systematic assessment is performance of a meta-analysis on the detection of
mention in Section 3 of this paper. Section 4 deal with voice disorders using the SVD [7], MEEI [8] and AVPD
the methodology that we choose to percent. The [8] databases, review outcomes and accuracy of 30
conclusion of this whole paper is provided in section 5 relevant articles and identify the gap for research in this
with restrictions, research gaps and recommendations field. We concentrated mostly on the related research
for further investigations. papers with respect to these criteria in order to
understand the problem through machine learning or
1. RELATED WORK implementation. This includes articles that mostly used
voice recordings from SVD [7], MEEI [8] and AVPD
As we notes previously that in this overview, we are [8] database to detect voice disorder. Research article
going to give a meta-examination of the important consisting of machine learning techniques. The articles
exploration articles that are straightforwardly focusing consists of voice filtering and segmentation techniques,
on voice disorder and the data sets use for the discovery an application, or any software in order to detect the
and the AI methods utilized for the recognition as made disease through voices. All articles are in the language
sense of. Shortly, this outline covers every one of the of English.
three most famous data sets SVD [7], MEEI [8] and In table 1, it has been noticed that SVM is the most
AVPD [8]. used algorithm for the identification of voice disorders
There is so many studies focused on the in all three datasets. In our lives today, the recognition
identification of parameters to measure the voice quality of voice disorders plays an important role. Many of
and new techniques able to detect voice disorders. An these disorders should therefore be treated until they
others studies focused on peer reviewed articles that progress to a critical condition at an early stage of
used machine learning to recognize voice disorders in incidence. SVMs have become a popular tool for
voice recordings. In fact, we concentrated mostly on the discriminatory labeling. Speech synthesis is a promising
related research papers with respect to these criteria in field for recent SVM applications [11].
order to understand the problem through machine
learning or implementation. This includes articles that
Table 1: Summary of the 30 selected studies with data extraction.
no Author/year DATASET Feature selection Classifier Accuracy in %

1 A. Al-Nasheri et al. /2017 SVD MFCC SVM 99.53
[12] MEEI 99.54
AVPD 96.01
2 Zulfiqar Ali et al./2017 [13] SVD MFCC GMM 80.02
MEEI 94,6
AVPD 83,6
3 Fonseca et al./2020 [14] SVD SE, ZCR, SH SVM 95
4 Garcia et al./2019 [15] SVD GBR scale GMR --
5 Guedes et al./2019 [16] SVD PCA DLN 80
6 Hammami et al./2020 [17] SVD HOS, DWT SVM 99.3
2
7 Panek et al./2016 [18] SVD PCA K-means Clustering 100
8 Markaki et al./2009 [19] SVD Mutual Information b/w SVM 94.1

subjective voice quality and
computed features
9 Markaki et al./2011 [20] SVD mutual information p/w voice SVM 94.1
classes
(normophonic/dysphonic)
10 Miramont et al./2020 [21] SVD CPP, SDNPCV, NPCV, HNR SVM 86.53
MEEI ---
11 Muhammad etal. /2017 [22] SVD Glottal source excitation SVM 93.2
MEEI 99,4
AVPD 91,5
12 Kadiri et al./2020 [23] SVD Glottal source features and SVM 76.19
MFCC
13 Teixeira et al./2018 [24] SVD Jitter, shimmer and HNR, SVM 71

MFCC
14 Teixeira et al./2017 [25] SVD Jitter, shimmer and HNR MLP-ANN 95
15 Amami et al./2017 [26] MEEI DBSCAN and MFCCs SVM 98
16 Londono et al./2010 [27] MEEI MFCC HMM 82.14
17 Arjmandi et al./2011 [28] MEEI MDVP parameters QD, NM, KNN, 78.9
SVM, ML
18 Barreira et al./2020 [29] MEEI HASS-KLD, H-KLD, MFCC GNB 99.55
19 Cordeiro et al./2017 [30] MEEI MFCC, LSF SVM, GMM, DA 98.7
20 Cordeiro et al./2018 [31] MEEI RPPC SVM 94.2
21 Fang et al. /2019 [32] MEEI MFCC DNN, SVM, GMM 99.14
22 Llorente et al./2009 [33] MEEI MFCC MLP-ANN 96
23 Mahmood /2019 [34] MEEI MFCC ANN, SVM, RF 72.70
24 Muhammad et al. /2014 [35] MEEI MPEG-7 feature SVM 99.994
25 Ghulam Muhammad/2016 SVD MFCC SVM 87,85

[36] MEEI GMM 98,23
26 Lili Chen/2021 [37] VED II LPCC KNN 93.3
27 Laura Verdea/2021 [38] ENT HNR, jitter, Shimmer RF 83.3
28 Li Deng/2016 [39] TIMIT DNN derived features DNN 81.7
CNN
RNN
29 Mousumi Malakar/2021 MEEI HNR, jitter, Shimmer KNN 96.1
[40]
30 Alireza A. MEEI MFCC LDC GMM NMC 98.59
Dibazar/2002[41]
3
Support Vector Machine (SVM) is an old classification spite of the fact that the distinctive conditions and sound
approach and has shown great scientific interest, levels utilized to capture typical and obsessive voice
especially in the fields of machine classification, have numerous disadvantages. In this collection, a few
regression and learning. SVM with the known classes devices, such as stroboscopy, sound-related optimal
associated. This is defined as filtering or extraction of design and physical neck and mouth tests, were utilized
features. Even if no prediction of unknown samples is to survey discourse clutters (this data was given by Kay
necessary, function selection and SVM classification Elemetrics).
have been used together. They may be used to define The third database that's utilized in this audit is
main sets that take part in the class differentiation Arabic voice pathology database (AVPD) [8]. Tests of
process. Let’s take an example about the classification words and voices were recorded at different sessions in
of pathological and healthy voices the first is based on Lord Abdul Aziz College Clinic in Riyadh, Saudi
MPEG-7 functionality to train and test an SVM Arabia, Communication & Gulping Disarranges Unit.
classifier. These got a good accuracy represented in In a sound treatment room, a standard recording
99.994% in [35], in this research, the database used in protocol was utilized to gather voices of the quiet by
MEEI it contains sustained vowel/AH/recordings by 53 experienced phoneticists. The database convention has
normal speakers with duration of around 3 s and by 657 been created to anticipate particular MEEI information
pathological speakers, with a wide variety of diseases, base lacks [43]. The AVPD gives records of long-
with duration of around 1 second. A method was standing vowels and voice collapsing disarranges,
developed in the seconde for short-time jitter and the coupled with the same records of customary speakers.
PCA reached 80% in [16]. Mel-frequency cepstral
coefficients (MFCC) are popular features in speech 2.2 Features used for classification:
analysis, and have been used also in voice pathology
detection. . The third one is [32], detection of Feature extraction is the primary step in any voice
pathological voice was done by MFCC and SVM, clutter discovery framework. In this step, the given
GMM and DNN classifiers reached an accuracy of voice signals are changed over into agent acoustic
99.14%, this research was applied on a subset samples highlights utilizing different digital signal handling
from the Massachusetts Eye and Ear Infirmary (MEEI) procedures. We'll presently talk about the foremost
database [8]. prevalently utilized techniques for acoustic
In other study [27] used HMM classifier applied on investigation and highlight extraction within the related
MEEI dataset and extract MFCC as feature give 82.14% region. We have chosen the most highlights utilized in
as accuracy. a few connected thinks about existing in writing
concerning the utilize of machine learning strategies for
the voice classification.
Mel-frequency cepstral coefficient: Mel-frequency
2. MATERIALS AND METHODS: cepstral coefficient (MFCC) is a standard method for
feature extraction that makes use of the knowledge of
2.1 The datasets: human auditory system. The general steps for extracting
the MFCC features for a single frame [44], [45] are as
The primary database that's utilized in this audit follows:
is Saarbruecken Voice Database (SVD) [7]. A i. Computation of the discrete Fourier transform
collection of voice recordings by over 2000 individuals. coefficients.
Vocal enlistment [I a, u] created at standard, tall and ii. Filtering with Mel spaced triangular filter.
moo pitches. Vocal documentation of expanding pitch iii. Computation of sub-band energies.
[I a, u]. Recording of the phrase'' Great morning, how iv. Computation of the discrete cosine transform
do you like it?''. The database has content record coefficients.
incorporates all pertinent data almost the dataset. Those Harmonic to Noise Ratio (HNR): this quantifies the
characteristics make it a great choice for experimenters ratio of signal information over noise due to turbulent
to utilize. All recorded SVD voices were tested with a airflow, resulting from an incomplete vocal fold closure
determination of 16-bit at 50 kHz. in speech pathologies.
The second database that's utilized in this Discrete wavelet transform: The discrete wavelet
survey is Massachusetts eye and ear hospital (MEEI) transform (DWT) can be used in variety of signal
[8]. Contains over 1,400 vocal tests of the long vowel / processing applications. It is discrete in time and scale.
a/ and the primary parcel of the Rainbow entry, made The voice is a time variant signal and sometimes need
by MEEI Voice and Discourse Lab. It has been sold in to be converted into frequency domain for the analysis.
two particular environment by Kay Elemetrics [42]. The DWT is capable of performing a joint time−frequency
inspecting recurrence was 50 kHz. It is utilized in most analysis and also the analysis of high frequency
voice pathology discovery and classification tests in characteristics of the pathological voices. Hence, it is
useful tool for voice disorder detection. The DWT
4
coefficients may have real values, but the time and scale which are continuous in nature. The voice acoustic
values used to index these coefficients are integers. [46], measures are continuous. Hence, it is suitable to model
[47]. the voice characteristics. GMM parameters are
Jitter: this describes the instabilities of the oscillating estimated during training using
pattern of the vocal folds, quantifying the cycle-to-cycle expectation−maximization iterative algorithm or
changes in fundamental frequency. maximum a posteriori algorithm.
Shimmer: this indicates the instabilities of the Convolutional neural network (CNN): is attracting
oscillating pattern of the vocal folds, quantifying the interest through a range of domains, including
cycle-to-cycle changes in amplitude. radiology. CNN is designed to learn spatial hierarchies
The jitter and shimmer features, they were estimated through numerous building blocks, including cooling
utilizing the method presented in [48]. layers, bonding layers and fully connected layers,
Glottal flow signal parameters: The glottal flow automatic and adaptive context propagation. [49]. CNN
signal can be obtained by performing an inverse
is a deep learning method that is commonly used for
filtering of the voice signal, which consists of
solving difficult problems. CNN is a deep learning
eliminating the influence of the vocal tract and the voice
radiation caused by the mouth, and preserving the solution.
glottal flow signal characteristics. Artificial Neural Networks (ANN): ANN algorithm
Linear Predictive Coding: LPC and LPCC simulate works based on the concepts of biological neural
the vocal tract characteristics based on human speech
networks in human. It consists of set of neurons
production model, whereas MFCC mimics the human
perception of hearing. Earlier, LPC and LPCC were connected to each other and the output of one neuron
widely used as they are considered as easy to compute. can be fed as input to another neuron. The neurons are
MDVP: the application of acoustic analysis in voice arranged in layers, ie, input layer, hidden layer, and
disorder detection can be found in9 and Sonu, output layer also referred as multilayer perceptron
Sharma.10 Multidimensional Voice Program (MDVP) (MLP) where the number of neurons in each layer and
is standard software for acoustic analysis, and it is most the total numbers of layers depend on the type of
popularly used in AVDD systems. It is reliable and very applications.
comprehensive. Using MDVP, one can estimate 33
voice parameters, which measure frequency related, Deep Neural Network (DNN): DNN is basically a feed
intensity related, noise related, tremor related, and its forward network having more layers, hence is more
perturbations measures among others. powerful than the traditional ANN. A special attraction
of DNN is the bottleneck layer which typically consists
of fewer nodes compared to previous layer. This layer
produces an abstract representation of incoming
2.3 Machine learning algorithms: information by compressing it. At an extreme, DNN can
be used to develop an end-to-end ASR system
Different machine learning algorithms have been successfully. [52]
selected by us. Actually, based on a set of shared
qualities, each of them has been selected as a Decision Tree (DT): this technique is used to classify
representative of a class of algorithms. These methods categorical data in which the learned function is
include: represented by a decision tree. Decision trees are easy
support vector machine (SVM): SVM is widely used to interpret, capable of working with missing values and
for data classification and it is known forits high categorical and continuous data, characteristics of the
prediction capabilities in many signal processing
medical field. We have used J48, an implementation of
applica-tions [11]. SVM objective is to find the optimal
algorithm C4.5 [53], the most popular tree classifier.
hyper plane that separates the two classes while
maximizing the margin between separating boundary It is important to remark here that other classification
and the closest samples to it (support vec-tor). techniques are not reported in this study due to the poor
HMM is used to model the spectral variability of each performance achieved during our experiments.
speech sound using a mixture Gaussian distribution.
Hidden Markov Model (HMM): HMM acts as a
stochastic finite state machine, which is assumed to be
built from a finite set of possible states (hidden during
evaluation) and each of those states are associated with
a mixture of Gaussian probability density function.[49],
[50], [51]
Gaussian Mixture Model (GMM): GMM is used to
model the probability distribution of feature values,
5
3. RESULTS AND DISCUSSION: o False Negative (FN): the voice sample is
pathological but the algorithm recognizes it as
healthy.
In this section, we are going to discuss the strengths and
weaknesses of SVD, MEEI and AVPD. After detailed 3.1 Results
analysis of the studies including the techniques used and
In our study, we have chosen to test the machine
outcome measurements. As we notice previously the
performance of the selected machine learning learning classification techniques over the overall
classification techniques was assessed as far as database, and, additionally, over three different
accuracy, sensibility and specificity. The accuracy, that databases chosen by selecting some of the calculated the
features selected applying the SVM.
is the percentage of correctly classified instances, is
defined as: A quantitative analysis has been done that shows that
(𝑇𝑃 + 𝑇𝑁) significance of SVM in figure 1, 2 and 3. SVM is the
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = algorithm that has been widely used in the detection of
(𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)
voice disorders. For many years SVM and its
While the sensitivity and the specificity that represent application in the area of medical has been the topic of
respectively the test’s ability to detect positive results or research for many researchers. SVM is the preference
the identification of negative results, are defined as: of scientist as a machine-learning algorithm because of
its best accuracy outcomes. In figure 1, 2 and 3 it has
𝑇𝑃
Sensitivity = been observed that with variation in features different
(𝑇𝑃 + 𝐹𝑁) accuracies has been evaluated with SVM as a common
𝑇𝑁 algorithm in SVD [7], MEEI [8] and AVPD [8]
Specificity = database. The features with the best accuracy are
(𝑇𝑁 + 𝐹𝑃)
MDVP, MPEG-7 and MFCC in the three datasets
respectively. However, generally all the extracted
features are good because the accuracy of each of them
o True Positive (TP): the voice sample is pathological
is greater than 50% except the VTAI feature in MEEI
and the algorithm recognizes this;
dataset.
o True Negative (TN): the voice sample is healthy
and the algorithm recognizes this;
o False Positive (FP): the voice sample is healthy but
the algorithm recognizes it as pathological;
jitter shimmer hnr mfcc [23] 71

Glottal source features and MFCC [22] 76,19
Glottal source excitation [21] 93,2
CPP, SDNPCV, NPCV, HNR, NOICE LEVEL, D2, Shimmer ,K2 [20] 86,53
mutual informaton b/w voice classes [19] 94,1
glottal flow features [58] 98,43
HOS, DWT [16] 99,35
SE, ZCRs, SH [13] 95
eight frequency band [57] 90,9
MDVP [60] 99,68
peak, lag, entropy [11] 99,53
0 10 20 30 40 50 60 70 80 90 100
Accuracy
Figure 1:Bar graph representing overall accuracy of SVM algorithm used in SVD with different features.
6
nonliear dynamic parameterization [39] 96,73
parametric wavelet y adaptation wavelet transform [36] 98,3
Glottal source excitation [21] 99,4
MPEG-7 feature [34] 99,99
CPP, SDNPCV, NPCV, HNR, NOICE LEVEL, D2, Shimmer ,K2 [20] 87,06
spectra, inferiorclliculus coefficients 99,9
glottal flow features [58] 93,57
wavelet packet transform based energy/entropy [59] 92,24
VTAI feature [8] 0
MPEG-7 low level audio feature [54] 99,99
RPPC [55] 94,2
MFCC LSF [29] 98,7
eight frequency band [57] 99,8
MDVP |60] 88,21
peak, lag, entropy [11] 99,54
0 10 20 30 40 50 60 70 80 90 100
Accuracy in %
Figure 2: Bar graph representing overall accuracy of SVM algorithm used in MEEI with different features.
91,5
Glottal source excitation [21]
93,6
MFCC [56]
91,6
eight frequency bands using correlation functions [57]
72,53
MDVP [60]
96,02
peak, lag, entropy [11]
0 10 20 30 40 50 60 70 80 90 100
Accuracy in %
Figure 3: Bar graph representing overall accuracy of SVM algorithm used in AVPD with different features.
A quantitative analysis has been done between other such as spectral related vocal-tract characteristics in a
algorithms in all selected databases that shows in figure speech recognition system. The GMM parameters can
4, 5 and 6. It has been observed that other than SVM, be estimated from training data based on a well-
there are some algorithms that are resulted in good qualified pre-model iterative EM or Maximum
accuracies. For example, in SVD, Zulfiqar Ali et al. Posteriori (MAP) estimation [63]. In Moon et al. [61],
[12], GMM is used and the resulted accuracy is 80.02% Random Forest algorithm is used to detect voice
with sensitivity 91.22% and specificity 94.27%. A disorders and the resulted outcome is 84.87% accuracy
Gaussian mixture model (GMM), as a weighted sum of however overall sensitivity and specificity were not
Gaussian elements, is a parametric probability density reported. RF is a series or community of classification
function. GMMs are commonly used as a parametric trees and regression trees [68] which is trained in
model to distribute the probability in continuous datasets of the same scale as the training set, called
measurements or characteristics in a biometry system, bootstraps. Once a tree is developed, bootstraps are used
7
as test set which do not contain any specific record of Convolutional neural network (CNN) is attracting
the original (out - of-bag (OOB)) samples. The OOB interest through a range of domains, including
estimate of the generalization error is the error rate of radiology. CNN is designed to learn spatial hierarchies
classification in all test sets. In 1996 [64] Breiman found through numerous building blocks, including cooling
that an OOB mistake is correct with a test set of the layers, bonding layers and fully connected layers,
same size as that for the bagged classifiers. It removes automatic and adaptive context propagation. [65]. CNN
the need for a different test set with the OOB is a deep learning method that is commonly used for
calculation. In SVD, the highest reported accuracy of is solving difficult problems. CNN is a deep learning
99% [60]. After SVM, GMM [12,14] and RT [58], solution. This overcomes the limitations of traditional
convolutional neural network used in the detection of machines [66]. In [15] CNN is used and the reported
voice disorder and resulted in good outcome. A class accuracy is 78%.
that is influential in various computer vision tasks,
100 94,23
91,22 100 99,27 93,3 100
84,4 87,4
90 80,02 82,77 80,25
787880
80
70
60
50
40
30
20
10 000 00
0
Accuracy Sensitivity Specificity
Figure 4: Bar graph representing overall accuracy of multiple algorithms used in SVD.
100 94,6 99,4 96,67 99,55

93,6
88,86 89,82 89,54
87,2 86,97
90 82,14
78,9
80
70
60
50
40
30
20
10
0
GMM ANN HMM LDA and HMM QD NM KNN Naive KNN PNN cart [59]
[12] [58] [34] NN [35] [25] classifier classifier classifier bayes [59] [59]
[27] [27] [27] [28]
Accuracy Sensitivity specificity
Figure 5: Bar graph representing overall accuracy of multiple algorithms used in MEEI.
8
100 91.60%
89.74% 90.30% 88.90%
90 83.60%
78.40%
80
70
60
50
40
30
20
10
0
GMM [12] VQ [40] GMM [40] HMM [40]
Accuracy Sensitivity Specificity
Figure 6 :Bar graph representing overall accuracy of multiple algorithms used in AVPD.
In figure 5 of MEEI, Naïve Bayes [28] has the lowest relevant number of key components for each pitch and
reported accuracy which is 72.70%. Other than Naïve kPCA result for each vowel.
Bayes, algorithms like HMM [34,25], LDA [35], GMM
The k-means algorithm provides perfect separation of
[12], RF [61], PNN [59], KNN [59], ANN [58] all have
data for male recordings, which is the opposite of the
accuracies ranging in between 90% to 100%, which is
female analysis using 28 parameters and PCA. This
again considered as the good reported outcome in terms
question was coped to and 99% of the classification
of accuracy.
accuracy from the kPCA analytics, which are non-linear
In AVPD, all the algorithms used GMM [12], GMM data transformation. This indicates that the isolation of
[40], VQ [40] and HMM [40] all have accuracies bigger data in linear fashion was not adequate. In addition, k-
than 83% which is again considered as the good means algorithm is presented as artifacts allocated by
reported outcome in terms of accuracy. distance to the closest cluster. [17], though it is been
suggested that researchers should focus more
unsupervised techniques and evaluate these databases.
3.2 Discussion: Tissue diseases, systemic changes, mechanical stress,
We notice that the quantitative analysis it used only one surface discomfort, change in tissue, changes in
unsupervised technique and that is only in SVD in neurology and muscle, and other factors [59] can cause
Panek et al. /2016 [17] , its resulted accuracy is increase Voice disease. The agility, strength and form of Vocal
folds, resulting in abnormal noise and reduced acoustic
to 99% with no sensitivity and specificity. Other than
tone, was affected by the vocal pathology. Subjective
there is not a researcher that use any unsupervised and objective evaluations of vocal problems have been
technique for voice pathology detection. The validation approached until now [67]. The first group (subjective
of PCA by k-mean clustering and cross validation loses assessment) is the auditory and visual analysis of vocal
10% signal (the variance of 90%) from the initial vector folds in a hospital [68]. The first is a subjective
of the feature and produces worse results than the assessment. The second category (target evaluation) is
analysis by the original 28 vectors of functionality. focused on automatic computer-based processing of
acoustic signals to measure and identify the underlying
In comparison with the results for the analysis based on vocal pathology, which may not even be detected by a
kPCA included all the pitches analyzed showed the human [40]. Therefore, this type of assessment is
most accurate evidence of patient's health and inherently non-subjective. Within reality, voices can
condition. The analogous analysis of records showed now easily be captured and stored globally via cloud
100 % accuracy for 28 feature vectors and for the technologies using many intelligent devices. Many
libraries have been commonly used by researchers for
9
the objective assessment of speech pathology. The The tests have been carried out over the overall dataset
Massachusetts Eye and Ear Infirmary (MEEI) [8], the and over three different subset where we only have
Saarbrücken Voice Database (SVD) [7], and the Arabic considered the selected features by three specific
Voice Pathology Database (AVPD) [8]. In the features selection methods. The results have shown that
repositories there are also some pitfalls. For example, the best accuracy in voice pathology detection is
certain bases are highly uniformly distributed within achieved using the Support Vector Machine algorithm.
stable and unhealthy groups, and datasets provide This technique classifies a voice as pathological or
troubling differences in the number of samples per type healthy with an accuracy equal to about 86% using all
of pathology (e.g. there are fewer than 3 as more parameters. Result confirmed observing the
pathologies in the database). Some repositories do not experimental tests in which the InfoGainAttribute
have details on the severity of disease or on pathology Eval method and PCA one have been applied (accuracy
symptoms during phonation, so some of the samples values were, respectively, equal to 84.16% and
may seem safe, despite being called pathology and vice 71.75%). Meanwhile, when we only considered
versa. Not to mention that more than 1 type of pathology parameters selected with the Correlation method, the
is used to label documents and it is particularly best accuracy was obtained with the Decision Tree
challenging to incorporate or delete samples in different technique.
language [68].
Although the accuracy values are smaller than the
Talking about the limitation of this systematic review, values obtained in other studies in literature, it is
we cannot deny the fact of lower number of included necessary to highlight that all these studies are
publications. Secondly those articles were selected performed on very limited and, often, non-accessible
which were published in English language, which can datasets. To enhance the classification rate obtained we
restrict the portrayal of work from non-English are interested in improving the classification phase by
speaking countries and limit the generalizability of the developing a hybrid system using a combination of
results. Thirdly, there's a big possibility that search several machine-learning techniques.
strategy for this review may have missed some relevant
studies, since the studies which were published in
conference proceedings were avoided mostly.
REFERENCES
[1]: S. Naddeo, L. Verde, M. Forastiere, G. De Pietro,
4. CONCLUSIONS: and G. Sannino, “A real-time m-health monitoring
system: An integrated solution combining the use of
In recent years, the use of mobile multimedia services several wearable sensors and mobile devices.” in
and applications in healthcare sector has been HEALTHINF, 2017, pp. 545–552.
increasing significantly. Mobile health applications
allow people to access medical information and data of [2]: M. S. Hossain and G. Muhammad, “Cloud-
interest at any time and anywhere, useful for the assisted industrial internet of things (iiot)–enabled
monitoring and detection of specific diseases, such as framework for health monitoring,” Computer
dysphonia, a voice disorder often underestimated that Networks, vol. 101, pp. 192–202, 2016.
affects a great percentage of people.
[3]: G. Sannino, I. De Falco, and G. De Pietro, “A
Research on mobile automatic systems to estimate voice supervised approach to automatically extract a set of
disorders has received considerable attention in the last rules to support fall detection in an mhealth system,”
few years due to its objectivity and non-invasive nature. Applied Soft Computing, vol. 34, pp. 205–216, 2015.
Machine learning techniques can be a valid support to
[4]: J. Mohammed, C.-H. Lung, A. Ocneanu, A.
investigate new approaches to signal processing in an
easy and fast way that can be implemented in an m- Thakral, C. Jones, and A. Adler, “Internet of things:
health solution. This study compares the performance of Remote patient monitoring using web services and
different voice pathology identification methods, taking cloud computing,” in Internet of Things (iThings),
into account the main machine learning techniques. 2014 IEEE International Conference on, and Green
Moreover, in this work we focus on identifying Computing and Communications (GreenCom), IEEE
appropriate voice signal features by using the and Cyber, Physical and Social Computing (CPSCom),
comparative study of different classifiers. All analyses IEEE. IEEE, 2014, pp. 256–263.
are performed on a wide dataset of voices selected from
the Saarbruecken Voice Database. [5]: R. Gravina and G. Fortino, “Automatic methods
for the detection of accelerative cardiac defense
10
response,” IEEE Transactions on Affective assessment of voice disorders, Eng. Appl. Artific.
Computing, vol. 7, no. 3, pp. 286–298, 2016. Intell., 82 (2019), 236–-251.
[15]: V. Guedes, F. Teixeira, A. Oliveira, J. Fernandes,
[6]: S. Iyengar, F. T. Bonda, R. Gravina, A. Guerrieri, L. Silva, A. Junior, et al., Transfer Learning with
G. Fortino, and A. Sangiovanni-Vincentelli, “A AudioSet to Voice Pathologies Identification in
framework for creating healthcare monitoring Continuous Speech, Proced. Comput. Sci., 164 (2019),
applications using wireless body sensor networks,” in 662–669.
Proceedings of the ICST 3rd international conference [16]: I. Hammami, L. Salhi, S. Labidi, Voice
on Body area networks. ICST (Institute for Computer pathologies classification and detection using EMD-
Sciences, Social-Informatics and Telecommunications DWT analysis based on higher order statistic features,
Engineering), 2008, p. 8. IRBM, 41 (2020), 161–171.
[17]: D. Hemmerling, A. Skalski, J. Gajda, Voice data
[7]: Saarbruecken Voice Database—Handbook, mining for laryngeal pathology assessment, Comput.
Stimmdatenbank.coli.uni-saarland.de. [Online]. Biol. Med., 69 (2016), 270–276.
Available: http://www.stimmdatenbank.coli.uni- [18]: M. Markaki, Y. Stylianou, Using modulation
saarland.de/help_en.php4. spectra for voice pathology detection and
classification, 2009 Annual International Conference
[8]: M. OpenCourseWare, Lab Database | Laboratory of the IEEE Engineering in Medicine and Biology
on the Physiology, Acoustics, and Perception of Society, 3–6 Sept. 2009, pp. 2514–2517.
Speech | Electrical Engineering and Computer Science [19]: M. Markaki, Y. Stylianou, Voice pathology
| MIT OpenCourseWare, Ocw.mit.edu. [Online]. detection and discrimination based on modulation
Available: https://ocw.mit.edu/courses/electrical- spectral features, IEEE Transact. Aud. Speech Langu.
engineering-and-computer-science/6-542j-laboratory- Process, 19 (2011), 1938–1948.
[20]: J. M. Miramont, J. F. Restrepo, J. Codino, C.
on-the-physiology-acoustics-and-perception-of-
Jackson-Menaldi, G. Schlotthauer, Voice signal typing
speech-fall-2005/lab-database/ using a pattern recognition approach, J. Voice, 2020.
[9]: M. S. Hossain, G. Muhammad, M. F. Alhamid, B. [21]: G. Muhammad, M. Alsulaiman, Z. Ali, T. A.
Song, and K. Al- Mutib, “Audio-visual emotion Mesallam, M. Farahat, K. H. Malki, et al., Voice
recognition using big data towards 5g,” Mobile pathology detection using interlaced derivative pattern
on glottal source excitation, Biomed. Signal Process.
Networks and Applications, vol. 21, no. 5, pp. 753–
Control, 31 (2017),156–164.
763, 2016. [22]: S. R. Kadiri, P. Alku, Analysis and detection of
[10]: M. S. Hossain, “Patient state recognition system pathological voice using glottal source features, IEEE
for healthcare using speech and facial expressions,” J. Select. Topics Signal Process., 14 (2020), 367–379.
[23]: F. Teixeira, J. Fernandes, V. Guedes, A. Junior, J.
Journal of medical systems, vol. 40, no. 12, p. 272,
P. Teixeira, Classification of control/pathologic
2016.
subjects with support vector machines, Proced.
[11]: A. Al-Nasheri, G. Muhammad, M. Alsulaiman, Comput. Sci., 138 (2018), 272–279.
Z. Ali, K. H. Malki, T. A. Mesallam, et al., Voice [24]: J. P. Teixeira, P. O. Fernandes, N. Alves, Vocal
pathology detection and classification using auto- acoustic analysis—classification of dysphonic voices
correlation and entropy features in different frequency with artificial neural networks, Proced. Comput. Sci.,
regions, IEEE Access, 6, 6961–6974. 121 (2017), 19–26.
[12]: Z. Ali, M. Alsulaiman, G. Muhammad, I. [25]: R. Amami, A. Smiti, An incremental method
Elamvazuthi, A. Al-Nasheri, T. A. Mesallam, K. H. combining density clustering and support vector
Malki, et al., Intra- and inter-database study for Arabic, machines for voice pathology detection, Comput.
English, and German databases: Do conventional Electr. Eng., 57 (2017), 257–265.
speech features detect voice pathology?, J. Voice, 31 [26]: J. D. Arias-Londoño, J. I. Godino-Llorente, N.
(2017), 386.e1–e8. Sáenz-Lechón, V. Osma-Ruiz, G. Castellanos-
[13]: E. S. Fonseca, R. C. Guido, S. B. Junior, H. Domínguez, An improved method for voice pathology
Dezani, R. R. Gati, D. C. Mosconi Pereira, Acoustic detection by means of a HMM-based feature space
investigation of speech pathologies based on the transformation, Patt. Recogn., 43 (2010), 3100–3112.
discriminative paraconsistent machine (DPM), [27]: M. K. Arjmandi, M. Pooyan, M. Mikaili, M.
Biomed. Signal Process. Control, 55 (2020). Vali, A. Moqarehzadeh, Identification of voice
[14]: J. A. Gómez-García, L. Moro-Velázquez, J. disorders using long-time features and support vector
Mendes-Laureano, G. Castellanos-Dominguez, J. I. machine with different feature reduction methods, J.
Godino-Llorente, Emulating the perceptual capabilities Voice, 25 (2011), e275–e289.
of a human evaluator to map the GRB scale for the [28]: R. R. A. Barreira, L. L. Ling, Kullback–leibler
divergence and sample skewness for pathological
11
voice quality assessment, Biomed. Signal Process. [41]: S. E. Shia, T. Jayasree, Detection of pathological
Control, 57 (2020), 101697. voices using discrete wavelet transform and artificial
[29]: H. Cordeiro, J. Fonseca, I. Guimarães, C. neural networks, 2017 IEEE International Conference
Meneses, Hierarchical classification and system on Intelligent Techniques in Control, Optimization and
combination for automatically identifying Signal Processing (INCOS), 23–25 March 2017, pp. 1–
physiological and neuromuscular laryngeal 6.
pathologies, J. Voice, 31 (2017), 384. [42]: K. Daoudi, B. Bertrac, On classification between
[30]: H. T. Cordeiro, C. M. Ribeiro, Spectral envelope normal and pathological voices using the MEEI-
first peak and periodic component in pathological KayPENTAX database: Issues and consequences,
voices: A spectral analysis, Proced. Comput. Sci., 138 INTERSPEECH-2014, Sep 2014, Singapour,
(2018), 64–71. Singapore. ffhal-01010857.
[31]: S. H. Fang, Y. Tsao, M. J. Hsiao, J. Y. Chen, Y. [43]: N. Sáenz-Lechón, J. I. Godino-Llorente, V.
H. Lai, F. C. Lin, et al., Detection of pathological Osma-Ruiz, P. Gómez-Vilda, Methodological issues in
voice using cepstrum vectors: A deep learning the development of automatic systems for voice
approach, J. Voice, 33 (2019), 634–641. pathology detection, Biomed. Signal Process. Control,
[32]: J. I. Godino-Llorente, R. Fraile, N. Sáenz- 1 (2006), 120–128.
Lechón, V. Osma-Ruiz, P. Gómez-Vilda, Automatic [44]: Rabiner LR, Juang B-H. Fundamentals of Speech
detection of voice impairments from text-dependent Recognition. Prentice Hall; 1993.
running speech, Biomed. Signal Process. Control, 4 [45] : Slaney M. Toolbox: a Matlab toolbox for
(2009), 176–182. auditory modeling. Work Technical Report,. Interval
[33]: A. Mahmood, A solution to the security Research Corporation; 1998:29–32.
authentication problem in smart houses based on [46]: Weeks M. Digital Signal Processing Using
speech, Proced. Comput. Sci., 155 (2019), 606–611. Matlab and Wavelets.
[34]: G. Muhammad, M. Melhem, Pathological voice Infinity Science Press LLC; 2006.
detection and binary classification using MPEG-7 [47] : Fonseca ES, Guido RC, Scalassara PR, et al.
audio features, Biomed. Signal Process. Control, 11 Wavelet time- frequency analysis and least squares
(2014), 1–9. support vector machines for the identification of voice
[35]: J. Nayak, P. S. Bhat, R. Acharya, U. V. Aithal, disorders. Comput Biol Med. 2007;37:571 578.
Classification and analysis of speech abnormalities, [48] : M. Farr´us, J. Hernando, and P. Ejarque, “Jitter
ITBM-RBM, 26 (2005), 319–327. and shimmer measurements for speaker recognition,”
in Eighth Annual Conference of the International
[36]: P. Henriquez, J. B. Alonso, M. A. Ferrer, C. M. Speech Communication Association, 2007.
Travieso, J. I. Godino-Llorente, F. Diaz-de- Maria, [49]: Levinson SE, Rabiner LR, Sondhi MM. An
Characterization of healthy and pathological voice introduction to the application of the theory of
through measures based on nonlinear dynamics, IEEE probabilistic functions of a Markov process to
Transact. Audio Speech Lang. Process., 17 (2009), automatic speech recognition. Bell Syst Tech J.
1186–1195. 1983;62:1035–1074.
[37]: P. Salehi, Using patient's speech signal for vocal [50]: Rabiner LR, Juang BH. Speech recognition:
ford disorders detection based on lifting scheme, in statistical methods. Encyclopedia of Language &
2015 2nd International Conference on Knowledge- Linguistics2nd ed. 1–18.
Based Engineering and Innovation (KBEI), 5–6 Nov. [51]: Rao PVS. VOICE: an integrated speech
2015, pp. 561–568. recognition synthesis system
[38]: N. Sáenz-Lechón, J. I. Godino-Llorente, V. for the Hindi language. Speech Commun.
Osma-Ruiz, P. Gómez-Vilda, Methodological issues in 1993;13:197–205.
the development of automatic systems for voice [52]: Motlıcek, P. (2002). Feature extraction in speech
pathology detection, Biomed. Signal Process. Control, coding and recognition. Technical Report of PhD
1 (2006), 120–128. research internship in ASP Group.
[39]: C. M. Travieso, J. B. Alonso, J. R. Orozco- [53]: S. L. Salzberg, “C4. 5: Programs for machine
Arroyave, J. F. Vargas-Bonilla, E. Nöth, A. G. Ravelo- learning by j. ross quinlan. morgan kaufmann
García, Detection of different voice diseases based on publishers, inc., 1993,” Machine Learning, vol. 16, no.
the nonlinear characterization of speech signals, Expert 3, pp. 235–240, 1994.
Systems Appl., 82 (2017), 184–195. [54]: A. E. Aronson, Clinical voice disorders, 3 ed.,
[40]: T. A. Mesallam, F. Mohamed, K. H. Malki, A. INC. New York: Thieme Medical Publishers, 1990, p.
Mansour, A. Zulfiqar, A. N. Ahmed, et al., 3–11.
Development of the arabic voice pathology database [55]: L. Verde, G. De Pietro, G. Sannino, Voice
and its evaluation by using speech features and disorder identification by using machine learning
machine learning algorithms, J. Healthc. Eng., (2017), techniques, IEEE Access, 6 (2018), 16246–16255.
1–13.
12
[56]: A. G. David, J. B. Magnus, Diagnosing parkinson
by using artificial neural networks and support vector
machines, Global J. Comput. Sci. Technol., (2009),
63–71.
[57]: A. Al-Nasheri, G. Muhammad, M. Alsulaiman,
Z. Ali, Investigation of voice pathology detection and
classification on different frequency regions using
correlation functions, J. Voice, 31 (2017), 3–15.
[58]: K. Ezzine, M. Frikha, Investigation of glottal
flow parameters for voice pathology detection on SVD
and MEEI databases, 2018 4th International
Conference on Advanced Technologies for Signal and
Image Processing (ATSIP), 21–24 March 2018, pp. 1–
6.
[59]: M. Hariharan, K. Polat, R. Sindhu, S. Yaacob, A

hybrid expert system approach for telemonitoring of
vocal fold pathology, Appl. Soft Comput., 13 (2013),
4148–4161.
[60]: A. Al-Nasheri, G. Muhammad, M. Alsulaiman,
Z. Ali, T. A. Mesallam, M. Farahat, et al., An
investigation of multidimensional voice program
parameters in three different databases for voice
pathology detection and classification, J. Voice, 31
(2017), 113.e9–e18.
[61]: J. Moon, S. Kim, An approach on a combination
of higher-order statistics and higher-order differential
energy operator for detecting pathological voice with
machine learning, 2018 International Conference on
Information and Communication Technology
Convergence (ICTC), 17–19 Oct. 2018, pp. 46–51.
[62]: P. Bradley, Voice disorders: Classification,
Otolaryngol. Head Neck Surgery, (2010), 555–562.
[63]: D. Reynolds, Gaussian Mixture Models, In: S. Z.
Li, A. Jain (eds), Encyclopedia of Biometrics,
Springer, Boston, MA, 2009.
[64]: L. Breiman, Bagging predictors, Mach. Learn.,
24 (1996), 123–140.
[65]: S. Indolia, A. Goswami, S. Mishra, P. Asopa,
Conceptual understanding of convolutional neural
network- A deep learning approach, Proced.
Computer Sci., 132 (2018), 679–688.
[66]: R. Yamashita, M. Nishio, R. Do, K. Togashi,
Convolutional neural networks: An overview and
application in radiology, Insights Imag., 9 (2018),
611–629.
[67]: D. D. Mehta, R. E. Hillman, Voice assessment:
Updates on perceptual, acoustic, aerodynamic, and
endoscopic imaging methods, Curr. Opin.
Otolaryngol. Head Neck Surg., 16 (2008), 211.
[68]: P. Harar, Z. Galaz, J. Alonso-Hernandez, J.
Mekyska, R. Burget, Z. Smekal, Towards robust
voice pathology detection, Neural Comput. Appl.,
2018.
13

Speech Processing Article

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Processing Article

Uploaded by

Copyright:

Available Formats

Overview of Voice Disorder Identification by using

Machine Learning Techniques

Benkhay abdellah a , Daali noussaiba a

ARTICLE INFO ABSTRACT :

INTRODUCTION: exploration and business networks. They offer, as a

Table 1: Summary of the 30 selected studies with data extraction.

no Author/year DATASET Feature selection Classifier Accuracy in %

4 Garcia et al./2019 [15] SVD GBR scale GMR --

5 Guedes et al./2019 [16] SVD PCA DLN 80

6 Hammami et al./2020 [17] SVD HOS, DWT SVM 99.3

8 Markaki et al./2009 [19] SVD Mutual Information b/w SVM 94.1

13 Teixeira et al./2018 [24] SVD Jitter, shimmer and HNR, SVM 71

14 Teixeira et al./2017 [25] SVD Jitter, shimmer and HNR MLP-ANN 95

15 Amami et al./2017 [26] MEEI DBSCAN and MFCCs SVM 98

16 Londono et al./2010 [27] MEEI MFCC HMM 82.14

19 Cordeiro et al./2017 [30] MEEI MFCC, LSF SVM, GMM, DA 98.7

20 Cordeiro et al./2018 [31] MEEI RPPC SVM 94.2

22 Llorente et al./2009 [33] MEEI MFCC MLP-ANN 96

23 Mahmood /2019 [34] MEEI MFCC ANN, SVM, RF 72.70

24 Muhammad et al. /2014 [35] MEEI MPEG-7 feature SVM 99.994

25 Ghulam Muhammad/2016 SVD MFCC SVM 87,85

jitter shimmer hnr mfcc [23] 71

Accuracy Sensitivity Specificity

100 94,6 99,4 96,67 99,55

Accuracy Sensitivity specificity

Accuracy Sensitivity Specificity

[59]: M. Hariharan, K. Polat, R. Sindhu, S. Yaacob, A

You might also like