Professional Documents
Culture Documents
Voice Disorder Classification Using Speech Enhancement and Deep Learning Models
Voice Disorder Classification Using Speech Enhancement and Deep Learning Models
Voice Disorder Classification Using Speech Enhancement and Deep Learning Models
Available at www.sciencedirect.com
ScienceDirect
Mounira Chaiani a, Sid Ahmed Selouani b,*, Malika Boudraa a, Mohammed Sidi Yakoub b
a
Laboratory of Speech Communication and Signal Processing, University of Sciences and Technology Houari Boumediene, Algiers, Algeria
b
Research Laboratory in Human-System Interaction, Université de Moncton, Shippagan Campus, New Brunswick E8S 1P6, Canada
A R T I C L E I N F O A B S T R A C T
Article history: With the recent development of speech-enabled interactive systems using artificial
Received 12 June 2021 agents, there has been substantial interest in the analysis and classification of voice dis-
Received in revised form orders to provide more inclusive systems for people living with specific speech and lan-
28 February 2022 guage impairments. In this paper, a two-stage framework is proposed to perform an
Accepted 4 March 2022 accurate classification of diverse voice pathologies. The first stage consists of speech
Available online 17 March 2022 enhancement processing based on the original premise, which considers impaired voice
as a noisy signal. To put this hypothesis into practice, the noise lestral harmonic-to-
noise ratio (CHNR). The second stage consists of a convolutional neural network with
Keywords:
long short-term memory (CNN-LSTM) architecture designed to learn complex features
Voice disorder classification
from spectrograms of the first-stage enhanced signals. A new sinusoidal rectified unit
CHNR
(SinRU) is proposed to be used as an activation function by the CNN-LSTM network.
Speech enhancement
The experiments are carried out by using two subsets of the Saarbruecken voice data-
CNN-LSTM
base (SVD) with different etiologies covering eight pathologies. The first subset contains
Trigonometric activation function
voice recordings of patients with vocal cordectomy, psychogenic dysphonia, pachyder-
mia laryngis and frontolateral partial laryngectomy, and the second subset contains
voice recordings of patients with vocal fold polyp, chronic laryngitis, functional dyspho-
nia, and vocal cord paresis. Dysarthria severity levels identification in Nemours and
Torgo databases is also carried out. The experimental results showed that using the
minimum mean square error (MMSE)-based signal enhancer prior to the CNN-LSTM net-
work using SinRU, led to a significant improvement in the automatic classification of the
investigated voice disorders and dysarhtria severity levels. These findings support the
hypothesis that using an appropriate speech enhancement preprocessing has positive
effects on the accuracy of the automatic classification of voice pathologies thanks to
the reduction of the intrinsic noise induced by the voice impairment.
Ó 2022 Nalecz Institute of Biocybernetics and Biomedical Engineering of the Polish Academy
of Sciences. Published by Elsevier B.V. All rights reserved.
* Corresponding author at: Information Management dpt, Université de Moncton, 218 Boul. J.-D. Gauthier, Shippagan, New Brunswick
E8S 1P6, Canada.
E-mail addresses: mchaiani@usthb.dz (M. Chaiani), sid-ahmed.selouani@umoncton.ca (S.A. Selouani), mboudraa@usthb.dz
(M. Boudraa), mohammed.sidi.yakoub@umoncton.ca (M. Sidi Yakoub).
https://doi.org/10.1016/j.bbe.2022.03.002
0168-8227/Ó 2022 Nalecz Institute of Biocybernetics and Biomedical Engineering of the Polish Academy of Sciences. Published by Elsevier
B.V. All rights reserved.
464 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0
descriptive vector to compare the performance of SVM, GMM and the second stage is a CNN-LSTM that performs the
and DNN with three hidden layers, which gave the best accu- identification of voice impairments by using two subsets
racy. The authors in [23] opted for MFCCs as inputs for the of the Saarbruecken voice database (SVD) [25] with differ-
DNN, which essentially consists of two sparse autoencoders. ent etiologies covering the following eight pathologies:
A comparison of objective and subjective detection of laryn- vocal cordectomy, psychogenic dysphonia, pachydermia
geal cancer was conducted in [24]. The subjective approach laryngis, frontolateral partial laryngectomy, vocal fold
consisted of playing back the voice recordings to 4 volunteers polyp, chronic laryngitis, functional dysphonia, and vocal
including 2 trained laryngologists. The objective method was cord paresis; an automatic classification of dysarthria
based on algorithms such as SVM, XGBoost, light gradient severity levels is also performed by using Nemours[26]
boosted machine (LGBM), artificial neural network (ANN), and Torgo [27] databases;
one-dimensional convolutional neural network (1D-CNN) (ii)to propose the cepstral harmonic-to-noise ratio (CHNR)
and two-dimensional convolutional neural network (2D- as an innovative estimator of pathological noise to enable
CNN). The feature extraction provided MFCCs, short-time proper functioning of the enhancement module;
Fourier transform (STFT), jitter, shimmer, harmonic-to-noise (iii)to propose the sinusoidal rectified unit (SinRU) as a new
ratio (HNR), fundamental frequency and the raw voice signal. activation function that is expected to improve the perfor-
According to their evaluation on a database of 50 males with mance of the CNN-LSTM classifier.
laryngeal cancer and 45 healthy males, most of the automatic
methods surpassed the expert diagnosis, which was 69.9% The remainder of this paper is structured as follows. Sec-
accurate. The raw signal processed by a 1D-CNN gave the best tion 2 introduces the two-stage voice pathology classification,
result of 85.2%. which is composed of a voice enhancement module based on
It is worth mentioning that most recent studies opted for the CHNR for noise estimation and a deep neural network.
pairwise classification, i.e., the classification between two Section 3 presents the database, evaluation metrics, evalu-
voice pathologies at the same time (one vs. one or one vs. all ated speech enhancement algorithms, proposed activation
configurations). This pairwise classification was carried out function, as well as the results of the evaluation on two sets
by the systems presented in [6–8,19] on three common voice of pathological voices and dysarthria severity levels. We dis-
pathologies (cysts, polyps and paralysis) extracted from the cuss the obtained experimental results in Section 4. Finally,
SVD, Arabic voice pathology database (AVDP) and Mas- conclusions are outlined in Section 5.
sachusetts eye and ear infirmary (MEEI) databases. In [12], the
pairwise approach was applied to classify 5 pathologies (ad- 2. Proposed approach: Deep learning of
ductor spasmodic dysphonia, keratosis, vocal fold nodules, pathologically enhanced speech
vocal fold polyp and paralysis) of the MEEI database. The
pairwise-based system presented in [9] classified 5 pathologies In this section, the proposed voice pathology classification
(dysphonia, laryngitis, funktionelle dysphonia, rekurrenspars- system is presented. As illustrated by the block diagram
ese and hyperfunktionelle dysphonia) of the SVD and 4 in Fig. 1, there are two main stages that cooperate to
pathologies (Alzheimer, Parkinson, chronic laryngitis and achieve the classification process. The first stage focuses
paralysis) of a private database. on enhancing the pathological voice signal based on the
In real-life conditions, the pairwise approach is not helpful premise that there are some similarities between disordered
in regard to the time to perform an accurate diagnosis. A pre- voice and noisy speech uttered in adverse conditions.
cise identification of one pathology that has to be recognized Unlike the studies dealing with adverse conditions such
among many others leads to a multiclass identification, those under the channel effect [28] or dealing with an
which is by far a more complicated task. external source of noise [29], our approach make the
hypothesis that a pathological voice is intrinsically noisy.
1.3. Goal and main contributions This assumption means that the voice impairment induces
a ‘‘noise effect” on the voice produced by a person suffering
In this work, a new approach is proposed to provide an effi- from various voice disorders. Based on preliminary observa-
cient solution that allows the precise multiclass identification tions and experimental evidence presented in [30,31], the
of vocal pathologies. For this purpose, various configurations use of speech enhancement techniques to improve quality
are investigated to assess the relevance of a speech enhance- and intelligibility of pathological speech is logically adopted
ment pretreatment and the optimal structure based on deep to reduce the effect of its induced noise. Although the ori-
neural networks for a multiclass recognition of voice disor- gin of the degradation differs, we can formulate the
ders. This is far more complex than the widely used pairwise hypothesis that conventional noisy speech results from
approaches. Besides this, we have extended the application of the degradation of the environment, while pathological
the proposed system to dysarthria severity levels’ classifica- voice results from the degradation of the speech production
tion. The goal is to evaluate the effectiveness of the enhance- system. During the second stage, the enhanced signal is
ment approach within a specific speech pathology, namely projected into the time–frequency domain thanks to a spec-
dysarthria. Our original contributions are threefold: trogram representation. This 2D representation is used by
the deep learning-based feature learning module to extract
(i)to design a two-stage multiclassifier of voice disorders the relevant characteristics allowing the classification of
where the first stage is a speech enhancement module voice pathologies.
466 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0
2.1. Enhancement of pathological voice where c and e are the a posteriori and a priori SNRs, respec-
tively, defined by
Speech enhancement aims at reducing the impact of noise
j Xðt; f Þj2
without deteriorating the quality and intelligibility of speech. cðt; f Þ ¼ h i; ð6Þ
E j Nðt; f Þj2
Recently, the application of speech enhancement was
extended to improve the robustness of dysarthric speech and
recognition systems [30,31] since a positive correlation is h i
found between intelligibility scores of dysarthric and noisy E j Sðt; f Þj2
eðt; f Þ ¼ h i; ð7Þ
speech [32]. To assess the impact of using speech enhance- E j Nðt; f Þj2
ment techniques, our study focuses on dysphonia, which is
characterized by alterations in voice pitch, loudness and qual- Since e cannot be calculated in concrete terms, it is esti-
ity. Impaired voice quality can be described as hoarseness mated by the decision-directed approach.
[33]. Hoarseness, quantified by the harmonics-to-noise ratio 2
eðt; f Þ ¼ a cðt 1; f Þ Gðt 1; f Þ þ ð1 aÞ
[34], is considered to be the class that encompasses breathi-
ness and roughness [35] and is thus technically perceived as maxðcðt; f Þ 1; 0Þ; ð8Þ
noise or strange tones [36]. The level of hoarseness can be h i
where a is a smoothing coefficient. The noise E j Nðt; f Þj2 is
estimated by evaluating the extent to which noise replaces
the harmonic structure in the spectrogram. The relationship estimated from the silent frames of the noisy signal.
between the harmonic component and noise was quantified The data used throughout our experiments are the sus-
as the harmonics-to-noise ratio (HNR). This quantifiable indi- tained vowels/a/ that are extracted from the SVD corpus.
cator can constitute an alternative to the conventional signal- The vowels are characterized by their quasiperiodicity, which
to-noise ratio (SNR) and therefore offers the possibility of results from vocal cord vibration at a fundamental frequency
using speech enhancement techniques to improve the quality F0 . Since these signals do not contain any silence, we replaced
of pathological voice. For the purposes of this study, we use the a posteriori SNR c with a CHNR whose calculation steps,
an adapted HNR developed in the cepstral domain. This cal- illustrated by Fig. 2, are detailed below.
culation is detailed in the context of the minimum-mean The CHNR is a ratio of overall energy to noise energy (in
square error (MMSE)-based signal enhancer [37]. dB) and is used in evaluating voice quality [34]. It measures
Let us consider a noisy voice signal xðiÞ, which is the sum the noise amount in a voice signal, and its calculation is based
of a clean voice signal sðiÞ and a noise nðiÞ: on the cepstrum [39]. Segments are first framed with an adap-
tive window of a length five times the fundamental period F10 ,
xðiÞ ¼ sðiÞ þ nðiÞ: ð1Þ
the nonlinear least squares (NLS) estimator of F0 [40] is used.
The short-term Fourier transform of noisy signal is given by Based on the hypothesis that the voice signal can be
Xðt; f Þ ¼ Sðt; f Þ þ Nðt; f Þ; ð2Þ expressed as a source signal filtered by the vocal tract, the
log magnitude spectrum of the windowed signal results in
where 0 6 t 6 T and 0 6 f 6 F are the time frame and spectral
the summation of the logarithms of the source spectra
component, respectively. To estimate the clean voice signal b
S, logXð f Þ and the spectral envelope logHð f Þ, as shown by
the noisy signal is attenuated by a gain function G.
Yð f Þ ¼ log j Sð f Þ j
b ð9Þ
S ðt; f Þ ¼ G Xðt; f Þ; ð3Þ ¼ logXð f Þ þ logHð f Þ
The signals are enhanced by the log-MMSE estimator, which
minimizes the mean-square error between the log-
magnitude spectra of the clean and estimated signals, leading
to noise attenuation without distorting the signal too much
[38]. Its spectral gain is given by
" Z #
eðt; f Þ 1 1 ey
Gðe; mÞ ¼ exp dy ; ð4Þ
1 þ eðt; f Þ 2 mðt;f Þ y
eðt; f Þ
mðt; f Þ ¼ cðt; f Þ ; ð5Þ
1 þ eðt; f Þ
Fig. 2 – CHNR calculation steps.
diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0 467
To separate the noise in the source signal from the har- volutional layer with a rectified linear unit (ReLU) activa-
monic part and the spectral envelope, a transition to the real tion function and one maximum pooling layer. Moreover,
cepstral domain Eq. (10) is required, and a liftering of the cep- it learns to extract the global features using one LSTM
strum is processed. layer. The circled unit in Fig. 3 is magnified in Fig. 4.
yðsÞ ¼ IFFTðlog j Sð f Þ jÞ ð10Þ The CNN-LSTM combination is then used to classify patho-
logical voices. The application of the 1D convolution on
with s the quefrency. the spectrogram was not previously used except by [41]
The lifter consists of the following steps: for seizure detection on electroencephalogram signals.
The convolution layer, which aims to learn to extract
local features from its input, includes several filters, also
detection of the rahmonics peaks in the cepstrum; known as convolution kernels. As shown in Fig. 3, given a
spectrogram I 2 RTF with T frames and F frequency bins,
first derivation of the cepstrum and detection of the sign
the convolution layer convolves I with L kernels Kl 2 RMF
change on the two sides of the peak to determine the peak
Eq. (16). The multitude of kernels allows the creation of
width;
multiple preactivated feature maps Cl 2 RTc 1 , with
Tc ¼ T M þ 1.
zeroing of the rahmonic peak to obtain the comb-liftered
part CðsÞ. X
M X
F
Cl ðuÞ ¼ convðI; KÞu ¼ Kl ði; jÞIðuþi;jÞ ð16Þ
i¼1 j¼1
The Fourier transform of the comb-liftered part is an
Each of the L preactivated feature maps is obtained once the
approximation of the noise spectrum ApNoiseð f Þ, which is
kernel has been shared by all temporal components of the
corrected Eq. (13) by subtracting the deviance Dð f Þ Eq. (12)
spectrogram.
so that the noise spectrum drops below the harmonic
To allow the network to learn complex data, an element-
minima.
wise nonlinear function is applied to the preactivated feature
ApNoiseð f Þ ¼ FFTðCðsÞÞ ð11Þ maps.
with ði 1Þ:F0 < f < i:F0 and i P 1 The ReLU function [42], which sets its negative inputs to zero
and preserves the positive input values, allows rapid training
Noiseð f Þ ¼ ApNoiseð f Þ Dð f Þ ð13Þ
of the network thanks to its simple function rðCl Þ ¼ maxðCl ; 0Þ
The CHNR is then given by: and performs very well in several contexts [43].
To reduce the resolution of the features and increase their
CHNRb ¼ 20 Y b Noiseb ; ð14Þ
robustness, the data must pass through a pooling layer. The
fs
where 0 6 b 6 2
is used to define a frequency band with fs as max pooling layer [44] performs a nonlinear subsampling of
the sampling frequency. its input. It calculates the maximum of each nonoverlapped
It can be seen that CHNRb is the a posteriori SNR in dB. We region of length p along the time axis.
CHNRb
replaced c in Eq. (8) with 10 10 . Once the signals are p1 Tc
Pl ðsÞ ¼ maxi¼0 fAl ðsp iÞg; s ¼ 1; 2; ; ð18Þ
enhanced, their representation in the time–frequency domain p
is calculated thanks to the spectrogram that is used through- Features that have a very low probability of being active
out all of the experiments. The use of spectrograms is moti- can be separated using this layer. The resulting feature
vated by the two-dimensional nature of this feature maps are fed into an LSTM layer to capture their short
representation and is therefore more suitable for convolu- and long-term contextual dependencies. LSTM is a recur-
tional networks. This representation visualizes the entire rent neural network that learns the short and long-term
spectral decomposition of the voice signal on a single graph dependencies of its input sequences. It is composed of G
and is obtained from the squared magnitude of the short- memory blocks which in turn consist of an activation cell
lived Fourier transform. This latter is the Fourier transform and an input, a forget and an output gate, as shown in
of a windowed signal by an N-length window shifted Fig. 4.
wðn tÞ across the signal xðnÞ as given by At time t, the current input data Xt and the data from pre-
X
N1
j2nf p
vious hidden state ht1 are fed to the memory block to update
XSTFT ðt; f Þ ¼ xðnÞwðn tÞe N : ð15Þ its state defined by Eq. (23). This is accomplished using Eqs.
n¼0
(19)–(22).
The resulting spectrograms are fed into the feature
f
learner. Fg ¼ sigmoid Wgf Xt þ Ugf ht1 þ bg ð19Þ
i
2.2. CNN-LSTM architecture for voice disorder Ig ¼ sigmoid Wgi Xt þ Ugi ht1 þ bg ð20Þ
classification
c
Cg ¼ tanh Wgc Xt þ Ugc ht1 þ bg ð21Þ
The proposed network, illustrated in Fig. 3, learns to o
extract local features from the spectrogram by one 1D con- Og ¼ sigmoid Wgo Xt þ Ugo ht1 þ bg ð22Þ
468 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0
ct ¼ Ig Cg þ Fg ct1 ð23Þ
Algorithm 1: Proposed voice disorder classification system
ht ¼ Og tanhðct Þ ð24Þ Input Pathological voice signals s
Output Optimized parameters of enhanced learner and
where Wg and Ug are weight matrices, bg are bias vectors, and
pathology label for input signal
is the elementwise (Hadamard) product. Fg, the forget gate
senf = enframe(s, window-size = 32 ms,
output, determines which information from the previous cell frame-shift = 16 ms)
state ct1 to forget and retain. Ig, the input gate output, deter- for each frame of senf do
mines the important information of Cg, the input modulation, Calculate F0 .
after their multiplication. The activation cell, as given by Eq. end for
(23), updates the state of the cell by summing the important foreach 5 periods (F10 ) of s do
Calculate CHNR
information of the input data and the information retained
end for
from the previous cell state. Og, the output gate output, con- senh =logMMSE(s, CHNR)
trols the hidden state of the next cell ht defined by Eq. (24). spec = spectrogram(senh , window = 30 ms, shift = 15 ms)
Some outputs of the LSTM layer are dropped out to prevent Create a set of D mini spectrograms of 36 frames
overfitting [45]. for d 2 D do
To finally classify the pathologies, an output layer with a S = CNN-LSTM(d)
^
b=argmax(S)
number of neurons equal to the B number of classes is ^ = CrossEntropyLoss(b; S)
f(b; b)
required. Using the softmax activation function, the outputs
Backpropagate gradient and update learner parameters
of the output layer are end for
exi F0 is the fundamental frequency. b and b^ are the true label
Si ðxÞ ¼ B ; ð25Þ
X and the predicted one, respectively.
exj
j¼1
Fig. 5 – Comparison of the objective quality measures for the conventional and proposed MMSE-based enhancement
methods applied to each pathology: (a) short-time objective intelligibility (STOI), (b) signal to noise ratio (SNR), and (c)
perceptual evaluation of speech quality (PESQ).
½4; 8; 16; 32; 64; 128, the grid of the kernel size is ½2; 3; 4, the grid the number of correctly predicted spectrograms. The sum of
of the pool size is ½2; 3; 4, the grid of LSTM units is the diagonal elements divided by the total number of tested
½50; 100; 150; 200, and the grid of the dropout rate is ½0:4; 0:5; 0:6. spectrograms is the accuracy. The F1 score for each class is
After evaluation, the learner’s convolution layer consists of the harmonic mean of the sensitivity and the precision ratios
32 filters of size 4 with a stride size of 1, allowing for the learn- Eq. (27). For a class B1 , the precision is the ratio of correctly clas-
ing of 32 feature maps. The kernel size in the max pooling sified spectrograms B1 and all spectrograms classified as B1 .
layer is of 4. A hundred units constitute the LSTM layer, and The sensitivity is the ratio of the correctly classified spectro-
the dropout rate is 0.4. The learner is trained over 200 epochs grams B1 and all B1 tested spectrograms. The mean of the F1
at a learning rate of 0.0001 using the Adam optimization algo- scores of all classes is the macro F1.
rithm [49]. precision sensitivity
The experiments were implemented by using Python 3.5; F1score ¼ 2 ð27Þ
precision þ sensitivity
spectrograms were calculated using the Scipy 1.4.1 library;
and the network is constructed using Keras 2.1.6 library on Each system is evaluated ten times due to the network’s ran-
TensorFlow 2.3.0 platform. dom initialization. The average accuracy of the system over
the 10 iterations and its standard deviation, as well as the
3.4. Evaluation metrics average and standard deviation of the F1 score, are provided.
The confusion matrix, accuracy and macro F1 score are used to 3.5. Impact of the cross-validation design
evaluate the system performance. The rows and columns of
the confusion matrix indicate the true classes and predicted A n-fold cross-validation approach is used in our experi-
classes, respectively. The diagonal of this matrix represents ments. The files of each class are divided into n folds. For each
diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0 471
of the n iterations, n 1 folds from each class are used for standard speech enhancement evaluation metrics, such as
training, and the remaining fold is used for testing. During SNR, perceptual evaluation of speech quality (PESQ), and
the final iteration, all files are tested. To decide how to evalu- short-time objective intelligibility (STOI) are reported in
ate the system, we tested two designs. The first is the 5-fold Fig. 5. As shown in Fig. 5 the proposed MMSE-based configu-
cross-validation, motivated by maintaining speaker indepen- ration performed better than the conventional MMSE what-
dent experiences, since some speakers have multiple files. ever the evaluation metric.
The second is the 10-fold cross-validation. By studying the In addition to the unsupervised log-MMSE enhancement
details of the two configurations, we noted that the accuracy algorithm and for comparison purposes, spectral subtrac-
of the 10-fold system is more than 8% better than the 5-fold tion and Wiener filter enhancement methods are used, as
one. Further investigation has shown that this optimistic well as a deep speech enhancement generative adversarial
10-fold accuracy is likely due to the presence of files from network (DSEGAN) and Large-Deep complex U-Net with
the same speaker in the training and test data. Indeed, for phone-fortified perceptual loss (LDCU–Net-PFPL), which is
the 10-fold cross-validation, 90% of the voice recordings of a recent supervised deep learning-based enhancement
each class are used for training and as a result, some record- method [50].
ings of the same speaker may appear in the test set. There- Spectral subtraction is the simplest method of noise
fore, to keep the experiences independent from the speaker reduction. It consists of subtracting the average power of
the 5-fold configuration is used. the noise from the instantaneous power spectrum of the
noisy signal. This generates musical noise. Hence, the
3.6. Pathological voice enhancement improvement, which subtracts an overestimation of the noise
[38].
To explicitly show the benefits of using CHNR, a direct com- Wiener filtering for speech enhancement applies an opti-
parison of conventional MMSE (using Eq. (8)) and the modified mal linear filter to minimize the mean square error (MSE)
MMSE (using CHNR) is provided. The results obtained by using between the clean signal and its estimate [38].
472 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0
Table 6 – Voice disorder classification performance using several activation functions with original and enhanced signals.
The task consists of identifying 4 classes: Cordectomy, Frontolateral laryngectomy, Pachydermia laryngis and Psychogenic
dysphonia. Boldface indicates best performance.
Activation function ReLU Swish SELU GELU SinRU
Original
Accuracy 69.372.38 69.741.67 70.562.24 69.81.62 70.622.27
F1 score 0.6720.0236 0.6780.0189 0.6840.0242 0.680.0173 0.6870.0249
MMSE 15
Accuracy 70.942.18 71.821.99 71.81.66 722 72.691.8
F1 score 0.6940.0254 0.7010.0221 0.7020.0183 0.7030.0228 0.7090.017
MMSE 25
Accuracy 722.1 71.431.49 72.032.28 70.962.01 72.562.01
F1 score 0.7050.022 0.6970.0162 0.7070.0245 0.69400.0211 0.7080.0209
MMSE 35
Accuracy 71.812.18 70.841.97 71.651.36 71.072.04 72.831.14
F1 score 0.7010.0234 0.690.0219 0.6990.0187 0.6970.0205 0.710.0173
Table 7 – Voice disorder classification performance using ReLU and SinRU activation functions with original and enhanced
signals for female and male. Boldface indicates best performance.
Female Male
Table 10 – Voice disorder classification performance using several activation functions with original and enhanced signals.
The task consists of identifying 3 classes: Pachydermia laryngis, Psychogenic dysphonia and a combined class composed of
Cordectomy and Frontolateral laryngectomy. Boldface indicates best performance.
Activation function ReLU Swish SELU GELU SinRU
Original
Accuracy 85.651.32 84.841.13 85.151.30 85.781.35 86.441.18
F1 score 0.830.0224 0.8220.16 0.8270.0205 0.8350.0169 0.8450.0157
MMSE 15
Accuracy 84.631.29 84.981.22 85.411.89 85.840.79 87.320.84
F1 score 0.82700.0168 0.8290.0164 0.8290.0284 0.8410.0104 0.8550.0128
MMSE 25
Accuracy 85.981.42 85.931.22 85.701.92 85.411.27 87.10.9
F1 score 0.840.0195 0.8390.0164 0.8340.0273 0.8330.0168 0.8530.0119
MMSE 35
Accuracy 86.261.56 85.861.24 85.401.25 85.421.56 87.380.64
F1 score 0.83900.0197 0.8380.0172 0.8290.0176 0.8330.0168 0.8540.0102
Table 11 – Voice disorder classification performance using ReLU and SinRU activation functions with original and enhanced
signals. The task consists of identifying 4 classes: Vocal fold polyp, Chronic laryngitis, Functional dysphonia and Vocal cord
paresis. Boldface indicates best performance.
Female and male Female Male
Configuration ReLU & SinRU & ReLU & SinRU & ReLU & SinRU &
Original MMSE 35 Original MMSE 35 Original MMSE 35
Accuracy 48.40.34 49.262.32 58.750.76 61.790.15 51.191.02 56.552.11
F1 score 0.4280.01 0.4430.018 0.5460.011 0.5880.013 0.4850.007 0.5150.019
spectrograms in Fig. 6. Indeed, the results confirm that the use Recent advances in machine learning have provided an
of pure deep learning approaches such as LDCU–Net-PFPL and opportunity to automate the diagnosis and assessment of
DSEGAN in the proposed scheme was not as effective as the voice, paving the way to the development of clinically applica-
combination of conventional and deep-learning-based ble tools that could be used by voice specialists. However,
enhancement techniques. Recently, this combination has although the results demonstrate the general viability of
demonstrated its effectiveness in the case of external noises machine learning algorithms for the assistive diagnosis of
[68,69]. In our study, where the noises are assumed to be of voice disorders, it is important to mention that the generaliza-
intrinsic origin, the combination of the modified MMSE and tion of the results to new patients is not necessarily achieved
DSEGAN achieved better performance. In the context of patho- nor measured properly because the training is usually per-
logical signals where pure noisy zones and clean signals are not formed on limited or non-accessible datasets. Besides the
available to train deep-learning-based methods, the support shortcomings of the existing datasets, the gaps in the adopted
provided by conventional techniques seems to be the best evaluation methodologies and the lack of the standardization
strategy to face this unfavorable situation. of clinical assessment protocols constitute the main obstacles
The experimental comparison between the different activa- to universal comparison and interpretation [72]. Recent
tion functions showed that the proposed SinRU function was methodological developments show an awareness to create
the most efficient. The SinRU activation function improved favorable conditions by establishing guidelines for machine
the performance of the CNN-LSTM voice pathology classifier learning algorithms such those presented in [63], to reach suf-
with different enhancement techniques. This is probably due ficient maturity that will allow reaching full acceptability in
to the sinusoidal component, which introduces additional real-life healthcare settings and environments.
nonlinearity to the positive input part of the function. This
allows better learning of complex and strongly nonlinear pat-
terns from pathological spectrograms and may be the reason 5. Conclusions
for its efficiency which can lead to an accuracy of 87.38%.
In regard to the good performance achieved by the pro- In this paper, a two-stage multiclass voice pathology classifi-
posed CNN-LSTM framework for the classification of dysarth- cation framework was proposed. The first stage carried out
ric severity levels, it becomes possible to consider its pathological voice enhancement by estimating the noise in
integration within pathological speech recognition systems the cepstral domain using CHNR. Supervised and unsuper-
to improve their effectiveness. Indeed, many recent vised speech enhancement algorithms were evaluated,
approaches of pathological speech recognition recommend namely, spectral subtraction, Wiener filtering, log-MMSE,
to design systems that are well-adapted to the severity level LDCU–Net-PFPL and DSEGAN. The second stage learned rele-
of the speech disorder. For instance, in [70], a tempo adjust- vant feature maps from the spectrograms using a hybrid
ment approach is proposed to perform robust personalized CNN-LSTM neural network. To strengthen the learning of
dysarthric speech recognition. The results presented in this complex patterns from spectrograms, a new activation func-
latter study showed that the system based on phoneme- tion SinRU was proposed, and the results confirmed its effec-
based tempo adjustment performs the best for moderate tiveness. The classification experiments were carried out
and severe cases. However, the authors also pointed out the considering different etiologies reflected in two subsets of
need of a robust mapping model of dysarthric speech dynam- the SVD database, and dysarthria severity levels on Nemours
ics at the level of every single phoneme. This feature can be and Torgo databases.
provided by our proposed system that achieved less than 1% The results showed that the proposed framework is
error rate to distinguish between dysarthric severity levels. etiology-independent and therefore could be used to perform
The development of assistive technologies for the assess- a multiclass recognition of voice disorders, which is by far
ment and therapeutic judgement of voice disorders opens more complex than the widely used pairwise methods. In
the way for new opportunities in the field of telemedicine contrast to several approaches where fixed phonemic con-
by providing screening tools that could be used in triage texts are used, the deep structure of the proposed CNN-
and pre-diagnosis in a prospect of improving the current LSTM layer implicitly learns contextual information from fea-
clinical routine. Prospects are also oriented towards the ture maps encompassing relevant spectrogram components.
advantages provided by mobile health (m-health) systems Our study confirms that the use of sustained vocalization
that can constitute a simple and rapid support for the detec- is an effective way to perform voice pathology analysis as
tion of vocal pathologies. A flowchart of a possible m-health demonstrated by several studies evaluating the effects of
system for voice health classification using machine learning non-invasive therapy as well as when performing measure-
is presented in [14]. In the perspective of integrating the pro- ments of some pathological voice characteristics. The use of
posed system in telemedicine applications, it would be impor- sustained vocalization also has the advantage of providing
tant to consider the aspects related to the signal compression language-independent systems. However, the effect of speech
of the speech signal on the transmission channels. Indeed, as signal compression should be considered in transmission
demonstrated in [71], compression algorithms have notice- channels because they modify the spacial area of vowels.
able effects on speech intelligibility by modifying the vowels’ This work formulates the original hypothesis that a patho-
space area. Our system does not use data compression. How- logical voice is intrinsically noisy. This assumption means
ever, in a telemedicine context, it is critical to be mindful of that the voice impairment induces a ‘‘noise effect” in sus-
these changes in intelligibility because our approach uses tained vocalization. The hypothesis is experimentally vali-
sustained vowels as units of classification. dated through the improvement achieved when
478 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0
enhancement of voice segments is used prior to voice pathol- [5] Chandrakala S, Rajeswari N. Representation learning based
ogy classification. In the context of pathological signals, speech assistive system for persons with dysarthria. IEEE
where the noise is considered to be of intrinsic origin, our Trans Neural Syst Rehabil Eng 2017;25(9):1510–7.
[6] Muhammad G, Altuwaijri G, Alsulaiman M, Ali Z, Mesallam
study demonstrated the advantage of combining deep-
TA, Farahat M, Malki KH, Al-nasheri A. Automatic voice
learning-based methods and conventional speech enhance- pathology detection and classification using vocal tract area
ment techniques to achieve better classification performance. irregularity. Biocybern Biomed Eng 2016;36(2):309–17.
The high classification accuracy of dysarthria severity [7] Al-nasheri A, Muhammad G, Alsulaiman M, Ali Z, Mesallam
levels achieved by our system opens up many possibilities TA, Farahat M, Malki KH, Bencherif MA. An investigation of
in the perspective of its joint use with recent dysarthric multidimensional voice program parameters in three
different databases for voice pathology detection and
speech recognition systems. In the context of the need for a
classification. J Voice 2017;31(1):113.e9–113.e18.
robust mapping model of dysarthric speech dynamics that
[8] Al-Nasheri A, Muhammad G, Alsulaiman M, Ali Z, Malki
many research studies have highlighted, our system can be KH, Mesallam TA, Ibrahim MF. Voice pathology detection
placed upstream of dysarthric speech recognizers to adapt and classification using auto-correlation and entropy
their configuration according to severity levels. features in different frequency regions. IEEE Access
Finally, the experimental evidence of this study showed 2018;6:6961–74.
that the proposed two-stage architecture is effective in per- [9] Hammami I, Salhi L, Labidi S. Voice pathologies classification
and detection using emd-dwt analysis based on higher order
forming a multiclass identification of voice disorders. Never-
statistic features. IRBM 2020;41(3):161–71.
theless, in light of the thoughts shared by the authors in
[10] Karan B, Sahu SS, Mahto K. Parkinson disease prediction
[73], we can state that both the features and the conceptual using intrinsic mode function based features from speech
models used in this study are to a large extent relatively sim- signal, Biocybernetics and Biomedical. Engineering 2020;40
ple compared to the inexhaustible richness and variability of (1):249–64.
voice, speech, and language pathologies. Therefore, to over- [11] Hossain MS, Muhammad G. Healthcare big data voice
come these limitations, it is necessary to deepen the multidis- pathology assessment framework. IEEE Access
2016;4:7806–15.
ciplinary collaboration between machine learning, behavioral
[12] Ali Z, Elamvazuthi I, Alsulaiman M, Muhammad G.
signal processing and clinical knowledge. Automatic voice pathology detection with running speech by
using estimation of auditory spectrum and cepstral
CRediT authorship contribution statement coefficients based on the all-pole model. J Voice 2016;30
(6):757.e7–757.e19.
Mounira Chaiani: Conceptualization, Writing - original draft, [13] Harar P, Galaz Z, Alonso-Hernandez JB, Mekyska J, Burget R,
Smekal Z, Towards robust voice pathology detection, Neural
Writing - review & editing, Visualization, Investigation,
Computing and Applications (2018) 1–11. .
Methodology. Sid Ahmed Selouani: Conceptualization, [14] Verde L, Pietro GD, Sannino G. Voice disorder identification by
Methodology, Software, Writing - review & editing, Supervi- using machine learning techniques. IEEE Access
sion, Validation, Methodology. Malika Boudraa: Conceptual- 2018;6:16246–55.
ization, Supervision. Mohammed Sidi Yakoub: Data [15] España-Bonet C, Fonollosa JAR. Automatic speech
curation, Software, Validation, Resources. recognition with deep neural networks for impaired speech.
In: Abad A, Ortega A, Teixeira A, Mateo CG, Hinarejos CDM,
Perdigão F, Batista F, Mamede N, editors. Advances in Speech
Declaration of Competing Interest and Language Technologies for Iberian
Languages. Cham: Springer International Publishing; 2016. p.
97–107.
The authors declare that they have no known competing [16] Zaidi BF, Selouani SA, Boudraa M, Yakoub MS. Deep neural
financial interests or personal relationships that could have network architectures for dysarthric speech analysis and
appeared to influence the work reported in this paper. recognition. Neural Comput Appl 2021:1–20.
[17] Alhussein M, Muhammad G. Voice pathology detection using
deep learning on mobile healthcare framework. IEEE Access
2018;6:41034–41.
R E F E R E N C E S
[18] Mohammed MA, Abdulkareem KH, Mostafa SA, Ghani MKA,
Maashi MS, Garcia-Zapirain B, Oleagordia I, Alhakami H, AL-
Dhief FT, Voice pathology detection and classification using
[1] American Speech-Language-Hearing Association, Definitions convolutional neural network model, Applied Sciences 10
of communication disorders and variations [relevant paper], (11) (2020) 3723..
available from www.asha.org/policy (1993). URL:https://www. [19] Alhussein M, Muhammad G. Automatic voice pathology
asha.org/policy/rp1993-00208/.. monitoring using parallel deep models for smart healthcare.
[2] Martins RHG, do Amaral HA, Tavares ELM, Martins MG, IEEE Access 2019;7:46474–9.
Gonçalves TM, Dias NH, Voice disorders: etiology and [20] Harar P, Alonso-Hernandezy JB, Mekyska J, Galaz Z, Burget
diagnosis, Journal of voice 30 (6) (2016) 761.e1–761.e9.. R, Smekal Z, Voice pathology detection using deep
[3] American Speech-Language-Hearing Association, Voice learning: a preliminary study, in: 2017 International
disorders (practice portal), Retrieved (December, 30, 2021). Conference and Workshop on Bioinspired Intelligence
URL:https://www.asha.org/Practice-Portal/Clinical-Topics/ (IWOBI), 2017, pp. 1–4. .
Voice-Disorders/. . [21] Wu H, Soraghan J, Lowit A, Di-Caterina G, A Deep Learning
[4] Baker J. The role of psychogenic and psychosocial factors in Method for Pathological Voice Detection Using Convolutional
the development of functional voice disorders. Int J Speech- Deep Belief Networks, in: Proc. Interspeech 2018, 2018, pp.
language Pathol 2008;10(4):210–30. 446–450. .
diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0 479
[22] Fang S-H, Tsao Y, Hsiao M-J, Chen J-Y, Lai Y-H, Lin F-C, Wang [43] Witten IH, Frank E, Hall MA, Pal CJ, Chapter 10 - deep
C-T. Detection of pathological voice using cepstrum vectors: learning, in: Witten I. H, Frank E, Hall M. A, Pal C.J (Eds.), Data
A deep learning approach. J Voice 2019;33(5):634–41. Mining (Fourth Edition), fourth edition Edition, Morgan
[23] Chen L, Chen J. Deep neural network for automatic Kaufmann, 2017, pp. 417–466. .
classification of pathological voice signals. J Voice 2020. [44] Boureau Y-L, Ponce J, LeCun Y. A theoretical analysis of
[24] Kim H, Jeon J, Han YJ, Joo Y, Lee J, Lee S, Im S. Convolutional feature pooling in visual recognition. In: Proceedings of the
neural network classifies pathological voice change in 27th international conference on machine learning (ICML-
laryngeal cancer with high accuracy. J Clinical Med 2020;9 10). p. 111–8.
(11):3415. [45] Srivastava N, Hinton G, Krizhevsky A, Sutskever I,
[25] Pützer M, Barry WJ. Saarbrüecken Voice Database, publisher: Salakhutdinov R. Dropout: A simple way to prevent neural
Institut für Phonetik. Universität des Saarlandes (May 2007. networks from overfitting. J Mach Learning Res 2014;15
URL:http://www.stimmdatenbank.coli.uni-saarland.de/ (56):1929–58.
help_en.php4. [46] Pützer M, Wokurek W. Electroglottographic and acoustic
[26] Menendez-Pidal X, Polikoff JB, Peters SM, Leonzio JE, Bunnell parametrization of phonatory quality provide voice profiles
HT, The nemours database of dysarthric speech, in: of pathological speakers. J Voice 2021.
Proceeding of Fourth International Conference on Spoken [47] Sasaki CT, accessed: 2021-05-07. URL:https://www.
Language Processing. ICSLP’96, Vol. 3, IEEE, 1996, pp. 1962– merckmanuals.com/fr-ca/professional/affections-de-l-
1965.. oreille,-du-nez-et-de-la-gorge/troubles-laryngiens/ulc%C3%
[27] Rudzicz F, Namasivayam AK, Wolff T. The TORGO database of A8res-de-contact-du-larynx. .
acoustic and articulatory speech from speakers with [48] Mouawad F, Chevalier D, Santini L, Fakhry N, Bozec A,
dysarthria. Language Resour Eval 2012;46(4):523–41. Espitalier F. Chapitre 9 - traitement chirurgical par
[28] Hsu Y.-T, Zhu Z, Wang C.-T, Fang S.-H, Rudzicz F, Tsao Y, cervicotomie et reconstruction laryngée. In: Barry B, Malard
Robustness against the channel effect in pathological voice O, Morinière S, editors. Cancers du Larynx. Paris: Elsevier
detection, CoRR abs/1811.10376 (2018). arXiv:1811.10376. . Masson; 2019. p. 89–115.
[29] Souli S, Amami R, Yahia S. B, A robust pathological voices [49] Kingma DP, Ba J, Adam: A method for stochastic optimization
recognition system based on DCNN and scattering (2017). arXiv:1412.6980. .
transform. Appl Acoust 2021;177 107854. [50] Hsieh T-A, Yu C, Fu S-W, Lu X, Tsao Y, Improving Perceptual
[30] Bhat C, Das B, Vachhani B, Kopparapu SK, Dysarthric speech Quality by Phone-Fortified Perceptual Loss Using Wasserstein
recognition using time-delay neural network based denoising Distance for Speech Enhancement, in: Proc. Interspeech
autoencoder, in: Proc. Interspeech 2018, 2018, pp. 451–455. . 2021, 2021, pp. 196–200..
[31] Yakoub MS, Selouani SA, Zaidi B-F, Bouchair A. Improving [51] Phan H, McLoughlin IV, Pham L, Chén OY, Koch P, Vos MD,
dysarthric speech recognition using empirical mode Mertins A. Improving gans for speech enhancement. IEEE
decomposition and convolutional neural network. EURASIP J Signal Process Lett 2020;27:1700–4.
Audio, Speech, Music Processing 2020;2020(1):1–7. [52] Pouchoulin G, Fredouille C, Bonastre J-F, Ghio A, Giovanni A,
[32] Borrie SA, Baese-Berk M, Engen KV, Bent T. A relationship Frequency study for the characterization of the dysphonic
between processing speech in noise and dysarthric speech. J voices, in: Proc. Interspeech 2007, 2007, pp. 1198–1201..
Acoust Soc Am 2017;141(6):4660–7. [53] Hendrycks D, Gimpel K, Bridging nonlinearities and
[33] Stachler RJ, Francis DO, Schwartz SR, Damask CC, Digoy GP, stochastic regularizers with gaussian error linear units, CoRR
Krouse HJ, McCoy SJ, Ouellette DR, Patel RR, Reavis CCW, abs/1606.08415 (2016). arXiv:1606.08415. .
Smith LJ, Smith M, Strode SW, Woo P, Nnacheta LC. Clinical [54] Klambauer G, Unterthiner T, Mayr A, Hochreiter S, Self-
practice guideline: Hoarseness (dysphonia) (update). normalizing neural networks, in: Proceedings of the 31st
Otolaryngology-Head Neck Surgery 2018;158(1_suppl):S1–S42. international conference on neural information processing
[34] Gómez-Garcı́a J, Moro-Velázquez L, Arias-Londoño J, Godino- systems, 2017, pp. 972–981. .
Llorente J. On the design of automatic voice condition [55] Ramachandran P, Zoph B, Le QV, Searching for
analysis systems. part iii: review of acoustic modelling activation functions, CoRR abs/1710.05941 (2017).
strategies. Biomed Signal Process Control 2021;66 102049. arXiv:1710.05941. .
[35] Moers C, Möbius B, Rosanowski F, Nöth E, Eysholdt U, [56] Kim M, Kim Y, Yoo J, Wang J, Kim H. Regularized speaker
Haderlein T. Vowel-and text-based cepstral analysis of adaptation of kl-hmm for dysarthric speech recognition. IEEE
chronic hoarseness. J Voice 2012;26(4):416–24. Trans Neural Syst Rehabil Eng 2017;25(9):1581–91.
[36] Aronson AE, Bless D. Clinical Voice [57] Kadi KL, Selouani SA, Boudraa B, Boudraa M. Fully automated
Disorders. Thieme: Thieme Publishers Series; 2009. speaker identification and intelligibility assessment in
[37] Ephraim Y, Malah D. Speech enhancement using a minimum dysarthria disease using auditory knowledge. Biocybern
mean-square error log-spectral amplitude estimator. IEEE Biomed Eng 2016;36(1):233–47.
Trans Acoust Speech Signal Process 1985;33(2):443–5. [58] Guedes V, Teixeira F, Oliveira A, Fernandes J, Silva L, Junior A,
[38] Loizou P. Speech Enhancement: Theory and Practice, Signal Teixeira JP. Transfer learning with audioset to voice
processing and communications. Taylor & Francis; 2007. pathologies identification in continuous speech. Procedia
[39] de Krom G. A cepstrum-based technique for determining a Computer Science 2019;164:662–9.
harmonics-to-noise ratio in speech signals. J Speech, [59] Yilmaz E, Mitra V, Bartels C, Franco H, Articulatory features
Language, Hearing Res 1993;36(2):254–66. for ASR of pathological speech (2018). arXiv:1807.10948..
[40] Nielsen JK, Jensen TL, Jensen JR, Christensen MG, Jensen SH. [60] Oue S, Marxer R, Rudzicz F, Automatic dysfluency detection
Fast fundamental frequency estimation: Making a in dysarthric speech using deep belief networks, in:
statistically efficient estimator computationally efficient. Proceedings of SLPAT 2015: 6th Workshop on Speech and
Signal Processing 2017;135:188–97. Language Processing for Assistive Technologies, Association
[41] Jana GC, Sharma R, Agrawal A. A 1d-cnn-spectrogram based for Computational Linguistics, Dresden, Germany, 2015, pp.
approach for seizure detection from eeg signal. Procedia 60–64..
Computer Science 2020;167:403–12. [61] Brückl M, Ghio A, Viallet F. Measurement of tremor in the
[42] Nair V, Hinton GE, Rectified linear units improve restricted voices of speakers with parkinson’s disease. Procedia
boltzmann machines, in: ICML, 2010, pp. 807–814. . Computer Science 2018;128:47–54.
480 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0
[62] Suppa A, Asci F, Saggio G, Leo PD, Zarezadeh Z, Ferrazzano G, enhancement, in: 2018 Asia-Pacific Signal and Information
Ruoppolo G, Berardelli A, Costantini G. Voice analysis with Processing Association Annual Summit and Conference
machine learning: One step closer to an objective diagnosis (APSIPA ASC), 2018, pp. 373–377. .
of essential tremor. Mov Disord 2021;36(6):1401–10. [69] Zhang Q, Nicolson A, Wang M, Paliwal KK, Wang C.
[63] Rusz J, Tykalova T, Ramig LO, Tripoliti E. Guidelines for speech Deepmmse: A deep learning approach to mmse-based noise
recording and acoustic analyses in dysarthrias of movement power spectral density estimation. IEEE/ACM Trans Audio,
disorders. Mov Disord 2021;36(4):803–14. Speech, Language Processing 2020;28:1404–15.
[64] Nilsson C, Nyberg J, Strömbergsson S. How are speech sound [70] Xiong F, Barker J, Christensen H. Phonetic analysis of
disorders perceived among children? A qualitative content dysarthric speech tempo and applications to robust
analysis of focus group interviews with 1011-year-old personalised dysarthric speech recognition. In: ICASSP 2019–
children. Child Language Teaching Therapy 2021;37 2019 IEEE International Conference on Acoustics, Speech and
(2):163–75. Signal Processing (ICASSP). p. 5836–40.
[65] Lin F-C, Chien H-Y, Kao Y-C, Wang C-T. Multi-dimensional [71] Utianski RL, Sandoval S, Berisha V, Lansford KL, Liss JM. The
investigation of the clinical effectiveness and prognostic effects of speech compression algorithms on the
factors of voice therapy for benign voice disorders. J Formos intelligibility of two individuals with dysarthric speech. Am J
Med Assoc 2021. Speech-Language Pathology 2019;28(1):195–203.
[66] Suppa A, Asci F, Saggio G, Marsili L, Casali D, Zarezadeh Z, [72] Patel RR, Awan SN, Barkmeier-Kraemer J, Courey M, Deliyski
Ruoppolo G, Berardelli A, Costantini G. Voice analysis in D, Eadie T, Paul D, vec JG, Hillman R. Recommended protocols
adductor spasmodic dysphonia: Objective diagnosis and for instrumental assessment of voice: American speech-
response to botulinum toxin. Parkinsonism Related language-hearing association expert panel to develop a
Disorders 2020;73:23–30. protocol for instrumental assessment of vocal function. Am J
[67] Khan T, Westin J, Dougherty M. Classification of speech Speech-Language Pathology 2018;27(3):887–905.
intelligibility in parkinson’s disease. Biocybern Biomed Eng [73] Corcoran C, Cecchi G. Using language processing and speech
2014;34(1):35–45. analysis for the identification of psychosis and other
[68] Zezario RE, Huang J-W, Lu X, Tsao Y, Hwang H-T, Wang H-M, disorders. Biological Psychiatry: Cognitive Neuroscience
Deep denoising autoencoder based post filtering for speech Neuroimaging 2020;5(8):770–9.