Voice Disorder Classification Using Speech Enhancement and Deep Learning Models

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

biocybernetics and biomedical engineering 42 (2022) 463– 480

Available at www.sciencedirect.com

ScienceDirect

journal homepage: www.elsevier.com/locate/bbe

Original Research Article

Voice disorder classification using speech


enhancement and deep learning models

Mounira Chaiani a, Sid Ahmed Selouani b,*, Malika Boudraa a, Mohammed Sidi Yakoub b
a
Laboratory of Speech Communication and Signal Processing, University of Sciences and Technology Houari Boumediene, Algiers, Algeria
b
Research Laboratory in Human-System Interaction, Université de Moncton, Shippagan Campus, New Brunswick E8S 1P6, Canada

A R T I C L E I N F O A B S T R A C T

Article history: With the recent development of speech-enabled interactive systems using artificial
Received 12 June 2021 agents, there has been substantial interest in the analysis and classification of voice dis-
Received in revised form orders to provide more inclusive systems for people living with specific speech and lan-
28 February 2022 guage impairments. In this paper, a two-stage framework is proposed to perform an
Accepted 4 March 2022 accurate classification of diverse voice pathologies. The first stage consists of speech
Available online 17 March 2022 enhancement processing based on the original premise, which considers impaired voice
as a noisy signal. To put this hypothesis into practice, the noise lestral harmonic-to-
noise ratio (CHNR). The second stage consists of a convolutional neural network with
Keywords:
long short-term memory (CNN-LSTM) architecture designed to learn complex features
Voice disorder classification
from spectrograms of the first-stage enhanced signals. A new sinusoidal rectified unit
CHNR
(SinRU) is proposed to be used as an activation function by the CNN-LSTM network.
Speech enhancement
The experiments are carried out by using two subsets of the Saarbruecken voice data-
CNN-LSTM
base (SVD) with different etiologies covering eight pathologies. The first subset contains
Trigonometric activation function
voice recordings of patients with vocal cordectomy, psychogenic dysphonia, pachyder-
mia laryngis and frontolateral partial laryngectomy, and the second subset contains
voice recordings of patients with vocal fold polyp, chronic laryngitis, functional dyspho-
nia, and vocal cord paresis. Dysarthria severity levels identification in Nemours and
Torgo databases is also carried out. The experimental results showed that using the
minimum mean square error (MMSE)-based signal enhancer prior to the CNN-LSTM net-
work using SinRU, led to a significant improvement in the automatic classification of the
investigated voice disorders and dysarhtria severity levels. These findings support the
hypothesis that using an appropriate speech enhancement preprocessing has positive
effects on the accuracy of the automatic classification of voice pathologies thanks to
the reduction of the intrinsic noise induced by the voice impairment.
Ó 2022 Nalecz Institute of Biocybernetics and Biomedical Engineering of the Polish Academy
of Sciences. Published by Elsevier B.V. All rights reserved.

* Corresponding author at: Information Management dpt, Université de Moncton, 218 Boul. J.-D. Gauthier, Shippagan, New Brunswick
E8S 1P6, Canada.
E-mail addresses: mchaiani@usthb.dz (M. Chaiani), sid-ahmed.selouani@umoncton.ca (S.A. Selouani), mboudraa@usthb.dz
(M. Boudraa), mohammed.sidi.yakoub@umoncton.ca (M. Sidi Yakoub).
https://doi.org/10.1016/j.bbe.2022.03.002
0168-8227/Ó 2022 Nalecz Institute of Biocybernetics and Biomedical Engineering of the Polish Academy of Sciences. Published by Elsevier
B.V. All rights reserved.
464 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0

1. Introduction variability of impaired speech, some recognizers perform


implicit modeling of phonetic variations. An effective
The production of speech is a very complex process that approach consists of using adaptation and representation
involves the coordination of multiple neuromotor systems learning to improve the recognition rate of dysarthric speech
including respiration, phonation and articulation. Any dam- [5]. However, to overcome the phonetic modeling and adapta-
age to the above systems or any part of them results in artic- tion that often require a priori knowledge or transcribed data,
ulation, fluency, or voice disorders [1]. Therefore, commonly end-to-end machine learning techniques have been devel-
used taxonomy classifies voice disorders according to their oped in various areas of pathological speech processing
origin, which can be organic, functional or psychogenic in including voice pathology detection and classification.
nature. While functional voice disorders are idiopathic, Detection is a classification task involving two classes. The
organic voice disorders result from underlying structural or first class contains individuals with voice pathology and the
neurological causes. Psychogenic voice disorders are a mani- second one contains healthy individuals. Voice pathology
festation of one or more types of psychological problems [2]. detection systems typically combine several features
extracted from the voice signal and use different machine
1.1. Overview of voice disorders learning algorithms to perform the detection. By using a sup-
port vector machine (SVM) as a classifier, [6] evaluated the
Functional voice disorders usually relate to a misuse of the statistical measurements of the vocal tract area, [7] examined
voice resulting in an altered voice quality without an affected the multidimensional voice program parameters and [8]
physiology of the voice production organs [3]. reported that the 1 to 8 kHz frequency bands contain relevant
Organic voice disorders occur when voice production information using the entropy and the autocorrelation peak
organs are impacted, and are divided into structural and neu- and its corresponding delay. [9] statistically measured the dis-
rogenic. The structural abnormalities are due to physical defor- crete wavelet transform signals after an empirical mode
mation of the voice production organs, such as vocal folds decomposition (EMD). [10] also used EMD to propose intrinsic
lesions. The neurogenic abnormalities are due to problems mode function related features including cepstral coefficient
with the nervous system, altering the voice production [3]. (IMFCC). [11] performed a fusion of the decisions of an
Psychogenic voice disorders can originate from one or extreme learning machine, Gaussian mixture model (GMM)
more types of psychological processes. These processes can and SVM, which had as input the fusion of MPEG-7 audio
include depression, the emotional state following a traumatic parameters and interlaced derivative pattern parameters.
or stressful event and anxiety. These psychological states [12] opted for GMM and processed the spectrum based on
may induce a partial loss of control of some speech produc- the psychophysical conditions of hearing. The performance
tion components, leading to disturbances in fluency, stam- of extreme gradient boosting (XGBoost), DenseNet and Isola-
mering and voice changes [4]. tion Forest were compared by [13] using at the input a combi-
The ultimate goal of speech therapists is to improve speech nation of a raw audio signal, spectrogram, Mel-frequency
communication for the benefit of people living with voice dis- cepstral coefficients (MFCC) and acoustic parameters. [14]
orders who often experience poor self-esteem and embarrass- compared the performance of classification algorithms SVM,
ment that can lead to many negative psychological impacts. decision tree, Bayesian classification, logistic model tree, k-
The therapist assesses the patients condition and quality of nearest neighbor one (k-NN) and K* with a vector input com-
their voice to accurately identify their voice disorder. This eval- posed of MFCC, fundamental frequency, perturbation-related
uation is carried out through auditory perception or by using measures and noise-related measures.
medical instruments. Auditory perception is based on the The increasing interest and dramatic development of
speech therapist’s analysis of the patient’s voice quality during deep-learning-based approaches have enabled practical
a conversation. This subjective description depends on the applications in pathological speech processing, achieving
experience of the speech therapist and may differ from one major breakthroughs and greatly improved performance
speech therapist to another. However, assessment using compared to traditional signal processing methods. For
instruments such as a rigid endoscope introduced through instance, deep neural network architectures are used to per-
the mouth or a flexible endoscope inserted through the nose form dysarthric speech analysis and recognition [15]16. In
allows visual perception and objective assessment. Another the voice pathology detection field, the transfer learning
objective evaluation is the calculation and analysis of acoustic approach was adopted by [17] using VGG16 and CaffeNet as
measurements of a voice signal picked up by a microphone feature extractors with the SVM as a classifier, by [18] using
using a computer. The automation of this method is attracting ResNet34 and by [19], who performed a connectionist fusion
increasing attention from researchers due to its noninvasive of the features obtained from three pretrained AlexNet net-
nature, which is more comfortable for the patient and offers works arranged in parallel.
a fast detection tool for speech therapists. To train deep neural network (DNN) architectures from
scratch, [20] opted for a CNN-LSTM with a raw speech signal
1.2. Related work as input to detect voice pathology and achieved 68.08% accu-
racy on the SVD database. In [21], 71% accuracy was obtained
Several studies address the processing of pathological speech for the detection of 6 organic pathologies of the SVD database
in various aspects that include speech recognition for people using a convolutional deep belief network (CDBN) with spec-
suffering from speech pathologies. To cope with the high trograms as inputs. The authors in [22] chose MFCC as a
diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0 465

descriptive vector to compare the performance of SVM, GMM and the second stage is a CNN-LSTM that performs the
and DNN with three hidden layers, which gave the best accu- identification of voice impairments by using two subsets
racy. The authors in [23] opted for MFCCs as inputs for the of the Saarbruecken voice database (SVD) [25] with differ-
DNN, which essentially consists of two sparse autoencoders. ent etiologies covering the following eight pathologies:
A comparison of objective and subjective detection of laryn- vocal cordectomy, psychogenic dysphonia, pachydermia
geal cancer was conducted in [24]. The subjective approach laryngis, frontolateral partial laryngectomy, vocal fold
consisted of playing back the voice recordings to 4 volunteers polyp, chronic laryngitis, functional dysphonia, and vocal
including 2 trained laryngologists. The objective method was cord paresis; an automatic classification of dysarthria
based on algorithms such as SVM, XGBoost, light gradient severity levels is also performed by using Nemours[26]
boosted machine (LGBM), artificial neural network (ANN), and Torgo [27] databases;
one-dimensional convolutional neural network (1D-CNN) (ii)to propose the cepstral harmonic-to-noise ratio (CHNR)
and two-dimensional convolutional neural network (2D- as an innovative estimator of pathological noise to enable
CNN). The feature extraction provided MFCCs, short-time proper functioning of the enhancement module;
Fourier transform (STFT), jitter, shimmer, harmonic-to-noise (iii)to propose the sinusoidal rectified unit (SinRU) as a new
ratio (HNR), fundamental frequency and the raw voice signal. activation function that is expected to improve the perfor-
According to their evaluation on a database of 50 males with mance of the CNN-LSTM classifier.
laryngeal cancer and 45 healthy males, most of the automatic
methods surpassed the expert diagnosis, which was 69.9% The remainder of this paper is structured as follows. Sec-
accurate. The raw signal processed by a 1D-CNN gave the best tion 2 introduces the two-stage voice pathology classification,
result of 85.2%. which is composed of a voice enhancement module based on
It is worth mentioning that most recent studies opted for the CHNR for noise estimation and a deep neural network.
pairwise classification, i.e., the classification between two Section 3 presents the database, evaluation metrics, evalu-
voice pathologies at the same time (one vs. one or one vs. all ated speech enhancement algorithms, proposed activation
configurations). This pairwise classification was carried out function, as well as the results of the evaluation on two sets
by the systems presented in [6–8,19] on three common voice of pathological voices and dysarthria severity levels. We dis-
pathologies (cysts, polyps and paralysis) extracted from the cuss the obtained experimental results in Section 4. Finally,
SVD, Arabic voice pathology database (AVDP) and Mas- conclusions are outlined in Section 5.
sachusetts eye and ear infirmary (MEEI) databases. In [12], the
pairwise approach was applied to classify 5 pathologies (ad- 2. Proposed approach: Deep learning of
ductor spasmodic dysphonia, keratosis, vocal fold nodules, pathologically enhanced speech
vocal fold polyp and paralysis) of the MEEI database. The
pairwise-based system presented in [9] classified 5 pathologies In this section, the proposed voice pathology classification
(dysphonia, laryngitis, funktionelle dysphonia, rekurrenspars- system is presented. As illustrated by the block diagram
ese and hyperfunktionelle dysphonia) of the SVD and 4 in Fig. 1, there are two main stages that cooperate to
pathologies (Alzheimer, Parkinson, chronic laryngitis and achieve the classification process. The first stage focuses
paralysis) of a private database. on enhancing the pathological voice signal based on the
In real-life conditions, the pairwise approach is not helpful premise that there are some similarities between disordered
in regard to the time to perform an accurate diagnosis. A pre- voice and noisy speech uttered in adverse conditions.
cise identification of one pathology that has to be recognized Unlike the studies dealing with adverse conditions such
among many others leads to a multiclass identification, those under the channel effect [28] or dealing with an
which is by far a more complicated task. external source of noise [29], our approach make the
hypothesis that a pathological voice is intrinsically noisy.
1.3. Goal and main contributions This assumption means that the voice impairment induces
a ‘‘noise effect” on the voice produced by a person suffering
In this work, a new approach is proposed to provide an effi- from various voice disorders. Based on preliminary observa-
cient solution that allows the precise multiclass identification tions and experimental evidence presented in [30,31], the
of vocal pathologies. For this purpose, various configurations use of speech enhancement techniques to improve quality
are investigated to assess the relevance of a speech enhance- and intelligibility of pathological speech is logically adopted
ment pretreatment and the optimal structure based on deep to reduce the effect of its induced noise. Although the ori-
neural networks for a multiclass recognition of voice disor- gin of the degradation differs, we can formulate the
ders. This is far more complex than the widely used pairwise hypothesis that conventional noisy speech results from
approaches. Besides this, we have extended the application of the degradation of the environment, while pathological
the proposed system to dysarthria severity levels’ classifica- voice results from the degradation of the speech production
tion. The goal is to evaluate the effectiveness of the enhance- system. During the second stage, the enhanced signal is
ment approach within a specific speech pathology, namely projected into the time–frequency domain thanks to a spec-
dysarthria. Our original contributions are threefold: trogram representation. This 2D representation is used by
the deep learning-based feature learning module to extract
(i)to design a two-stage multiclassifier of voice disorders the relevant characteristics allowing the classification of
where the first stage is a speech enhancement module voice pathologies.
466 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0

Fig. 1 – Classification system for two-stage voice pathologies.

2.1. Enhancement of pathological voice where c and e are the a posteriori and a priori SNRs, respec-
tively, defined by
Speech enhancement aims at reducing the impact of noise
j Xðt; f Þj2
without deteriorating the quality and intelligibility of speech. cðt; f Þ ¼ h i; ð6Þ
E j Nðt; f Þj2
Recently, the application of speech enhancement was
extended to improve the robustness of dysarthric speech and
recognition systems [30,31] since a positive correlation is h i
found between intelligibility scores of dysarthric and noisy E j Sðt; f Þj2
eðt; f Þ ¼ h i; ð7Þ
speech [32]. To assess the impact of using speech enhance- E j Nðt; f Þj2
ment techniques, our study focuses on dysphonia, which is
characterized by alterations in voice pitch, loudness and qual- Since e cannot be calculated in concrete terms, it is esti-
ity. Impaired voice quality can be described as hoarseness mated by the decision-directed approach.
[33]. Hoarseness, quantified by the harmonics-to-noise ratio 2
eðt; f Þ ¼ a  cðt  1; f Þ  Gðt  1; f Þ þ ð1  aÞ
[34], is considered to be the class that encompasses breathi-
ness and roughness [35] and is thus technically perceived as  maxðcðt; f Þ  1; 0Þ; ð8Þ
noise or strange tones [36]. The level of hoarseness can be h i
where a is a smoothing coefficient. The noise E j Nðt; f Þj2 is
estimated by evaluating the extent to which noise replaces
the harmonic structure in the spectrogram. The relationship estimated from the silent frames of the noisy signal.
between the harmonic component and noise was quantified The data used throughout our experiments are the sus-
as the harmonics-to-noise ratio (HNR). This quantifiable indi- tained vowels/a/ that are extracted from the SVD corpus.
cator can constitute an alternative to the conventional signal- The vowels are characterized by their quasiperiodicity, which
to-noise ratio (SNR) and therefore offers the possibility of results from vocal cord vibration at a fundamental frequency
using speech enhancement techniques to improve the quality F0 . Since these signals do not contain any silence, we replaced
of pathological voice. For the purposes of this study, we use the a posteriori SNR c with a CHNR whose calculation steps,
an adapted HNR developed in the cepstral domain. This cal- illustrated by Fig. 2, are detailed below.
culation is detailed in the context of the minimum-mean The CHNR is a ratio of overall energy to noise energy (in
square error (MMSE)-based signal enhancer [37]. dB) and is used in evaluating voice quality [34]. It measures
Let us consider a noisy voice signal xðiÞ, which is the sum the noise amount in a voice signal, and its calculation is based
of a clean voice signal sðiÞ and a noise nðiÞ: on the cepstrum [39]. Segments are first framed with an adap-
tive window of a length five times the fundamental period F10 ,
xðiÞ ¼ sðiÞ þ nðiÞ: ð1Þ
the nonlinear least squares (NLS) estimator of F0 [40] is used.
The short-term Fourier transform of noisy signal is given by Based on the hypothesis that the voice signal can be
Xðt; f Þ ¼ Sðt; f Þ þ Nðt; f Þ; ð2Þ expressed as a source signal filtered by the vocal tract, the
log magnitude spectrum of the windowed signal results in
where 0 6 t 6 T and 0 6 f 6 F are the time frame and spectral
the summation of the logarithms of the source spectra
component, respectively. To estimate the clean voice signal b
S, logXð f Þ and the spectral envelope logHð f Þ, as shown by
the noisy signal is attenuated by a gain function G.
Yð f Þ ¼ log j Sð f Þ j
b ð9Þ
S ðt; f Þ ¼ G  Xðt; f Þ; ð3Þ ¼ logXð f Þ þ logHð f Þ
The signals are enhanced by the log-MMSE estimator, which
minimizes the mean-square error between the log-
magnitude spectra of the clean and estimated signals, leading
to noise attenuation without distorting the signal too much
[38]. Its spectral gain is given by
" Z #
eðt; f Þ 1 1 ey
Gðe; mÞ ¼  exp dy ; ð4Þ
1 þ eðt; f Þ 2 mðt;f Þ y

eðt; f Þ
mðt; f Þ ¼ cðt; f Þ  ; ð5Þ
1 þ eðt; f Þ
Fig. 2 – CHNR calculation steps.
diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0 467

To separate the noise in the source signal from the har- volutional layer with a rectified linear unit (ReLU) activa-
monic part and the spectral envelope, a transition to the real tion function and one maximum pooling layer. Moreover,
cepstral domain Eq. (10) is required, and a liftering of the cep- it learns to extract the global features using one LSTM
strum is processed. layer. The circled unit in Fig. 3 is magnified in Fig. 4.
yðsÞ ¼ IFFTðlog j Sð f Þ jÞ ð10Þ The CNN-LSTM combination is then used to classify patho-
logical voices. The application of the 1D convolution on
with s the quefrency. the spectrogram was not previously used except by [41]
The lifter consists of the following steps: for seizure detection on electroencephalogram signals.
The convolution layer, which aims to learn to extract
local features from its input, includes several filters, also
 detection of the rahmonics peaks in the cepstrum; known as convolution kernels. As shown in Fig. 3, given a
spectrogram I 2 RTF with T frames and F frequency bins,
 first derivation of the cepstrum and detection of the sign
the convolution layer convolves I with L kernels Kl 2 RMF
change on the two sides of the peak to determine the peak
Eq. (16). The multitude of kernels allows the creation of
width;
multiple preactivated feature maps Cl 2 RTc 1 , with
Tc ¼ T  M þ 1.
 zeroing of the rahmonic peak to obtain the comb-liftered
part CðsÞ. X
M X
F
Cl ðuÞ ¼ convðI; KÞu ¼ Kl ði; jÞIðuþi;jÞ ð16Þ
i¼1 j¼1
The Fourier transform of the comb-liftered part is an
Each of the L preactivated feature maps is obtained once the
approximation of the noise spectrum ApNoiseð f Þ, which is
kernel has been shared by all temporal components of the
corrected Eq. (13) by subtracting the deviance Dð f Þ Eq. (12)
spectrogram.
so that the noise spectrum drops below the harmonic
To allow the network to learn complex data, an element-
minima.
wise nonlinear function is applied to the preactivated feature
ApNoiseð f Þ ¼ FFTðCðsÞÞ ð11Þ maps.

Dð f Þ ¼j minðY ð f Þ  ApNoiseð f ÞÞ j ð12Þ Al ¼ rðCl Þ ð17Þ

with ði  1Þ:F0 < f < i:F0 and i P 1 The ReLU function [42], which sets its negative inputs to zero
and preserves the positive input values, allows rapid training
Noiseð f Þ ¼ ApNoiseð f Þ  Dð f Þ ð13Þ
of the network thanks to its simple function rðCl Þ ¼ maxðCl ; 0Þ
The CHNR is then given by: and performs very well in several contexts [43].
  To reduce the resolution of the features and increase their
CHNRb ¼ 20 Y b  Noiseb ; ð14Þ
robustness, the data must pass through a pooling layer. The
fs
where 0 6 b 6 2
is used to define a frequency band with fs as max pooling layer [44] performs a nonlinear subsampling of
the sampling frequency. its input. It calculates the maximum of each nonoverlapped
It can be seen that CHNRb is the a posteriori SNR in dB. We region of length p along the time axis.
CHNRb
replaced c in Eq. (8) with 10 10 . Once the signals are p1 Tc
Pl ðsÞ ¼ maxi¼0 fAl ðsp  iÞg; s ¼ 1; 2;    ; ð18Þ
enhanced, their representation in the time–frequency domain p
is calculated thanks to the spectrogram that is used through- Features that have a very low probability of being active
out all of the experiments. The use of spectrograms is moti- can be separated using this layer. The resulting feature
vated by the two-dimensional nature of this feature maps are fed into an LSTM layer to capture their short
representation and is therefore more suitable for convolu- and long-term contextual dependencies. LSTM is a recur-
tional networks. This representation visualizes the entire rent neural network that learns the short and long-term
spectral decomposition of the voice signal on a single graph dependencies of its input sequences. It is composed of G
and is obtained from the squared magnitude of the short- memory blocks which in turn consist of an activation cell
lived Fourier transform. This latter is the Fourier transform and an input, a forget and an output gate, as shown in
of a windowed signal by an N-length window shifted Fig. 4.
wðn  tÞ across the signal xðnÞ as given by At time t, the current input data Xt and the data from pre-
X
N1
j2nf p
vious hidden state ht1 are fed to the memory block to update
XSTFT ðt; f Þ ¼ xðnÞwðn  tÞe N : ð15Þ its state defined by Eq. (23). This is accomplished using Eqs.
n¼0
(19)–(22).
The resulting spectrograms are fed into the feature  
f
learner. Fg ¼ sigmoid Wgf  Xt þ Ugf  ht1 þ bg ð19Þ

 
i
2.2. CNN-LSTM architecture for voice disorder Ig ¼ sigmoid Wgi  Xt þ Ugi  ht1 þ bg ð20Þ
classification
 c
Cg ¼ tanh Wgc  Xt þ Ugc  ht1 þ bg ð21Þ
The proposed network, illustrated in Fig. 3, learns to  o
extract local features from the spectrogram by one 1D con- Og ¼ sigmoid Wgo  Xt þ Ugo  ht1 þ bg ð22Þ
468 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0

Fig. 3 – Proposed feature learner architecture.

ct ¼ Ig  Cg þ Fg  ct1 ð23Þ
Algorithm 1: Proposed voice disorder classification system
ht ¼ Og  tanhðct Þ ð24Þ Input Pathological voice signals s
Output Optimized parameters of enhanced learner and
where Wg and Ug are weight matrices, bg are bias vectors, and
pathology label for input signal
 is the elementwise (Hadamard) product. Fg, the forget gate
senf = enframe(s, window-size = 32 ms,
output, determines which information from the previous cell frame-shift = 16 ms)
state ct1 to forget and retain. Ig, the input gate output, deter- for each frame of senf do
mines the important information of Cg, the input modulation, Calculate F0 .
after their multiplication. The activation cell, as given by Eq. end for
(23), updates the state of the cell by summing the important foreach 5 periods (F10 ) of s do
Calculate CHNR
information of the input data and the information retained
end for
from the previous cell state. Og, the output gate output, con- senh =logMMSE(s, CHNR)
trols the hidden state of the next cell ht defined by Eq. (24). spec = spectrogram(senh , window = 30 ms, shift = 15 ms)
Some outputs of the LSTM layer are dropped out to prevent Create a set of D mini spectrograms of 36 frames
overfitting [45]. for d 2 D do
To finally classify the pathologies, an output layer with a S = CNN-LSTM(d)
^
b=argmax(S)
number of neurons equal to the B number of classes is ^ = CrossEntropyLoss(b; S)
f(b; b)
required. Using the softmax activation function, the outputs
Backpropagate gradient and update learner parameters
of the output layer are end for
exi F0 is the fundamental frequency. b and b^ are the true label
Si ðxÞ ¼ B ; ð25Þ
X and the predicted one, respectively.
exj
j¼1

with 1 6 i 6 B. From these outputs, a cross-entropy loss given


by Eq. (26) is calculated between the true labels b of the input
3. Experimental setup and results
data, presented by one-hot vectors, and the predictions S.
X
B
f ¼  bi logðSi Þ ð26Þ In this section, we describe the datasets used in this study,
i¼1 and we present the unsupervised and supervised enhance-
The gradient of the obtained loss is backpropagated through ment algorithms used for comparison purposes. We also
the network to optimize the error by updating the learner’s detail the hyperparameters of the feature learner in addition
parameters. to the proposed activation function. Moreover, we present the
Algorithm 1 summarizes the processing steps followed by results of the pathology classification system on two sets of
pathological voice until the classification decision is obtained. pathologies.
diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0 469

This pathology selection allows us to validate the hypothesis


which considers pathological voice as a noisy signal, espe-
cially on a disorder acquired after surgical intervention.
Indeed, in real contexts, an automatic system capable of dis-
tinguishing vocal pathologies of different etiologies is neces-
sary for the future integration of systems based on machine
learning in a medical environment.
Psychogenic dysphonia is a voice alteration of psychologi-
cal etiology such as depression and anxiety. A patient may
present it unconsciously as an organically caused disorder
to avoid confronting emotional conflict, stress or personal
failure [4].
Pachydermia laryngis occurs due to lesions of the ary-
tenoid cartilage, where the vocal cords are attached and
appear like ulcers. These ulcers are usually due to voice mis-
use in the form of repeated sudden glottal attacks[47].
Cordectomy and frontolateral partial laryngectomy are
surgical treatments for glottic cancers [48]. The first is the
removal of a vocal fold or part of it. The second involves
resection of the tumorous vocal fold, the anterior commis-
sure, an anterior portion of the contralateral vocal fold and
an anterior fragment of the thyroid cartilage.1.
By analyzing the first row of Fig. 6, we can visually notice
that the spectrograms belonging to the two pathologies
cordectomy and frontolateral partial laryngectomy contain
more noise than those of the two pathologies pachydermia
Fig. 4 – Architecture of LSTM memory block.
laryngis and psychogenic dysphonia. Visual observations of
the first subset waveforms show similarity to noisy speech
3.1. Data signals, leading us to assess the impact of the use of speech
enhancement processing on the automatic classification of
The selected datasets used in this study are extracted from a voice pathologies of the first subset.
German voice database freely available online [25]. Developed The Second SVD subset used in this study contains record-
by the Phonetic Institute of Saarland University, the Saar- ings of voice disorders based on an etiology related to organic-
bruecken Voice Database (SVD) contains approximately 2041 structural, organic-neurogenic, and functional impairments.
voice recording sessions of 681 healthy and 1019 pathological Hence, the following pathologies have been selected: vocal
individuals. Speakers who were recorded before their pathol- cord paresis, vocal fold polyp, chronic laryngitis and func-
ogy and those recorded after their recovery were counted in tional dysphonia. This second subset has the advantage to
both healthy and pathological classes. A pathological subject have a balanced number of files with respect to the gender
suffers from at least one of the 71 pathologies mentioned in of speakers. The demographic information about the patients
the database. The voice recording session consists of the sen- is given in Table 2. We can note that the number of male and
tence ‘Guten Morgen, wie geht es Ihnen?’ whose English female is balanced and the effect of age is neutralized since
translation is ‘Good morning, how are you?’, and the vowels all groups have the same age average.
‘a’, ‘i’, and ‘u’ pronounced at different pitches (normal, high,
low and rising-falling). Each voice file is recorded in a 3.3. Setup
sound-treated room using a Computerized Speech Lab station
(model 4300B) at a sampling frequency of 50 kHz and an The spectrogram is calculated with a Hamming window of
amplitude resolution of 16 bits. The distance between the lips 30 ms with a shift of 15 ms, resulting in a spectrogram of size
and the microphone was maintained constant through the T  750, where 750 represents the frequency bins and T the
use of a headset condenser microphone (NEM 192.15, Beyer- number of frames. Each spectrogram is divided into minis-
dynamic, Heilbronn, Germany) [46]. For a large part of the pectrograms of size 36  750, where 36 represents the number
voice files, an electroglottographic (EGG) signal is available. of frames in the smallest voice signal. This division into min-
In our experiments, we opted for the sustained vowel/a/ pro- ispectrograms allows data augmentation. To classify a voice
nounced with a neutral pitch due to its rich harmonic signal, the mean of the predictions of its segments is
content. calculated.
Grids of predefined values are evaluated to obtain the opti-
3.2. Description of selected pathologies mal learner parameters. The grid of filters is
1
Cordectomy and laryngectomy are not, strictly speaking, vocal
The first selected dataset, presented in Table 1, covers
pathologies (they are acquired disorders) but they have been
pathologies based on the following etiology: psychogenic, considered as such in the SVD database. In this study, we kept the
organic, and impairment following surgery on the vocal cords. same nomenclature as that used in SVD
470 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0

Table 1 – Number of records for each pathology and speakers’ age.


Pathology Female records Average female agestd Male records Average male agestd Total

Cordectomy 2 51.50.71 33 61.366.99 35


Frontolateral partial laryngectomy 0 / 27 57.857.84 27
Pachydermia laryngis 1 610 30 56.3312.32 31
Psychogenic dysphonia 35 51.179.48 12 50.679.39 47

Fig. 5 – Comparison of the objective quality measures for the conventional and proposed MMSE-based enhancement
methods applied to each pathology: (a) short-time objective intelligibility (STOI), (b) signal to noise ratio (SNR), and (c)
perceptual evaluation of speech quality (PESQ).

½4; 8; 16; 32; 64; 128, the grid of the kernel size is ½2; 3; 4, the grid the number of correctly predicted spectrograms. The sum of
of the pool size is ½2; 3; 4, the grid of LSTM units is the diagonal elements divided by the total number of tested
½50; 100; 150; 200, and the grid of the dropout rate is ½0:4; 0:5; 0:6. spectrograms is the accuracy. The F1 score for each class is
After evaluation, the learner’s convolution layer consists of the harmonic mean of the sensitivity and the precision ratios
32 filters of size 4 with a stride size of 1, allowing for the learn- Eq. (27). For a class B1 , the precision is the ratio of correctly clas-
ing of 32 feature maps. The kernel size in the max pooling sified spectrograms B1 and all spectrograms classified as B1 .
layer is of 4. A hundred units constitute the LSTM layer, and The sensitivity is the ratio of the correctly classified spectro-
the dropout rate is 0.4. The learner is trained over 200 epochs grams B1 and all B1 tested spectrograms. The mean of the F1
at a learning rate of 0.0001 using the Adam optimization algo- scores of all classes is the macro F1.
rithm [49]. precision  sensitivity
The experiments were implemented by using Python 3.5; F1score ¼ 2 ð27Þ
precision þ sensitivity
spectrograms were calculated using the Scipy 1.4.1 library;
and the network is constructed using Keras 2.1.6 library on Each system is evaluated ten times due to the network’s ran-
TensorFlow 2.3.0 platform. dom initialization. The average accuracy of the system over
the 10 iterations and its standard deviation, as well as the
3.4. Evaluation metrics average and standard deviation of the F1 score, are provided.

The confusion matrix, accuracy and macro F1 score are used to 3.5. Impact of the cross-validation design
evaluate the system performance. The rows and columns of
the confusion matrix indicate the true classes and predicted A n-fold cross-validation approach is used in our experi-
classes, respectively. The diagonal of this matrix represents ments. The files of each class are divided into n folds. For each
diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0 471

Fig. 6 – Vowel/a/ spectrograms of four pathologies and their enhanced versions.

Table 2 – Number of records for each pathology and speakers’ age.


Pathology Female records Average female agestd Male records Average male agestd Total

Vocal fold polyp 19 53.6815.46 19 49.5811.27 38


Chronic laryngitis 19 49.9513.87 19 5011.23 38
Functional dysphonia 19 51.4714.3 19 49.2114.88 38
Vocal cord paresis 19 528.83 19 49.5812.67 38

of the n iterations, n  1 folds from each class are used for standard speech enhancement evaluation metrics, such as
training, and the remaining fold is used for testing. During SNR, perceptual evaluation of speech quality (PESQ), and
the final iteration, all files are tested. To decide how to evalu- short-time objective intelligibility (STOI) are reported in
ate the system, we tested two designs. The first is the 5-fold Fig. 5. As shown in Fig. 5 the proposed MMSE-based configu-
cross-validation, motivated by maintaining speaker indepen- ration performed better than the conventional MMSE what-
dent experiences, since some speakers have multiple files. ever the evaluation metric.
The second is the 10-fold cross-validation. By studying the In addition to the unsupervised log-MMSE enhancement
details of the two configurations, we noted that the accuracy algorithm and for comparison purposes, spectral subtrac-
of the 10-fold system is more than 8% better than the 5-fold tion and Wiener filter enhancement methods are used, as
one. Further investigation has shown that this optimistic well as a deep speech enhancement generative adversarial
10-fold accuracy is likely due to the presence of files from network (DSEGAN) and Large-Deep complex U-Net with
the same speaker in the training and test data. Indeed, for phone-fortified perceptual loss (LDCU–Net-PFPL), which is
the 10-fold cross-validation, 90% of the voice recordings of a recent supervised deep learning-based enhancement
each class are used for training and as a result, some record- method [50].
ings of the same speaker may appear in the test set. There- Spectral subtraction is the simplest method of noise
fore, to keep the experiences independent from the speaker reduction. It consists of subtracting the average power of
the 5-fold configuration is used. the noise from the instantaneous power spectrum of the
noisy signal. This generates musical noise. Hence, the
3.6. Pathological voice enhancement improvement, which subtracts an overestimation of the noise
[38].
To explicitly show the benefits of using CHNR, a direct com- Wiener filtering for speech enhancement applies an opti-
parison of conventional MMSE (using Eq. (8)) and the modified mal linear filter to minimize the mean square error (MSE)
MMSE (using CHNR) is provided. The results obtained by using between the clean signal and its estimate [38].
472 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0

As mentioned in subSection 2.1, the noise in these meth-


Table 3 – Accuracy of CNN-LSTM network using different
ods is estimated from the cepstral domain. voice signals.
The deep speech enhancement generative adversarial net-
work (DSEGAN) [51] is composed of one discriminator D and Signals Methods Accuracy F1 score
two generative networks Gj ; j 2 f1; 2g. The first generator G1, Original / 69.372.38 0.6720.0236
having as input the noisy signal x and a noise nj following a Enhanced MMSE 71.152.23 0.6950.0225
normal distribution Nð0; IÞ, generates an enhanced signal SPECSUB 70.542.18 0.6840.0215
Weiner 70.22.44 0.6840.0291
G1 ðn1 ; xÞ ¼ s^1 whose distribution approaches that of the origi-
SEGAN2 70.562.46 0.6940.0262
nal signal s after each update. The second generator G2
LDCU–Net-PFPL 65.162.88 0.6240.0388
receives as input the enhanced signal s^1 of its predecessor
and the noise n to improve the approximation to the original
signal distribution G2 ðn2 ; s^1 Þ ¼ s^2 . To ensure approximation to
the original signal distribution, the discriminator classifies with, l and m the densities of v and v ^, and g a function belong-
the pair ðs; xÞ as real and the pairs ðs^1 ; xÞ and ðs^2 ; xÞ as false. ing to a set G : fg : Rn ! Rn jkgðaÞ  gðbÞk 6 1ka  bk; 8a; b 2 Rn g
This informs the generators of their errors, which the gener- of all 1-Lipschitz functions.
ators try to reduce at each update. The mathematical expres- Fig. 6 shows spectrograms of the vowel/a/ as pronounced
sion summarizing this is given by by speakers with a cordectomy, pachydermia laryngis, fronto-
lateral partial laryngectomy and psychogenic dysphonia. The
1 spectrograms are of signals without preprocessing, namely,
minVðDÞ ¼ Es;x Pdata ðs;xÞ ðDðs; xÞ  1Þ2
D 2
‘Original’ and enhanced by different enhancement methods:
X2
1    2
þ En P ðnÞ;x Pdata ðxÞ D Gj nj ; ^sj1 ; x ð28Þ ‘MMSE’ for log MMSE, ‘SpecSub’ for spectral subtraction, ‘Wei-
j¼1
4 j n ner’ for Weiner filtering, ‘SEGAN’ for SEGAN2 and ‘LDCU–Net-
PFPL’ for LDCU–Net-PFPL.
X
2
1      2 Table 3 provides the accuracy and F1 score of the CNN-
minVðGÞ ¼ Enj Pn ðnÞ;x Pdata ðxÞ D Gj nj ; ^sj1 ; x  1
G 4 LSTM network using different speech enhancement methods.
j¼1
From the results, the spectral subtraction-based, the
X
2
 
þ kj jjGj nj ; ^sj1  sjj1 ð29Þ Wiener filter-based and the MMSE-based enhancements
j¼1 exhibited a performance improvement. The MMSE-based
enhancement achieved 71:15% accuracy, with an improve-
with E the expectation, ^s0 ¼ x; k1 ¼ 50 and k2 ¼ 100. Eqs. (28)
ment of 1.78%. On the other hand, for the deep learning-
and (29) are the least-square losses of the discriminator and
based enhancement methods, SEGAN2 achieved an accuracy
generators, respectively.
improvement of 1.19%. Whereas, the LDCU–Net-PFPL method,
Each generator has a fully convolutional encoder-decoder
decreased the accuracy by 4.21%.
architecture, and the discriminator is a fully convolutional
Since the best performance is achieved by the MMSE-
network.
based enhancement, and for a more in-depth look, this
DSEGAN is trained with the database used in [51]. The
enhancement is performed considering only the noise esti-
training of DSEGAN requires the availability of both clean
mated over the frequency ranges of 0–1.5 kHz, 0–2.5 kHz,
and noisy signals. This requirement is not verified for patho-
and 0–3.5 kHz. For simplicity, these ranges are denoted as
logical signals. Thus, we introduce a two-stage enhancement.
MMSE 15, MMSE 25 and MMSE 35, respectively. The limitation
The signals are first enhanced by MMSE and considered clean,
to these frequency bands is motivated by an investigation by
and the original signals are considered noisy. By adding the
[52], which concluded that a division into frequency sub-
original and MMSE-based enhanced signals to the training
bands, specifically the 0–3 kHz frequency band containing
data, the SEGAN2 network is obtained. In the test phase, the
the formants, is more suitable for classifying dysphonia com-
enhanced pathological signals are obtained by passing the
pared to the entire frequency band. The results in Table 4
original signals through the trained networks.
show an improved performance in the classification of voice
The Large-Deep complex U-Net with phone-fortified per-
pathologies. Compared to the original signals, the MMSE-
ceptual loss (LDCU–Net-PFPL) [50] uses the short-time Fourier
based enhancement in the 0–1.5 kHz, 0–2.5 kHz, and 0–
transform (complex spectrum) of the noisy signal to generate
3.5 kHz frequency bands yielded improvements of 1.57 %,
a complex ratio mask, which is multiplied by the complex
2.63 %, and 2.44 %, respectively.
spectrum of the noisy signal to obtain the complex spectrum
of the enhanced signal. The obtained spectrum is trans-
3.7. Effect of activation functions
formed into the time domain by the inverse short-time Four-
ier transform. The enhanced signal ^s as well as the clean
Activation functions are primordial blocks in a neural net-
signal s are then represented by latent representations v ^
work, and they allow for the learning of complex forms in
and v, respectively, rich in phonetic information resulting
the data. After a neuron is input, the functions determine
from the passage through a wav2vec encoder. These repre-
whether it should be activated and input to the next layer.
sentations allow to train the enhancement system taking into
To explore the effect of the activation functions on the classi-
consideration the phonetic information contained in the
fication of voice pathologies, and in addition to ReLU, several
speech signal using the perceptual loss, given by Eq. (30).
activation functions are employed: Gaussian error linear unit
Łðs; ^sÞ ¼ jjs  ^sjj1 þ supEl ½gðvÞ  Em ½gðv
^Þ ð30Þ (GELU) [53], considered a smoothed form of ReLU with posi-
diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0 473

Table 4 – Accuracy of CNN-LSTM network using different


voice signals.

Signal Methods Accuracy Macro F1 score

Enhanced MMSE 15 70.942.18 0.6940.0254


MMSE 25 722.1 0.7050.022
MMSE 35 71.812.18 0.7010.0234

tive and negative values; scaled exponential linear unit


(SELU), introduced by [54] to ensure network self-
normalization; and Swish [55], which combines the input
and the sigmoid function. The aforementioned activation
functions, shown in Fig. 7, are compared to the SinRU that
we propose. SinRU is a combination of the ReLU function
and the periodic sinus function, resulting in a monotonic
function. By increasing the inputs, the activation increases Fig. 8 – Proposed activation function plot.
either rapidly in the intervals [0 þ 2np; p2 þ 2np] and
[3p
2
þ 2np; 2p þ 2np] or slowly in the interval [p2 þ 2np; 3p
2
þ 2np]
improvement in performance is observed using our proposed
with n 2 Z. For negative inputs, the function output is 0, as
SinRU activation function. The improvement in accuracy over
illustrated in Fig. 8. The activation functions are listed with
the ReLU function is 1.25%, 1.75% and 1.02% for the original,
their mathematical expressions in Table 5. These are imple-
MMSE 15 and MMSE 35 signals, respectively.
mented within the convolutional layer. Each system is then
Despite the imbalance between the number of male and
trained with different types of activation functions and differ-
female recordings in the first SVD dataset, we investigate
ent input signals, namely, ‘Original’, ‘MMSE 15’, ‘MMSE 25’,
the effect of gender on the systems’ accuracy. Table 7 shows
and ‘MMSE 35’. Table 6 summarizes the performance of the
that for the system with the SinRU activation function and
systems and highlights the best achieved performance.
enhanced signal input, the female overall accuracy is 87.36%
The results show that for each learner, an improvement in
and the male overall accuracy is 66.37%. Compared to the
the classification performance is reported for the enhanced
baseline system where the female and male overall accura-
signals compared to the original signals. This improvement
cies are, respectively, 84.74% and 62.06%, an improvement is
is achieved for all activation functions. The MMSE 25 signals
noticed for both genders.
improved the accuracy of the ReLU and SELU functions by
Tables 8 and 9 present the confusion matrices of the sys-
2.63% and 1.47% in accuracy and 3.3% and 2.3% in F1 score,
tems with the original and MMSE 35 enhanced input signals,
respectively. The MMSE 15 signals improved the accuracy
respectively. It is clear that the pathological voice enhance-
for the swish and GELU functions by 2.08% and 2.2%, respec-
ment improves the discrimination between the two classes
tively, in accuracy, and 2.3% in F1 score. The MMSE 35 signals
of cordectomy and frontolateral laryngectomy. As these two
improved the accuracy for the SinRU function by 2.21% in
classes have a large overlap of speakers, the voice recordings
accuracy and 2.3% in F1 score. When comparing the effect
of a cordectomized speaker are used for network training, and
of the activation functions, for each type of signal, an
the voice recordings of the same speaker who subsequently
underwent frontolateral laryngectomy are used for testing
and vice versa. Based on this observation, an additional
experiment, which consists of merging the data of the two
classes (cordectomy and frontolateral laryngectomy), was car-
ried out. Therefore, the recognition task consists of identify-
ing 3 classes. The results presented in Table 10 shows that
there is a significant improvement when the cordectomy
and frontolateral classes are combined. The accuracy and F1
score for the MMSE 35 system, when using SinRU activation
function, reach 87.38% and 0.854, respectively. These results
are 14.55% and 14.4% higher than the accuracy and F1 score,
respectively, achieved by the same system when the two
classes are distinct.

3.8. Performance with respect of gender

To extend our hypothesis on pathologies based on different


etiologies than those of the first subset while considering
the effect of the gender, additional experiments are per-
Fig. 7 – Activation functions plot. formed by using a second SVD dataset.
474 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0

Table 5 – Activation functions. Table 8 – Confusion matrix of best-achieving performance


using SinRU activation function with original signals. Cord.:
Activation function Expression Cordectomy, F.l.: Frontolateral laryngectomy, P.l.: Pachyder-
 mia laryngis and P.d.: Psychogenic dysphonia.
ReLU x ifx > 0
ReLUðxÞ ¼
0 else: Predicted classes
Swish SwishðxÞ ¼ 1þexbx , with b=1
 Cord. F. l. P. l. P. d.
SELU bx ifx > 0
SELUðxÞ ¼
baðex  1Þ else; True classes Cord. 20 10 1 4
with b=1.05070098 and a=1.67326324 F. l. 7 19 1 0
 qffiffi 
GELU GELUðxÞ ¼ 2x 1 þ tanh 2
x þ 0:044715x3 P. l. 2 2 22 5
p
 P. d. 2 0 4 41
SinRU x þ sinðxÞ ifx > 0
SinRUðxÞ ¼
0 else:

Table 9 – Confusion matrix of best-achieving performance


using SinRU activation function with MMSE 35 signals.
Cord.: Cordectomy, F. l.: Frontolateral laryngectomy, P. l.:
The baseline system, with ReLU as activation function and
Pachydermia laryngis, and P. d.: Psychogenic dysphonia.
original input signal, is compared to the one with SinRU as
activation function and enhanced input signal. The systems Predicted classes
are evaluated for females exclusively, males exclusively, and Cord. F. l. P. l. P. d.
females and males together. The accuracies and F1 scores
True classes Cord. 22 7 2 4
are provided in Table 11. F. l. 5 20 1 1
From the obtained results, it can be observed that the P. l. 1 1 22 7
pathological voice enhancement as well as the SinRU func- P. d. 1 0 5 41
tion improve the performance of the pathological voice clas-
sification system. An improvement is achieved for the
female, male and gender mixed-based system. For the 0.86% is observed, while the F1 score improves by 1.5%. Nev-
females, the accuracy and F1 score improvements are of ertheless, this performance can be increased by searching for
3.04% and 4.2% respectively. For males, the accuracy and F1 an optimal CNN configuration since we reproduced the expe-
riences with the same system and conditions used with first
score improvements are of 5.36% and 3% respectively. When
subset. The pathological voices are best classified for women
the genders are mixed a little improvement of approximately

Table 6 – Voice disorder classification performance using several activation functions with original and enhanced signals.
The task consists of identifying 4 classes: Cordectomy, Frontolateral laryngectomy, Pachydermia laryngis and Psychogenic
dysphonia. Boldface indicates best performance.
Activation function ReLU Swish SELU GELU SinRU

Original
Accuracy 69.372.38 69.741.67 70.562.24 69.81.62 70.622.27
F1 score 0.6720.0236 0.6780.0189 0.6840.0242 0.680.0173 0.6870.0249
MMSE 15
Accuracy 70.942.18 71.821.99 71.81.66 722 72.691.8
F1 score 0.6940.0254 0.7010.0221 0.7020.0183 0.7030.0228 0.7090.017
MMSE 25
Accuracy 722.1 71.431.49 72.032.28 70.962.01 72.562.01
F1 score 0.7050.022 0.6970.0162 0.7070.0245 0.69400.0211 0.7080.0209
MMSE 35
Accuracy 71.812.18 70.841.97 71.651.36 71.072.04 72.831.14
F1 score 0.7010.0234 0.690.0219 0.6990.0187 0.6970.0205 0.710.0173

Table 7 – Voice disorder classification performance using ReLU and SinRU activation functions with original and enhanced
signals for female and male. Boldface indicates best performance.

Female Male

Configuration ReLU & SinRU & ReLU & SinRU &


Original MMSE 35 Original MMSE 35
Accuracy 84.744.53 87.374.53 62.062.7 66.372.24
F1 score 0.4380.049 0.4540.0472 0.6070.029 0.6430.0248
diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0 475

Table 10 – Voice disorder classification performance using several activation functions with original and enhanced signals.
The task consists of identifying 3 classes: Pachydermia laryngis, Psychogenic dysphonia and a combined class composed of
Cordectomy and Frontolateral laryngectomy. Boldface indicates best performance.
Activation function ReLU Swish SELU GELU SinRU

Original
Accuracy 85.651.32 84.841.13 85.151.30 85.781.35 86.441.18
F1 score 0.830.0224 0.8220.16 0.8270.0205 0.8350.0169 0.8450.0157
MMSE 15
Accuracy 84.631.29 84.981.22 85.411.89 85.840.79 87.320.84
F1 score 0.82700.0168 0.8290.0164 0.8290.0284 0.8410.0104 0.8550.0128
MMSE 25
Accuracy 85.981.42 85.931.22 85.701.92 85.411.27 87.10.9
F1 score 0.840.0195 0.8390.0164 0.8340.0273 0.8330.0168 0.8530.0119
MMSE 35
Accuracy 86.261.56 85.861.24 85.401.25 85.421.56 87.380.64
F1 score 0.83900.0197 0.8380.0172 0.8290.0176 0.8330.0168 0.8540.0102

Table 11 – Voice disorder classification performance using ReLU and SinRU activation functions with original and enhanced
signals. The task consists of identifying 4 classes: Vocal fold polyp, Chronic laryngitis, Functional dysphonia and Vocal cord
paresis. Boldface indicates best performance.
Female and male Female Male

Configuration ReLU & SinRU & ReLU & SinRU & ReLU & SinRU &
Original MMSE 35 Original MMSE 35 Original MMSE 35
Accuracy 48.40.34 49.262.32 58.750.76 61.790.15 51.191.02 56.552.11
F1 score 0.4280.01 0.4430.018 0.5460.011 0.5880.013 0.4850.007 0.5150.019

with an average accuracy of 61.79% and average F1 score of 4. Discussion


0.588.
The proposed CNN-LSTM using the new SinRU activation
3.9. Dysarthria severity levels’ classification function is an innovative architecture designed for voice
pathology classification that competes very well with state-
Dysarthria is characterized by multiple deviant speech of-the-art systems. The proposed configuration, composed
dimensions, such as perturbation of pitch and loudness, of a 1D-CNN layer and an LSTM layer using spectrograms,
reduced stress, inappropriate silence, variable rate, harsh achieved an accuracy of 70.62% in the classification of four
and breathy voice [56]. Both pitch and amplitude are per- voice pathologies (vocal cordectomy, psychogenic dysphonia,
turbed which results to jitter2 and shimmer3 that can be con- pachydermia laryngis, frontolateral partial laryngectomy).
sidered as a modulation noise[39]. This can be considered as satisfactory accuracy since the
To assess the effect of our CHNR-based system on the clas- multiclass identification is a complicated task compared to
sification of the severity levels of dysarthria, an experiment is the detection (usually, binary) task. This statement was con-
conducted with the same experimental conditions as those firmed in [58], where four classes, three pathological and
deployed in a previous work [57], on Nemours [26] and Torgo one healthy, were identified using the VGGish network as a
[27] databases. The performance of our CNN-LSTM system feature extractor; the best accuracy was 41%, whereas when
with enhanced spectrograms and those of the systems of the task was reduced to a pairwise classification (discriminat-
Kadi K. L et al. [57] are presented in Table 12. As reported in ing between pathological and healthy voice), the accuracy
[57], the best accuracies of GMM and GMM-SVM systems are increased to 80% for vocal cord paralysis detection using an
achieved by using a fusion of auditory cues and MFCCs; the LSTM architecture and to 66% and 67% for dysphonia and
SVM-based system performed the best by using only MFCCs. laryngitis detection, respectively, using 1D-CNN. Using the
As shown in Table 12, the proposed CNN-LSTM system SVD database, the system presented in [20] achieved an accu-
improved the performance by 6.14% compared to the best racy of 68.08% in voice pathology detection (binary task) using
conventional system, achieving less than 1% error rate in a 1D-CNN-LSTM architecture with segmented raw signals as
classifying dysarthria severity levels. inputs. [21] used spectrograms to detect organic dysphonia
2 (binary task) by a network composed of 2D-CNN and fully
Jitter is the fundamental frequency variation of the voice
connected layers pretrained by a convolutional deep belief
signal from cycle to cycle.
3
Shimmer is the amplitude variation of the voice signal from network, and achieved accuracies of 71% and 77% with and
cycle to cycle. without pretraining, respectively.
476 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0

ical voice enhancement as preprocessing prior to voice


Table 12 – Comparison of the performance of our CNN-LSTM
system with those of Kadi K. L et al. in classifying three pathology classification in a two-stage system, is found to
levels of dysarthria severity. be effective when noise estimation is performed by the CHNR
in the cepstral domain. The CHNR is sensitive to both additive
Our system Kadi K. L et al. [57]
noise and modulation noise since it decreases with increased
Configuration SinRU & SVM GMM GMM-SVM noise or jitter [39]. The CHNR feature related to the additive
MMSE 35 noise seems to be effective in discriminating the SVD
Accuracy 99.34 76.6 93.2 78.8 pathologies investigated in this study. The results also
F1 score 0.995 - - - demonstrated the effectiveness of the CHNR-based system
in the context of modulation noise, since it achieved the best
performance in classifying dysarthria severity levels.
The speech enhancement module improved the classifica-
One of prominent advantage of the LSTM component used tion of voice pathologies using conventional speech enhan-
within our proposed system is its ability to learn the short cers, namely spectral subtraction, Weiner filtering and
and long-term contextual dependencies by retaining relevant MMSE. The MMSE-based enhancement, which use log-
information from previous state and forgetting irrelevant one. magnitude spectra, gives the best performance using CHNR
The importance of contextual information in acoustic model- to estimate noise. The effect of CHNR in enhancing voice is
ing was demonstrated in pathological speech recognition sys- measured by objective intelligibility metrics instead of subjec-
tems using DNNs. Usually, the contextual information is tive ones such as in language-dependent experiments carried
provided by a fixed-size context window. For instance, in out in [64]. Since our experiments are independent of lan-
the phonetic variation modeling and speaker adaptation for guage thanks to the use of sustained vocalization (vowel/a/),
dysarthric speech recognition system proposed in [56], the the proposed approach based on CHNR seems to be adequate.
context window size is 11. [59] opted for a context of size 15 Therefore, our study confirms that using sustained vocaliza-
using both acoustic and articulatory features. In [60], the tion is recommended to perform voice pathology analysis as
results showed that accuracies of stuttering detection in dys- demonstrated in studies that measured the effects of non-
arthric and non-dysarthric speech are improved after using invasive therapy on organic vocal pathologies [65], as well as
the fixed-size context window. In contrast to these studies, when performing voice tremor measurements [62]. Sustained
where the selection of the best size of the context window vocalization has also been successfully used by machine
was empirically determined, our approach does not use a learning algorithms to track improvements in voice quality
fixed-size context since it implicitly integrates the contextual following a specific treatment based on botulinum injection,
information with a changing size of the window thanks to the where a significant correlation was found between dysphonia
cells and gates of LSTM. clinical scores and ANN based scores [66]. These latest find-
To put these results into perspective, it is worthy to men- ings are promising signs that machine learning algorithms
tion the effectiveness of other competing techniques, such can be trained to be sensitive and accurate in monitoring
as frequency and amplitude tremor measurements that have the rehabilitation process of people with voice disorders.
been used to characterize utterances of speakers suffering The noise estimation in limited bands increases the per-
from Parkinson’s disease [61] and those with voice tremor formance of voice pathology classification compared to its
[62]. However, in automatic detection of organic voice estimation in the entire spectrum. Since these bands contain
pathologies, features based on tremor measurements were the harmonics and formants of the vowel/a/, the improve-
not in the top 3 most contributing parameters that improve ment in the performance can be explained by the wise choice
the performance [7]. Instead, using spectrograms seems to of estimating the noise in the bands where the harmonics are
have the advantage to cope with wider range of voice patholo- corrupted by noise. Our approach confirms the importance of
gies since it is not restricted in terms of spectrum coverage. selecting the adequate frequency bands on the detection and
The use of time–frequency representations seems to be an classification of voice pathologies. The frequency band selec-
effective approach for the identification and detection of tion was also recommended by [8] where focusing on 1–8 kHz
pathologies. Indeed, in [29], the authors found that architec- band was found effective. Another experimental investigation
tures based on DNNs perform better when scatter wavelet carried out in [67] found that for the Parkinson disease intel-
features are used as inputs of Deep Convolutional Neural Net- ligibility classification, the mean absolute deviation in the
works. This latter system achieved 99.62% to classify normal cepstral separation difference between source and filter com-
and pathological voices, which is a binary classification task. puted between 0–1 kHz and the 4th MFCC, are strongly corre-
On a more complex task, which consists of classifying three lated with the clinical scores. Similarly, the deep structure of
dysarthia severity levels, our proposed system using time–fre- the proposed CNN-LSTM performs a blind selection of spec-
quency spectrograms as input features achieved 99.34% trograms by activating, thanks to SinRu function, relevant
accuracy. feature maps for the classification task.
The utterances of the SVD database were recorded in a When the LDCU–Net-PFPL which is a supervised deep-
sound-treated room. The lips to microphone distance was learning-based approach is used in the first stage for speech
assured by the headset condenser microphone. The sampling enhancement, a performance degradation is noticed. This
frequency is above the minimum required as mentioned in may be explained by the fact that the system is not well suited
[63]. Knowing that recording conditions are the same for all to intrinsic noises like those of the pathological voices, which
files, the new paradigm, which consists of using the patholog- explains the appearance of horizontal lines on the enhanced
diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0 477

spectrograms in Fig. 6. Indeed, the results confirm that the use Recent advances in machine learning have provided an
of pure deep learning approaches such as LDCU–Net-PFPL and opportunity to automate the diagnosis and assessment of
DSEGAN in the proposed scheme was not as effective as the voice, paving the way to the development of clinically applica-
combination of conventional and deep-learning-based ble tools that could be used by voice specialists. However,
enhancement techniques. Recently, this combination has although the results demonstrate the general viability of
demonstrated its effectiveness in the case of external noises machine learning algorithms for the assistive diagnosis of
[68,69]. In our study, where the noises are assumed to be of voice disorders, it is important to mention that the generaliza-
intrinsic origin, the combination of the modified MMSE and tion of the results to new patients is not necessarily achieved
DSEGAN achieved better performance. In the context of patho- nor measured properly because the training is usually per-
logical signals where pure noisy zones and clean signals are not formed on limited or non-accessible datasets. Besides the
available to train deep-learning-based methods, the support shortcomings of the existing datasets, the gaps in the adopted
provided by conventional techniques seems to be the best evaluation methodologies and the lack of the standardization
strategy to face this unfavorable situation. of clinical assessment protocols constitute the main obstacles
The experimental comparison between the different activa- to universal comparison and interpretation [72]. Recent
tion functions showed that the proposed SinRU function was methodological developments show an awareness to create
the most efficient. The SinRU activation function improved favorable conditions by establishing guidelines for machine
the performance of the CNN-LSTM voice pathology classifier learning algorithms such those presented in [63], to reach suf-
with different enhancement techniques. This is probably due ficient maturity that will allow reaching full acceptability in
to the sinusoidal component, which introduces additional real-life healthcare settings and environments.
nonlinearity to the positive input part of the function. This
allows better learning of complex and strongly nonlinear pat-
terns from pathological spectrograms and may be the reason 5. Conclusions
for its efficiency which can lead to an accuracy of 87.38%.
In regard to the good performance achieved by the pro- In this paper, a two-stage multiclass voice pathology classifi-
posed CNN-LSTM framework for the classification of dysarth- cation framework was proposed. The first stage carried out
ric severity levels, it becomes possible to consider its pathological voice enhancement by estimating the noise in
integration within pathological speech recognition systems the cepstral domain using CHNR. Supervised and unsuper-
to improve their effectiveness. Indeed, many recent vised speech enhancement algorithms were evaluated,
approaches of pathological speech recognition recommend namely, spectral subtraction, Wiener filtering, log-MMSE,
to design systems that are well-adapted to the severity level LDCU–Net-PFPL and DSEGAN. The second stage learned rele-
of the speech disorder. For instance, in [70], a tempo adjust- vant feature maps from the spectrograms using a hybrid
ment approach is proposed to perform robust personalized CNN-LSTM neural network. To strengthen the learning of
dysarthric speech recognition. The results presented in this complex patterns from spectrograms, a new activation func-
latter study showed that the system based on phoneme- tion SinRU was proposed, and the results confirmed its effec-
based tempo adjustment performs the best for moderate tiveness. The classification experiments were carried out
and severe cases. However, the authors also pointed out the considering different etiologies reflected in two subsets of
need of a robust mapping model of dysarthric speech dynam- the SVD database, and dysarthria severity levels on Nemours
ics at the level of every single phoneme. This feature can be and Torgo databases.
provided by our proposed system that achieved less than 1% The results showed that the proposed framework is
error rate to distinguish between dysarthric severity levels. etiology-independent and therefore could be used to perform
The development of assistive technologies for the assess- a multiclass recognition of voice disorders, which is by far
ment and therapeutic judgement of voice disorders opens more complex than the widely used pairwise methods. In
the way for new opportunities in the field of telemedicine contrast to several approaches where fixed phonemic con-
by providing screening tools that could be used in triage texts are used, the deep structure of the proposed CNN-
and pre-diagnosis in a prospect of improving the current LSTM layer implicitly learns contextual information from fea-
clinical routine. Prospects are also oriented towards the ture maps encompassing relevant spectrogram components.
advantages provided by mobile health (m-health) systems Our study confirms that the use of sustained vocalization
that can constitute a simple and rapid support for the detec- is an effective way to perform voice pathology analysis as
tion of vocal pathologies. A flowchart of a possible m-health demonstrated by several studies evaluating the effects of
system for voice health classification using machine learning non-invasive therapy as well as when performing measure-
is presented in [14]. In the perspective of integrating the pro- ments of some pathological voice characteristics. The use of
posed system in telemedicine applications, it would be impor- sustained vocalization also has the advantage of providing
tant to consider the aspects related to the signal compression language-independent systems. However, the effect of speech
of the speech signal on the transmission channels. Indeed, as signal compression should be considered in transmission
demonstrated in [71], compression algorithms have notice- channels because they modify the spacial area of vowels.
able effects on speech intelligibility by modifying the vowels’ This work formulates the original hypothesis that a patho-
space area. Our system does not use data compression. How- logical voice is intrinsically noisy. This assumption means
ever, in a telemedicine context, it is critical to be mindful of that the voice impairment induces a ‘‘noise effect” in sus-
these changes in intelligibility because our approach uses tained vocalization. The hypothesis is experimentally vali-
sustained vowels as units of classification. dated through the improvement achieved when
478 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0

enhancement of voice segments is used prior to voice pathol- [5] Chandrakala S, Rajeswari N. Representation learning based
ogy classification. In the context of pathological signals, speech assistive system for persons with dysarthria. IEEE
where the noise is considered to be of intrinsic origin, our Trans Neural Syst Rehabil Eng 2017;25(9):1510–7.
[6] Muhammad G, Altuwaijri G, Alsulaiman M, Ali Z, Mesallam
study demonstrated the advantage of combining deep-
TA, Farahat M, Malki KH, Al-nasheri A. Automatic voice
learning-based methods and conventional speech enhance- pathology detection and classification using vocal tract area
ment techniques to achieve better classification performance. irregularity. Biocybern Biomed Eng 2016;36(2):309–17.
The high classification accuracy of dysarthria severity [7] Al-nasheri A, Muhammad G, Alsulaiman M, Ali Z, Mesallam
levels achieved by our system opens up many possibilities TA, Farahat M, Malki KH, Bencherif MA. An investigation of
in the perspective of its joint use with recent dysarthric multidimensional voice program parameters in three
different databases for voice pathology detection and
speech recognition systems. In the context of the need for a
classification. J Voice 2017;31(1):113.e9–113.e18.
robust mapping model of dysarthric speech dynamics that
[8] Al-Nasheri A, Muhammad G, Alsulaiman M, Ali Z, Malki
many research studies have highlighted, our system can be KH, Mesallam TA, Ibrahim MF. Voice pathology detection
placed upstream of dysarthric speech recognizers to adapt and classification using auto-correlation and entropy
their configuration according to severity levels. features in different frequency regions. IEEE Access
Finally, the experimental evidence of this study showed 2018;6:6961–74.
that the proposed two-stage architecture is effective in per- [9] Hammami I, Salhi L, Labidi S. Voice pathologies classification
and detection using emd-dwt analysis based on higher order
forming a multiclass identification of voice disorders. Never-
statistic features. IRBM 2020;41(3):161–71.
theless, in light of the thoughts shared by the authors in
[10] Karan B, Sahu SS, Mahto K. Parkinson disease prediction
[73], we can state that both the features and the conceptual using intrinsic mode function based features from speech
models used in this study are to a large extent relatively sim- signal, Biocybernetics and Biomedical. Engineering 2020;40
ple compared to the inexhaustible richness and variability of (1):249–64.
voice, speech, and language pathologies. Therefore, to over- [11] Hossain MS, Muhammad G. Healthcare big data voice
come these limitations, it is necessary to deepen the multidis- pathology assessment framework. IEEE Access
2016;4:7806–15.
ciplinary collaboration between machine learning, behavioral
[12] Ali Z, Elamvazuthi I, Alsulaiman M, Muhammad G.
signal processing and clinical knowledge. Automatic voice pathology detection with running speech by
using estimation of auditory spectrum and cepstral
CRediT authorship contribution statement coefficients based on the all-pole model. J Voice 2016;30
(6):757.e7–757.e19.
Mounira Chaiani: Conceptualization, Writing - original draft, [13] Harar P, Galaz Z, Alonso-Hernandez JB, Mekyska J, Burget R,
Smekal Z, Towards robust voice pathology detection, Neural
Writing - review & editing, Visualization, Investigation,
Computing and Applications (2018) 1–11. .
Methodology. Sid Ahmed Selouani: Conceptualization, [14] Verde L, Pietro GD, Sannino G. Voice disorder identification by
Methodology, Software, Writing - review & editing, Supervi- using machine learning techniques. IEEE Access
sion, Validation, Methodology. Malika Boudraa: Conceptual- 2018;6:16246–55.
ization, Supervision. Mohammed Sidi Yakoub: Data [15] España-Bonet C, Fonollosa JAR. Automatic speech
curation, Software, Validation, Resources. recognition with deep neural networks for impaired speech.
In: Abad A, Ortega A, Teixeira A, Mateo CG, Hinarejos CDM,
Perdigão F, Batista F, Mamede N, editors. Advances in Speech
Declaration of Competing Interest and Language Technologies for Iberian
Languages. Cham: Springer International Publishing; 2016. p.
97–107.
The authors declare that they have no known competing [16] Zaidi BF, Selouani SA, Boudraa M, Yakoub MS. Deep neural
financial interests or personal relationships that could have network architectures for dysarthric speech analysis and
appeared to influence the work reported in this paper. recognition. Neural Comput Appl 2021:1–20.
[17] Alhussein M, Muhammad G. Voice pathology detection using
deep learning on mobile healthcare framework. IEEE Access
2018;6:41034–41.
R E F E R E N C E S
[18] Mohammed MA, Abdulkareem KH, Mostafa SA, Ghani MKA,
Maashi MS, Garcia-Zapirain B, Oleagordia I, Alhakami H, AL-
Dhief FT, Voice pathology detection and classification using
[1] American Speech-Language-Hearing Association, Definitions convolutional neural network model, Applied Sciences 10
of communication disorders and variations [relevant paper], (11) (2020) 3723..
available from www.asha.org/policy (1993). URL:https://www. [19] Alhussein M, Muhammad G. Automatic voice pathology
asha.org/policy/rp1993-00208/.. monitoring using parallel deep models for smart healthcare.
[2] Martins RHG, do Amaral HA, Tavares ELM, Martins MG, IEEE Access 2019;7:46474–9.
Gonçalves TM, Dias NH, Voice disorders: etiology and [20] Harar P, Alonso-Hernandezy JB, Mekyska J, Galaz Z, Burget
diagnosis, Journal of voice 30 (6) (2016) 761.e1–761.e9.. R, Smekal Z, Voice pathology detection using deep
[3] American Speech-Language-Hearing Association, Voice learning: a preliminary study, in: 2017 International
disorders (practice portal), Retrieved (December, 30, 2021). Conference and Workshop on Bioinspired Intelligence
URL:https://www.asha.org/Practice-Portal/Clinical-Topics/ (IWOBI), 2017, pp. 1–4. .
Voice-Disorders/. . [21] Wu H, Soraghan J, Lowit A, Di-Caterina G, A Deep Learning
[4] Baker J. The role of psychogenic and psychosocial factors in Method for Pathological Voice Detection Using Convolutional
the development of functional voice disorders. Int J Speech- Deep Belief Networks, in: Proc. Interspeech 2018, 2018, pp.
language Pathol 2008;10(4):210–30. 446–450. .
diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0 479

[22] Fang S-H, Tsao Y, Hsiao M-J, Chen J-Y, Lai Y-H, Lin F-C, Wang [43] Witten IH, Frank E, Hall MA, Pal CJ, Chapter 10 - deep
C-T. Detection of pathological voice using cepstrum vectors: learning, in: Witten I. H, Frank E, Hall M. A, Pal C.J (Eds.), Data
A deep learning approach. J Voice 2019;33(5):634–41. Mining (Fourth Edition), fourth edition Edition, Morgan
[23] Chen L, Chen J. Deep neural network for automatic Kaufmann, 2017, pp. 417–466. .
classification of pathological voice signals. J Voice 2020. [44] Boureau Y-L, Ponce J, LeCun Y. A theoretical analysis of
[24] Kim H, Jeon J, Han YJ, Joo Y, Lee J, Lee S, Im S. Convolutional feature pooling in visual recognition. In: Proceedings of the
neural network classifies pathological voice change in 27th international conference on machine learning (ICML-
laryngeal cancer with high accuracy. J Clinical Med 2020;9 10). p. 111–8.
(11):3415. [45] Srivastava N, Hinton G, Krizhevsky A, Sutskever I,
[25] Pützer M, Barry WJ. Saarbrüecken Voice Database, publisher: Salakhutdinov R. Dropout: A simple way to prevent neural
Institut für Phonetik. Universität des Saarlandes (May 2007. networks from overfitting. J Mach Learning Res 2014;15
URL:http://www.stimmdatenbank.coli.uni-saarland.de/ (56):1929–58.
help_en.php4. [46] Pützer M, Wokurek W. Electroglottographic and acoustic
[26] Menendez-Pidal X, Polikoff JB, Peters SM, Leonzio JE, Bunnell parametrization of phonatory quality provide voice profiles
HT, The nemours database of dysarthric speech, in: of pathological speakers. J Voice 2021.
Proceeding of Fourth International Conference on Spoken [47] Sasaki CT, accessed: 2021-05-07. URL:https://www.
Language Processing. ICSLP’96, Vol. 3, IEEE, 1996, pp. 1962– merckmanuals.com/fr-ca/professional/affections-de-l-
1965.. oreille,-du-nez-et-de-la-gorge/troubles-laryngiens/ulc%C3%
[27] Rudzicz F, Namasivayam AK, Wolff T. The TORGO database of A8res-de-contact-du-larynx. .
acoustic and articulatory speech from speakers with [48] Mouawad F, Chevalier D, Santini L, Fakhry N, Bozec A,
dysarthria. Language Resour Eval 2012;46(4):523–41. Espitalier F. Chapitre 9 - traitement chirurgical par
[28] Hsu Y.-T, Zhu Z, Wang C.-T, Fang S.-H, Rudzicz F, Tsao Y, cervicotomie et reconstruction laryngée. In: Barry B, Malard
Robustness against the channel effect in pathological voice O, Morinière S, editors. Cancers du Larynx. Paris: Elsevier
detection, CoRR abs/1811.10376 (2018). arXiv:1811.10376. . Masson; 2019. p. 89–115.
[29] Souli S, Amami R, Yahia S. B, A robust pathological voices [49] Kingma DP, Ba J, Adam: A method for stochastic optimization
recognition system based on DCNN and scattering (2017). arXiv:1412.6980. .
transform. Appl Acoust 2021;177 107854. [50] Hsieh T-A, Yu C, Fu S-W, Lu X, Tsao Y, Improving Perceptual
[30] Bhat C, Das B, Vachhani B, Kopparapu SK, Dysarthric speech Quality by Phone-Fortified Perceptual Loss Using Wasserstein
recognition using time-delay neural network based denoising Distance for Speech Enhancement, in: Proc. Interspeech
autoencoder, in: Proc. Interspeech 2018, 2018, pp. 451–455. . 2021, 2021, pp. 196–200..
[31] Yakoub MS, Selouani SA, Zaidi B-F, Bouchair A. Improving [51] Phan H, McLoughlin IV, Pham L, Chén OY, Koch P, Vos MD,
dysarthric speech recognition using empirical mode Mertins A. Improving gans for speech enhancement. IEEE
decomposition and convolutional neural network. EURASIP J Signal Process Lett 2020;27:1700–4.
Audio, Speech, Music Processing 2020;2020(1):1–7. [52] Pouchoulin G, Fredouille C, Bonastre J-F, Ghio A, Giovanni A,
[32] Borrie SA, Baese-Berk M, Engen KV, Bent T. A relationship Frequency study for the characterization of the dysphonic
between processing speech in noise and dysarthric speech. J voices, in: Proc. Interspeech 2007, 2007, pp. 1198–1201..
Acoust Soc Am 2017;141(6):4660–7. [53] Hendrycks D, Gimpel K, Bridging nonlinearities and
[33] Stachler RJ, Francis DO, Schwartz SR, Damask CC, Digoy GP, stochastic regularizers with gaussian error linear units, CoRR
Krouse HJ, McCoy SJ, Ouellette DR, Patel RR, Reavis CCW, abs/1606.08415 (2016). arXiv:1606.08415. .
Smith LJ, Smith M, Strode SW, Woo P, Nnacheta LC. Clinical [54] Klambauer G, Unterthiner T, Mayr A, Hochreiter S, Self-
practice guideline: Hoarseness (dysphonia) (update). normalizing neural networks, in: Proceedings of the 31st
Otolaryngology-Head Neck Surgery 2018;158(1_suppl):S1–S42. international conference on neural information processing
[34] Gómez-Garcı́a J, Moro-Velázquez L, Arias-Londoño J, Godino- systems, 2017, pp. 972–981. .
Llorente J. On the design of automatic voice condition [55] Ramachandran P, Zoph B, Le QV, Searching for
analysis systems. part iii: review of acoustic modelling activation functions, CoRR abs/1710.05941 (2017).
strategies. Biomed Signal Process Control 2021;66 102049. arXiv:1710.05941. .
[35] Moers C, Möbius B, Rosanowski F, Nöth E, Eysholdt U, [56] Kim M, Kim Y, Yoo J, Wang J, Kim H. Regularized speaker
Haderlein T. Vowel-and text-based cepstral analysis of adaptation of kl-hmm for dysarthric speech recognition. IEEE
chronic hoarseness. J Voice 2012;26(4):416–24. Trans Neural Syst Rehabil Eng 2017;25(9):1581–91.
[36] Aronson AE, Bless D. Clinical Voice [57] Kadi KL, Selouani SA, Boudraa B, Boudraa M. Fully automated
Disorders. Thieme: Thieme Publishers Series; 2009. speaker identification and intelligibility assessment in
[37] Ephraim Y, Malah D. Speech enhancement using a minimum dysarthria disease using auditory knowledge. Biocybern
mean-square error log-spectral amplitude estimator. IEEE Biomed Eng 2016;36(1):233–47.
Trans Acoust Speech Signal Process 1985;33(2):443–5. [58] Guedes V, Teixeira F, Oliveira A, Fernandes J, Silva L, Junior A,
[38] Loizou P. Speech Enhancement: Theory and Practice, Signal Teixeira JP. Transfer learning with audioset to voice
processing and communications. Taylor & Francis; 2007. pathologies identification in continuous speech. Procedia
[39] de Krom G. A cepstrum-based technique for determining a Computer Science 2019;164:662–9.
harmonics-to-noise ratio in speech signals. J Speech, [59] Yilmaz E, Mitra V, Bartels C, Franco H, Articulatory features
Language, Hearing Res 1993;36(2):254–66. for ASR of pathological speech (2018). arXiv:1807.10948..
[40] Nielsen JK, Jensen TL, Jensen JR, Christensen MG, Jensen SH. [60] Oue S, Marxer R, Rudzicz F, Automatic dysfluency detection
Fast fundamental frequency estimation: Making a in dysarthric speech using deep belief networks, in:
statistically efficient estimator computationally efficient. Proceedings of SLPAT 2015: 6th Workshop on Speech and
Signal Processing 2017;135:188–97. Language Processing for Assistive Technologies, Association
[41] Jana GC, Sharma R, Agrawal A. A 1d-cnn-spectrogram based for Computational Linguistics, Dresden, Germany, 2015, pp.
approach for seizure detection from eeg signal. Procedia 60–64..
Computer Science 2020;167:403–12. [61] Brückl M, Ghio A, Viallet F. Measurement of tremor in the
[42] Nair V, Hinton GE, Rectified linear units improve restricted voices of speakers with parkinson’s disease. Procedia
boltzmann machines, in: ICML, 2010, pp. 807–814. . Computer Science 2018;128:47–54.
480 diabetes research and clinical practice 4 2 ( 2 0 2 2 ) 4 6 3 –4 8 0

[62] Suppa A, Asci F, Saggio G, Leo PD, Zarezadeh Z, Ferrazzano G, enhancement, in: 2018 Asia-Pacific Signal and Information
Ruoppolo G, Berardelli A, Costantini G. Voice analysis with Processing Association Annual Summit and Conference
machine learning: One step closer to an objective diagnosis (APSIPA ASC), 2018, pp. 373–377. .
of essential tremor. Mov Disord 2021;36(6):1401–10. [69] Zhang Q, Nicolson A, Wang M, Paliwal KK, Wang C.
[63] Rusz J, Tykalova T, Ramig LO, Tripoliti E. Guidelines for speech Deepmmse: A deep learning approach to mmse-based noise
recording and acoustic analyses in dysarthrias of movement power spectral density estimation. IEEE/ACM Trans Audio,
disorders. Mov Disord 2021;36(4):803–14. Speech, Language Processing 2020;28:1404–15.
[64] Nilsson C, Nyberg J, Strömbergsson S. How are speech sound [70] Xiong F, Barker J, Christensen H. Phonetic analysis of
disorders perceived among children? A qualitative content dysarthric speech tempo and applications to robust
analysis of focus group interviews with 1011-year-old personalised dysarthric speech recognition. In: ICASSP 2019–
children. Child Language Teaching Therapy 2021;37 2019 IEEE International Conference on Acoustics, Speech and
(2):163–75. Signal Processing (ICASSP). p. 5836–40.
[65] Lin F-C, Chien H-Y, Kao Y-C, Wang C-T. Multi-dimensional [71] Utianski RL, Sandoval S, Berisha V, Lansford KL, Liss JM. The
investigation of the clinical effectiveness and prognostic effects of speech compression algorithms on the
factors of voice therapy for benign voice disorders. J Formos intelligibility of two individuals with dysarthric speech. Am J
Med Assoc 2021. Speech-Language Pathology 2019;28(1):195–203.
[66] Suppa A, Asci F, Saggio G, Marsili L, Casali D, Zarezadeh Z, [72] Patel RR, Awan SN, Barkmeier-Kraemer J, Courey M, Deliyski
Ruoppolo G, Berardelli A, Costantini G. Voice analysis in D, Eadie T, Paul D, vec JG, Hillman R. Recommended protocols
adductor spasmodic dysphonia: Objective diagnosis and for instrumental assessment of voice: American speech-
response to botulinum toxin. Parkinsonism Related language-hearing association expert panel to develop a
Disorders 2020;73:23–30. protocol for instrumental assessment of vocal function. Am J
[67] Khan T, Westin J, Dougherty M. Classification of speech Speech-Language Pathology 2018;27(3):887–905.
intelligibility in parkinson’s disease. Biocybern Biomed Eng [73] Corcoran C, Cecchi G. Using language processing and speech
2014;34(1):35–45. analysis for the identification of psychosis and other
[68] Zezario RE, Huang J-W, Lu X, Tsao Y, Hwang H-T, Wang H-M, disorders. Biological Psychiatry: Cognitive Neuroscience
Deep denoising autoencoder based post filtering for speech Neuroimaging 2020;5(8):770–9.

You might also like