1 s2.0 S1746809423006924 Main

Biomedical Signal Processing and Control 86 (2023) 105259
Contents lists available at ScienceDirect
Biomedical Signal Processing and Control

journal homepage: www.elsevier.com/locate/bspc
Non-invasive way to diagnose dysphagia by training deep learning model

with voice spectrograms
Heekyu Kim a, 1, Hae-Yeon Park b, 1, DoGyeom Park c, Sun Im b, 2, *, Seungchul Lee a, c, 2, *
a
Department of Mechanical Engineering, Pohang University of Science and Technology (POSTECH), Pohang, Republic of Korea
b
Department of Rehabilitation Medicine, Bucheon St. Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea
c
Graduate School of Artificial Intelligence, Pohang University of Science and Technology (POSTECH), Pohang, Republic of Korea
A R T I C L E I N F O A B S T R A C T
Keywords: Background and objective: Patients with dysphagia show changes in articulation and voice quality, and recent
Dysphagia studies using machine learning models have been employed to help in the classification. This study aimed to
Disease detection apply a novel deep learning method using only the patient‘s voice to classify normal controls from dysphagia
Acoustic data
patients and determine whether this new deep learning method may help provide a rapid and accurate means to
Deep learning
STFT
supplement the existing clinical methods in dysphagia screening and assessment.
MFCC Methods: Voice samples from 299 healthy controls and 290 patients with post-stroke dysphagia; who performed
four simple phonation tasks; were obtained in a prospective manner at a university-affiliated hospital using a
smart digital device. Deep learning methods were employed as follows: firstly, a spectrogram is obtained through
the short time Fourier transform (STFT) and Mel-frequency cepstral coefficients (MFCC) on a sound signal,
respectively. Secondly, the STFT and MFCC spectrograms obtained for each protocol are fed to each multibranch
model. Finally, during the test, each model is ensembled in a soft voting method to distinguish normal and
dysphagia classes.
Results: Five evaluation metrics are used to evaluate the performance of the model: AUC, Sensitivity, Specificity,
Positive predictive value (PPV), and Negative predictive value (NPV). Among the performance metrics, sensi
tivity and specificity levels are compared with the existing diagnostic tools. The ensemble model incorporating
all four tasks showed an AUC of 0.950 ± 0.004, with sensitivity and specificity levels as high as 94.7% and
77.9%, respectively.
Conclusions: The novel deep learning model proposed in this paper shows promising performance levels in
dysphagia classification. Our results show that the ensemble method used in this study may be utilized as a
convenient and rapid digital biomarker of dysphagia in a non-invasive and automated manner.
1. Introduction undiagnosed can lead to aspiration pneumonia, which is known to lead

to poor functional recovery and increase poststroke mortality. There
1.1. Dysphagia after stroke fore, early diagnosis and prompt treatment of those with swallowing
disturbance, especially after a stroke, is of paramount importance. In
Post-stroke dysphagia may result in dysfunction of multiple fact, many international guidelines recommend dysphagia screening as
oropharyngeal muscles including those around the vocal cords. As a part of acute stroke management [1–3].
result, when people with dysphagia attempt to swallow, the vocal cords Several methods exist for diagnosing dysphagia starting from
may not close properly, allowing food or secretion to enter the airways bedside screening tests [4–6], to instrumental tests such as the video
resulting in aspiration and respiratory complications. Dysphagia left fluoroscopic swallowing study (VFSS) [7], and fiberoptic endoscopic
* Corresponding authors at: Department of Rehabilitation Medicine, Bucheon St. Mary’s Hospital, College of Medicine, the Catholic University of Korea, 327 Sosa-
ro, Bucheon-si, Gyeonggi-do 14647, Republic of Korea (Sun Im); Department of Mechanical Engineering, POSTECH, 77 Cheongam-ro, Pohang, Gyeongbuk, 37673,
South Korea (Seungchul Lee).
E-mail addresses: lafolia@catholic.ac.kr (S. Im), seunglee@postech.ac.kr (S. Lee).
1
These authors contributed equally: Heekyu Kim and Hae-Yeon Park.
2
These authors jointly supervised this work: Seungchul Lee and Sun Im.
https://doi.org/10.1016/j.bspc.2023.105259
Received 11 October 2022; Received in revised form 25 April 2023; Accepted 3 July 2023
Available online 10 July 2023
1746-8094/© 2023 Elsevier Ltd. All rights reserved.
H. Kim et al. Biomedical Signal Processing and Control 86 (2023) 105259
examination of swallowing (FEES) [8]. The former simple in its appli (MLP) neural network or conventional machine learning models (SVM,
cation may show variable diagnostic performance and the latter two random forest, etc.). Extracting features from spectrograms requires
tools, considered as gold standard methods, require specialized staff and disease-related domain knowledge, and extracting appropriate features
expensive equipment. Recent studies have advocated using the acoustic greatly affects performance. Also, in general, convolutional neural
information from voice changes in dysphagia patients as a promising network (CNN)-based neural network models are known to have higher
method to supplement current assessment tools. These voice changes are performance than MLP due to the formation of a deeper network layer
hypothesized to reflect the impaired vocal cord closure and limited and a large number of parameters [18]. To date, there have been limited
laryngeal function. Nonetheless, specialized recording devices and studies that have applied CNN in analyzing voice changes in dysphagia
analytical tools are required to extract these voice features. The use of patients. Therefore, we aimed to construct a model that does not need to
smart devices has led to multiple studies advocating the use of compu directly extract features by receiving the spectrogram itself as an input.
tational techniques to detect changes in speech and voice analysis in We instead proposed a model that utilizes the strength of CNN. It follows
COVID-19 infection [9,10] mood disorder [9,10] and even cardiovas a simple procedure of converting speech data into short time Fourier
cular disorders [11]. Simple automated methods using digital devices transform (STFT) and MFCC spectrograms and inputting them into the
that can help screen stroke patients who may need formal dysphagia model.
assessments after a stroke could potentially be helpful.
1.2. Application of machine learning using acoustic features in voice 1.3. Contribution of this work
disorders
The contributions of this work are as follows:
With the rapid development of machine learning in recent years,
several studies have been conducted to diagnose diseases using machine 1) Because dysphagia is diagnosed using deep learning with only the
learning. These studies have mainly adopted the method of diagnosing patient’s voice, it is a non-invasive, low-cost, and simple test method
diseases from voice data because it is a non-invasive, low-cost, and that may easily be applicable in the clinical setting.
simple diagnostic method. The voice also contains information such as 2) Unlike previous studies, extracting features from voice signals is
pitch, intensity, and formant, and from this, various features related to unnecessary. Only the spectrograms are taken as an input, so it can
paralysis of the vocal cords can be obtained. Jitter, shimmer, and be applied more generally. Higher performances were obtained by
harmonics-to-noise ratio (HNR) [12] have been widely used as features using a CNN-based model instead of existing machine learning
for diagnosing diseases from voice. When diagnosing disease through a methods.
machine learning model, sustained vowel or conversational speech is 3) Through voice analysis, we found dysphagia patients have differ
widely used as voice data. With these data, linguistic or vocal disorders ences in fundamental frequency. These peculiarities are included in
can be detected, and diseases can be diagnosed by capturing charac MFCC spectrograms. In addition, because the STFT spectrograms
teristics different from normal people through voice signal analysis. display changes in amplitude and frequency over time, and CNN can
When collecting sustained vowel data, the patient is asked to produce detect local changes using a sliding window approach, applying CNN
the commanded pronunciation as steady as possible (in terms of to STFT spectrogram can include pathological features such as jitter
amplitude and pitch) over a period of time. and shimmer. Thus, we propose a multibranch model that takes both
There are several prior studies related to diagnosing vocal pathology STFT spectrograms and MFCC spectrograms as the input of the model
from sound signals using machine learning. Felipe L. and Joao Paulo to improve the model performances.
Teixeria [13] proposed a diagnosis method using a multi-layer- 4) Not limited to the old way of classifying pathologies from one
perceptron neural network with Mel-frequency cepstral coefficients phonation task protocol, we used four phonation protocols to
(MFCC), jitter, shimmer, HNR, and autocorrelation. They performed maximize the reliability of model classification with the protocol
classification using sustained vowel and conversational speech datasets ensemble method. Four models are trained with four protocols,
and showed 65.9 % specificity and 78 % sensitivity when distinguishing respectively. The predictions of the model for each protocol data are
normal people from patients. Fang, S.-H., et al. [14] compared the per aggregated and then make a final decision.
formance using both the shallow learning (support vector machine,
Gaussian mixture model) and the deep learning method, and showed the 2. Background
potential of the deep learning method by showing the highest perfor
mance with an accuracy of 99 % in deep learning. To validate the per 2.1. Short time Fourier transform (STFT)
formance of the deep neural network (DNN), they utilized the voice
disorder database from Massachusetts Eye and Ear Infirmary (MEEI), Fast Fourier transform (FFT) [19] has the advantage of transforming
which consists of sustained vowel. They extracted features from MFCC the time axis into the frequency axis so that it can be analyzed in the
and used them as input data for deep learning models. A. Al-Nasheri frequency domain. However, FFT only shows what frequency the signal
et al. [15] conducted a study on extracting features from speech data has, and does not tell when the frequency exists. STFT [20] is a means to
using a correlation function and classifying patients using a support compensate for these shortcomings of FFT. It divides a non-stationary
vector machine (SVM). The performance of the MEEI database consist signal into short sections that can be roughly assumed to be station
ing of sustained vowel showed 98.571 % sensitivity and 99.545 % ary, performs FFT, and connects the results of FFT for each section to see
specificity. D. Hemmerling el al. [16] used the Saarbruecken Voice the change in frequency over time. FFT is a faster version of discrete
Database consisting of sustained vowel data, extracted 28 features from Fourier transform (DFT), and DFT follows Equation (1).
the voice data, processed the features with principal component analysis ∑
N− 1
(PCA) methods, and then proceeded with classification using a random X[k] = x[n]e− j2Nπk n
(1)
forest model. S. Jothilakshmi [17] used data composed of sustained n=0
vowel, extracted the linear prediction coefficients (LPC) and MFCC from
voice data, and conducted a study to distinguish normal voices and for k = 0,1,⋯,N − 1, where k corresponds to the frequency f(k) = k Nfs , fs
pathological voices with Gaussian mixture model (GMM) and hidden is the sampling frequency in hertz [21]. The original signal x[n] can be
[ ]
Markov model (HMM) proceeded.
divided into → x 1 [n], →
x 2 [n], →
x 3 [n], ⋯, →
x p [n], ⋯, →
x P [n] , where →
x p [n] is
Prior studies have mostly been carried out by extracting features
from spectrograms and inputting them into a multi-layer perceptron a signal having N samples in the pth segment. Then, DFT can be taken for
2
each segment, according to Equation (2). trum. It is expressed by Equation (5).

∑
N− 1 ∑
F { }
d(2m − 1)π
Xp [k] = x[n]ω[n]e− j2Nπk n
(2) Mdp {x[n] } = LFBp (m, k)cos (5)
n=0 m=1
2F
Here, ω[n] is a window function, and hamming window ω[n] = Where Mdp denotes the dth MFCC of the pth segment of the original
( )
0.54 − 0.46cos nNπ is typically used. signal x[n]. As a result, the MFCC spectrum M for the original signal x[n]
follows as:
2.2. Mel-frequency cepstral coefficient (MFCC) M = [M1 , M2 , M3 , ⋯, Mp , ⋯, MP ] (6)
Human hearing is sensitive to changes at low frequencies, but tends Fig. 1 shows the process of converting the raw signal to MFCC
to be insensitive to changes at high frequencies. To reflect these human spectrum.
auditory characteristics, the Mel frequency was devised, and the rela
tionship between the Mel frequency ϕf and the linear frequency lf is 2.3. Convolutional neural network architecture
( )
lf
represented by ϕf = 2595*log10 1 +700 where lf (k) = k Nfs . Mel filter
CNN [23] architectures are used in many fields, such as image
bank consists of several triangular filters with different center fre
classification, object detection, and pose estimation, and play an
quencies lf c (m) and partially overlapping each other [22]. Equation (3)
important role. CNN architectures generally consist of convolutional,
is a mathematical representation of the Mel filter bank.
pooling, fully connected layers, and activation functions. When the in
⎧
⎪ 0 puts go through convolutional layers, feature maps are produced, and
⎪
⎪
⎪
⎪ l l the goal of the convolutional layers is to automatically search for fea
⎪
⎪ (k) − fc (m − 1)
tures needed for feature detection or classification. The convolutional
f
⎪
⎪
⎨ lfc (m) − lfc (m − 1)
Mel(m, k) for lf (k) < lfc (m − 1) layer consists of several convolutional kernels, and different feature
⎪
⎪ lf (k) − lfc (m + 1) maps are output when the input passes through different kernels. When
⎪
⎪
⎪
⎪
⎪
⎪
lfc (m) − lfc (m + 1) the value at the (i, j) position of the input x of the l-th layer passes
⎪
⎩
0 through the convolutional layers, feature maps z are obtained, and
mathematically it is shown in Equation (7).
for lfc (m − 1) ≤ lf (k) < lfc (m)
(7)
T
zli,j = ωl xi,j
l
+ bl
for lfc (m) ≤ lf (k) < lfc (m + 1)
where, ω is the weight and b is the bias. The activation function plays a
for lf (k) ≥ lfc (m + 1) (3) role in enabling non-linear mapping, and generally goes through the
convolutional layer and then the activation function. When the value of
Here, m = 1, 2, ⋯, F and F is the number of Mel filters. Then, the Mel the (i, j) position of the feature map z of the l-th layer passes through the
spectrum, the result of passing through the Mel filter bank, is given as in activation function α, the output a is obtained, which is mathematically
Equation (4). equivalent to Equation (8).
{ } ( )
∑
N− 1 ⃒ ⃒
LFBp (m, k) = ln ⃒
Mel(m, k)* Xp (k) ⃒ (4) ali,j = α zli,j (8)
k=0
Sigmoid, tanh, and the rectified linear unit (ReLU) are widely used
where m = 1, 2, ⋯, F and p = 1, 2, ⋯, P. That is, Mel spectrum is the activation functions, and among them, ReLU is the most used. Pooling
product of the Mel filter bank, which is an F × N matrix, and the such as max pooling, average pooling, and L2-norm pooling has
magnitude spectrum |X|, which is an N × P matrix. Next, MFCC can be translation-invariance properties and reduces the size of feature maps.
obtained by applying discrete cosine transform (DCT) to the Mel spec Fully connected layers are generally used in the last part of the CNN
Fig. 1. Overall process to get Mel-frequency cepstral coefficient (MFCC). In the process of converting Mel-spectrum to MFCC, inverse fast Fourier transform (IFFT) is
used, which corresponds to cepstral analysis. Discrete cosine transform (DCT) can be replaced with IFFT.
3
architecture, and have a structure in which each node in the previous For this study, the participants were required to produce the
layer is connected to all nodes in the next layer. Mathematically, it can following four tasks. 1) sustained vowel phonation ‘e’ for at least 3 s, 2) a
be expressed by Equation (7), but in CNN, the dimension of the input is voluntary “cough” with maximal effort as if to remove secretion, 3) pitch
RH×W×Cin , whereas in the fully connected layer, it is Rnin ×1 , and the elevation with phonation of the “eee” with effort moving from a low to a
dimension of the weights and bias changes according to the input and high pitch and 4) counting from 1 to 5. The dataset’s structure is shown
output dimensions. Where nin is the dimension of the input, H is the in Fig. 2.
height of the image, W is the width of the image, and Cin is the input The recording was conducted at a sampling rate of 22,050 Hz using
channel of the image. the same smart device in a closed room where external noise was
blocked. The details regarding the recording have been detailed in the
previous publication [27]. In research related to the acoustical energy of
2.4. Transfer learning
human speech and voice communication, it has been common practice
to focus on the frequency range that is lower than approximately 5000
Transfer learning [24] is a method designed to use a model trained in
Hz [28]. Therefore, the sampling rate of 22,050 Hz is sufficient to meet
the source domain in the target domain, and it uses a pre-trained model
the Nyquist theorem. Also, the silent section at both ends of the signal
and improves performance through fine-tuning, and is a widely used
was cut out. Table 2 shows the amount of data for each protocol of train
method in tasks related to CNNs. Training deep CNN models requires
and test datasets. We conducted experiments using five randomly
large amount of data, often using millions of datasets such as ImageNet
created independent databases that followed the data configuration of
to build pre-trained models. Transfer learning works well when the
Table 2. In the case of train datasets, there can be multiple data of one
source domain and the target domain are similar. For example, the
person in each protocol, but in the case of test datasets, the number of
ImageNet datasets include ‘tabby cat’ and ‘maltese dog’ classes. If we
data was adjusted because there should not be multiple data of one
pre-train the model with ImageNet datasets and perform the task of
person for each protocol.
classifying cats and dogs, the classification performance will be high.
However, S. Kim et al. [25] experimentally proved that transfer learning
4. Methodology
works well even when the domain gap is large by using ImageNet
datasets in the source domain and DAGM datasets as the target domain.
The overall workflow of this paper is shown in Fig. 3. The task is to
classify between normal control and dysphagia.
2.5. Ensemble model
4.1. Convert to STFT and MFCC spectrogram
Ensemble uses several models harmoniously, that is, combines
several weak classifiers to create a strong classifier to improve perfor First, to examine the difference between people with and without
mance [26]. Among ensemble methods, the voting method is often used dysphagia, raw signal data was analyzed using Praat (Boersma & Wee
in classification tasks, which is a method to make a final decision by nink). In Figure S1, paying attention to F1 and F2, it can be seen that the
aggregating the results predicted by the model. The voting method can line of a normal person is a straight line, and the line of a person with
be divided into two types: the hard voting method in which the final dysphagia is unstable. Formants are resonance points created in the
decision is made by aggregating each final prediction result made by the vocal track. F1 changes with the volume of the pharyngeal cavity, and
model in a majority voting method, and the soft voting method in which F2 changes with the length of the oral cavity according to the anterior-
the final judgment is made by averaging the results predicted by the posterior position of the tongue. The positions of the pharyngeal cavity
model (the values before taking softmax and argmax). The method used and oral cavity are shown in Fig. 4.
in this paper is the soft voting method, and the detailed method will be When a person makes a sound, not only the frequency corresponding
explained in Section 4. to the sound but also frequency components that are multiples of that
frequency appear. Glottal flow is a form of harmonics. When we speak,
3. Datasets the glottal airflow is amplified by meeting the resonance point in the
vocal track. This is called source-filter theory and is shown in Fig. 5.
The data sets were obtained in a prospective manner using a smart Here, the resonance points of the vocal track are called F1, F2, F3, F4,
device and the details have been published elsewhere [27]. For this and F5 in order from the lowest frequency. It is not easy to find formants
study protocol, voice samples from 290 patients with dysphagia were in the resulting spectrum resulting from the interaction of glottal flow
recruited in this prospective study. All patients were diagnosed with and the resonance point of the vocal track, because it is difficult to find
dysphagia by a fully certified swallowing specialist with more than 10 the peak point if the harmonics deviate even a little from the resonance
years of experience in both the FEES and VFSS. The appropriate level of point. Therefore, in order to obtain formants, cepstral analysis must be
aspiration and level of feeding were recorded at the assessment. Char
acteristics and details are provided in Table 1.
Individuals with no known health problems or any previous history
of dysphagia or voice disorder were recruited as controls (N = 299, age
60.8 ± 14.5). The Institutional Review Board (HC19EESE0060) of the
Catholic University of Korea, Bucheon St. Mary’s hospital approved the
protocols of this study. All participants gave informed consent.
Table 1
Dysphagia characteristics of the stroke participants.
Clinical Parameters
Age (years) 68.8 ± 12.6

Penetration Aspiration Scale 5.2 ± 2.4
Functional Oral Intake Scale 3.6 ± 2.0
Mann Assessment of Swallowing Assessment 172.0 ± 20.0
Mini–Mental State Examination 19.8 ± 8.1
Fig. 2. Structure of datasets. Four protocol data from both male and female are
National Institutes of Health Stroke Scale 7.6 ± 6.8
collected for each of two classes.
4
Table 2 proceeds through the fully connected layer and softmax. We selected
The available voice data for each protocol for (A) Train and (B) Test datasets. DenseNet121 as a pre-trained model because it can enhance feature
(A) propagation and prevent the vanishing gradient problem that occurs in
models with deep layers [29].
Normal Dysphagia
In addition, we increased the accuracy of prediction by building an
Protocol 1 407 705 ensemble model using majority voting. The majority voting method we
Protocol 2 368 653
Protocol 3 379 549
used is soft voting. This is a method to determine whether a patient is a
Protocol 4 281 370 normal person or a patient with dysphagia by training a multibranch
model independently for four protocols and synthesizing the decisions
(B)
made by each multibranch model. To be specific, for the normal and
Normal Dysphagia
dysphagia classes, the softmax value for each protocol will be calculated.
Protocol 1 110 186 After adding all these values, the value divided by four is adopted as the
Protocol 2 110 186 final softmax output value. Since this method makes decisions by col
Protocol 3 110 186
Protocol 4 110 186
lecting consecutive values, it works more reasonably than the voting
method based on hard voting. For example, when two models output the
values [0.33, 0.67], [0.55, 0.45] as probability distributions for two
performed from the spectrum. After converting the STFT results to Mel- classes, according to the hard voting, the first output value is judged as
spectrum, inverse fast Fourier transform (IFFT) is performed to obtain class 2 and the second output value as class 1. Therefore, the score is 1:1
MFCC, and the process of performing IFFT here corresponds to cepstral in the hard voting, and a decision between the two classes cannot be
analysis. See Fig. 1 for cepstral analysis. Therefore, if the spectrogram made. However, according to soft voting, class 2 is clearly chosen
obtained through MFCC is used, it can be thought that formants infor because the score is 0.88:1.12. An explanation of the soft voting method
mation is inherent, and the difference between F1 and F2 mentioned is shown in Fig. 7.
above can be used as a feature.
Another thing that can be used as a feature is a spectrogram from
STFT. It is assumed that people with dysphagia will generally have large
changes in amplitude and frequency in their voices. And this can be
thought of as represented in the STFT spectrogram showing frequency
and amplitude over time. Therefore, we decided to use the STFT spec
trogram as the feature.
4.2. Model for each protocol data and ensemble method
We built a multibranch model to use both MFCC and STFT spectro

grams, and the structure is shown in Fig. 6. To make the input size of the
model the same, the STFT spectrogram was resized to (224, 224) and the
MFCC spectrogram was resized to (40, 20). In the MFCC part, blocks
consisting of Convolution-Batch Normalization-ReLU are stacked in four
layers. In the STFT part, the STFT spectrogram passes through the
DenseNet121. After the STFT and MFCC spectrograms passed through
Fig. 4. Positions of the pharyngeal cavity and oral cavity.
the model are flattened, they are concatenated. Then, the classification
Fig. 3. Overall workflow. The CNN-based models in this paper follow the same procedure.
5
Fig. 5. Source-filter theory. Output spectrum is the result of the interaction between glottal airflow and vocal track filter.
Fig. 6. Multibranch model. STFT and MFCC spectrogram are simultaneously input, and they are concatenated after going through each CNN layers.
5. Result all five metrics to indicate performance.
We evaluated the performance of the model for each protocol and the 5.1. Evaluation metrics
ensemble model. Accuracy, area under the curve (AUC), sensitivity,
specificity, negative predictive value (NPV), and positive predictive First, true positive (TP), true negative (TN), false positive (FP), and
value (PPV) were used as performance measures. Since the number of false negative (FN) can be defined as shown in Fig. 8. Then, accuracy,
data corresponding to dysphagia is twice as large as the number of data sensitivity, specificity, NPV, and PPV are defined as follows:
corresponding to normal, a class imbalance exists. In this situation,
TN + TP
rather than using only accuracy as an evaluation metric, it is fair to use Accuracy = (9)
TN + FN + FP + TP
6
Fig. 7. Soft voting. The output values are averaged in an unweighted way.
The values in bold indicate the best performance among the four pro
tocols. The multibranch model (Table 3A) outperformed the model that
takes only MFCC spectrograms as input (Table 3C) in all performance
metrics except for sensitivity in the “Cough” protocol. Similarly, the
model that takes only STFT spectrograms as input (Table 3B) out
performed the model that takes only MFCC spectrograms as input in all
performance metrics except for sensitivity in the “Cough” protocol. It is
suggested that the STFT spectrograms may play a crucial role in the
diagnosis of dysphagia. In terms of each evaluation metric, the perfor
mance of the model that takes only STFT spectrograms is comparable to
that of the multibranch model. When a protocol ensemble was con
Fig. 8. Confusion matrix to explain the evaluation metric. ducted, however, the multibranch model showed superior performance
to the model that only takes STFT spectrograms as input in all perfor
mance metrics, specifically exhibiting a higher accuracy of 4.6% and a
Sensitivity =
TP
(10) higher AUC of 1.1%. In addition, the multibranch model showed reliable
TP + FN diagnostic performances with more than 50 % lower standard deviation
than that of the model that takes only STFT spectrograms in all perfor
TN
Specificity = (11) mance metrics. Also, the protocol ensemble of the multibranch model
FP + TN
demonstrated superior performances in all performance metrics as well
TP as a lower standard deviation than that of the model that only takes
PPV = (12) MFCC spectrograms as input. Thus, we have built a more reliable and
TP + FP
accurate diagnostic model by utilizing a multibranch model that takes
TN both STFT and MFCC spectrograms as inputs.
NPV = (13)
FN + TN Fig. 9 shows the ROC curve and confusion matrix of the trial that
AUC is a widely used evaluation metric in the biomedical field and is showed the highest performance in the Ensemble model. There were
especially used to compare the performance of diagnostic tests [30]. more cases where Class 0 (Normal) was judged as Class 1 (Dysphagia)
AUC is defined as the area under the receiver operating characteristic than when Class 1 was judged as Class 0. This is thought to be because
(ROC) curve drawn when the horizontal axis value is (1 - sensitivity) and more training for Class 1 occurred during the training process because
the vertical axis value is sensitivity. the data corresponding to Class 1 is twice as large as the data corre
sponding to Class 0. Despite the class imbalance problem, the average
accuracy is about 88 %, which 35 out of 296 are wrong. One additional
5.2. Model performances
point to emphasize was that our study aimed to develop a novel deep
learning method that can be applied in clinical settings to screen
The test was implemented five times independently, and the average
dysphagia cases with high sensitivity levels. While it is true that in real-
and standard deviation of the model performances are shown in Table 3.
7
Table 3
Performances of (A) a model that takes both STFT spectrograms and MFCC spectrograms (multibranch model), (B) a model that takes only STFT spectrograms as input,
and (C) a model that takes only MFCC spectrograms as input. Performances of models for each protocol data and protocol ensemble results are shown.
(A)
Protocols Accuracy (%) AUC Sensitivity Specificity NPV PPV
Phonation 78.20 ± 4.96 0.852 ± 0.055 0.843 ± 0.063 0.687 ± 0.023 0.818 ± 0.021 0.729 ± 0.085
Cough 79.00 ± 3.58 0.878 ± 0.022 0.817 ± 0.090 0.757 ± 0.062 0.853 ± 0.020 0.727 ± 0.084
High-pitch phonation 80.80 ± 1.94 0.880 ± 0.020 0.882 ± 0.044 0.696 ± 0.082 0.832 ± 0.032 0.784 ± 0.046
Count 81.20 ± 1.17 0.887 ± 0.014 0.878 ± 0.044 0.714 ± 0.060 0.851 ± 0.007 0.771 ± 0.043
Ensemble 88.00 ± 1.10 0.950 ± 0.004 0.947 ± 0.034 0.779 ± 0.048 0.879 ± 0.022 0.903 ± 0.052
(B)
Phonation 78.60 ± 2.42 0.861 ± 0.015 0.847 ± 0.086 0.693 ± 0.137 0.829 ± 0.047 0.756 ± 0.098
Cough 77.40 ± 1.62 0.871 ± 0.016 0.836 ± 0.107 0.679 ± 0.167 0.826 ± 0.068 0.744 ± 0.094
Count 79.80 ± 1.60 0.880 ± 0.027 0.866 ± 0.075 0.699 ± 0.133 0.836 ± 0.057 0.776 ± 0.078
Ensemble 83.40 ± 3.07 0.939 ± 0.018 0.933 ± 0.066 0.675 ± 0.188 0.840 ± 0.076 0.891 ± 0.093
(C)
Phonation 73.60 ± 1.74 0.781 ± 0.028 0.829 ± 0.037 0.589 ± 0.053 0.782 ± 0.035 0.660 ± 0.058
Cough 71.80 ± 1.94 0.799 ± 0.011 0.879 ± 0.070 0.461 ± 0.150 0.747 ± 0.054 0.709 ± 0.095
Count 70.60 ± 4.80 0.777 ± 0.048 0.797 ± 0.076 0.567 ± 0.067 0.766 ± 0.040 0.619 ± 0.107
Ensemble 80.80 ± 2.64 0.895 ± 0.009 0.944 ± 0.035 0.577 ± 0.083 0.799 ± 0.044 0.859 ± 0.078
Fig. 9. (left) ROC curve and (right) confusion matrix for Ensemble model.
world scenarios the normal control group is typically larger than the trees was utilized. We extracted these features utilizing Python packages
dysphagia group, by including a larger dysphagia sample in our study, openSMILE [33] and Librosa [34]. The data used for the baseline models
we were able to increase the model’s exposure to positive cases, which and the proposed model were identical. As shown in Table 4, for almost
helped it better learn to identify dysphagia cases with high sensitivity all performance metrics in each protocol, the proposed method shows
levels [31]. Similar approaches have been employed in other algorithms higher values than the existing machine learning methods. Only Ref 3
to classify vocal fold pathologies, indicating the validity and value of our among the existing machine learning methods outperformed the pro
approach [14,32]. posed method for “Phonation” and “Cough” in terms of sensitivity, and
Furthermore, we compared the results of the proposed model with for “Phonation” in terms of specificity. However, Ref 3 approach lagged
the three baseline models [13,14,16], which are referred to as Ref 1, Ref behind the proposed method by 5 % to 8 % in terms of accuracy and AUC
2, and Ref 3, respectively. In Ref 1 [13], the input features consisted of for each protocol. The computational time of the proposed model is
13 MFCC parameters, jitter, shimmer, HNR, and autocorrelation, and a greater than that of machine learning, as shown in Table S1, but the
multi-layer perceptron (MLP) model with a hidden layer of 75 nodes was diagnostic results are more accurate than machine learning. Further
used. In Ref 2 [14], the input features consisted of 13 MFCC parameters more, our method diagnoses patients more rapidly than the clinical
and 13 delta-MFCC parameters, and an MLP model with three hidden dysphagia screening tests, which may make them ideal to be used in the
layers, each consisting of 300 nodes, was used. In Ref 3 [16], input supplement to existing diagnostic methods (VFSS, FEES).
features included fundamental frequency, jitter, shimmer, energy, 0-, 1-,
2-, 3-order moments, kurtosis, power factor, 1-, 2-, 3-formants ampli
tudes, 1-, 2-, 3-formants frequency, maximum and minimum values of
the signal, and 10 MFCCs parameters. A random forest model with 300
8
Table 4
Predictive performances of existing machine learning method for each protocol. The bold numbers represent the performance of the model that showed the highest
performance in each protocol.
Reference Protocols Accuracy (%) AUC Sensitivity Specificity NPV PPV
Ref1 Phonation 69.79 ± 1.60 0.729 ± 0.016 0.781 ± 0.034 0.618 ± 0.044 0.605 ± 0.038 0.750 ± 0.022
Cough 65.61 ± 1.25 0.692 ± 0.026 0.751 ± 0.035 0.559 ± 0.066 0.603 ± 0.024 0.716 ± 0.018
Count 70.60 ± 2.44 0.758 ± 0.016 0.740 ± 0.067 0.597 ± 0.044 0.600 ± 0.041 0.783 ± 0.027
Ref2 Phonation 67.90 ± 2.31 0.729 ± 0.019 0.757 ± 0.036 0.626 ± 0.084 0.600 ± 0.049 0.738 ± 0.017
Cough 66.75 ± 1.86 0.700 ± 0.035 0.781 ± 0.040 0.548 ± 0.043 0.574 ± 0.034 0.716 ± 0.014
Count 70.67 ± 2.87 0.747 ± 0.027 0.760 ± 0.052 0.606 ± 0.028 0.618 ± 0.061 0.773 ± 0.033
Ref3 Phonation 73.36 ± 2.60 0.787 ± 0.020 0.854 ± 0.024 0.706 ± 0.157 0.635 ± 0.135 0.755 ± 0.024
Cough 73.70 ± 0.92 0.794 ± 0.017 0.883 ± 0.031 0.532 ± 0.059 0.684 ± 0.039 0.746 ± 0.013
Count 76.40 ± 2.63 0.839 ± 0.011 0.833 ± 0.020 0.602 ± 0.024 0.680 ± 0.050 0.800 ± 0.028
Proposed Phonation 78.20 ± 4.96 0.852 ± 0.055 0.843 ± 0.063 0.687 ± 0.023 0.818 ± 0.021 0.729 ± 0.085
Method Cough 79.00 ± 3.58 0.878 ± 0.022 0.817 ± 0.090 0.757 ± 0.062 0.853 ± 0.020 0.727 ± 0.084
Count 81.20 ± 1.17 0.887 ± 0.014 0.878 ± 0.044 0.714 ± 0.060 0.851 ± 0.007 0.771 ± 0.043
6. Discussion (GUSS) test [42], voluntary coughing, throat clearing, and voice changes
after swallowing saliva are scored. Similarly, the first part of the Toronto
6.1. Automated classification of dysphagia using four tasks swallowing test also tests for voice quality [41]. The performance level
of CNN for each of these two tasks showed high sensitivity levels com
This study aimed to demonstrate whether the use of deep learning parable to those reported in these previous clinical studies. Therefore,
with only the patient’s voice recorded performing four different tasks one could conclude that the tasks included in the protocol are clinically
via a digital smart device shows good diagnostic properties in classifying relevant and could be easily incorporated into existing swallowing
dysphagia. The model did not require extracted voice features but screening tests.
instead received the spectrogram itself as an input. The methods were Although all tasks showed excellent AUC levels, “1–5” counting
non-invasive, low-cost, and simple, with an automated workflow that showed the highest AUC, followed by effortful pitch glide elevation.
can be used immediately in the clinical field. In addition, since there is Effortful pitch glide showed the highest sensitivity level, followed by
no need to extract task-specific features, the methodologies applied in “1–5” counting. Effortful pitch glide, a combination of a pitch glide and
this study are not limited to the task of diagnosing dysphagia per se but pharyngeal squeeze maneuver, is known to correlate well with the
have the potential to be applicable to use in other diseases. biomechanics of swallowing [43]. These two maneuvers reflect short
The ensemble model incorporating the four tasks showed an AUC of ening and constriction of the pharynx and have been proven to be a valid
0.950, with sensitivity and specificity levels as high as 94.7% and and surrogate measure of pharyngeal motor integrity [44,45]. Dimin
77.9%, respectively. The novel model proposed in this paper showed at ished pharyngeal motor integrity as assessed by these maneuvers is at
least 18.4 % higher performance in sensitivity and 1.7 % in specificity higher risk of aspirating [46]. The high accuracy levels shown in our
than the existing Bedsides Screening Tests [5–7]. model advocate the use of this task as a valuable component in the
Through the results of this paper, it was shown that this novel deep swallowing screening assessment. By contrast, counting “1–5” reflects
learning CNN using voice data spectrogram without extracting de articulation and prosody which reflect tongue motion. Speech produc
scriptors, may be applied as an indicator of dysphagia. Unlike existing tion and swallowing share anatomical structures and networks. Thus,
models that only receive a single spectrogram as input, we built a CNN dysphagia can be characterized by speech alternations [47]. A bedside
model that trains by receiving both STFT and MFCC spectrograms in evaluation of speech that included reading sentences predicted
order to maximize the information needed for dysphagia diagnosis. In dysphagia with an AUC of 0.74 [48]. A recent ML study [49] has
addition, when a deep learning model is trained using only a single incorporated acoustic features from rapid alternations of syllables and
protocol as in the previous study, the model’s performance deviation is spontaneous monology and has shown to correlate with dysphagia with
large. However, by using “Protocol ensemble,” a method that trains each an AUC of 0.91. The AUC level of 0.95 from this study is higher than past
model for several protocols and ensembles results of each model, we studies [37,49]. In summary, the four tasks used in our study, which
were able to not only increase the overall performance of the model, but were simple to follow and required little time, reflect the integrity of the
also reduce the performance deviation. Furthermore, we discuss the variable anatomic structures involved in swallowing, and is of no
explainability of the proposed model in Supplementary Material. coincidence that the ensemble thus showed the highest level of
For this study, the participants were asked to perform four tasks: accuracy.
simple vowel phonation, voluntary coughing, high-pitch gliding, and
counting “1-to-5”. Assembling and analyzing these four tasks into an
ensemble is the critical feature that distinguishes our technique from 6.2. Automated classification of dysphagia and “Telestroke” care
past studies [35–38] that usually incorporated phonation tasks using
single vowels. Swallowing incorporates the use of multiple oropharyn Our methods required a commercially available iPad with no special
geal muscles, and the four tasks were designed to reflect these muscles’ sensors. The procedure was safe and did not require bolus testing to
functions. As a result, the ensemble model showed the highest AUC level assess voice changes [35–38]. Voice testing after direct bolus swallow
than the single task model. ing is one of the most studied methods, but the methodologies and
Stroke patients with a risk of aspiration show changes in voice clinical validity of extracting voice features after direct bolus testing
quality during phonation and a reduced cough force due to poor vocal have been questioned [37]. Also, this method can be dangerous to
fold closure and respiratory muscle recruitment [39]. These non- someone with a high risk of respiratory complications. By contrast, our
swallowing tasks are similar to those that are incorporated in many protocols can be safely performed without direct contact and potentially
screening protocols [1,4,40–42]. In the first Gugging Swallowing Screen be performed as part of telemedicine in the future.
The rise of telerehabilitation and online stroke monitoring has
9
allowed the widespread use of mobile digital devices in stroke man Acknowledgments
agement. With telemedicine, stroke experts can now determine crucial
decisions such as candidacy for thrombolysis or thrombectomy and This research was supported in part by Ministry of Trade, Industry
recommend treatment online. Since dysphagia screening is integral to and Energy (MOTIE) (Development of Meta Soft Organ Module
stroke management [1–3], our techniques may help to automatically Manufacturing Technology without Immunity Rejection and Module
and noninvasively screen dysphagia with online tele stroke care. Assembly Robot System, 20012378), in part by the Priority Research
Centers Program through the National Research Foundation of Korea
6.3. Limitation (NRF) funded by the Ministry of Education (2020R1A6A1A03047902),
in part by the NRF (no. 2020R1F1A1065814), in part by the Korean
The following are some pitfalls that need to be discussed. First, for Fund for Regenerative Medicine (KFRM) grant funded by the Korea
the ensemble, voice data from only those with enough vigilance to government (23C0121L1), and in part by the Po-Ca Networking Groups
follow the four tasks were included in this study. Though the four tasks funded by the Postech-Catholic Biomedical Engineering Institute (No. 5-
were simple to perform within a short time frame, the techniques may 2021-B0001-00303).
not apply to those with altered consciousness or global aphasia. As seen
in our results, though most patients were able to perform the phonation,
coughing and high glide pitch tasks, some failed in the counting task. Institutional Review Board Statement
Second, all protocols were performed in one language, and the perfor
mance level may change with the counting task if counting “1–5” if Institutional Review Board (HC19EESE0060) of the Catholic Uni
performed in different languages. Third, gender distribution showed versity of Korea, Bucheon St. Mary’s hospital approved the use of
differences between the control and patient group, though this issue was pertinent clinical information.
solved through gender matching in the test data set; which showed
excellent performance across both genders. Finally, the participants Appendix A. Supplementary data
were mostly related to stroke, and the generalizability of our protocol to
those with swallowing problems related to glottic cancer or Parkinson’s Supplementary data to this article can be found online at https://doi.
disease needs to be verified in future studies. org/10.1016/j.bspc.2023.105259.
7. Conclusion References
Smartphones or mobile apps, with their ease of accessibility, have [1] E.E. Smith, D.M. Kent, K.R. Bulsara, L.Y. Leung, J.H. Lichtman, M.J. Reeves,
A. Towfighi, W.N. Whiteley, D.B. Zahuranec, A.H.A.S. Council, Effect of Dysphagia
advantages over traditional approaches, allow real-time interactions Screening Strategies on Clinical Outcomes After Stroke, Stroke 49 (2018)
and feedback, and can be scaled up to large populations cost-effectively e123–e128.
[50]. They also can be used as devices that can record patient-generated [2] L.K. Casaubon, J.-M. Boulanger, E. Glasser, D. Blacquiere, S. Boucher, K. Brown,
T. Goddard, J. Gordon, M. Horton, J. Lalonde, Canadian stroke best practice
health data. Nevertheless, future large-scale studies are warranted to recommendations: acute inpatient stroke care guidelines, update 2015, Int. J.
investigate whether the novel CNN method used in this study, when Stroke 11 (2016) 239–252.
integrated into current standards of stroke care, will help ensure that [3] K.L. Furie, S.E. Kasner, R.J. Adams, G.W. Albers, R.L. Bush, S.C. Fagan, J.
L. Halperin, S.C. Johnston, I. Katzan, W.N. Kernan, American Heart Association
stroke patients are screened for dysphagia automatically and precisely. Stroke Council, Council on Cardiovascular Nursing, Council on Clinical Cardiology,
Our technique could lead to an earlier referral to instrumental assess and Interdisciplinary Council on Quality of Care and Outcomes Research.
ments for proper diagnosis and, if proven clinically valid by future Guidelines for the prevention of stroke in patients with stroke or transient ischemic
attack: a guideline for healthcare professionals from the American Heart
studies, help reduce medical complications such as aspiration
Association/American Stroke Association, Stroke 42 (2011) 227–276.
pneumonia. [4] M.S. Chong, P.K. Lieu, Y.Y. Sitoh, Y.Y. Meng, L.P. Leow, Bedside clinical methods
useful as screening test for aspiration in elderly patients with recent and previous
strokes, Ann Acad Med Singap 32 (2003) 790–794.
CRediT authorship contribution statement
[5] K.L. DePippo, M.A. Holas, M.J. Reding, Validation of the 3-oz water swallow test
for aspiration following stroke, Arch Neurol 49 (1992) 1259–1261.
Heekyu Kim: Methodology, Writing – original draft, Visualization. [6] S. Teramoto, Y. Fukuchi, Detection of aspiration and swallowing disorder in older
Hae-Yeon Park: Conceptualization, Investigation, Writing – review & stroke patients: simple swallowing provocation test versus water swallowing test,
Arch Phys Med Rehabil 81 (2000) 1517–1519.
editing. DoGyeom Park: Software. Sun Im: Validation, Data curation, [7] M. Peladeau-Pigeon, C.M. Steele, Technical aspects of a videofluoroscopic
Writing – review & editing, Project administration, Funding acquisition. swallowing study, Can J Speech-Language Pathol Audiol 37 (2013) 216–226.
Seungchul Lee: Validation, Resources, Supervision, Project adminis [8] S.E. Langmore, K. Schatz, N. Olsen, Fiberoptic endoscopic examination of
swallowing safety: a new procedure, Dysphagia 2 (1988) 216–219.
tration, Funding acquisition. [9] U.S. S.M, G. R, J. Katiravan, R. M, R.K. R, Mobile application based speech and
voice analysis for COVID-19 detection using computational audit techniques,
Declaration of Competing Interest International Journal of Pervasive Computing and Communications, ahead-of-print
(2020).
[10] P. Mouawad, T. Dubnov, S. Dubnov, Robust Detection of COVID-19 in Cough
The authors declare the following financial interests/personal re Sounds, SN Computer Science 2 (2021) 34.
lationships which may be considered as potential competing interests: [11] E. Maor, J.D. Sara, D.M. Orbelo, L.O. Lerman, Y. Levanon, A. Lerman, Voice Signal
Characteristics Are Independently Associated With Coronary Artery Disease, Mayo
The Industry-Academic Cooperation Foundation of the Catholic uni
Clin. Proc. 93 (2018) 840–847.
versity of Korea, College of Medicine Industry-Academic Cooperation [12] J.P. Teixeira, C. Oliveira, C. Lopes, Vocal acoustic analysis–jitter, shimmer and hnr
Foundation and Pohang University of Science and Technology (POST parameters, Procedia Technol. 9 (2013) 1112–1122.
[13] F.L. Teixeira, J.P. Teixeira, Deep-learning in Identification of Vocal Pathologies,
ECH) hold patent application rights (not public, no 10-2216160, US:
BIOSIGNALS (2020) 288–295.
Serial No. 17/908,629, Eu: 21765455.7) presented in this study. [14] S.-H. Fang, Y. Tsao, M.-J. Hsiao, J.-Y. Chen, Y.-H. Lai, F.-C. Lin, C.-T. Wang,
Detection of pathological voice using cepstrum vectors: A deep learning approach,
Data availability J. Voice 33 (2019) 634–641.
[15] A. Al-Nasheri, G. Muhammad, M. Alsulaiman, Z. Ali, Investigation of voice
pathology detection and classification on different frequency regions using
Data will be made available on request. correlation functions, J. Voice 31 (2017) 3–15.
[16] D. Hemmerling, A. Skalski, J. Gajda, Voice data mining for laryngeal pathology
assessment, Comput. Biol. Med. 69 (2016) 270–276.
[17] S. Jothilakshmi, Automatic system to detect the type of voice pathology, Appl. Soft
Comput. 21 (2014) 244–249.
10
[18] A. Botalb, M. Moinuddin, U. Al-Saggaf, S.S. Ali, Contrasting convolutional neural [35] Y.A. Kang, J. Kim, S.J. Jee, C.W. Jo, B.S. Koo, Detection of voice changes due to
network (CNN) with multi-layer perceptron (MLP) for big data analysis, 2018 aspiration via acoustic voice analysis, Auris Nasus Larynx 45 (2018) 801–806.
International conference on intelligent and advanced system (ICIAS), IEEE, 2018, [36] J.S. Ryu, S.R. Park, K.H. Choi, Prediction of laryngeal aspiration using voice
pp. 1-5. analysis, Am J Phys Med Rehabil 83 (2004) 753–757.
[19] H.J. Nussbaumer, The fast Fourier transform, Fast Fourier Transform and [37] K.W.d. Santos, E.d.C. Rodrigues, R.S. Rech, E.M.d.R. Wendland, M. Neves, F.N.
Convolution Algorithms, Springer1981, pp. 80-111. Hugo, J.B. Hilgert, Using Voice Change as an Indicator of Dysphagia: A Systematic
[20] L. Durak, O. Arikan, Short-time Fourier transform: two fundamental properties and Review, Dysphagia, (2021) 1-13.
an optimal implementation, IEEE Trans. Signal Process. 51 (2003) 1231–1242. [38] T. Warms, J. Richards, “Wet Voice” as a predictor of penetration and aspiration in
[21] S. Sigurdsson, K.B. Petersen, T. Lehn-Schiøler, Mel Frequency Cepstral Coefficients: oropharyngeal dysphagia, Dysphagia 15 (2000) 84–88.
An Evaluation of Robustness of MP3 Encoded Music, ISMIR (2006) 286–289. [39] Y.M. Choi, G.Y. Park, Y. Yoo, D. Sohn, Y. Jang, S. Im, Reduced Diaphragm
[22] S.K. Kopparapu, M. Laxminarayana, Choice of Mel filter bank in computing MFCC Excursion During Reflexive Citric Acid Cough Test in Subjects With Subacute
of a resampled speech, in: 10th International Conference on Information Science, Stroke, Respir Care 62 (2017) 1571–1581.
Signal Processing and Their Applications (ISSPA 2010), 2010, pp. 121–124. [40] C. Henke, C. Foerch, S. Lapa, Early screening parameters for dysphagia in acute
[23] S. Albawi, T.A. Mohammed, S. Al-Zawi, Understanding of a convolutional neural ischemic stroke, Cerebrovasc Dis 44 (2017) 285–290.
network, 2017 international conference on engineering and technology (ICET), [41] R. Martino, F. Silver, R. Teasell, M. Bayley, G. Nicholson, D.L. Streiner, N.
Ieee, 2017, pp. 1-6. E. Diamant, The Toronto Bedside Swallowing Screening Test (TOR-BSST):
[24] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22 development and validation of a dysphagia screening tool for patients with stroke,
(2009) 1345–1359. Stroke 40 (2009) 555–561.
[25] S. Kim, Y.-K. Noh, F.C. Park, Efficient neural network compression via transfer [42] T. Warnecke, S. Im, C. Kaiser, C. Hamacher, S. Oelenberg, R. Dziewas, Aspiration
learning for machine vision inspection, Neurocomputing 413 (2020) 294–304. and dysphagia screening in acute stroke - the Gugging Swallowing Screen revisited,
[26] O. Sagi, L. Rokach, Ensemble learning: A survey, Wiley Interdisciplinary Reviews, Eur. J. Neurol. (2017) 1–8.
Data Min. Knowl. Disc. 8 (2018) e1249. [43] K.V. Miloro, W.G. Pearson, S.E. Langmore, Effortful Pitch Glide: A Potential New
[27] H.-Y. Park, D. Park, H.S. Kang, H. Kim, S. Lee, S. Im, Post-stroke respiratory Exercise Evaluated by Dynamic MRI, J. Speech Lang. Hear. Res. 57 (2014)
complications using machine learning with voice features from mobile devices, 1243–1250.
(2022). [44] L.G. Close, J.E. Avi, Laryngeal Adductor Reflex and Pharyngeal Squeeze as
[28] B.B. Monson, E.J. Hunter, A.J. Lotto, B.H. Story, The perceptual significance of Predictors of Laryngeal Penetration and Aspiration, Laryngoscope (2015) 1–4.
high-frequency energy in the human voice, Front. Psychol. 5 (2014) 587. [45] S.C. Fuller, R. Leonard, S. Aminpour, P.C. Belafsky, Validation of the pharyngeal
[29] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected squeeze maneuver, Otolaryngology -, Head Neck Surg 140 (2009) 391–394.
convolutional networks, in: Proceedings of the IEEE conference on computer vision [46] P.W. Perlman, M.A. Cohen, M. Setzen, P.C. Belafsky, J. Guss, K.F. Mattucci,
and pattern recognition, 2017, pp. 4700–4708. M. Ditkoff, The risk of aspiration of pureed food as determined by flexible
[30] M.S. Pepe, Receiver Operating Characteristic Methodology, J. Am. Stat. Assoc. 95 endoscopic evaluation of swallowing with sensory testing, Otolaryngol Head Neck
(2000) 308–311. Surg 130 (2004) 80–83.
[31] N.J. Donovan, S.K. Daniels, J. Edmiaston, J. Weinhardt, D. Summers, P.H. Mitchell, [47] K. Tjaden, Speech and Swallowing in Parkinson’s Disease, Top Geriatr Rehabil 24
Dysphagia screening: state of the art: invitational conference proceeding from the (2008) 115–126.
State-of-the-Art Nursing Symposium, International Stroke Conference 2012, Stroke [48] E. Festic, J.S. Soto, L.A. Pitre, M. Leveton, D.M. Ramsey, W.D. Freeman, M.
44 (2013) e24–e31. G. Heckman, A.S. Lee, Novel bedside phonetic evaluation to identify dysphagia and
[32] H.-C. Hu, S.-Y. Chang, C.-H. Wang, K.-J. Li, H.-Y. Cho, Y.-T. Chen, C.-J. Lu, T.- aspiration risk, Chest 149 (2016) 649–659.
P. Tsai, O.-K.-S. Lee, Deep learning application for vocal fold disease prediction [49] S. Roldan-Vasco, A. Orozco-Duque, J.C. Suarez-Escudero, J.R. Orozco-Arroyave,
through voice recognition: preliminary development study, J. Med. Internet Res. Machine learning based analysis of speech dimensions in functional oropharyngeal
23 (2021) e25247. dysphagia, Comput Methods Programs Biomed 208 (2021), 106248.
[33] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the munich versatile and fast open- [50] J. Webb, S. Peerbux, P. Smittenaar, S. Siddiqui, Y. Sherwani, M. Ahmed, H.
source audio feature extractor, Proceedings of the 18th ACM international MacRae, H. Puri, S. Bhalla, A. Majeed, A Randomized Controlled Trial of a Digital
conference on Multimedia, 2010, pp. 1459-1462. Therapeutic Intervention for Smoking Cessation, medRxiv, (2020)
[34] B. McFee, C. Raffel, D. Liang, D.P. Ellis, M. McVicar, E. Battenberg, O. Nieto, 2020.2006.2025.20139741.
librosa: Audio and music signal analysis in python, Proceedings of the 14th python
in science conference, 2015, pp. 18-25.
11

1 s2.0 S1746809423006924 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1746809423006924 Main

Uploaded by

Copyright:

Available Formats

Biomedical Signal Processing and Control 86 (2023) 105259

Contents lists available at ScienceDirect

Biomedical Signal Processing and Control

Non-invasive way to diagnose dysphagia by training deep learning model

1. Introduction undiagnosed can lead to aspiration pneumonia, which is known to lead

each segment, according to Equation (2). trum. It is expressed by Equation (5).

Age (years) 68.8 ± 12.6

4.2. Model for each protocol data and ensemble method

We built a multibranch model to use both MFCC and STFT spectro

5. Result all five metrics to indicate performance.

Protocols Accuracy (%) AUC Sensitivity Specificity NPV PPV

Protocols Accuracy (%) AUC Sensitivity Specificity NPV PPV

Protocols Accuracy (%) AUC Sensitivity Specificity NPV PPV

You might also like

1 s2.0 S1746809423006924 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1746809423006924 Main

Uploaded by

Copyright:

Available Formats

Biomedical Signal Processing and Control 86 (2023) 105259

Contents lists available at ScienceDirect

Biomedical Signal Processing and Control

Non-invasive way to diagnose dysphagia by training deep learning model

1. Introduction undiagnosed can lead to aspiration pneumonia, which is known to lead

each segment, according to Equation (2). trum. It is expressed by Equation (5).

Age (years) 68.8 ± 12.6

4.2. Model for each protocol data and ensemble method

We built a multibranch model to use both MFCC and STFT spectro­

5. Result all five metrics to indicate performance.

Protocols Accuracy (%) AUC Sensitivity Specificity NPV PPV

Protocols Accuracy (%) AUC Sensitivity Specificity NPV PPV

Protocols Accuracy (%) AUC Sensitivity Specificity NPV PPV

You might also like

We built a multibranch model to use both MFCC and STFT spectro