38 Classification of Lung Sounds Using Scalogram Representation of Sound Segments and Convolutional Neural Network

Journal of Medical Engineering & Technology
ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/ijmt20
Classification of lung sounds using scalogram

representation of sound segments and
convolutional neural network
Huong Pham Thi Viet, Huyen Nguyen Thi Ngoc, Vu Tran Anh & Huy Hoang
Quang
To cite this article: Huong Pham Thi Viet, Huyen Nguyen Thi Ngoc, Vu Tran Anh & Huy
Hoang Quang (2022): Classification of lung sounds using scalogram representation of sound
segments and convolutional neural network, Journal of Medical Engineering & Technology, DOI:
10.1080/03091902.2022.2040624
To link to this article: https://doi.org/10.1080/03091902.2022.2040624
Published online: 25 Feb 2022.
Submit your article to this journal
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=ijmt20
JOURNAL OF MEDICAL ENGINEERING & TECHNOLOGY
https://doi.org/10.1080/03091902.2022.2040624
RESEARCH ARTICLE
Classification of lung sounds using scalogram representation of sound

segments and convolutional neural network
Huong Pham Thi Vieta , Huyen Nguyen Thi Ngocb , Vu Tran Anhb and Huy Hoang Quangb
a
International School, Vietnam National University, Hanoi, Vietnam; bSchool of Electronics and Telecommunications, Hanoi University
of Science and Technology, Hanoi, Vietnam
ABSTRACT ARTICLE HISTORY

Lung auscultation is one of the most common methods for screening of lung diseases. The Received 23 August 2021
increasingly high rate of respiratory diseases leads to the need for robust methods to detect the Revised 27 January 2022
abnormalities in patients’ breathing sounds. Lung sounds analysis stands out as a promising Accepted 7 February 2022
approach to automatic screening of lung diseases, serving as a second opinion for doctors as a
KEYWORDS
stand-alone device for preliminary screening of lung diseases in remote areas. In previous Lung sounds; classification;
research on lung classification using ICBHI Database on Kaggle, lung audios are converted to wheeze; crackle; scalogram;
spectral images and fed into deep neural networks for training. There are a few studies which convolutional
uses the scalogram, however they focussed on classification among different lung diseases. The neural network
use of scalograms in categorising the sound types are rarely used. In this paper, we combined
scalograms and neural networks for classification of lung sound types. Padding methods and
augmentation are also considered to evaluate the impacts on classification score. An ensemble
learning is incorporated to increase classification accuracy by utilising voting of many models.
The model trained and evaluated has shown prominent improvement of this method on classifi-
cation on the benchmark ICBHI database.
Introduction possible. Research in the area of lung classification has

received more and more contributions on sound fea-
As reported with World Lung Day by WHO [1], there
tures extraction and the building of efficient learning
are more than 65 million cases of COPD respiratory
pipelines that guarantee real-time accurate applica-
infection every year, killing around 3 million people.
Asthma, the most common chronic respiratory con- tions of lung diseases screening devices.
traction, affects 344 million people, whereas around During breathing, the trachea-bronchial tree produ-
10 million contracted and 1.4 million died from tuber- ces characteristic sounds that can be heard at the
culosis yearly. More notably, pneumonia is a leading chest wall. According to [2], the lung sounds are com-
cause of death in children under the age of five. At posed of normal breathing sound and adventitious
the early stage of respiratory diseases diagnosis, doc- sounds such as crackle, wheeze, squawk, and stridor.
tors normally use lung auscultation to examine the Adventitious sounds can be distinguished from normal
states of the lung. While good physicians can take breathing sounds to aid in diagnosing respiratory dis-
advantage of prominent infectious sound traits from orders. The normal breathing, or normal vesicular
the lung to rule out respiratory disorders without the sounds, have attained peak energy at around 100 Hz,
need for invasion techniques, there are various factors and energy drop off in the range between 100 Hz and
that affect to this method. Those include the noise 200 Hz, though sometimes in sensitive cases it can be
that can interfere with sound transmission, and inher- detected at frequencies up to 800 Hz. Wheeze and
ent limited hearing capability of humans that makes crackle are among the sound types classified as con-
adventitious sounds sometimes undetected. With tinuous sines as they have sinusoidal waveforms, last-
advances in digital filtering and recording in electronic ing about 80–100ms. Wheeze peak energy is around
stethoscopes, computational lung sounds analysis 400 Hz, while that of stridors’ centres at around or less
(CLSA), which aims to automatically detect the adven- than 200 Hz. Wheeze can be traced back to patients
titious sounds in a patient’s breathing, has been made with airway obstruction, i.e., asthma and chronic
CONTACT Vu Tran Anh Vu.trananh@hust.edu.vn School of Electronics and Telecommunications, Hanoi University of Science and Technology,
Hanoi, Vietnam
ß 2022 Informa UK Limited, trading as Taylor & Francis Group
2 T. V. H. PHAM ET AL.
obstructive pulmonary diseases (COPD), though not categorisation. In Tran et al. [12], discrete wavelet
always inherent as it can be missing in cases of transform was used to classify the different types of
extreme obstruction. Wheeze can also present in vari- lung sounds, energies of reconstructed sub-band sig-
ous conditions. Crackle is a group of sounds with non- nals were used as features for machine learning mod-
musical characteristics which can be further divided els, achieving an average accuracy of 89%.
into coarse and fine, based on sound pitch, loudness, Most recent research incorporates deep convolu-
and timings. Crackles are discontinuous, which means tional networks to extract lung sound features for clas-
its duration is short at about 25 ms. The energy of sification tasks. In Dalal Bardou et al. [13],
crackles is distributed in range from 60 to 2000 Hz spectrograms and Alexnet-base topology of neural
with most energy concentration in 60–1200 Hz. networks were tested with an accuracy of 95.1%. The
Crackles are associated with a number of disorders, VGG-16 was used in [14] for feature extraction and
namely in pneumonia, bronchiectasis, COPD, diffuse SVM to classify lung sounds with a correct classifica-
parenchymal lung disease, and in some cases congest- tion rate of 65%. Light-weight frameworks have
ive heart failures. Crackles heard in COPD Patients are attracted attention lately as more efficient techniques
usually linked with severe airway obstruction [2]. are proposed such as Mel-log spectrogram and light-
Many attempts have been made in the task of lung weight CNN, attaining ICBHI score of 78.3% and 81%
sounds classification. Early day attempts focussed on for 4 class classification, respectively [15]. Pham et al.
extracting time domain features. Statistics and music [16] also proposed a novel model C-DNN þ MoE for
information retrieval features are utilised in [3] to classification of spectrogram features with a reported
detect crackles using logistic regression. In Mendes boost in performance. Some recent works have been
et al. [4], wheeze signature in spectrogram space was paying attention to continuous wavelet transform and
proposed to detect frames with signs of wheeze. In the use of scalogram for features instead of short time
Nandini Sengupta et al. [5], the MFCC (Mel frequency Fourier Transform, which show promising results. In
cepstral coefficients), LFCC (Linear frequency cepstral Jayalakshmy et al. [17], scalogram based classification
coefficients), IMFCC (Inverse Mel frequency cepstral was introduced using a scalogram for IMF (Intrinsic
coefficients) with a GMM (Gaussian Mixture Model) are Mode Function) signals derived from EMD (Empirical
modelled to propose a classification based on statis- Mode Decomposition) and Alexnet convolutional net-
tical characteristics, attaining an accuracy of 97.2%. work, which attained 83.78% accuracy rate. The
The statistical features are also used in [6] as the authors in [18] experimented with both traditional sca-
authors extracted HOS (higher order statistical) fea- lograms and the IMF-based ones on the task of patho-
tures and k-NN (k-nearest neighbor), Bayes classifier to logical classification, attaining accuracy of 99.05% for
distinguish fine crackle, coarse crackle samples, mono- differentiating six pathological classes.
phonic and polyphonic wheeze. The work in [6] was The usage of different datasets in the above
able to classify five sound types with overall accuracy research presents a challenge for cross validating
roughly of 90% and in [7], a classifying pipeline was among different proposed approaches. Some utilised
constructed using AR models and SVM (support vector open datasets like RALE, which was used repeatedly in
machine), k-NN classifier, the best feature set perform- [5,13] but contains few samples of lung sounds that
ance reaching accuracies of 97.7% and 98.8% for defies generalisation of the classification, while others
inspiratory and expiratory data segments, respectively. realised their methods on self-collected private data-
The classification using time-frequency and time-scale sets that cannot be publicised. More recent works usu-
features with different window types and classifiers ally take advantage of the ICBHI database released in
such as SVM was devised in [8], which attained high- 2017 [11,12,14–16,18], which contains large enough
est overall accuracy of 81.1% with SVM classifier. In samples of lung sounds for deep learning approaches
[9], wheeze detection based on discrete wavelet and and provides a benchmark dataset for comparison
ANN achieved accuracy rate of 89.29%, whereas fea- among different works.
tures from short time Fourier and wavelet transform In this paper, we apply a scalogram-based approach
with kernel SVM were combined to get 60% accuracy to classification of the four sound types, namely
rate [10]. In Aykanat et al. [11], a CNN based classifica- crackle, wheeze, normal vesicular, and both using
tion was proposed for classification of rale, rhonchus deep convolutional neural networks on benchmark
and normal sounds as well as multilabel classification dataset ICBHI. Though MFFC and spectrogram are
with relative accuracy to the benchmark methods of widely used on this dataset, the use of scalograms
MFCC and SVM, at 86% for healthy or pathological have less been prominent, especially for sound types
JOURNAL OF MEDICAL ENGINEERING & TECHNOLOGY 3
classification task. With inspiration from [15] to study preprocessing, conversion of audio to scalograms,
the effect of different padding methods and frame splitting and padding of scalograms frames and train-
split lengths, we also incorporate them into this study ing of extracted features.
with scalograms. We introduce a new padding
method, which is reverse sample padding to handle
Database: ICBHI challenge and respiratory
the discontinuity between padded sample. Data aug-
sound database
mentation have been studied in some previous
research on lung sounds classification. However, it was The database was released first in ICBHI 2017. The data
commonly performed on audio with vocal tract length included sound recordings from 126 patients and 920
perturbation and speed stretching. In this research, we audio samples. The duration of the audio is from 10 to
attempt to apply augmentation on the scalograms to 90 s and mostly comprises more than one cycle in each
show their effectiveness in classification task, which to recording. Since the release, the dataset has served as a
the best of our knowledge, was not experimented benchmark for many researches on lung sounds analysis.
with before. Ensemble models are also considered to There is a total of 6898 cycles of respiration. Each .wav
increases performance of the classification. ICHI scores file comes with an annotation document with notes on
are then computed for this pipeline to highlight per- timing of each cycle in the recording and label of the
formance of the approach with some other methods. cycle as containing wheeze and/or containing crackle. In
The paper is organised as follows: the first part addition, there is a document that lists patients who are
introduces related concepts, the second part shows in contracted with a certain type of respiratory disorder dis-
detail steps of scalogram-based lung sound classifica- eases. Therefore, the dataset presents a basis for different
tion, the third part gives experimental results and dis- classification tasks as proposed by the ICBHI challenge
cussion, the final part concludes the paper. [19]. There are main tasks which are further divided into
sub-tasks. The first task involves the classification of lung
sounds into two groups in the case of normal or abnor-
Methodology
mal classification and 4 groups in the case of wheeze,
The block diagram of the procedure is shown in crackle, normal and both sounds classification. The second
Figure 1, which contained three main steps: audio task involves classifying the patients’ conditions of
Figure 1. Block diagram of scalogram based sound types classification.

ð1
healthy, non-chronic and chronic, and classifying the 1 tb
W ða, bÞ ¼ pffiffiffi sðtÞw dt (1)
patients as healthy or diseased, either with chronic or a 1 a
non-chronic symptoms. In this work, we focus on sound
types classification of four kinds of lung sounds: wheeze, With a is the scale and b is the translation factor in
crackle, normal, and both. The data is retrieved from time domain for the transform [21]. The possible draw-
Kaggle webpage on Respiratory Database [20]. backs of CWT are probably the time and computation
complexity in comparison with STFT.
Preprocessing of audios The scalograms used in this paper are obtained
from the magnitude of CWT coefficients. We utilised
Audio recordings from the meta-database of Kaggle site MATLAB function cwt, where the type of wavelet
contains audio recordings from various settings. These chosen was “bump”. The scale of the transform origin-
recordings contain noise, from both internal factors ally exceeded 120 scale. However, since the higher
(heart sounds) and external (people talking in the back-
scale correlates with lower frequency range and for
ground). In order to leverage noise effects, wavelet
lung sound classification tasks, those frequencies are
shrinkage denoising is implemented for all of the audio
out of interest, so we set a limit for scale, which is 85
recordings. We applied a soft threshold and chose sym4
scale for frequency ranging from 10 to 2000 Hz. The
as the mother wavelet. Varying frequency sampling rates
scalogram representation of different sound types are
lung recordings are the results of the many devices used
illustrated in Figure 2.
for lung sounds recording experiments. For a more
standardised set of lungs recording samples, researchers
tend to resample the sound signal to a fixed sampling Frame splitting and padding
frequency. In this paper, we also set the sampling rate
Since the length of each respiratory cycle is different,
to 4000 Hz, which is achieved by down sampling the
the scalogram representation’s horizontal lengths also
recordings from Meditron (44,100 kHz) to 4000 Hz.
Audios with deviation in amplitude level are also ampli- differ. To handle inhomogeneity in respiratory scalo-
tude normalised to avoid any uncertainty caused by gram frames, the images are resized [14]. However,
amplitude deviation. The normalised audio recordings the resized images with varied length will lead to
have samples with amplitudes in the range [–1;1]. stretched or compressed traits of sounds on the scalo-
grams. The authors proposed to discard the audio
recordings which are longer than a predefined thresh-
Scalogram conversion old, but that might lead to missing data with valuable
Scalograms based lung sounds classification was pro- information. In Nguyen and Pernkopf [15], they
posed in [17], where the author utilised continuous wave- handled the varied length by splitting and padding
let transform to create features for the four types of lung the audio so that they all have a fixed length. We are
sounds. Unlike Short Time Fourier Transform (STFT) with a motivated by their work and in this paper, we also
fixed resolution, the Continuous Wavelet Transform (CWT) split and pad the scalogram frames with the essen-
utilised a set of basic functions which are scaled versions tially same pipeline as was conducted in Nguyen and
of a mother wavelet. Dilation and translation of the ver- Pernkopf [15], but instead of audio signals the proced-
sions allow CWT to detect characteristics of the signal ure is conducted on scalogram. The detail splitting
both in time and frequency domains and padding algorithm is illustrated in Figure 3.
Figure 2. Scalogram representation of different types of lung sounds from ICBHI database
Figure 3. (a) Splitting and padding of scalogram frames and (b) Different padding methods.
Since a fixed length is used for all recorded sam- also incorporate this step in processing of the scalo-
ples, it is logical to experiment with different split gram frames:
lengths of scalogram frames. In this paper, lengths of value l
scalogram frames are chosen from a range of values v¼ (2)
r
to study its impacts on performance of classification
With v is the normalised value for scalograms, value
models. The scalogram’s frame lengths chosen to
is the scalograms pixel intensity, l and r are the
examine were 3.5 s, 4 s and 5 s.
mean and standard deviation of pixel values,
Padding methods were also proposed in [15] as the
respectively.
authors examined the effects of different padding to
the classification task. We also incorporate a test on
padding methods in this work. The padding methods Data augmentation
of interest were zero padding, sample padding. We In previous research, the use of data augmentation
propose a new padding method which aims to help was mentioned in some works [13,15,18] for sound
ease the cut-off effects in sample padding. Instead of classification. Augmentation plays great importance in
stacking features one after another, the features are handling data imbalance and helps boost the perform-
flipped in time before stacked to the previous sample, ance of the classifier [13]. Therefore, in our approach,
which helps to minimise the sudden changes in pixel we attempt to perform the augmentation on the sca-
values and create a smoother transition between logram frames, unlike some previous works that com-
padded features. In that sense, models can focus more monly performed it on audio with the most popular
on learning characteristically discriminating patterns of techniques being time stretching and vocal tract
the signals. length perturbation [13,15]. We incorporated a num-
ber of augmentation methods introduced in the Audio
Z-score normalisation Augment toolbox in Matlab [22]. The toolbox was ori-
The authors in Nguyen and Pernkopf [15] mentioned ginally introduced for spectrograms, but the same
feature rolling as a method to account for the cyclic techniques could be used with scalograms. After scalo-
nature of respiratory signals with a z-score normalisa- grams are extracted, they are augmented using many
tion of the spectrogram. Z-score is among the com- methods such as vocal tract length normalisation
mon scaling methods in machine learning together (VTLN), frequency mask (scales of wavelet transform
with Min-Max scaling. Z-score standardisation works acts as frequency), and random shift of the scalogram
well in normalising data with normal distribution and (random time-scale shift), sum of arbitrary two scalo-
the scalogram-based approach that we conduct, we grams to enhance data generalisation. The details of
5000 4652
4000 3644 3468 3316

2864
3000
2000 1721
1000 717 732 510

431
183 128
0
Normal Crackle Wheeze Both
Original train samples Augmented train samples Test samples
Figure 5. Available samples of sound classes before and after the augmentation step.
Figure 4. Different augmentation methods performed on a scalogram containing wheeze.
each augmentation practice can be found at [22]. similar to Alexnet’s with the same initial weight and
Figure 4 illustrates the application of augmentation backbone of the pretrained models. The authors of
methods on a wheeze sound sample. The scalograms [14] incorporated a pretrained VGG16 neural network
in Figure 4 represent (from left to right): the original in their training. An Alexnet backbone was also uti-
sound, sound augmented with random shift, sound lised in [17] to train scalogram features. In the recent
augmented with masking, and sound augmented with works of [15,18], light-weight neural networks were
VTLN. The random shift technique randomly shifts fabricated and reported to be more efficient for train-
some scales and times in the scalogram, while masking of respiratory features. Seeing the performance of
ing set some certain scales and time ranges to zeros VGG-16 on the ImageNet database [23], we decided to
(we used masking of two time and one scale ranges build our neural network based on the model’s back-
in this paper, but the choice can vary to generate bone, which has suitable configuration setting for our
more data). VTLN was used commonly in speech rec- application. The final network configuration is shown
ognition and verification to eliminate the effect of dif- on the Table 1.
ferent vocal tract lengths among different speakers. In this paper, our proposed CNN composed of 11
For augmentation of audio lung sounds using VTLN, a convolutional layers. There are total 5 convolution
warp factor is randomly generated at training to warp blocks, each composed of 2 consecutive convolutional
the scale axis, which slightly perturbates the sounds. layers followed by a max pooling layer and RELu acti-
Another method that was used for data augmentation vation function. The number of filters in each block
was the sum of two arbitrary scalograms of the same also differs at 64, 128, 256, and 512 filters, as we get
class to produce a new one. deeper into the network. Conv2D is used as a feature
The data distribution of each sound class before extractor, and Max pooling layer downsizes the
and after augmentation can be seen in Figure images by half for computation efficiency. The activa-
5 below. tion function is Leaky Relu. The ouput of the final con-
volutional layer was flattened and followed by fully
connected layers of 512. Finally, softmax activation
Convolutional neural network and training
function outputs the probability of each of the 4
In Dalal Bardou et al. [13], the spectrograms were classes of sound types. For the pooling layer, a stride
trained on a neural network whose topology was of 2 is used to reduce the network complexity and
Table 1. Neural network configuration based on VGG-16 structure.

Layer Name Type Ouput Learnables
1 Data 85 350 1 images with batch size 64 Image input 85 350 64
2 Conv1 3 364 convolutions stride [1 1], padding same Convolution 85 350 64 Weights 3 364
bias 1 64
3 Conv2 3 364 convolutions stride [1 1], Convolution 85 350 64 Weights 3 364 bias 1 64
padding same
4 Ma Pool1 3 3 max pooling Stride [2 2] and LeakyReU activation Max Pooling 43 175 64
5 Conv3 3 3128 convolutions stride [1 1], padding same Convolution 43 175128 Weights 3 3128 bias 1 128
6 Conv4 3 3128 convolutions stride [1 1], padding same Convolution 43 175 128 Weights 3 3128 bias 1 128
7 MaxPool2 3 3 max pooling Stride [2 2], and LeakyReU activation Max Pooling 21 87 128
12 Conv8 3 3512 convolutions stride [1 1], padding same Convolution 11 44 512 Weights 3 3512
bias 1 512
17 AveragePool1 3 3 max pooling Stride [2 2] Average Pooling 2 11 512
18 Flatten Flatten 11,264
19 Dense 512 nodes, dropout rate ¼ 0.2, ReLU activation Dense 512 Weights 512 512 bias 512 1
20 Dense 512 nodes, dropout rate ¼ 0.2, ReLU activation Dense 512 Weights 512 512 bias 512 1
21 Softmax Softmax 4
help eliminate the effect of over-fitting. Between Training set-up

dense layers and the fully connected layer, a dropout
Train-test split:
rate of 0.2 was implemented for the sake of avoiding
The train-test-split was set as 80/20. The 5% of the
overfitting. The experiment shows that with a few dis- training set is extracted for validation. This train-test
carded layers, the custom model works just fine on split is fixed across the various experiments conducted
the classification task. The default 4096 nodes of the to set baseline for comparison of different experi-
dense layers in VGG-16 were reduced to 521 node ments with cycle lengths and padding methods.
layers, which significantly reduce the parameters while
training and retain the same level of accuracy at the Loss function:
same time. The newly established model is lighter We used the categorical cross entropy loss, which is
than the original VGG-16, at only 17 M trainable one of the most loss function for muti-class classifica-
parameters, compared to 60 M parameters of the ori- tion tasks:
ginal model. Xn
L¼ yi logð^y i Þ (3)
i¼0
Ensemble learning with yi is the target and ^y i is the predicted probability

of class i, n the total number of classes. The minus
In this paper, we incorporated an approach with
sign ensures the loss get smaller as when the distribu-
ensemble learning to enhance the classification per-
tions get closer.
formance. The method was proposed in the paper
The losses of the training set and validation set
“Train 1, get M for free”, where the author proposed a with respect to single model and ensemble models
solution of training multiple models with the same learning are shown in Figure 6. The setting of the
amount of time needed for training one model. The hyperparameter is as follows: the batch size chosen
various models can be memorised, each capturing a was 64, each training was conducted with the number
characteristic difference among the sounds pattern. of epochs equal 150. The learning rate was set at 1e-
The learning rate mentioned is called the cosine cycles 4, the optimiser used was Adam optimiser. The of
learning rate schedule, or cyclic cosine annealing [24]. Adam optimiser was set at ẞ1 ¼ 0.9, ẞ2 ¼ 0.999. All
For prediction using the ensemble learning, the class computation to extract the feature was performed on
probability is average across all snapshot models to MATLAB R2019a, and the neural network was trained
determine the final class prediction of ensem- on Keras framework with Google Colab Pro GPU
ble learning. (12 G RAM).
1.1 Training loss 1.1 Training loss

Validaon loss Validaon loss
0.9 0.9
0.7 0.7
Loss
Loss
0.5 0.5
0.3 0.3
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Epochs Epochs
Figure 6. Training and validation losses with respect to single model (left) and ensemble models (right).
Table 2. The ICBHI score with respect to different lengths of Different split length and padding method test
the scalogram frames and padding methods.
Lengths of frames
The ICBHI score of classification performance with
(seconds) 3.5 4 4.5 5 respect to different lengths of the scalogram frames and
Task 1 Sample padding 0.809 0.798 0.785 0.785 padding methods are summarised in the Table 1. From
(2 classes) Rev padding 0.801 0.792 0.783 0.782
Zero padding 0.789 0.782 0.781 0.780 these experiments, we can see that the best score
Task 2 Sample padding 0.771 0.770 0.756 0.755 attained with the use of splitting lengths 3.5 s and sam-
(4 classes) Rev padding 0.770 0.770 0.755 0.754
Zero padding 0.769 0.767 0.753 0.751 ple padding methods. The zero-padding performed
slightly worse than the sample and reverse sample pad-
ding methods, possibly due to the enhanced repeated
respiratory traits obtained in sample padding methods.
The best choice of length and padding method is used
Result and discussion as setting configuration for the augmentation and
ensemble learning tests (Table 2).
In this paper, we used ICBHI scores to measure the
performance of the model, which are sensitivity, spe-
cificity, and F1-score. The ICBHI challenge includes Augmentation test
two tasks. The first is to differentiate between normal Table 3 shows classification results for classification
and abnormal sound types (2-class task) and the with and without augmentation. The data with aug-
second is to classify among a total of four types of mentation steps significantly outperforms the one
sounds, namely wheeze, crackle, normal and both without augmentation, which asserts the importance
sounds (4-class task). Their formulas for the 2-class of data balancing. With the original samples of each
and 4-class are given below: Sensitivity ¼ P crackle or class of the original dataset, we can see that the
wheeze/N crackle or wheeze and Sensitivity ¼ (P model focuses on learning traits from the sound class
crackle þ P wheeze þ P both)/(N crackle þ N with the greatest number of samples, namely the class
wheeze þ N both) for 2-class and 4-class classifica- normal and failing to capture patterns from the other
tions, respectively, with P denotes the correctly pre- classes. Data augmentation helps balance the class
dicted samples and N the total number of the class’s samples and boost network performance in classifying
samples. Specificity ¼ P normal/N normal is similar for sound types.
both classification tasks. The ICBHI score, which is
their average, is calculated as: ICBHI score ¼
(Sensitivity þ Specificity)/2. Ensemble learning
We conduct experiments on classification perform- The results of snapshot ensemble learning are shown in
ance with respect to the two tasks in the ICBHI chal- the table below. The different models of snapshot
lenge. In this work, we also study the effect of split ensemble learning (Figure 7) have increased the classifi-
lengths, sample padding, data augmentation and cation performance by 2%, which confirms the effective-
ensemble learning to the classification tasks, and the ness of ensemble learning. However, the many models
results of each of the experiments will be shown as in learned also increase computation complexity, which
the tables below. can further stretch the predicting time (Table 4).
Table 3. The ICBHI score for un-augmented and augmented dataset.

No Augment With augment
Specificity Sensitivity ICBHI score Specificity Sensitivity ICBHI score
Task 1 (2 classes) 0.754 0.708 0.731 0.823 0.794 0.809
Task 2 (4 classes) 0.588 0.794 0.691 0.772 0.770 0.771
Figure 7. Confusion matrix of 4-class (left) and 2-class (right) classification.
Table 4. The ICBHI score for single model and ensemble learning.
No ensemble With ensemble
Specificity Sensitivity ICBHI score Specificity Sensitivity ICBHI score
Task 1 (2 classes) 0.823 0.794 0.809 0.864 0.831 0.848
Task 2 (4 classes) 0.772 0.770 0.771 0.842 0.751 0.797
Table 5. Comparison with other methods on the ICBHI score.

Task 1 Task 2
Specificity Sensitivity ICBHI score Specificity Sensitivity ICBHI score Parameters
LSTM [25] 0.84 0.64 0.74 – – 0.81 –
MNRNN [26] – – – 0.74 0.56 0.65 –
Mel-log Spectrogram and CNN [15] 0.873 0.801 0.837 0.873 0.694 0.784 39M
Kaggle Baseline [27] 0.832 0.796 0.814 0.832 0.665 0.748 42M
Our proposed method 0.851 0.845 0.848 0.842 0.751 0.797 68M
The ICBHI scores attained by this approach is com- ensemble learning if the training capability is enough
pared with some other state-of-the-art works as noted to boost the model performance. Time processing for
in Table 5. scalogram features extraction is 0.3295 ± 0.05 s/frames
and training takes about 15 s/epoch. Our results with
scalogram features with convolutional neural network
Conclusion
method performs better than other state-of-the-art
This is the first scalogram-based research on sound methods. However, further studies can be initiated to
classification, which are wheeze, crackle and normal investigate even more light-weight CNNs for scalo-
vesicular rather than diseases classification with gram-based features that are able to reduce the
respect to the ICBHI database. The project also com- model complexities and consequently time and com-
bines studies on cycle lengths and data augmentation putational costs.
to scalograms for respiratory classification, which has
not been prominent in previous research. By empirical
knowledge, the author recommends splitting respira- Acknowledgement
tory cycles with length of 3.5 s and the sample pad- This research is funded by International School, Vietnam
ding method for classification task. The project also National University, Hanoi (VNU-IS) under project number
recommends the use of data augmentation and CS.2022-01
Disclosure statement reconstructed sub-bands signal and machine

learning. In: Tran DT, Jeon G, Nguyen TDL, Lu J,
No potential conflict of interest was reported by the author(s). Xuan TD, editors. Intelligent Systems and Networks.
ICISN 2021. Lecture Notes in Networks and Systems.
Singapore: Springer; 2021, vol. 243, pp. 215-224.
ORCID
[13] Bardou D, Zhang K, Ahmad SM. Sayed mohammad
Huong Pham Thi Viet http://orcid.org/0000-0002- ahmad: lung sounds classification using convolutional
4883-5969 neural networks. Artif Intell Med. 2018;88:58–69.
Huyen Nguyen Thi Ngoc http://orcid.org/0000-0002- [14] Demir F, Sengur A, Bajaj V. Convolutional neural net-
8065-4893 works based efficient approach for classification of
Vu Tran Anh http://orcid.org/0000-0002-3439-6491 lung diseases. Health Inf Sci Syst. 2020;8(1):4.
Huy Hoang Quang http://orcid.org/0000-0002-4603-1297 [15] Nguyen T, Pernkopf F. Lung sound classification using
snapshot ensemble of convolutional neural networks.
In: 42nd Annual International Conference of the IEEE
References Engineering in Medicine & Biology Society (EMBC);
[1] [cited 2020 October 25]. Available from: https://www. 2020, pp. 760–763.
who.int/gard/publications/The_Global_Impact_of_ [16] Pham L, McLoughlin I, Phan H, et al. Robust deep
Respiratory_Disease.pdf. learning framework for predicting respiratory anomalies
[2] Sarkar M, Madabhavi I, Niranjan N, et al. Auscultation and diseases In: 42nd Annual International Conference
of the respiratory system. Ann Thorac Med. 2015; of the IEEE Engineering in Medicine & Biology Society
10(3):158–168. (EMBC) , Montreal, QC, Canada; 2020, pp. 164–167.
[3] Mendes L, Vogiatzis IM, Perantoni E, et al. Detection [17] Jayalakshmy S, Sudha G. Scalogram based prediction
of crackle events using a multi-feature approach. In: model for respiratory disorders using optimized convolu-
Annual International Conference of the IEEE tional neural networks. Artif Intell Med. 2020;103:101809.
Engineering in Medicine and Biology Society 2016; [18] Shuvo SB, Ali SN, Swapnil SI, et al. A lightweight CNN
2016. pp. 3679–3683. model for detecting respiratory diseases from lung
[4] Mendes L, et al. Detection of wheezes using their sig- auscultation sounds using EMD-CWT-Based hybrid
nature in the spectrogram space and musical fea- scalogram. IEEE J Biomed Health Inform. 2021;25(7):
tures. In: 37th Annual International Conference of the 2595–2603.
IEEE Engineering in Medicine and Biology Society [19] Rocha B, Filos D, Mendes L, et al. Maglaveras: a respira-
(EMBC) , Milan; 2015. pp. 5581–5584. tory sound database for the development of automated
[5] Sengupta N, Sahidullah M, Saha G. Lung sound classi- classification. In: International Conference on Biomedical
fication using cepstral-based statistical features. and Health Informatics. 2017;2017:33–37.
Comput Biol Med. 2016;75:118–129. [20] [cited 2020 Oct 25]. Available from: https://www.kag-
[6] Naves R, Barbosa BH, Ferreira DD. Classification of gle.com/vbookshelf/respiratory-sound-database.
lung sounds using higher-order statistics: a divide- [21] [cited 2021 Jan 25]. Available from: https://www.
and-conquer approach. Comput Methods Programs mathworks.com/help/wavelet/ref/cwt.html.
Biomed. 2016;129:12–20. [22] Maguolo G, Paci M, Nanni L, et al. Audiogmenter: a
[7] Jin F, Sattar F, Goh DYT. New approaches for spectro- MATLAB Toolbox for Audio Data Augmentation. 2019;
temporal feature extraction with applications to [cited 2021 May 25]. Available from: https://arxiv.org/
respiratory sound classification. Neurocomputing.
abs/1912.05472.
2014;123:362–371.
[23] Simonyan K, Zisserman A. Very deep convolutional
[8] Serbes G, Okan SC, Kahya YP, et al. Pulmonary crackle
networks for large-scale image recognition. 2015;
detection using time–frequency and time–scale ana-
[cited 2021 May 25]. Available from: https://arxiv.org/
lysis. Digit Signal Process. 2013;23(3):1012–1021.
abs/1409.1556.
[9] Hashemi A, Hossein A. Agin Khosrow: classification of
[24] Huang G, Li Y, Pleiss G, et al. Snapshot ensembles:
wheeze sounds using wavelets and neural networks.
In: International Conference on Biomedical Train 1, get M for free. 2017; [cited 2021 May 25].
Engineering and Technology, IPCBEE; 2011. Available from: https://arxiv.org/abs/1704.00109.
[10] Serbes G, Ulukaya S, Kahya YP. An automated lung [25] Perna D, Tagarelli A. Deep auscultation: predicting
sound preprocessing and classification system based respiratory anomalies and diseases via recurrent
OnSpectral analysis methods. In: Maglaveras N., neural networks. In: 2019 IEEE 32nd International
Chouvarda I., de Carvalho P., editors. Precision Symposium on Computer-Based Medical Systems
Medicine Powered by pHealth and Connected Health, (CBMS); 2019. pp. 50–55.
ICBHI 2017, IFMBE Proceedings, vol 66. Singapore: [26] Kochetov K, Putin E, Balashov M, et al. Noise masking
Springer; 2018. recurrent neural network forrespiratory sound classifi-
[11] Aykanat M, Kılıç O,€ Kurt B, et al. Classification of lung cation. In: International Conference on Artificial
sounds using convolutional neural networks. J Image Neural Networks. 2018; 208–217.
Video Proc. 2017;2017(1):65. [27] [cited 2021 Jan 25]. Available from: https://www.kag-
[12] Tran V, Trinh T, Nguyen H, Tran H, Nguyen K, Hoang, gle.com/eatmygoose/cnn-detection-of-wheezes-and-
H, Pham H. Lung sounds classification using wavelet crackles.

38 Classification of Lung Sounds Using Scalogram Representation of Sound Segments and Convolutional Neural Network

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

38 Classification of Lung Sounds Using Scalogram Representation of Sound Segments and Convolutional Neural Network

Uploaded by

Copyright:

Available Formats

Journal of Medical Engineering & Technology

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/ijmt20

Classification of lung sounds using scalogram

To link to this article: https://doi.org/10.1080/03091902.2022.2040624

Published online: 25 Feb 2022.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at

Classification of lung sounds using scalogram representation of sound

ABSTRACT ARTICLE HISTORY

Introduction possible. Research in the area of lung classification has

Figure 1. Block diagram of scalogram based sound types classification.

4000 3644 3468 3316

1000 717 732 510

Original train samples Augmented train samples Test samples

Figure 4. Different augmentation methods performed on a scalogram containing wheeze.

Table 1. Neural network configuration based on VGG-16 structure.

help eliminate the effect of over-fitting. Between Training set-up

Ensemble learning with yi is the target and ^y i is the predicted probability

1.1 Training loss 1.1 Training loss

Table 3. The ICBHI score for un-augmented and augmented dataset.

Figure 7. Confusion matrix of 4-class (left) and 2-class (right) classification.

Table 5. Comparison with other methods on the ICBHI score.

Disclosure statement reconstructed sub-bands signal and machine

You might also like