Professional Documents
Culture Documents
38 Classification of Lung Sounds Using Scalogram Representation of Sound Segments and Convolutional Neural Network
38 Classification of Lung Sounds Using Scalogram Representation of Sound Segments and Convolutional Neural Network
Huong Pham Thi Viet, Huyen Nguyen Thi Ngoc, Vu Tran Anh & Huy Hoang
Quang
To cite this article: Huong Pham Thi Viet, Huyen Nguyen Thi Ngoc, Vu Tran Anh & Huy
Hoang Quang (2022): Classification of lung sounds using scalogram representation of sound
segments and convolutional neural network, Journal of Medical Engineering & Technology, DOI:
10.1080/03091902.2022.2040624
RESEARCH ARTICLE
CONTACT Vu Tran Anh Vu.trananh@hust.edu.vn School of Electronics and Telecommunications, Hanoi University of Science and Technology,
Hanoi, Vietnam
ß 2022 Informa UK Limited, trading as Taylor & Francis Group
2 T. V. H. PHAM ET AL.
obstructive pulmonary diseases (COPD), though not categorisation. In Tran et al. [12], discrete wavelet
always inherent as it can be missing in cases of transform was used to classify the different types of
extreme obstruction. Wheeze can also present in vari- lung sounds, energies of reconstructed sub-band sig-
ous conditions. Crackle is a group of sounds with non- nals were used as features for machine learning mod-
musical characteristics which can be further divided els, achieving an average accuracy of 89%.
into coarse and fine, based on sound pitch, loudness, Most recent research incorporates deep convolu-
and timings. Crackles are discontinuous, which means tional networks to extract lung sound features for clas-
its duration is short at about 25 ms. The energy of sification tasks. In Dalal Bardou et al. [13],
crackles is distributed in range from 60 to 2000 Hz spectrograms and Alexnet-base topology of neural
with most energy concentration in 60–1200 Hz. networks were tested with an accuracy of 95.1%. The
Crackles are associated with a number of disorders, VGG-16 was used in [14] for feature extraction and
namely in pneumonia, bronchiectasis, COPD, diffuse SVM to classify lung sounds with a correct classifica-
parenchymal lung disease, and in some cases congest- tion rate of 65%. Light-weight frameworks have
ive heart failures. Crackles heard in COPD Patients are attracted attention lately as more efficient techniques
usually linked with severe airway obstruction [2]. are proposed such as Mel-log spectrogram and light-
Many attempts have been made in the task of lung weight CNN, attaining ICBHI score of 78.3% and 81%
sounds classification. Early day attempts focussed on for 4 class classification, respectively [15]. Pham et al.
extracting time domain features. Statistics and music [16] also proposed a novel model C-DNN þ MoE for
information retrieval features are utilised in [3] to classification of spectrogram features with a reported
detect crackles using logistic regression. In Mendes boost in performance. Some recent works have been
et al. [4], wheeze signature in spectrogram space was paying attention to continuous wavelet transform and
proposed to detect frames with signs of wheeze. In the use of scalogram for features instead of short time
Nandini Sengupta et al. [5], the MFCC (Mel frequency Fourier Transform, which show promising results. In
cepstral coefficients), LFCC (Linear frequency cepstral Jayalakshmy et al. [17], scalogram based classification
coefficients), IMFCC (Inverse Mel frequency cepstral was introduced using a scalogram for IMF (Intrinsic
coefficients) with a GMM (Gaussian Mixture Model) are Mode Function) signals derived from EMD (Empirical
modelled to propose a classification based on statis- Mode Decomposition) and Alexnet convolutional net-
tical characteristics, attaining an accuracy of 97.2%. work, which attained 83.78% accuracy rate. The
The statistical features are also used in [6] as the authors in [18] experimented with both traditional sca-
authors extracted HOS (higher order statistical) fea- lograms and the IMF-based ones on the task of patho-
tures and k-NN (k-nearest neighbor), Bayes classifier to logical classification, attaining accuracy of 99.05% for
distinguish fine crackle, coarse crackle samples, mono- differentiating six pathological classes.
phonic and polyphonic wheeze. The work in [6] was The usage of different datasets in the above
able to classify five sound types with overall accuracy research presents a challenge for cross validating
roughly of 90% and in [7], a classifying pipeline was among different proposed approaches. Some utilised
constructed using AR models and SVM (support vector open datasets like RALE, which was used repeatedly in
machine), k-NN classifier, the best feature set perform- [5,13] but contains few samples of lung sounds that
ance reaching accuracies of 97.7% and 98.8% for defies generalisation of the classification, while others
inspiratory and expiratory data segments, respectively. realised their methods on self-collected private data-
The classification using time-frequency and time-scale sets that cannot be publicised. More recent works usu-
features with different window types and classifiers ally take advantage of the ICBHI database released in
such as SVM was devised in [8], which attained high- 2017 [11,12,14–16,18], which contains large enough
est overall accuracy of 81.1% with SVM classifier. In samples of lung sounds for deep learning approaches
[9], wheeze detection based on discrete wavelet and and provides a benchmark dataset for comparison
ANN achieved accuracy rate of 89.29%, whereas fea- among different works.
tures from short time Fourier and wavelet transform In this paper, we apply a scalogram-based approach
with kernel SVM were combined to get 60% accuracy to classification of the four sound types, namely
rate [10]. In Aykanat et al. [11], a CNN based classifica- crackle, wheeze, normal vesicular, and both using
tion was proposed for classification of rale, rhonchus deep convolutional neural networks on benchmark
and normal sounds as well as multilabel classification dataset ICBHI. Though MFFC and spectrogram are
with relative accuracy to the benchmark methods of widely used on this dataset, the use of scalograms
MFCC and SVM, at 86% for healthy or pathological have less been prominent, especially for sound types
JOURNAL OF MEDICAL ENGINEERING & TECHNOLOGY 3
classification task. With inspiration from [15] to study preprocessing, conversion of audio to scalograms,
the effect of different padding methods and frame splitting and padding of scalograms frames and train-
split lengths, we also incorporate them into this study ing of extracted features.
with scalograms. We introduce a new padding
method, which is reverse sample padding to handle
Database: ICBHI challenge and respiratory
the discontinuity between padded sample. Data aug-
sound database
mentation have been studied in some previous
research on lung sounds classification. However, it was The database was released first in ICBHI 2017. The data
commonly performed on audio with vocal tract length included sound recordings from 126 patients and 920
perturbation and speed stretching. In this research, we audio samples. The duration of the audio is from 10 to
attempt to apply augmentation on the scalograms to 90 s and mostly comprises more than one cycle in each
show their effectiveness in classification task, which to recording. Since the release, the dataset has served as a
the best of our knowledge, was not experimented benchmark for many researches on lung sounds analysis.
with before. Ensemble models are also considered to There is a total of 6898 cycles of respiration. Each .wav
increases performance of the classification. ICHI scores file comes with an annotation document with notes on
are then computed for this pipeline to highlight per- timing of each cycle in the recording and label of the
formance of the approach with some other methods. cycle as containing wheeze and/or containing crackle. In
The paper is organised as follows: the first part addition, there is a document that lists patients who are
introduces related concepts, the second part shows in contracted with a certain type of respiratory disorder dis-
detail steps of scalogram-based lung sound classifica- eases. Therefore, the dataset presents a basis for different
tion, the third part gives experimental results and dis- classification tasks as proposed by the ICBHI challenge
cussion, the final part concludes the paper. [19]. There are main tasks which are further divided into
sub-tasks. The first task involves the classification of lung
sounds into two groups in the case of normal or abnor-
Methodology
mal classification and 4 groups in the case of wheeze,
The block diagram of the procedure is shown in crackle, normal and both sounds classification. The second
Figure 1, which contained three main steps: audio task involves classifying the patients’ conditions of
ð1
healthy, non-chronic and chronic, and classifying the 1 tb
W ða, bÞ ¼ pffiffiffi sðtÞw dt (1)
patients as healthy or diseased, either with chronic or a 1 a
non-chronic symptoms. In this work, we focus on sound
types classification of four kinds of lung sounds: wheeze, With a is the scale and b is the translation factor in
crackle, normal, and both. The data is retrieved from time domain for the transform [21]. The possible draw-
Kaggle webpage on Respiratory Database [20]. backs of CWT are probably the time and computation
complexity in comparison with STFT.
Preprocessing of audios The scalograms used in this paper are obtained
from the magnitude of CWT coefficients. We utilised
Audio recordings from the meta-database of Kaggle site MATLAB function cwt, where the type of wavelet
contains audio recordings from various settings. These chosen was “bump”. The scale of the transform origin-
recordings contain noise, from both internal factors ally exceeded 120 scale. However, since the higher
(heart sounds) and external (people talking in the back-
scale correlates with lower frequency range and for
ground). In order to leverage noise effects, wavelet
lung sound classification tasks, those frequencies are
shrinkage denoising is implemented for all of the audio
out of interest, so we set a limit for scale, which is 85
recordings. We applied a soft threshold and chose sym4
scale for frequency ranging from 10 to 2000 Hz. The
as the mother wavelet. Varying frequency sampling rates
scalogram representation of different sound types are
lung recordings are the results of the many devices used
illustrated in Figure 2.
for lung sounds recording experiments. For a more
standardised set of lungs recording samples, researchers
tend to resample the sound signal to a fixed sampling Frame splitting and padding
frequency. In this paper, we also set the sampling rate
Since the length of each respiratory cycle is different,
to 4000 Hz, which is achieved by down sampling the
the scalogram representation’s horizontal lengths also
recordings from Meditron (44,100 kHz) to 4000 Hz.
Audios with deviation in amplitude level are also ampli- differ. To handle inhomogeneity in respiratory scalo-
tude normalised to avoid any uncertainty caused by gram frames, the images are resized [14]. However,
amplitude deviation. The normalised audio recordings the resized images with varied length will lead to
have samples with amplitudes in the range [–1;1]. stretched or compressed traits of sounds on the scalo-
grams. The authors proposed to discard the audio
recordings which are longer than a predefined thresh-
Scalogram conversion old, but that might lead to missing data with valuable
Scalograms based lung sounds classification was pro- information. In Nguyen and Pernkopf [15], they
posed in [17], where the author utilised continuous wave- handled the varied length by splitting and padding
let transform to create features for the four types of lung the audio so that they all have a fixed length. We are
sounds. Unlike Short Time Fourier Transform (STFT) with a motivated by their work and in this paper, we also
fixed resolution, the Continuous Wavelet Transform (CWT) split and pad the scalogram frames with the essen-
utilised a set of basic functions which are scaled versions tially same pipeline as was conducted in Nguyen and
of a mother wavelet. Dilation and translation of the ver- Pernkopf [15], but instead of audio signals the proced-
sions allow CWT to detect characteristics of the signal ure is conducted on scalogram. The detail splitting
both in time and frequency domains and padding algorithm is illustrated in Figure 3.
Figure 2. Scalogram representation of different types of lung sounds from ICBHI database
JOURNAL OF MEDICAL ENGINEERING & TECHNOLOGY 5
Figure 3. (a) Splitting and padding of scalogram frames and (b) Different padding methods.
Since a fixed length is used for all recorded sam- also incorporate this step in processing of the scalo-
ples, it is logical to experiment with different split gram frames:
lengths of scalogram frames. In this paper, lengths of value l
scalogram frames are chosen from a range of values v¼ (2)
r
to study its impacts on performance of classification
With v is the normalised value for scalograms, value
models. The scalogram’s frame lengths chosen to
is the scalograms pixel intensity, l and r are the
examine were 3.5 s, 4 s and 5 s.
mean and standard deviation of pixel values,
Padding methods were also proposed in [15] as the
respectively.
authors examined the effects of different padding to
the classification task. We also incorporate a test on
padding methods in this work. The padding methods Data augmentation
of interest were zero padding, sample padding. We In previous research, the use of data augmentation
propose a new padding method which aims to help was mentioned in some works [13,15,18] for sound
ease the cut-off effects in sample padding. Instead of classification. Augmentation plays great importance in
stacking features one after another, the features are handling data imbalance and helps boost the perform-
flipped in time before stacked to the previous sample, ance of the classifier [13]. Therefore, in our approach,
which helps to minimise the sudden changes in pixel we attempt to perform the augmentation on the sca-
values and create a smoother transition between logram frames, unlike some previous works that com-
padded features. In that sense, models can focus more monly performed it on audio with the most popular
on learning characteristically discriminating patterns of techniques being time stretching and vocal tract
the signals. length perturbation [13,15]. We incorporated a num-
ber of augmentation methods introduced in the Audio
Z-score normalisation Augment toolbox in Matlab [22]. The toolbox was ori-
The authors in Nguyen and Pernkopf [15] mentioned ginally introduced for spectrograms, but the same
feature rolling as a method to account for the cyclic techniques could be used with scalograms. After scalo-
nature of respiratory signals with a z-score normalisa- grams are extracted, they are augmented using many
tion of the spectrogram. Z-score is among the com- methods such as vocal tract length normalisation
mon scaling methods in machine learning together (VTLN), frequency mask (scales of wavelet transform
with Min-Max scaling. Z-score standardisation works acts as frequency), and random shift of the scalogram
well in normalising data with normal distribution and (random time-scale shift), sum of arbitrary two scalo-
the scalogram-based approach that we conduct, we grams to enhance data generalisation. The details of
6 T. V. H. PHAM ET AL.
5000 4652
2000 1721
Figure 5. Available samples of sound classes before and after the augmentation step.
each augmentation practice can be found at [22]. similar to Alexnet’s with the same initial weight and
Figure 4 illustrates the application of augmentation backbone of the pretrained models. The authors of
methods on a wheeze sound sample. The scalograms [14] incorporated a pretrained VGG16 neural network
in Figure 4 represent (from left to right): the original in their training. An Alexnet backbone was also uti-
sound, sound augmented with random shift, sound lised in [17] to train scalogram features. In the recent
augmented with masking, and sound augmented with works of [15,18], light-weight neural networks were
VTLN. The random shift technique randomly shifts fabricated and reported to be more efficient for train-
some scales and times in the scalogram, while mask- ing of respiratory features. Seeing the performance of
ing set some certain scales and time ranges to zeros VGG-16 on the ImageNet database [23], we decided to
(we used masking of two time and one scale ranges build our neural network based on the model’s back-
in this paper, but the choice can vary to generate bone, which has suitable configuration setting for our
more data). VTLN was used commonly in speech rec- application. The final network configuration is shown
ognition and verification to eliminate the effect of dif- on the Table 1.
ferent vocal tract lengths among different speakers. In this paper, our proposed CNN composed of 11
For augmentation of audio lung sounds using VTLN, a convolutional layers. There are total 5 convolution
warp factor is randomly generated at training to warp blocks, each composed of 2 consecutive convolutional
the scale axis, which slightly perturbates the sounds. layers followed by a max pooling layer and RELu acti-
Another method that was used for data augmentation vation function. The number of filters in each block
was the sum of two arbitrary scalograms of the same also differs at 64, 128, 256, and 512 filters, as we get
class to produce a new one. deeper into the network. Conv2D is used as a feature
The data distribution of each sound class before extractor, and Max pooling layer downsizes the
and after augmentation can be seen in Figure images by half for computation efficiency. The activa-
5 below. tion function is Leaky Relu. The ouput of the final con-
volutional layer was flattened and followed by fully
connected layers of 512. Finally, softmax activation
Convolutional neural network and training
function outputs the probability of each of the 4
In Dalal Bardou et al. [13], the spectrograms were classes of sound types. For the pooling layer, a stride
trained on a neural network whose topology was of 2 is used to reduce the network complexity and
JOURNAL OF MEDICAL ENGINEERING & TECHNOLOGY 7
0.9 0.9
0.7 0.7
Loss
Loss
0.5 0.5
0.3 0.3
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Epochs Epochs
Figure 6. Training and validation losses with respect to single model (left) and ensemble models (right).
Table 2. The ICBHI score with respect to different lengths of Different split length and padding method test
the scalogram frames and padding methods.
Lengths of frames
The ICBHI score of classification performance with
(seconds) 3.5 4 4.5 5 respect to different lengths of the scalogram frames and
Task 1 Sample padding 0.809 0.798 0.785 0.785 padding methods are summarised in the Table 1. From
(2 classes) Rev padding 0.801 0.792 0.783 0.782
Zero padding 0.789 0.782 0.781 0.780 these experiments, we can see that the best score
Task 2 Sample padding 0.771 0.770 0.756 0.755 attained with the use of splitting lengths 3.5 s and sam-
(4 classes) Rev padding 0.770 0.770 0.755 0.754
Zero padding 0.769 0.767 0.753 0.751 ple padding methods. The zero-padding performed
slightly worse than the sample and reverse sample pad-
ding methods, possibly due to the enhanced repeated
respiratory traits obtained in sample padding methods.
The best choice of length and padding method is used
Result and discussion as setting configuration for the augmentation and
ensemble learning tests (Table 2).
In this paper, we used ICBHI scores to measure the
performance of the model, which are sensitivity, spe-
cificity, and F1-score. The ICBHI challenge includes Augmentation test
two tasks. The first is to differentiate between normal Table 3 shows classification results for classification
and abnormal sound types (2-class task) and the with and without augmentation. The data with aug-
second is to classify among a total of four types of mentation steps significantly outperforms the one
sounds, namely wheeze, crackle, normal and both without augmentation, which asserts the importance
sounds (4-class task). Their formulas for the 2-class of data balancing. With the original samples of each
and 4-class are given below: Sensitivity ¼ P crackle or class of the original dataset, we can see that the
wheeze/N crackle or wheeze and Sensitivity ¼ (P model focuses on learning traits from the sound class
crackle þ P wheeze þ P both)/(N crackle þ N with the greatest number of samples, namely the class
wheeze þ N both) for 2-class and 4-class classifica- normal and failing to capture patterns from the other
tions, respectively, with P denotes the correctly pre- classes. Data augmentation helps balance the class
dicted samples and N the total number of the class’s samples and boost network performance in classifying
samples. Specificity ¼ P normal/N normal is similar for sound types.
both classification tasks. The ICBHI score, which is
their average, is calculated as: ICBHI score ¼
(Sensitivity þ Specificity)/2. Ensemble learning
We conduct experiments on classification perform- The results of snapshot ensemble learning are shown in
ance with respect to the two tasks in the ICBHI chal- the table below. The different models of snapshot
lenge. In this work, we also study the effect of split ensemble learning (Figure 7) have increased the classifi-
lengths, sample padding, data augmentation and cation performance by 2%, which confirms the effective-
ensemble learning to the classification tasks, and the ness of ensemble learning. However, the many models
results of each of the experiments will be shown as in learned also increase computation complexity, which
the tables below. can further stretch the predicting time (Table 4).
JOURNAL OF MEDICAL ENGINEERING & TECHNOLOGY 9
Table 4. The ICBHI score for single model and ensemble learning.
No ensemble With ensemble
Specificity Sensitivity ICBHI score Specificity Sensitivity ICBHI score
Task 1 (2 classes) 0.823 0.794 0.809 0.864 0.831 0.848
Task 2 (4 classes) 0.772 0.770 0.771 0.842 0.751 0.797
The ICBHI scores attained by this approach is com- ensemble learning if the training capability is enough
pared with some other state-of-the-art works as noted to boost the model performance. Time processing for
in Table 5. scalogram features extraction is 0.3295 ± 0.05 s/frames
and training takes about 15 s/epoch. Our results with
scalogram features with convolutional neural network
Conclusion
method performs better than other state-of-the-art
This is the first scalogram-based research on sound methods. However, further studies can be initiated to
classification, which are wheeze, crackle and normal investigate even more light-weight CNNs for scalo-
vesicular rather than diseases classification with gram-based features that are able to reduce the
respect to the ICBHI database. The project also com- model complexities and consequently time and com-
bines studies on cycle lengths and data augmentation putational costs.
to scalograms for respiratory classification, which has
not been prominent in previous research. By empirical
knowledge, the author recommends splitting respira- Acknowledgement
tory cycles with length of 3.5 s and the sample pad- This research is funded by International School, Vietnam
ding method for classification task. The project also National University, Hanoi (VNU-IS) under project number
recommends the use of data augmentation and CS.2022-01
10 T. V. H. PHAM ET AL.