Applied Acoustics: Özkan - Inik

Applied Acoustics 202 (2023) 109168
Contents lists available at ScienceDirect
Applied Acoustics
journal homepage: www.elsevier.com/locate/apacoust
CNN hyper-parameter optimization for environmental sound

classification
_
Özkan Inik
Department of Computer Engineering, Tokat Gaziosmanpasa University, Tokat, Turkey
a r t i c l e i n f o a b s t r a c t
Article history: Environmental sounds are being used widely in our lives. It is especially used in tasks such as managing
Received 10 May 2022 smart cities, location determination, surveillance systems, machine hearing, and environmental monitor-
Received in revised form 23 November 2022 ing. The main method for this, environmental sound classification (ESC), has been increasingly studied in
Accepted 7 December 2022
recent years. However, the classification of these sounds is more difficult than other sounds because there
Available online 15 December 2022
are too many parameters that generate noise. The study tried to find the convolutional neural network
(CNN) model that gave the highest accuracy for ESC tasks with the optimization of hyper-parameters.
Keywords:
For this purpose, the Particle Swarm Optimization (PSO) algorithm was rearranged to represent the
Environmental sound classification (ESC)
CNN
CNN architecture. Thus, the hyper-parameters in CNN are represented exactly without any transforma-
Particle swarm optimization (PSO) tion during optimization. Studies were carried out on the ESC-10, ESC-50, and Urbansound8k data sets,
Hyper-parameter optimization which are state-of-art for ESC tasks. Some data augmentation techniques have been used for data sets
Urbansound8k in the training of CNN models. The CNN models, which were obtained with PSO, achieved success rates
ESC-50 of 98.64 % for ESC-10, 93.71 % for ESC-50, and 98.45 % for Urbansound8k, respectively. These results are
the best accuracy values obtained with the pure CNN model when compared with previous studies. As a
result, it has been made possible to automatically design CNN models for the classification of urban
sounds, giving high classification accuracy. Thus, researchers who do not know much about CNN design
can use this method in their desired datasets without the need for expert knowledge.
Ó 2022 Elsevier Ltd. All rights reserved.
1. Introduction success rate in the ImageNet [35] competition in 2012. Due to this
success, deep learning models have been used frequently in recent
Sound data contain more semantic information than visual data years for ESC tasks [36–50]. These studies are described in the next
[1]. In particular, sound data becomes more important to obtain section.
information about an environment. To realize some applications in
daily life, it is necessary to use environmental sounds, unlike speech 1.1. Related work
and music sounds. For this reason, studies on the classification of
urban sounds have intensified in recent years. Environmental sound In the study by Piczak[36], the CNN model was used to classify-
classification (ESC), is known as one of the most important issues of three different data sets. The proposed CNN model consists of two
the non-speech voice classification task [2]. ESC is of critical impor- convolution layers, a pooling layer, and two fully connected layers.
tance in many problems such as; noise pollution analysis[3,4], In the results obtained, it has been shown that the CNN model per-
surveillance systems [5–7], context-aware applications [1,8–13], forms better than other existing methods. In the study by Salamon
machine hearing [14–17], environmental monitoring [18], crime et al.[37], CNN was used for the classification of ESC. Data augmen-
alert systems[19], soundscape assessment [20,21], and smart cities tation was also applied in CNN training. Because of the training, it
[22,23]. Different data sets have been created for the ESC task. has been shown that the CNN model performs better with data
ESC-10, ESC-50 [24], and Urbansound8k(US8K) [25] datasets are augmentation. The CNN model was proposed for the detection of
used extensively. Different statistical and machine learning meth- acoustic events by Takahashi et al. [38]. The proposed CNN model
ods have been used for ESC in the literature [1,26–33]. is inspired by VggNet [51]. In addition, a new method was pro-
The success rates of these methods are relatively low compared posed for increasing the data given in training processes. In the
to deep-learning-based studies. Deep learning[34] achieved a high training phase of deep learning models, a different strategy was
developed by Tokozume et al.[39] to feed the data to the model.
E-mail address: ozkan.inik@gop.edu.tr In this method called Between-Class learning, sounds between
https://doi.org/10.1016/j.apacoust.2022.109168
0003-682X/Ó 2022 Elsevier Ltd. All rights reserved.
_
Ö. Inik Applied Acoustics 202 (2023) 109168
classes are produced by mixing two sounds belonging to different

classes at random rates. In the study, it was stated that the cross-
class learning method performed well in voice recognition net-
works and data augmentation. In the study, a network was also
defined for the classification of ESC and trained with the proposed
method. It has been stated that the results obtained are lower than
the human voice recognition error value.
Classification was performed on ESC-10, ESC-50 and US8K data-
sets by Boddapatia et al. [40] using AlexNet and GoogLeNet mod-
els. Audio signals in datasets were converted into images using
spectrogram, MFCC and CRP methods. Then, the classification pro-
cess was carried out using deep learning models on the images
obtained. An original stacked CNN model was proposed in the
study by Li et al. [41]. Multiple convolution layers are used with
reduced filter numbers in the proposed network. In the method,
two different CNN networks are trained using raw audio signal
data. These two models were then combined using Dempster–Sha-
fer (DS) evidence theory, and a new model for CNN named DS-CNN
was proposed. It is stated that the proposed model outperforms
other CNN-based models in ESC-10, ESC-50 and US8K datasets.
In the study by Su et al. [42], it was suggested to combine two
different features of ESC to represent it more comprehensively. A
four-layer CNN network named TSCNN-DS was designed for the
classification process. The proposed model achieved 97.2 % accu-
racy on the US8K dataset. In the study by Mushtaq et al. [43], it
was emphasized that the overlapping of sounds increases the com-
plexity with the inclusion of many sound sources in the environ-
ment in the distance between the acoustic source and the
microphone. Due to this complexity, a deep convolutional neural
network (DCNN) is used with the validated audio properties of
the ESC. In the study, three sound feature extraction techniques,
Mel spectrogram, Mel Frequency Cepstral Coefficient, and Log-
Mel were considered for the ESC-10, ESC-50, and US8K datasets.
The obtained accuracy values were 94.94 %, 89.28 %, and 95.37 %
for the ESC-10, ESC-50 and US8K, respectively. In the study con-
ducted by Mushtaq et al. [44], an approach on spectral images in
classification with CNN was introduced using the method of direct
data augmentation of audio data. In the presented approach, the
Mel spectrogram feature is used. Randomly selected 7- or 9-layer
CNN models and ESC-10, ESC-50 and US8K data were used. It
was emphasized that effective and high accuracy was achieved
by using the meaningful data augmentation method directly over
the voice. Among the models used, accuracy rates of 99.04 % were
achieved for ResNet-52, ESC-10, 99.49 % for US8K datasets, and
97.57 % for DenseNet-161, ESC-50.
In the study by Chen et al. [45], the ESC problem was handled
with extended CNN. It has been stated that using this structure
results in higher classification accuracy than the max-pooling pro-
cess used after the convolution layer. At the same time, the effect
Fig. 2. Flow chart of particle swarm optimization algorithm.
of different expansion ratios and the number of convolution layers
Fig. 1. The basic structure of CNN architectures in deep learning.
2
_
on the results was investigated. In the ESC problem, the expanded the best performance among the end-to-end studies. In addition, it
CNN achieved better results than the maximum pooling CNN; was stated that the proposed method has fewer parameters than
however, it has been stated that increasing the number and ratio other models in the literature.
of filters negatively affects the classification accuracy. Conditional Neural Network (CLNN) and its extension Masked
In the study by Abdoli et al. [46], a one-dimensional (1D) CNN ConditionaL Neural Network (MCLNN) have been proposed by
network was used for ESC classification. Frames are received over Medhat et al. [47], In the study, the time-frequency representation
the audio signal for the input data. Because of the experimental of the sound was modeled by making use of the natural state of the
studies, an average accuracy value of 89 % was obtained on the sound. With the proposed MCLNN, the classification process of dif-
US8K data set. It was stated that the use of raw input data showed ferent music and ESCs was achieved. MCLNN classification accu-
Fig. 3. An example representation of records of classes in the ESC10 dataset.
Table 1
Categories in the ESC50 dataset and the names of the classes in each category.
Animals Natural sound spaces & water sounds Human, Non-speech sounds Interior/domestic sounds Exterior/urban noises
Dog Rain Crying baby Door knock Helicopter
Rooster Sea waves Sneezing Mouse click Chainsaw
Pig Crackling fire Clapping Keyboard typing Siren
Cow Crickets Breathing Door, wood creaks Car horn
Frog Chirping birds Coughing Can opening Engine
Cat Water drops Footsteps Washing machine Train
Hen Wind Laughing Vacuum cleaner Church bells
Insects (flying) Pouring water Brushing teeth Clock alarm Airplane
Sheep Toilet flush Snoring Clock tick Fireworks
Crow Thunderstorm Drinking, sipping Glass breaking Hand saw
3
_
Table 2
Number of records used for training, verification, and testing in the US8K data set.
Class Number of images Train Validation Test

Air conditioner 1000 600 200 200
Car horn 429 257 86 86
Children playing 1000 600 200 200
Dog bark 1000 600 200 200
Drilling 1000 600 200 200
Engine idling 1000 600 200 200
Gun shot 374 224 75 75
Jackhammer 1000 600 200 200
Siren 929 557 186 186
Street music 1000 600 200 200
Fig. 4. An example representation of records of classes in the US8K dataset.
racy outperforms CNN-based models. In the study conducted by three different sound data sets. The percent less error value was
Zhang et al. [48], the effects of the sizes and activation functions obtained in the US8K dataset compared to other methods.
of the filters in the convolutional layer in CNNs on the ESC were In the study by Lim et al. [49], a CNN-based method was pro-
investigated. For this purpose, a Dilated CNN-based model (D- posed for the classification of sound events. The proposed method
CNN-ESC) is proposed. The proposed system has been applied to classifies 30 different sound events on different datasets with an
4
_
Fig. 5. The architecture of the proposed method.
Fig. 6. Sound (up) and transformed image according to a scalogram (down) of classes in the US8K data set.
accuracy rate of 81.5 %. In Akbal’s study [50], a stable feature pooling methods. This impressive study has been applied to the
extraction method is presented to determine the location of the ESC-10 and ESC-50 datasets, and high accuracy has been achieved.
activity from environmental sounds. This method consists of three
basic stages, which include feature creation, selection, and classifi- 1.2. Motivation
cation. In this study, the proposed method is applied to the ESC-10
data set and the classification of the sounds in the data set is pro- In general, it has been observed that the success rates of ESC
vided. With this proposed method, an accuracy rate of 90.25 % was obtained with deep learning models have better results than other
obtained. Although deep learning models achieve high accuracies, artificial intelligence methods. The main reason for this can be
the models can be misled by noise added to the data. This has been summarized as automatic feature discovery in deep learning mod-
addressed in a detailed study by Tripathi and Mishra [52]. In their els. However, many parameters need to be adjusted in the design
study, the effect of attacks on ESC data classified with CNN models of deep learning models. These are divided into two parts: opti-
was investigated. They also created two datasets that will be mization parameters and model design parameters. This situation
benchmarks in these studies. Finally, in Tuncer et al. [53] ESC clas- can be seen as both an advantage and a disadvantage. As an advan-
sification was performed by using spiral patterns and two- tage, it allows different models to be designed by different
dimensional maximum, minimum, median, and mean (2D-M4) researchers for the solution of a problem. On the other hand, the
5
_
motivation of this work is to perform the optimization of CNN

hyper-parameters, which gives the highest accuracy for the classi-
fication of urban sounds.
1.3. Contributions
Signal processing methods, classical machine learning methods,

and deep learning models are used for the classification of urban
sounds. The contribution of this study in addition to these studies
is given below.
1. Optimization of hyperparameters of CNN used for the classifica-

tion of urban sounds was carried out for the first time in this
study.
2. The PSO [58] algorithm was adapted for CNN parameter opti-
mization. Thus, the CNN parameters were optimized to be used
directly in the PSO algorithm without converting them into any
numerical values.
3. Unlike any transfer learning or CNN models designed in differ-
ent architectures, a basic CNN model that works with the high-
est accuracy was obtained.
4. With the parameter optimization performed in the study, a CNN
model was automatically obtained for the classification of envi-
ronmental sound data in image format. Thus, it is possible to
obtain an automatic CNN for those who want to work on any
data set in image format but have no knowledge of CNN design
or are beginners.
1.4. Paper structure
This paper is organized as follows. In section 2, information

about the CNN, PSO, and ESC data sets is given. In section 3, infor-
mation about the proposed method is given. This section describes
how to optimize CNN models with PSO. In addition, CNN models
that give the highest classification accuracy for each data set are
given with the proposed method. Experimental studies have been
conducted in section 4. Finally, the conclusion is explained in sec-
Fig. 7. The Flow chart of PSO used in CNN optimization.
tion 5.
fact that it is almost impossible to design the best model for a 2. Background
problem can be seen as a disadvantage. A lot of work has been done
in recent years to overcome this disadvantage of CNN models. 2.1. Convolutional neural networks
These studies used different algorithms for the optimization of
CNN parameters. Optimization methods such as the genetic algo- Convolutional neural networks (CNN) are accepted as the main
rithm, the evolutionary algorithm, and particle swarm optimiza- architecture of deep learning, because of discovering its unique
tion were used for the parameter optimization of CNN models. features by performing the learning process on raw data. It exhibits
Satisfactory results have been obtained for the optimization of high performance in problems such as classification, identification,
CNN parameters [54–56]. Recently, CNN models [2,36–38,41–46, and segmentation. For this reason, it has become widespread in
48,49,57] have become used for ESC tasks. However, no studies many fields such as engineering, medicine, and the defense indus-
have been found so far regarding the most appropriate layer struc- try. It has become even more popular with its use with big data,
ture and parameter value for CNN to be used in ESC tasks. The main especially thanks to automatic feature discovery [59]. CNNs are
Fig. 8. Representation of CNN layers as particles.
6
_
Fig. 9. Calculation of Gbest-Particle or Pbest-Particle.
Fig. 10. Calculation of velocity.
Fig. 11. Update of particle.
7
_
Table 3 ers. In this way, countless trial areas were left for the researcher to
Particle swarm optimization initial values. reach the highest performance ratio in the model he will create.
Name of Parameters Value The basic structure of the CNN is given in Fig. 1.
Number of iterations 3 Looking at the CNN architecture, while the section where the
Swarm size 50 input layer, convolution layer, and pooling layers are used for fea-
c 0.8 ture extraction, classification is performed in the section created by
the fully connected layers. In the working principle of CNNs, the
appropriate data received in the input layer is passed through
blocks consisting of convolution and pooling layers and transferred
to the last fully connected layer. After this layer, output data is
Table 4
Parameters values of CNN models architecture.
obtained. This data is compared with the desired result, and the
difference is accepted as an error. To minimize this error, the
Name of Parameters Value weights are updated using a gradient-based backpropagation algo-
Minimum number of layers 3 rithm. Network training is terminated when the desired epoch
Maximum number of layers 15 value is reached [60].
Minimum number of filters in a convolutional layer 16
Maximum number of filters in a convolutional layer 256
Minimum size of a convolutional kernel 2
2.2. Particle swarm optimization
Maximum size of a convolutional kernel 11
Minimum size of kernel a pooling layer 2
Maximum size of kernel a pooling layer 7 Particle swarm optimization (PSO), developed by Kennedy
Minimum size of stride a pooling layer 2 Eberhart in 1995, was inspired by the movements of herd animals
Maximum size of stride a pooling layer 7 in search of food [58]. The population created in this algorithm is
Minimum number of neurons in a fully connected layer 10
Maximum number of neurons in a fully connected layer 1024
called the swarm, and each individual in the population is called
a particle. Each particle searches for the best solution, and this is
called the ‘Pbest’. The search for the best solution made by the
whole population is called the ‘Gbest’. The flow diagram of the
Table 5 algorithm is given in Fig. 2. After the initial parameters are defined
CNN training parameters. in Fig. 2, the particles in the population are randomly distributed to
Parameters Value create a swarm. The Fitness function value is calculated according
to the positions of the particles created first, and then the Pbest
Optimizer SGDM
Epoch used to obtain the best CNN 10 and Gbest values are updated. The velocities and positions of the
Epoch for training the best CNN 50 particles are then recalculated. The position calculation of each
Dropout rate 0.5 particle is made according to Equation (1). The velocity calculation
Mini batch size 256 of each particle in Equation (1) is made according to Equation (2).
Initial learning rate 0.001
xiðtþ1Þ ¼ xiðtÞ þ v iðtþ1Þ ð1Þ

v iðt þ 1Þ ¼ w v iðtÞ þ C 1 Rand1 Pbesti xiðtÞ þ C 2

deep networks that consist of multiple layers in succession and Rand2 Gbesti xiðtÞ ð2Þ
extract feature maps of objects during network training. In addi-
tion to the advantages of CNNs in extracting the feature maps where i is the number of particles, t is the number of iterations, v is
themselves, it also left the researcher free in the number of layers particle velocity, x is particle position, Rand1 and Rand2 are random
in the model to be used and the parameters to be used in these lay- functions, Pbest is the best position ever visited by particle i, Gbest is
Fig. 12. Convergence graphs of the CNN-ESC10 model during the training.
8
_
Fig. 13. Convergence graphs of the CNN-ESC50 model during the training.
Fig. 14. Convergence graphs of the CNN-US8K model during the training.
the best position discovered so far, and w, C1, and C2 are positive recordings of 5 s in length. It has 2000 sound clips from urban set-
constants (the inertia weight and acceleration coefficients). tings. These classes consist of the main categories: ‘Animals, natu-
ral soundscapes & water sounds’, ‘Human non-speech sounds’,
2.3. Data sets ‘Interior/domestic sounds’, and ‘Exterior/urban noises’. The names
of each class under these main categories are given in Table 1.
Three different data sets have been used extensively for the
classification of environmental sounds in recent years. These data-
2.3.3. UrbanSound8K
sets are ESC10, ESC50, and UrbanSaund8K, respectively. Informa-
UrbanSound8K (US8K) is a dataset consisting of 8732 labeled
tion about these datasets is given below.
sounds prepared by Salamon et al. [25]. Each record in the dataset
is approximately 4 s long. The dataset has 10 classes. These classes
2.3.1. ESC-10 data set are: ‘air conditioner’, ‘car horn’, ‘children playing’, ‘dog bark’, ‘dril-
The ESC-10 dataset was created by Karol J. Piczak [24]. There are ling’, ‘engine idling’, ‘gunshot’, ‘jackhammer’, ‘siren’, and ‘street
ten classes in the dataset; these classes are ‘Dog bark’, ‘Rain’, ‘Sea music’. The number of records for each class in this data set is given
waves’, ‘Baby cry’, ‘Clock tick’, ‘Person sneeze’, ‘Helicopter’, ‘Chain- in Table 2. An example representation of the sounds belonging to
saw’, ‘Rooster’, and ‘Fire crackling’. Each recording consists of an each class in this dataset is given in Fig. 4.
average of 5 s long. In total, there is an average of 40 audio record-
ings for each class. An example representation of the sounds
belonging to each class in the dataset is given in Fig. 3. This dataset 3. Proposed method
is a subset of the ESC-50 dataset.
The architecture of the proposed method is given in Fig. 5. In the
2.3.2. ESC-50 data set proposed method, firstly, the audio data is converted to image for-
The ESC-50 dataset is an enhanced version of the ESC-10 data- mat using the scalogram method. Secondly, the CNN model that
set. There are 50 classes in this dataset. Each class consists of 40 gave the best results for the data set was obtained with the PSO
9
_
Fig. 15. Confusion matrix giving the average accuracy value of the CNN-ESC10 model.
Fig. 16. Confusion matrix giving the average accuracy of the CNN-ESC 50 model.
algorithm. The optimization of hyper-parameters in the CNN ple, the records converted from sound to image in the US8K dataset
model is detailed in Section 3.2. are given in Fig. 6.
Since the models obtained in experimental studies are deeper, it
3.1. Data set preprocessing has been understood that more data is needed for the training of
these models. For this reason, the models were trained with both
In the study, sounds were converted from signal to image for- original data sets and augmented data sets. Translation and flip
mat using the scalogram method. The scalogram is the absolute methods were used for data augmentation. By using these methods
value of the continuous wavelet transform of a signal plotted as in vertical, horizontal, and vertical horizontal directions, the data
a function of time and frequency. The Wavelet Toolbox of Matlab set was increased three times in each method and the data sets
R2020b software was used for the conversion process. As an exam- were increased 6 times in total.
10
_
Fig. 17. Average confusion matrix of model ESC-US8K.
Fig. 18. Boxplot of the CNN-ESC10, CNN-ESC50, and ESC-US8K models.
3.2. Optimization of CNN Hyper-parameters. any task. However, this is not an easy process because of the num-
ber of parameters to be adjusted, and the computational cost is
CNN parameter optimization aims to perform the parameter very high. For this reason, it is necessary to use optimization algo-
selection for the CNN model that gives the highest accuracy for rithms that will result in lower iterations. In this study, the PSO
11
_
Table 6
Performance values obtained with the proposed models.
Proposed Model Data Augmentation Mean Accuracy Min Accuracy Max Accuracy Standard Deviation
CNN-ESC10 Yes 98.64 99.64 97.86 0.59
CNN-ESC10 No 88.50 95.00 80.00 4.64
CNN-ESC50 Yes 96.77 97.36 96.36 0.34
CNN-ESC50 No 74.85 81.00 71.00 2.82
CNN– US8K Yes 98.45 98.64 98.18 0.13
CNN– US8K No 91.17 92.90 88.89 1.03
algorithm is used to obtain the CNN model that will give the high- swarm represents a CNN layer. Therefore, 50 different CNN layers
est accuracy for the classification of urban sounds. For optimization are created in each iteration. Since the convergence speed is very
with PSO, first, the calculation made in Equation-1 and Equation-2 high in PSO, the number of iterations was chosen as low as 3.
should be adapted to the CNN structure. For this reason, a new The c value in the table represents the rv value in Eq. (3). An exam-
method has been developed for the representation of CNN layer ple usage of this value in the proposed method is given in Fig. 10.
structures, similar to the studies in [61,62]. In the proposed The parameter values used for the optimization of the CNN
method, CNN layers are used directly without converting any model are given in Table 4. In particular, the number of parameters
numerical values. The flow chart of the updated PSO algorithm is has been kept wide to find the most suitable model by searching
given in Fig. 7. the CNN models in a wide solution space. The maximum number
In Fig. 7, the first step in the flow chart is the creation of the par- of layers of the models has been selected as 15. The minimum
ticles. As shown in Fig. 8, each particle was formed to represent a number of layers is set to three, and one of them must be a fully
CNN model. The best model is obtained after the created CNN mod- connected layer. In addition, the fully connected layer must always
els are trained with the data set. At this stage, Gbest represents the be placed last in the sequence of layers. Layers represent only con-
best-performing CNN model and Pbest represents the first CNN volution, pooling, and fully connected layer numbers. The ReLu lay-
model. In the next step of the flow diagram, the velocity is calcu- ers after the convolution layers and the DropOut layers after the FC
lated according to Equation (3). layers are added automatically.
The parameters used for the training of each CNN model are
Pbest Pif ðr > r v Þ
Velocity ¼ ð3Þ given in Table 5. In the study, there are two types of epoch values:
Gbest Pelse the epoch value of the CNNs trained during optimization and the
The r in the equation refers to the randomly generated number epoch value of the best CNN found after optimization is completed.
in the range 0–1. The rv represents the reference value and deter- In the first, the epoch value was chosen as 10 to find the CNN
mines which layer will be taken in the comparison. In the algo- model with the best performance in the training phase. Indeed,
rithm, the Gbest – P or Pbest – P difference is calculated as shown 10 epochs are a low value for training a CNN model. However, it
in Fig. 9. In Fig. 9, by comparing each convolution and pooling is acceptable for all candidate CNN models to have an epoch value
layer, the same layers take the letter ‘‘O” and in fully connected of 10 and to choose the best among them. In the second, the best
layers O (fc). If the values in the layer are different, the Pbest and candidate is trained with the final training parameters and epoch
Gbest values are taken into account. An example of velocity calcu- value of 50.
lation according to Equation (3) is given in Fig. 10. As seen in
Fig. 10, the rv value was determined as 0.8. For the first layer, r
4.2. Best CNN models obtained with adapted PSO
was generated as 0.74. Hence, the first layer of velocity (Gbest –
Particle) is taken. In this way, the velocity value is obtained after
The best CNN models obtained for the ESC10, ESC50, and US8K
comparing all layers according to the randomly generated r-
datasets are named CNN-ESC10, CNN-ESC50, and CNN-US8K,
value. After the velocity is obtained, the particles are updated as
respectively. The layer architecture and parameters of these mod-
shown in Fig. 11. In Fig. 11, if the velocity layer is 0, the particle
els are given Appendix A. The CNN-ESC10 model consists of a total
layer is considered. If the velocity and the particle layers are differ-
of 22 layers, including 4 convolutional, 3 pooling, and 2 fully con-
ent, the velocity layer is taken.
nected layers. The CNN-ESC50 model consists of a total of 24 lay-
ers, including 5 convolutional, 2 pooling, and 2 fully connected
4. Experimental studies layers. The CNN-US8K model consists of a total of 18 layers, includ-
ing 4 convolutional, 3 pooling, and 2 fully connected layers. The
In this section, training of CNN models obtained with PSO and CNN-ESC50 model consists of a total of 24 layers, including 5 con-
test results obtained by these models are given. The success values volutional, 2 pooling, and 2 fully connected layers.
obtained were compared with the studies in the literature. In addi-
tion, both the values of the parameters used in the design of the
CNN models and the initial values of the parameters used in the 4.3. Training and results of CNN models
PSO are given. Experimental studies have been done on a computer
with IntelÒ CoreTM i9-7900X 3.30 GHz 20 processor, 64 GB RAM The training process for the CNN models was performed with
and 2 GeForce RTX2080Ti graphic card. Matlab R2021b 64bit 10-fold cross-validation. The convergence graphs of the CNN-
(win64) was used as the software platform. ESC10, CNN-ESC50, and CNN-US8K during the training phase are
given in Fig. 12, Fig. 13, and Fig. 14, respectively. When all three
4.1. Parameters used in the proposed method figures are examined, it can be seen that the accuracy and valida-
tion graphs and the error and validation error graphs are very close
The parameters used for Adapted-PSO are given in Table 3. As to each other; therefore, the models do not overfit when training.
seen in Table 3, the number of swarms was determined as 50. Each In addition, CNN-ESC10 converges faster than other models.
12
_
The confusion matrix giving the average accuracy of the CNN- enough datasets to update its weights. With the increase of the
ESC10 model is given in Fig. 15. Looking at Fig. 15, the model data, the success values for 50 classes increased more.
labeled the Rooster class with the highest accuracy and the Dog
bark class with the lowest accuracy. In addition, the confusion 4.4. Comparison with other studies
matrices obtained by this model for each fold are given in Appen-
dix B. The highest accuracy rate was obtained in Fold 9, and the Different studies based on deep learning have been conducted
lowest accuracy rate was obtained in Fold 4. on the ESC data set. The comparison of the average accuracy value
The confusion matrix giving the average accuracy of the CNN- obtained with the proposed method with other studies is given in
ESC50 model is given in Fig. 16. Looking at Fig. 16, the model labels Table 7. The ‘‘–” character in Table 7 means that the relevant
26 classes with 100 % accuracy. Confusion matrices obtained by the method was not used on that data set. It has been observed that
model for each fold are given in Appendix C. the proposed CNN models have higher success than the baseline
The confusion matrix giving the average accuracy of the CNN- models [24,25] in all data sets. Compared to all studies in the table,
US8K model during the test phase is given in Fig. 17. Looking at it outperformed all studies except for one [44]. Since transfer
Fig. 17, the class labeled with the highest accuracy by the model learning is used in this study, it is expected to obtain a higher accu-
is Air conditioner and the classes labeled with the lowest accuracy racy rate because, in transfer learning, the models were applied to
are Car horns and Street music. In addition, the confusion matrices urban sounds after being trained on a different data set consisting
obtained by this model for each fold are given in Appendix D. of millions of data. The proposed models yielded better results
Boxplots of all three models are given in Fig. 18. As seen in than human performance in the ESC-10 and ESC-50 datasets.
Fig. 18, CNN-ESC10 showed the best performance, while the
CNN-ESC10 model showed the lowest performance. In addition, 4.5. Discussion
the CNN-US8K model showed the lowest standard deviation.
The performance values of the models obtained in this study on Classification of urban sounds with artificial intelligence meth-
the data sets are presented in Table 6. In the table, the results ods has increased in recent years. Due to the high success of deep
obtained with or without data augmentation are given for each learning models in other data types, it has been used frequently in
data set. The average accuracy values obtained in the ESC-10, urban sounds. It is still being investigated by many researchers to
ESC-50, and US8K datasets are 88.50, 74.85, and 91.17 % without obtain CNN models that perform better in the classification of
data augmentation, and 98.64, 96.77, and 98.45 % with data aug- urban sounds. However, in the studies carried out so far, the mod-
mentation, respectively. It was seen that the models obtained els were designed manually. Therefore, it is almost impossible to
higher values with data augmentation. The highest accuracy value obtain CNN models that work with the highest accuracy because
between normal data and augmented data was seen in the ESC-50 the number of parameters that need to be set in a CNN is very
data set. The main reason for this is that there are 50 classes in this large. For example, just the order of the layers can be in a million
dataset and less data for training because the model could not find different ways. In addition, considering that the parameter values
Table 7
Comparison of the accuracy value obtained by the proposed method with other methods’ accuracy (%).
Year-Authors [References] Method US8K ESC10 ESC50

2014- Salamon et al. [25] SVM (Baseline) 68.00 – –
2015-Piczak [24] Random Forest Ensemble (Baseline) – 72.70 44.30
2015- Piczak [24] Human Performance – 95.70 81.30
2017- Boddapati et al. [40] AlexNet, GoogLeNet, CRNN 93.00 91.00 73.00
2020- Inik and Seker [63] CNN 82.45 – –
2017- X Zhang et al. [48] D-CNN (Activation func. = LeakyReLU) 81.90 – 68.10
2017- X Zhang et al. [48] D-CNN (Activation func. = PReLU) 81.40 – 66.20
2017- X Zhang et al. [48] D-CNN (Activation func. = ReLU) 81.20 – 67.14
2016- Yusuf Aytar et al. [64] SoundNet – 92.20 74.20
2016- Salamon and Bello [37] DCNN + data augmentation 79.00 – –
2017- X Zhang et al. [48] D-CNN (Activation func. = ELU) 78.90 – 68.00
2020-Demir et al. [2] Pyramid-Combined CNN 78.14 94.80 81.40
2019-Chen et al. [45] Dilated CNN 78.00 – –
2017-Ye et al. [65] feature learning 77.36 – –
2017- X Zhang et al. [48] D-CNN(Activation func. = Softplus) 73.70 – 53.00
2017- Dai et al. [66] CNN 71,68 – –
2018- Pons and Serra[67] VGG and SVMs 70.74 – –
2020-Akbal [50] SVM – 90.25 –
2021-Mushtaq et. al. [44] ResNet-152, DenseNet-161 99.49 99.04 97.57
2018- Zhu B et al. [68] WaveMsNet – 93.75 79.10
2019- Xinyu Li et al. [69] Multi-Stream CNN – 93.70 83.50
2021-S. Luz et al.[70] Handcrafted + Deep 96.80 – 86.20
2021- Tripathi and Mishra [71] self-supervised learning (SSL) – 91.67 –
2021- Zhang et al. [72] RNN – 93.70 86.10
2021- Tripathi and Mishra [73] Residual Network – 92.00 –
2022- Zhang et al. [74] PSO + ensemble CBiLSTM – 93.00 –
2020-Medhat et al.[47] Masked Conditional Neural Networks 74.22 85.25 66.60
Proposed CNN optimization with PSO 91.17 88.50 74.85
Proposed (Data Augmentation) CNN optimization with PSO 98.45 98.64 96.77
13
_
in each layer need to be adjusted, it is understood how wide the 5. Conclusion

solution space of the parameters needs to be adjusted for a CNN.
For this reason, obtaining the ideal solution in this wide solution In the study, the PSO algorithm was used to obtain the highest
space can only be achieved with optimization algorithms. There- CNN model on the ESC-10, ESC-50, and Urbansound8k data sets,
fore, in this study, the CNN model with the best performance which are accepted as state of art in the classification of urban
was sought using the PSO algorithm for the classification of urban sounds. In the proposed method, the PSO algorithm was modified
sounds. The most important reason for using the PSO algorithm is to optimize the CNN parameter. The best models for the ESC-10,
the high convergence speed of this algorithm. Therefore, it is the ESC-50, and Urbansound8k datasets were CNN-ESC10, CNN-
ideal solution for CNNs with high computational costs, as it gives ESC50, and CNN-US8K, with 88.50, 74.85, and 91.17 % accuracy
better results in lower iterations. rates, respectively. These results were obtained from raw data.
One of the most important issues for CNN parameter optimiza- The data sets were increased seven times with data augmentation
tion is the representativeness of these models for parameter opti- techniques, and the same models were trained again. The results
mization. In this study, a unique representation method was were obtained with 98.64, 96.77, and 98.45 % accuracy, respec-
created by modifying the PSO algorithm. Thus, unlike some studies tively. When these results are compared with competing studies,
[54,75,76], the CNN parameters were optimized to be used directly the CNN models obtained for the classification of urban sounds
in the PSO algorithm without converting them to any numerical were very successful.
value. Compared to other studies [37,45,48,66] using pure CNN
in the classification of urban sounds, CNN models obtained with
CRediT authorship contribution statement
PSO performed much better. It has been seen that the CNN models
obtained with PSO are deeper. It was understood that the weights _
Özkan Inik: Conceptualization, Data curation, Formal analysis,
of these deep networks would need more data to reach the opti-
Funding acquisition, Investigation, Methodology, Project adminis-
mum value, and the models were trained by generating synthetic
tration, Resources, Software, Validation, Visualization, Writing –
data in the data sets. Especially with the high number of classes
original draft, Writing – review & editing.
in the ESC-50 data set and the low training data in each class,
the weights of the CNN-ESC50 model could not update sufficiently
with the raw data. When this data set was increased, and the Data availability
model was trained again, the accuracy rate increased from
74.85 % to 96.77 %. This result suggests that CNN models obtained Data will be made available on request.
for ESC will also achieve higher accuracy rates in ESC classification
after training on a larger data set. Therefore, when the proposed Declaration of Competing Interest
models were trained only with ESC sounds, they gave lower results
than their competitors [44] in all three datasets. These cutting- The authors declare that they have no known competing finan-
edge models have achieved high success rates with transfer cial interests or personal relationships that could have appeared
learning. to influence the work reported in this paper.
In addition, some ensemble-based methods [70,74], which are
made by combining different features, achieved better results Appendix A. Layer architecture and parameters of CNN-ESC10,
when compared with the proposed models without data augmen- CNN-ESC10, and US8K models
tation because these methods used machine learning and hand-
crafted features in addition to the features found by CNNs, which Fig. A1.
can be predicted to achieve better results. Table A1.
In future studies, it is thought that more effective transfer learn- Fig. A2.
ing studies will be carried out to use PSO-based CNN models Table A2.
obtained on lower datasets with less computational cost in training Fig. A3.
processes in large datasets. Table A3.
Fig. A1. Layer architecture of the CNN-ESC10 model.
14
_
Table A1
Parameter values in each layer of the CNN-ESC10 model.
No Layer Activations Number of Filter Filter Size Stride Total Parameter

1 Input Layer 2242243 – – – –
2 Conv. 224224140 140 66 11 663140
3 Batch Norm. 224224140 – – – 111140
4 ReLu Layer 224224140 – – – –
5 Avr. Pooling 4545140 – 44 55 –
6 Conv. Layer 4545205 205 1010 11 1010140205
7 Batch Norm. 4545205 – – – 11205
8 ReLu 4545205 – – – –
9 Max. Pooling 2121205 – 44 22 –
10 Conv. 2121183 183 99 11 99205183
11 Batch Norm 2121183 11183
12 ReLu 2121183 – – – –
13 Max. Pooling 33183 – 55 77 –
14 Conv. 33103 103 1010 11 1010183103
15 Batch Norm 33103 – – – 11103
16 ReLu 33103 – – – –
17 Fully Connected 11763 – – – 763927
18 ReLu 11763
19 DropOut 11763 – – – –
20 Fully Connected 1110 – – – 10763
21 Softmax 1110 – – – –
22 Classification 1110 – – – –
Fig. A2. Layer architecture of the CNN-ESC50 model.
Table A2
Parameter values in each layer of the CNN-ESC50 model.

1 Input 2242243 – – – –
2 Conv 224224109 109 22 11 223109
3 Batch. Norm. 224224109 – – – 11109
4 ReLu 224224109 – – – –
5 Avg. Pooling 5656109 – 44 44 –
6 Conv. 5656203 203 22 11 22109203
7 Batch. Norm. 5656203 – – – 11203
8 ReLu 5656203 – – – –
9 Max. Pooling 1818203 – 44 33 –
10 Conv. 1818181 181 33 11 33203181
11 Batch. Norm. 1818181 11181
12 ReLu 1818181 – – – –
13 Conv. 1818210 210 44 11 44181210
14 Batch. Norm. 1818210 – – – 11210
15 ReLu 1818210 – – – –
16 Conv. 1818169 169 44 11 44210169
17 Batch. Norm. 1818169 – – – 11169
18 Relu 1818169 – – – –
19 Fully Connected 11850 – – – 85054756
20 ReLu 11850 – – –
21 DropOut 11850 – – – –
23 Softmax 1150 – – – –
15
_
Fig. A3. Layer architecture of the CNN-US8K model.
Table A3
Parameter values in each layer of the CNN-US8K model.

1 Input 2242243 – – – –
2 Conv. 22422418 18 88 11 88318
3 ReLu 22422418 – – – –
4 Avrg. Pooling 373718 – 55 66 –
5 Conv. 3737214 214 88 11 8818214
6 ReLu 3737214 – – – –
7 Avrg. Pooling 1111214 – 55 33 –
8 Conv. 1111249 249 44 11 44214249
9 ReLu 1111249 – – – –
10 Conv. 1111229 229 33 11 33249229
11 ReLu 1111229 – – – –
12 Avrg. Pooling 33229 – 33 44 –
13 Fully Connected 11714 – – – 7142061
14 DroupOut 11714 – – – –
15 ReLu 11714 – – – –
17 Softmax 1110 – – – –
16
_
Appendix B. Confusion matrices obtained by the CNN-ESC10 model. The highest accuracy was achieved in Fold-9
17
_
Appendix C. Confusion matrices obtained by the CNN-ESC50 model. The highest accuracy was achieved in Fold-3
18
_
19
_
20
_
21
_
22
_
Appendix D. Confusion matrices obtained by the ESC-US8K model. The highest accuracy was achieved in Fold-6
23
_
References [31] J. T. Geiger and K. Helwani, ‘‘Improving event detection for audio surveillance
using gabor filterbank features,” in 2015 23rd European Signal Processing
Conference (EUSIPCO), 2015, pp. 714-718.
[1] Chu S, Narayanan S, Kuo C-C-J. Environmental sound recognition with time–
[32] Mulimani M, Koolagudi SG. Segmentation and characterization of acoustic
frequency audio features. IEEE Trans Audio Speech Lang Process
event spectrograms using singular value decomposition. Expert Syst Appl
2009;17:1142–58.
2019;120:413–25.
[2] Demir F, Turkoglu M, Aslan M, Sengur A. A new pyramidal concatenated CNN
[33] Xie J, Zhu M. Investigation of acoustic and visual features for acoustic scene
approach for environmental sound classification. Appl Acoust
classification. Expert Syst Appl 2019;126:20–9.
2020;170:107520.
[34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘Imagenet classification with
[3] Aumond P, Lavandier C, Ribeiro C, Boix EG, Kambona K, D’Hondt E, et al. A
deep convolutional neural networks,” in Advances in neural information
study of the accuracy of mobile technology for measuring urban noise
processing systems, 2012, pp. 1097-1105.
pollution in large scale participatory sensing campaigns. Appl Acoust
[35] Deng J, Berg A, Satheesh S, Su H, Khosla A, Fei-Fei L. Imagenet large scale visual
2017;117:219–26.
recognition competition 2012 (ILSVRC2012). See net org/challenges/LSVRC
[4] Cao J, Cao M, Wang J, Yin C, Wang D, Vidal P-P. Urban noise recognition with
2012:41.
convolutional neural network. Multimed Tools Appl 2019;78:29021–41.
[36] K. J. Piczak, ‘‘Environmental sound classification with convolutional neural
[5] Radhakrishnan R, Divakaran A, Smaragdis A. Audio analysis for surveillance
networks,” in 2015 IEEE 25th International Workshop on Machine Learning for
applications. IEEE Workshop on Applications of Signal Processing to Audio and
Signal Processing (MLSP), 2015, pp. 1-6.
Acoustics 2005;2005:158–61.
[37] Salamon J, Bello JP. Deep convolutional neural networks and data
[6] Crocco M, Cristani M, Trucco A, Murino V. Audio surveillance: A systematic
augmentation for environmental sound classification. IEEE Signal Process
review. ACM Computing Surveys (CSUR) 2016;48:1–46.
Lett 2017;24:279–83.
[7] Laffitte P, Wang Y, Sodoyer D, Girin L. Assessing the performances of different
[38] N. Takahashi, M. Gygli, B. Pfister, and L. Van Gool, ‘‘Deep convolutional neural
neural network architectures for the detection of screams and shouts in public
networks and data augmentation for acoustic event detection,” arXiv preprint
transportation. Expert Syst Appl 2019;117:29–41.
arXiv:1604.07160, 2016.
[8] Heittola T, Mesaros A, Eronen A, Virtanen T. Audio context recognition using
[39] Y. Tokozume, Y. Ushiku, and T. Harada, ‘‘Learning from between-class
audio event histograms. European Signal Processing Conference 2010:1272–6.
examples for deep sound recognition,” arXiv preprint arXiv:1711.10282, 2017.
in 2010 18th.
[40] Boddapati V, Petef A, Rasmusson J, Lundberg L. Classifying environmental
[9] Xu M, Xu C, Duan L, Jin JS, Luo S. Audio keywords generation for sports video
sounds using image recognition networks. Procedia Comput Sci
analysis. ACM Transactions on Multimedia Computing, Communications, and
2017;112:2048–56.
Applications (TOMM) 2008;4:1–23.
[41] Li S, Yao Y, Hu J, Liu G, Yao X, Hu J. An ensemble stacked convolutional neural
[10] A. Waibel, H. Steusloff, and R. Stiefelhagen, ‘‘CHIL-Computers in the human
network model for environmental event sound recognition. Appl Sci
interaction loop. 5th Intern,” in Workshop on Image Analysis for Multimedia
2018;8:1152.
Interactive Services, 2004.
[42] Su Y, Zhang K, Wang J, Madani K. Environment sound classification using a
[11] D. P. Ellis and K. Lee, ‘‘Minimal-impact audio-based personal archives,” in
two-stream CNN based on decision-level fusion. Sensors 2019;19:1733.
Proceedings of the the 1st ACM workshop on Continuous archival and retrieval of
[43] Mushtaq Z, Su S-F. Environmental sound classification using a regularized
personal experiences, 2004, pp. 39-47.
deep convolutional neural network with data augmentation. Appl Acoust
[12] Eronen AJ, Peltonen VT, Tuomi JT, Klapuri AP, Fagerlund S, Sorsa T, et al. Audio-
2020;167:107389.
based context recognition. IEEE Trans Audio Speech Lang Process
[44] Mushtaq Z, Su S-F, Tran Q-V. Spectral images based environmental sound
2005;14:321–9.
classification using CNN with meaningful data augmentation. Appl Acoust
[13] Barchiesi D, Giannoulis D, Stowell D, Plumbley MD. Acoustic scene
2021;172:107581.
classification: Classifying environments from the sounds they produce. IEEE
[45] Chen Y, Guo Q, Liang X, Wang J, Qian Y. Environmental sound classification
Signal Process Mag 2015;32:16–34.
with dilated convolutions. Appl Acoust 2019;148:123–32.
[14] H. Li, S. Ishikawa, Q. Zhao, M. Ebana, H. Yamamoto, and J. Huang, ‘‘Robot
[46] Abdoli S, Cardinal P, Koerich AL. End-to-end environmental sound
navigation and sound based position identification,” in 2007 IEEE International
classification using a 1D convolutional neural network. Expert Syst Appl
Conference on Systems, Man and Cybernetics, 2007, pp. 2449-2454.
2019;136:252–63.
[15] Lyon RF. Machine hearing: An emerging field [exploratory dsp]. IEEE Signal
[47] Medhat F, Chesmore D, Robinson J. Masked Conditional Neural Networks for
Process Mag 2010;27:131–9.
sound classification. Appl Soft Comput 2020;90:106073.
[16] Chu S, Narayanan S, Kuo C-C-J, Mataric MJ. ‘‘Where am I? Scene recognition for
[48] X. Zhang, Y. Zou, and W. Shi, ‘‘Dilated convolution neural network with
mobile robots using audio features,” in. IEEE International conference on
LeakyReLU for environmental sound classification,” in 2017 22nd International
multimedia and expo 2006;2006:885–8.
Conference on Digital Signal Processing (DSP), 2017, pp. 1-5.
[17] J. Huang, ‘‘Spatial auditory processing for a hearing robot,” in Proceedings. IEEE
[49] M. Lim, D. Lee, H. Park, Y. Kang, J. Oh, J.-S. Park, et al., ‘‘Convolutional Neural
International Conference on Multimedia and Expo, 2002, pp. 253-256.
Network based Audio Event Classification,” KSII Transactions on Internet &
[18] Green M, Murphy D. Environmental sound monitoring using machine learning
Information Systems, vol. 12, 2018.
on mobile devices. Appl Acoust 2020;159:107041.
[50] Akbal E. An automated environmental sound classification methods based on
[19] P. Intani and T. Orachon, ‘‘Crime warning system using image and sound
statistical and textural feature. Appl Acoust 2020;167:107413.
processing,” in 2013 13th International Conference on Control, Automation and
[51] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for large-
Systems (ICCAS 2013), 2013, pp. 1751-1753.
scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[20] Torija AJ, Ruiz DP, Ramos-Ridao ÁF. A tool for urban soundscape evaluation
[52] Tripathi AM, Mishra A. Adv-ESC: Adversarial attack datasets for an
applying support vector machines for developing a soundscape classification
environmental sound classification. Appl Acoust 2022;185:108437.
model. Sci Total Environ 2014;482:440–51.
[53] Tuncer T, Subasi A, Ertam F, Dogan S. A novel spiral pattern and 2D M4 pooling
[21] Romero VP, Maffei L, Brambilla G, Ciaburro G. Modelling the soundscape
based environmental sound classification method. Appl Acoust
quality of urban waterfronts by artificial neural networks. Appl Acoust
2020;170:107508.
2016;111:121–8.
[54] Ma B, Li X, Xia Y, Zhang Y. Autonomous deep learning: A genetic DCNN
[22] Agha A, Ranjan R, Gan W-S. Noisy vehicle surveillance camera: A system to
designer for image classification. Neurocomputing 2020;379:152–61.
deter noisy vehicle in smart city. Appl Acoust 2017;117:236–45.
[55] Gonçalves CB, Souza JR, Fernandes H. CNN architecture optimization using bio-
[23] Ntalampiras S. Universal background modeling for acoustic surveillance of
inspired algorithms for breast cancer detection in infrared images. Comput
urban traffic. Digital Signal Process 2014;31:69–78.
Biol Med 2022;142:105205.
[24] K. J. Piczak, ‘‘ESC: Dataset for environmental sound classification,” in
[56] Singh P, Chaudhury S, Panigrahi BK. Hybrid MPSO-CNN: Multi-level particle
Proceedings of the 23rd ACM international conference on Multimedia, 2015, pp.
swarm optimized hyperparameters of convolutional neural network. Swarm
1015-1018.
Evol Comput 2021;63:100863.
[25] J. Salamon, C. Jacoby, and J. P. Bello, ‘‘A dataset and taxonomy for urban sound
[57] Z. Zhang, S. Xu, T. Qiao, S. Zhang, and S. Cao, ‘‘Attention based convolutional
research,” in Proceedings of the 22nd ACM international conference on
recurrent neural network for environmental sound classification,” in Chinese
Multimedia, 2014, pp. 1041-1044.
Conference on Pattern Recognition and Computer Vision (PRCV), 2019, pp. 261-
[26] Bisot V, Serizel R, Essid S, Richard G. Feature learning with matrix factorization
271.
applied to acoustic scene classification. IEEE/ACM Trans Audio Speech Lang
[58] J. Kennedy and R. Eberhart, ‘‘Particle swarm optimization (PSO),” in Proc. IEEE
Process 2017;25:1216–29.
International Conference on Neural Networks, Perth, Australia, 1995, pp. 1942-
[27] Stowell D, Giannoulis D, Benetos E, Lagrange M, Plumbley MD. Detection and
1948.
classification of acoustic scenes and events. IEEE Trans Multimedia
[59] Dev D. Deep Learning with Hadoop. Packt Publishing Ltd; 2017.
2015;17:1733–46. _ Ülker E. Derin Öğrenme ve Görüntü Analizinde Kullanılan Derin Ög
[60] Özkan I,
[28] Dhanalakshmi P, Palanivel S, Ramalingam V. Classification of audio signals
˘renme Modelleri. Gaziosmanpasßa Bilimsel Arasßtırma Dergisi 2017;6:85–104.
using AANN and GMM. Appl Soft Comput 2011;11:716–23.
[61] Junior FEF, Yen GG. Particle swarm optimization of deep neural networks
[29] Ludena-Choez J, Gallardo-Antolin A. Acoustic Event Classification using
architectures for image classification. Swarm Evol Comput 2019;49:62–74.
spectral band selection and Non-Negative Matrix Factorization-based
[62] Passricha V, Aggarwal RK. PSO-based optimized CNN for Hindi ASR. Int J
features. Expert Syst Appl 2016;46:77–86.
Speech Technol 2019;22:1123–33.
[30] J. Salamon and J. P. Bello, ‘‘Unsupervised feature learning for urban sound
[63] O. Inik and H. Seker, ‘‘CnnSound: Convolutional Neural Networks for the
classification,” in 2015 IEEE International Conference on Acoustics, Speech and
Classification of Environmental Sounds,” in 2020 The 4th International
Signal Processing (ICASSP), 2015, pp. 171-175.
Conference on Advances in Artificial Intelligence, 2020, pp. 79-84.
24
_
[64] Y. Aytar, C. Vondrick, and A. Torralba, ‘‘Soundnet: Learning sound [70] Luz JS, Oliveira MC, Araujo FH, Magalhães DM. Ensemble of handcrafted and
representations from unlabeled video,” in Advances in neural information deep features for urban sound classification. Appl Acoust 2021;175:107819.
processing systems, 2016, pp. 892-900. [71] Tripathi AM, Mishra A. Self-supervised learning for Environmental Sound
[65] Ye J, Kobayashi T, Murakawa M. Urban sound event classification based on Classification. Appl Acoust 2021;182:108183.
local and global features aggregation. Appl Acoust 2017;117:246–56. [72] Zhang Z, Xu S, Zhang S, Qiao T, Cao S. Attention based convolutional recurrent
[66] W. Dai, C. Dai, S. Qu, J. Li, and S. Das, ‘‘Very deep convolutional neural networks neural network for environmental sound classification. Neurocomputing
for raw waveforms,” in 2017 IEEE International Conference on Acoustics, Speech 2021;453:896–903.
and Signal Processing (ICASSP), 2017, pp. 421-425. [73] Tripathi AM, Mishra A. Environment sound classification using an attention-
[67] J. Pons and X. Serra, ‘‘Randomly weighted CNNs for (music) audio based residual neural network. Neurocomputing 2021;460:409–23.
classification,” in ICASSP 2019-2019 IEEE international conference on acoustics, [74] Zhang L, Lim CP, Yu Y, Jiang M. Sound classification using evolving ensemble
speech and signal processing (ICASSP), 2019, pp. 336-340. models and Particle Swarm Optimization. Appl Soft Comput
[68] Zhu B, Wang C, Liu F, Lei J, Huang Z, Peng Y, et al. ‘‘Learning environmental 2022;116:108322.
sounds with multi-scale convolutional neural network,” in. International Joint [75] L. Xie and A. Yuille, ‘‘Genetic cnn,” in Proceedings of the IEEE international
Conference on Neural Networks (IJCNN) 2018;2018:1–8. conference on computer vision, 2017, pp. 1379-1388.
[69] X. Li, V. Chebiyyam, and K. Kirchhoff, ‘‘Multi-stream network with temporal [76] Sinha T, Haidar A, Verma B. ‘‘Particle swarm optimization based approach for
attention for environmental sound classification,” arXiv preprint finding optimal values of convolutional neural network parameters,” in. IEEE
arXiv:1901.08608, 2019. congress on evolutionary computation (CEC) 2018;2018:1–6.
25

Applied Acoustics: Özkan - Inik

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applied Acoustics: Özkan - Inik

Uploaded by

Copyright:

Available Formats

Applied Acoustics 202 (2023) 109168

Contents lists available at ScienceDirect

CNN hyper-parameter optimization for environmental sound

classes are produced by mixing two sounds belonging to different

Fig. 1. The basic structure of CNN architectures in deep learning.

Fig. 3. An example representation of records of classes in the ESC10 dataset.

Class Number of images Train Validation Test

Fig. 4. An example representation of records of classes in the US8K dataset.

Fig. 5. The architecture of the proposed method.

motivation of this work is to perform the optimization of CNN

Signal processing methods, classical machine learning methods,

1. Optimization of hyperparameters of CNN used for the classifica-

1.4. Paper structure

This paper is organized as follows. In section 2, information

Fig. 8. Representation of CNN layers as particles.

Fig. 9. Calculation of Gbest-Particle or Pbest-Particle.

Fig. 10. Calculation of velocity.

Fig. 11. Update of particle.

xiðtþ1Þ ¼ xiðtÞ þ v iðtþ1Þ ð1Þ

Fig. 17. Average confusion matrix of model ESC-US8K.

Fig. 18. Boxplot of the CNN-ESC10, CNN-ESC50, and ESC-US8K models.

Year-Authors [References] Method US8K ESC10 ESC50

in each layer need to be adjusted, it is understood how wide the 5. Conclusion

Fig. A1. Layer architecture of the CNN-ESC10 model.

No Layer Activations Number of Filter Filter Size Stride Total Parameter

Fig. A2. Layer architecture of the CNN-ESC50 model.

No Layer Activations Number of Filter Filter Size Stride Total Parameter

Fig. A3. Layer architecture of the CNN-US8K model.

No Layer Activations Number of Filter Filter Size Stride Total Parameter

You might also like