Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

Chapter 4

Design and Implementation

To implement the acoustic scene classification, the proposed system is designed with the
five processing steps such as preprocessing, segmentation, acoustic feature extraction, feature
normalization and acoustic scene classification with channel attention shown in Figure 4.1.

Input audio files


Preprocessing
Framing and
Windowing
Features Extraction

Log-Mel Features
Beach
Bus
Café_restaurant
Features Residual Network
Normalization Model with_CA
«

Classification Office
Library
Train
Class Label

Figure 4.1. Overview of the System

4.1. Preprocessing of the System

The proposed ASC system aims to handle variable-length audio signals by directly
learning a discriminative representation from the audio signal, which produces an effective
classification performance on various sounds in the environment.
4.1.1. Frame Blocking

The input data for this system architecture consists of audio recordings from the DCASE
2016 or 2017 dataset, which contain various acoustic scenes captured in real-world
environments. In this system, particularly when working with datasets like DCASE 2016 and
2017, the default segment length of 10 seconds is not be optimal for all real-world applications
due to varying dynamics and characteristics of different acoustic scenes. To address this, the
granularity analysis is performed where segment lengths ranging from 1 second to 10 seconds to
capture the dynamic nature of acoustic scenes more comprehensively.

Figure 4.2. Framing the Input Audio Signal to Several Frame


As sound captured from the environment can have diverse durations, it becomes crucial
to adapt audio signals of different lengths for analysis. By segmenting the audio signals into
frames, overlapping frames are used to capture temporal dynamics more effectively. Each frame
of the audio signal undergoes a Short-Time Fourier Transform (STFT), converting it into the
frequency domain in the next section.

4.1.2. Windowing and STFT for Variable-Length Segments

In the proposed ASC system, each segment undergoes a straightforward preprocessing


step to enhance the quality of subsequent analysis. This starts with individual windowing of
segments, regardless of their duration. The signal is gradually trimmed to zero at its endpoints
using the Hamming window. This results in reduced spectrum loss and clearer frequency
representations.
To maintain consistent frequency resolution across segments of varying lengths, the
window size for the STFT is adjusted based on each segment's duration. Shorter segments can
require longer window sizes for adequate frequency resolution, while longer segments can
tolerate shorter ones. This dynamic adjustment ensures accurate frequency representation in
resulting spectrograms, crucial for reliable feature extraction.

Additionally, overlapping segments are used to capture temporal dynamics effectively.


Based on the length of each segment, the degree of overlap between them is set. Less overlap is
needed for longer segments, whereas more for shorter ones in order to capture fast temporal
variations. The use of this flexible method enables an advanced knowledge of the time-
dependent variations that exist in acoustic environments.

Once windowed and overlapped, each segment is processed using the STFT, producing a
spectrogram that represents its frequency content over time. This spectrogram serves as the basis
for subsequent feature extraction, enabling the system to derive informative features for
classification. The next section describes feature extraction.
4.2. Acoustic Feature Extraction

Feature extraction is a pivotal step in the process of the proposed ASC system,
transforming raw audio signals into a structured set of features suitable for deep residual learning
network models. In this work, log mel spectrograms is employed as the primary feature
extraction technique, combined with nearest neighbor interpolation techniques as used in the
DCASE 2016 and 2017 challenges, to capture the essential characteristics of various acoustic
scenes.

4.2.1. Log Mel Spectrogram

In this system, log mel spectrogram is used to extract the characteristics of audio signals,
providing a rich representation that is useful for distinguishing between different acoustic
environments. The process of computing the log mel spectrogram involves several steps.
Initially, the audio signals are resampled to a consistent sampling rate, to ensure uniformity
across the dataset. The amplitude of the audio signals is then normalized to reduce the effects of
varying loudness levels, ensuring that the dynamic range of the signals is comparable.

As mentioned before, the continuous audio signal is divided into short, overlapping
frames, ranging from 1s to 10s with a 10 ms overlap. This framing ensures that important
features of the audio are captured effectively for the proposed SC system. Each frame is
multiplied by the Hamming window function to minimize spectral leakage and improve the
frequency resolution of the subsequent analysis.

STFT is applied to each frame to convert the time-domain signal into the frequency
domain, resulting in a spectrogram that represents the magnitude of the signal across different
frequency bins over time. The spectrogram is then passed through a mel filter bank to convert it
to the mel scale, which is a perceptual scale of pitches judged by listeners to be equal in distance
from one another. The mel scale emphasizes frequencies that are more important to human
perception. Finally, logarithmic scaling is applied to compress the dynamic range of the mel
spectrogram, making the features more robust to variations in loudness.
Log-mel spectrograms, when used in combination with advanced techniques, have been
instrumental in achieving good results in ASC tasks. In this system, nearest neighbor
interpolation techniques are employed as a feature engineering technique to enhance the feature
extraction process.

4.2.2. Enhancing Log-Mel Features through Normalization

In this system, normalization is a crucial step in the preprocessing pipeline for audio
feature extraction, especially when working with log-mel spectrograms. By applying
normalization techniques, the quality of the log-mel features is enhanced, leading to improved
performance in ASC tasks. Considering nearest neighbor interpolation is straightforward to use,
efficient, and effective at maintaining audio information, it has great potential in the field of ASC
research. Normalization is applied to log-mel spectrograms to enhance their robustness and
comparability, and normalization procedures along with nearest-neighbor interpolation have
been included in this portion of the procedure.

Initially, the audio file is divided into several equal-length segments within the given
range of one to ten seconds in order to segment the input audio. The audio signal is represented
by each segment, which allows for targeted processing and analysis. The feature extraction stage
begins after segmentation. In this section each audio segment is converted into a log-mel
spectrogram, a process that involves multiple mathematical operations. The audio segment
undergoes processing through a Fast Fourier Transform (FFT), resulting in a spectrogram
representation denoted by the letter 'S' as defined in (4.1).

S=FFT ¿ y) (4.1)

Subsequently, a Mel filter bank is employed to map the frequencies within the
spectrogram to the mel scale, producing 𝑀.

M =MelFilterBank (S) (4.2)

Finally, the logarithm function is applied to 𝑀 to obtain the log-mel spectrogram, denoted
as logM in (4.2).

log M =log(M ) (4.3)


Each log-mel spectrogram is resized to a uniform size of 128x128 features using nearest-
neighbor interpolation, as defined in (4.4) and (4.5). To calculate the scaling factors required for
resizing, the following formulas are utilized:

128
scale x= (4.4)
M

128
scale y = (4.5)
N

In this equation, M and N represent the dimensions of the original log-mel spectrogram.
The scaling factors are calculated by dividing the target size (128x128 pixels) by the dimensions
of the original spectrogram.

Each pixel in the resized image is mapped to its corresponding position in the original
log-mel spectrogram using the following equations:

x=⌊ i× scale x ⌋ (4.6)

y=⌊ j× scale y ⌋ (4.7)

In this equation, i and j represent the pixel coordinates in the resized image, (x,y)
represent the corresponding coordinates in the original log-mel spectrogram. The values are
rounded down to the nearest integer using the floor function ⌊⋅⌋ to ensure alignment with the
discrete grid of the spectrogram. The mapping equation is described in equation (4.8).

(x , y )=⌊ i× scale x ⌋ , ⌊ j × scale y ⌋ (4.8)

Finally, the value of the nearest pixel in the original spectrogram is assigned to the
corresponding pixel in the resized image, completing the nearest-neighbor interpolation process.
This systematic method essentially coordinates the extraction of features, segmentation, and
scaling of audio data, all of which are supported by mathematical formulas. By converting audio
inputs into log-mel spectrograms, resizing them to a standard format, and preparing them for
further analysis. When computational simplicity and efficiency are valued more highly than
interpolation accuracy, nearest neighbor interpolation proves particularly beneficial, making it
appropriate for certain applications in ASC system.
There are various benefits that Natural Neighbor Interpolation provides for Acoustic
Scene Classification (ASC) systems.

 Preservation of Feature Boundaries: Meaningful feature extraction from audio data is


a prerequisite for ASC systems. More realistic depictions of acoustic scenes result
from the interpolation process maintaining these features' constraints because of
Natural Neighbor Interpolation.
 Preserving Local Variability: When using Natural Neighbor Interpolation, the local
density and distribution of data points are taken into account. Because acoustic
situations can change dramatically in short amounts of time in ASC, feature
conservation helps the algorithm identify minute changes in the audio data, improving
classification accuracy.
 Flexibility with Data Structures: ASC systems frequently handle audio recordings that
are inconsistently distributed and were recorded in a variety of settings. Natural
Neighbor Interpolation is an effective technique for handling the variety of data types
found in ASC tasks because it lacks the data to be organized on a grid or patterning.
 Adaptation to Varying Data Densities: Audio recordings in ASC datasets may exhibit
varying densities, with some scenes being more densely represented than others.
Natural Neighbor Interpolation dynamically adjusts to these variations, assigning
appropriate weights to different regions of the feature space based on data density.
This adaptability improves the robustness of ASC systems across diverse audio
samples.
In this research, the nearest neighbor interpolation method has been employed within the
log mel spectrograms. This method serves as a crucial component in the processing pipeline,
facilitating the transformation of raw audio signals into a more manageable and informative
representation. By leveraging nearest neighbor interpolation, the log mel spectrograms can be
resized efficiently while preserving the essential spectral characteristics of the audio signals. This
allows for more effective analysis and interpretation of the audio data, enhancing the overall
performance of the research objectives.

4.3. Acoustic Scene Classification of the System

In this section, a powerful deep learning technique is used to classify audio environments.
It starts by preprocessing the audio data, turning it into log Mel-spectrograms to extract
important features. These spectrograms are then fed into a deep residual learning network
classifier, which is well-known for its skill in understanding complex patterns. During this phase,
the deep residual learning network classifier learns to associate the extracted features from the
log Mel-spectrograms with their corresponding labels (the environment in which the recording
was made). After training, the classifier is evaluated using a separate set of unlabeled audio
samples (test set). The classifier’s performance is analyzed by measuring the errors in its
predictions. The aim of this study is to precisely determine the environment where an audio
recording originated, such as a park or beach.

4.3.1. Deep Residual Learning Network

In this research, the deep residual learning network classifier serves as a cornerstone of
the proposed system for Acoustic Scene Classification (ASC). At its core lies the concept of
residual learning, which addresses the challenges associated with training deep neural networks
by introducing shortcut connections within residual blocks. These blocks enable the network to
learn residual functions, capturing the difference between the desired output and the current
output of the block. Mathematically, the output of a residual block is obtained by adding the
input to the residual mapping computed by the block. This innovative approach mitigates the
vanishing gradient problem and facilitates the training of networks with several layers. Figure
4.3. shows the overall proposed deep residual learning network structure.

The fundamental unit of a residual learning network is a residual block. A residual block
consists of two main paths: the identity path and the shortcut path. The identity path simply
passes the input through a sequence of layers without any modification. The shortcut path,
however, provides a shortcut for the input to jump over one or more layers.

Figure 4.3. Overall Proposed Deep Residual Learning Network Structure

In a deep residual learning network, a shortcut connection skips one or more layers and is
added to the output of deeper layers. This allows for a direct path for information to flow through
the network. This forms a residual block. RNN are made by stacking these residual blocks
together. The approach behind this network is instead of layers learning the underlying mapping,
the network is allowed to fit the residual mapping. Let x represents the input to residual block.
The output of the residual block is denoted as in equation (4.9).

H (x )=F ( x ) + x (4.9)

In the initial stages of network training, the mapping function, H(x) is employed to fit the
network. This mapping function represents the desired relationship between the input data х and
the corresponding output. By attempting to fit the network to this initial mapping function, the
network aims to learn the underlying patterns and correlations present in the training data. This
process involves adjusting the parameters of the network, such as weights and biases, through
optimization algorithms like gradient descent, to minimize the discrepancy between the predicted
outputs of the network and the true outputs provided in the training data. As training progresses,
the network iteratively refines its parameters to better approximate the desired mapping H(x),
ultimately leading to improved performance and generalization on unseen data.

The proposed system uses the images of Log-Mel spectrogram as its input. In order to
construct a downsampled (pooled) feature map, the average value for patches of a feature map is
calculated using the average pooling method. It is extensively employed following a
convolutional layer. Soft attention is used by the proposed channel attention network to
determine how significantly to emphasize the separate channels. In order to introduce
nonlinearity in DRLN, activation functions are used. The Rectified Linear Unit (ReLU),
Softmax, and Sigmoid activation functions have been employed. Practically, it has been
discovered that ReLU is superior to sigmoidal functions for intermediate network activation. The
output of channel attention block for input X is given by:

X (l+1) =Favg ( X )+ F max ⁡( X ) (4.10)

Where F_avg (.) performs the processes of average pooling, 1×1 convolution, 3×3
convolution, batch normalization, and ReLU activation. F_max (.) performs the processes of
maximum pooling, 1×1 convolution, 3×3 convolution, batch normalization, and ReLU
activation.

0 0 0 0 0 0

0 1 0 0.5 0.5 0
1 0 0.5 3 3 4 3.5
0 0 0.5 1 0 0
* 0 0.5 1 = 1 10 9 1.5
0 0 1 0.5 1 0
1 0.5 0.5 6.5 7.5 10 5
0 1 0.5 0.5 1 0
1.5 5.5 2.25 1
0 0 0 0 0 0 Filter size
3x3x8 Output size
Input size 4x4x8
4x4x1

Figure 4.4. Convolution Layer in Deep Residual Learning Network Architecture

The input features in the proposed system are convolved using filters of size 3 x 3,
starting with an initial sizing of [128 x 128 x 1], resulting in an output of size [128 x 128 x 8].
The input data is converted into a higher-dimensional feature space, simplifying the extraction of
discriminative features for classification tasks. This transformation helps models capture
complex patterns, improving classification accuracy. The system is structured around two-
layered building blocks, each incorporating a convolutional layer with a kernel size of (3x3), a
Stride (S) of 1, and Padding (P) set to 'same'. The number of channels progressively increases
through the layers, with values of 8, 16, 32, 64, 128, 192, and 256, respectively. Within each
building block, a skip connection is established between the input and output, facilitating the
flow of information and mitigating potential vanishing gradient issues commonly encountered in
deep neural networks.

For a given input feature with dimensions W in x Hin, and a filter with dimensions K x K,
the convolutional operation is performed with a stride of S and padding P. The output size of the
convolutional layer is determined by Equation (4.11) and (4.12), which governs the spatial
dimensions of the output feature map based on the input dimensions, filter size, stride, and
padding parameters.

W ¿− K+ 2 P
W out = +1 (4.11)
S

H ¿ −K +2 P
H out = +1 (4.12)
S

Where Wout and Hout represent the width and height of the output feature map,
respectively. Win and Hin denote the width and height of the input feature map, respectively. K
is the kernel size (3 in this case). S is the stride (1 in this case). P is the padding size (determined
as 'same' in this context). By applying Equation (3.14) and (3.15) to each convolutional layer
within the building blocks, the spatial dimensions of the feature maps are calculated at each stage
of the network. The width and height of the output feature map (Wout and Hout) are computed
based on the dimensions of the input feature map (W in and Hin), the kernel size (K), the stride (S),
and the padding (P). The step size that the convolutional filter takes while traversing the input
feature map is determined by the stride (S) value. One pixel at a time is moved by the filter when
the stride value is 1. The output feature map's spatial dimensions are managed by the padding
(P), especially when the stride is larger than 1 or when the dimensions of the input and output
feature maps must coincide. The input size, padding, stride, and kernel size are the variables used
in the output size calculations to determine the output feature map's dimensions. The formula
ensures that the output dimensions fit the required configuration by handling the reduction in size
caused by convolutional operations and adjusting for the padding. The output size calculation for
Convolutional Layer Conv_1 can be expressed using the provided equation:

W ¿ −K+ 2 P H −K + 2 P
Output ¿ Conv 1= +1 × ¿ +1 (4.13)
S S

Substituting the given values:

(128−3+2× 1) (128−3+2× 1)
Output ¿ Conv 1= +1 × +1
1 1

(125+2) (125+2)
¿ +1 × +1
1 1

¿ ( 127+1 ) × ( 127+ 1 )

=128 × 128

The output size of Conv_1 is 128 x 128. After Conv_1, Batch Normalization (BN_1) is
applied. Batch Normalization is a technique used to alleviate the effect of unstable gradients
within deep neural networks. It normalizes the activations of each batch across the batch
dimension, effectively stabilizing and accelerating the training process. The feature maps
produced by Convolutional Layer Conv_1 is put through to the Rectified Linear Unit (ReLU)
activation function, known as ReLU_1. ReLU is a popular non-linear activation function for
providing non-linearity to neural networks. Equation (4.14) defines it as follows:

f (x)=max ⁡( 0 , x ) (4.14)

According to this equation, the maximum of zero and the input value f(x) are the outputs
of ReLU. Put differently, ReLU maintains positive input values unaltered while replacing any
negative input value with zero. For instance, ReLU_1 would function element-wise on each
value of a 4x4 feature matrix, as shown in Figure 3.15, replacing any negative values with zero
and maintaining any positive values.

Figure 4.5. ReLU Activation Layer in DRLN architecture

The feature maps are transmitted through Conv_2, the second convolutional layer, once
the ReLU activation function (ReLU_1) has been applied.

The output size of a MaxPooling layer can be calculated using the following equations:

W ¿ −K H −K
Output ¿ MaxPooling= +1 × ¿ +1 (4.15)
S S

Substituting the given values:

(128−3) (128−3)
Output ¿ Conv 1= +1 × +1 ×8
2 2

(125) (125)
¿ +1 × +1× 8
2 2

¿ [64 × 64 × 8]

Conv_2 uses the same filter size, stride, and padding parameters as Conv_1 to apply
convolutional operations to the input feature maps. To stabilize and normalize the activations,
Batch Normalization (BN_2) is used after Conv_2. During training, this phase keeps the network
stable and avoids vanishing or expanding gradient problems. To add non-linearity to the network
and improve its representational capacity and ability to learn complicated features, an additional
ReLU activation function (ReLU_2) is then added.

2 2 5 1 3 0 1
9 6 3 5 7 5 6
0 1 8 6 6 2 5 9 8 7
3x3 max pooling
4 5 8 3 1 2 3 8 9 6
Stride 2
4 5 3 9 1 0 1 7 9 8

3 2 1 0 8 2 0
1 0 7 5 6 2 1

Figure 4.6. MaxPooling Layer in DRLN architecture

The convolutional layer Conv_2 in this study uses a filter size of [3 x 3] with a stride of
2, and the input feature map is [128 x 128 x 8]. The output feature map's spatial dimensions and
channel count are established by this setup. All things considered, the set of operations involving
Conv_2, BN_2, ReLU_2, and MaxPool_1 helps to extract features from the input data in a
hierarchical manner, gradually decreasing the spatial dimensions and increasing the
discriminative power, which in turn improves performance in ASC tasks.

ReLU activation (ReLU_2) is applied to the feature map of size [128 x 128 x 8] at the
beginning of the procedure, and this serves as the input for the first residual block as shown in
Figure 4.7. First, the spatial dimensions are reduced to [64 x 64 x 16] through the use of
MaxPooling (MaxPool_1), the first of several processes in this lock. An output feature map
measuring [64 x 64 x 16] is obtained by using two convolutional layers, Conv_3 and Conv_4,
followed by Batch Normalization (BN_3 and BN_4) and ReLU activation (ReLU_3 and
ReLU_4). As shown in Figure 3.18, the output of ReLU_4 is increased by a skip connection with
skipConv_1, a 1 x 1 convolutional layer, to generate Add_1 of size [64 x 64 x 16]. This
procedure helps to maintain important features across the network and improves gradient flow.
Add_1 is then manipulated by MaxPooling (MaxPool_2) in order to further downsample the
feature map, resulting in [32 x 32 x 16].

ReLU_2
128 x128 x 8

MaxPool_1
64 x 64 x 8

Conv_3
64 x 64 x 16

BN_3
64 x 64 x 16
skipConv_1
ReLU_3 1 x 1 x 16
64 x 64 x 16

Conv_4
64 x 64 x 16

BN_4
64 x 64 x 16

ReLU_4
64 x 64 x 16

Add_1

Figure 4.7. Residual Block in DRLN architecture

The output of the feature map is [32 x 32 x 32] after going through Conv_5 and Conv_6,
Batch Normalization (BN_5 and BN_6), and ReLU activation (ReLU_5 and ReLU_6). The
second residual block then incorporates the ReLU_6 output, starting with MaxPooling
(MaxPool_3), further reducing the spatial dimensions to [16 x 16 x 32]. Conv_7 and Conv_8,
Batch Normalization (BN_7 and BN_8), ReLU activation (ReLU_7 and ReLU_8), and Conv_7
and Conv_8 come next. ReLU_8's output is joined with a 1 x 1 convolutional layer called
skipConv_2 to create Add_2, which has dimensions of [16 x 16 x 64]. After that, add_2 goes
through MaxPooling (MaxPool_4), which further reduces the feature map's resolution to [8 x 8 x
64]. The output of the feature map is [8 x 8 x 128] after passing through Conv_9 and Conv_10,
Batch Normalization (BN_9 and BN_10), and ReLU activation (ReLU_9 and ReLU_10).
These subsequent layers and processes build on the robust, multi-level feature
representations obtained from the residual blocks, ultimately improving the network's
performance on various ASC tasks.

4.3.2. Channel Attention

In this research, a channel attention module is incorporated into the deep residual network
architecture. It focuses on emphasizing important feature channels. This mechanism enhances
the network's ability to capture critical information, thereby improving performance in ASC
tasks within the proposed system. By dynamically adjusting the importance of each channel,
channel attention aids the network classifier in learning more effective feature representations.

ReLU_10 is fed into the third residual block, which is the final block in the hierarchical
feature extraction process. A channel attention mechanism is implemented in the third block, as
shown in Figure 4.8.

ReLU_10
8 x8 x 128

MaxPool_5 AvgPool_5
8 x 8 x 128 8 x 8 x 128

Conv_11_pre Conv_11_r_pre
8 x 8 x 192 8 x 8 x 192

Conv_11 Conv_11_r
8 x 8 x 256 8 x 8 x 256

BN_11 BN_11_r
8 x 8 x 256 8 x 8 x 256

ReLU_11 ReLU_11_r
8 x 8 x 256 8 x 8 x 256

Add_3
[8 x 8 x 256]

Figure 4.8. Enhanced Residual Block (3rd block) with Embedded Channel Attention Mechanism
Through channel-specific feature selection, this technique seeks to improve the network's
representation power. First, max pooling and average pooling are applied simultaneously to the
input feature maps. Whereas average-pooled (AvgPool_5) [8 x 8 x 128] has the same filter size,
stride, and padding configuration as max pooling, max-pooled (MaxPool_5) [8 x 8 x 128] uses a
3 x 3 filter size with a stride of 1 and 'same' padding. There are two reasons to use max pooling
and average pooling at the same time: Max pooling makes it possible to gather distinguishing
characteristics, which permits more focused channel-wise attention. Max pooling highlights the
most notable characteristics in each channel by choosing the maximum value inside each pooling
frame.

On the contrary, spatial information from all of the feature maps is combined via average
pooling. Average pooling provides a wider view of the input data by capturing the overall spatial
distribution of characteristics by averaging the values within each pooling window. The channel
attention mechanism is able to achieve a better balance between aggregating spatial information
and emphasizing unique features by combining the insights gathered from max pooling and
average pooling [78]. This allows for more discriminative and informed feature representation
within the network. In the third residual block, an embedded channel attention module is
introduced, incorporating two pooling functions: max pooling and average pooling. These
pooling functions are defined by the following formulas:

( p ,q ) =(h , w)
Pmax =max ( p ,q )=(0 , 0) {x ( p , q) } (4.16)

(h ,w)
1
Pavg = ∑ {x }
h × w ( p ,q )=0 ,0 ( p , q)
(4.17)

The results of the max pooling and average pooling processes are denoted by P_maxand
P_avg, respectively. The size of the kernels for both pooling procedures are h×w and, where h
and w represent the pooling window's height and width, respectively. The feature maps undergo
these pooling processes concurrently, and the feature map value at the pth column and qth row
within the pooling window is indicated by the notation x_((p,q) ). The outputs are doubly
convoluted along the channel dimension after the pooling processes. Conv_11_pre and
Conv_11_right_pre are the two convolutional layers used in this method. Conv_11_right_pre
acts on the average-pooled output, likewise producing a feature map size of [8 x 8 x 192], while
Conv_11_pre operates on the max-pooled output. All things considered, this channel attention
method efficiently gathers spatial information and captures distinguishing features by utilizing
both average and max pooling. The network learns channel-wise attention weights over the
ensuing convolutional processes, which improves the feature maps' discriminative capability in
the third residual block.

Within the third residual block, two separate operations are carried out following the
convolutional layers Conv_11 and Conv_11_right, which apply a filter size of [3 x 3] with 256
channels, stride 1, with'same' padding. In order to stabilize training and add non-linearity,
Conv_11_right is subjected to additional batch normalization (BN) and corrected activation
function processes, which are widely used in deep learning networks. By doing this, the
network's representational capacity is improved and the feature maps are guaranteed to retain
desired statistical features. Following that, the outputs of Conv_11 and Conv_11_right undergo
pointwise convolution, where every spatial position is convolved separately for every channel.
To further enrich the recovered features with non-linearities and normalize them, rectified linear
unit (ReLU) activation and batch normalization (BN) are performed after this procedure.
Eventually, add_3, a feature map with dimensions of [8 x 8 x 256], is created by combining the
resultant feature maps from Conv_11 and Conv_11_right via element-wise addition. Through the
suppression of irrelevant features and the network's concentration on pertinent features, this
fusion process enables channel attention. Add_3 improves the feature map's discriminative
strength by highlighting important features and reducing noise or unnecessary information. This
helps the network recognize prominent patterns and make more intelligent judgments at later
processing stages.

One feature map for each category in the classification work is produced in the network's
final step by applying Global Average Pooling (GAP) to the feature maps. The whole feature
map's spatial information is combined during this pooling process to provide a fixed-size output
of [1 x 1 x 256]. Notably, GAP introduces no trainable parameters, which simplifies the model
and lowers the possibility of learning false patterns unique to the training set, hence mitigating
overfitting. A dropout layer is added after GAP to improve the model's generalization
capabilities even more. In order to properly introduce noise and prevent the network from
depending too heavily on any one set of features, dropout randomly deactivates a fraction of
neurons during training. Through encouraging the network to acquire stronger and broadly
applicable representations of the data, this regularization technique benefits in defending against
overfitting. The network is encouraged to learn more resilient and diversified feature
representations while also decreasing computational complexity by introducing dropout after
GAP. As a result, the model becomes more effective and economical and performs better when
generalizing to new data.

A fully-connected layer is used in the last stage of the neural network to link each neuron
in the layer before it to each neuron in the layer after it. In order to predict the output neuron, this
layer acts as the classifier, flattening the input representation and turning it into a feature vector
that is then routed through a network of neurons. The final activation function of the neural
network, as defined in Equation 4.18, is the softmax function. It is utilized to calculate the
probabilities for each class in the classification task. The network's output is normalized using
the softmax function into a probability distribution over the anticipated output classes. By
exponentiating the input value and dividing it by the total of the exponentiated values of all
possible classes, it determines the probability of each class.

xi
e
softmax ( x i )= N
(4.18)
∑ ex j

j =1

Where N is the number of potential classes, xi is a representation of the individual input.


The symbol xj represents the total exponents of all the inputs. The cross-entropy loss between the
true labels connected to the input data and the probability estimates produced by the softmax
function is calculated by the classification layer during training. The difference between the
actual and expected to distributions of class probabilities is measured by this loss function. The
effective deep residual learning network is trained to accurately categorize input data into
various predefined classes, including "bus," "train," "library," "residential area," and others.

4.4. Summary

The chapter outlines the architecture of the proposed system for classifying audio scenes.
It starts by explaining how input audio data is organized into different categories. Then, the
preprocessing steps are discussed in Section 4.1, where framing, windowing, and the STFT
process are explained. In Section 4.2, the focus is on extracting important features from the
audio, such as the log mel spectrogram, and improving this extraction process using nearest
neighbor interpolation. Finally, in Section 4.3, the proposed acoustic scene classification system
is developed. In this section, deep residual learning networks and channel attention mechanisms
are introduced to enhance the model's performance in accurately identifying different audio
scenes.

You might also like