Professional Documents
Culture Documents
Voice Activity Detection Using Deep Learning (Speech Enhancement) For GSM Telephony
Voice Activity Detection Using Deep Learning (Speech Enhancement) For GSM Telephony
By
May, 2022
ii
DEDICATION
UNIVERSITY OF BUEA
CERTIFICATION
The dissertation of Daniel Graham Boaz (CT19P008) entitled: “Voice Activity
Detection Using Deep Learning for Gsm Telephony”, Submitted to the Department of
Electrical and Electronic Engineering, College of Technology of the University of Buea
in partial fulfillment of the Requirements for the award of Master of Technology
(M.Tech.) Degree in Telecommunications and Networks has been read, examined and
approved by the examination panel composed of:
Date:
Dr. Sone Ekonde Michael (AP)
(Director)
iv
ACKNOWLEDGEMENT
First and foremost, I thank God for letting me live to see this dissertation through. I am
deeply grateful to my supervisors who were more than generous with their expertise and
precious time. A special thanks to Dr. Sone Ekonde Michael and Dr. Feudjio Cyrille, for
their countless hours of reflecting, reading, encouraging, advising and most of all,
I would like to thank all my Master’s lecturers, the various head of departments, and all
the College of Technology staff for the strong theoretical knowledge which laid the
I’m most grateful to my parents for their support, encouragement and motivation
throughout this work. To my mother Sahan Annie Angele, I heartily appreciate her
prayers and efforts towards the completion of this work and for supporting me strongly
in my endeavors.
Lastly but not least, I express sincere thanks to my whole family, all my classmates and
friends who have patiently extended all sorts of help for the accomplishment of this
undertaking.
v
ABSTRACT
Voice Activity Detectors (VAD) are algorithms for detecting the presence of speech
signals in the mixture of speech and noise. They play an essential role in speech coders
for GSM telephony as they operate as a two-way classifier by flagging audio frames
where voice is detected. However, in a low SNR environment, the presence of babble
noise drastically reduces hearing intelligibility of speech resulting in poor VAD decisions.
In this dissertation, we propose using deep learning to learn the mapping between the
noisy speech and clean speech features that will improve VAD decisions. Specifically,
we propose using fully convolutional neural networks (CNN), which automatically
extract distinctive features of noisy and clean speech spectra using a few network
parameters. The proposed network model built, showed improved subjective and
objective measures with average PESQ of 2.4755, average STOI of 0.7016, improved
average SNR of 5.1737𝑑𝐵, and improved average channel capacity of 118% for noisy
speech samples with 0𝑑𝐵 𝑆𝑁𝑅.
Key Words: Speech, Voice Activity Detection (VAD), Deep Neural Network (DNN),
Convolutional Neural Networks (CNN).
vi
TABLE OF CONTENTS
DEDICATION............................................................................................................................. ii
ACKNOWLEDGEMENT ......................................................................................................... iv
ABSTRACT ..................................................................................................................................v
TABLE OF CONTENTS........................................................................................................... vi
CHAPTER 1 .................................................................................................................................1
INTRODUCTION ........................................................................................................................1
1.3. Objectives.......................................................................................................................2
CHAPTER 2 .................................................................................................................................4
LITERATURE REVIEW............................................................................................................4
2.2.1. Speech Detection Using Energy and Zero Crossing Rate ......................................4
vii
2.2.4. Speech Detection Using Cepstral Features and Mel-Energy Features ...................5
2.3. Demystifying the VAD Problem (The Rationale of our Work) .....................................7
CHAPTER 3 ...............................................................................................................................18
METHODOLOGY.....................................................................................................................18
CHAPTER 4 ...............................................................................................................................31
RESULTS ...................................................................................................................................31
CONCLUSION...........................................................................................................................44
Summary ..................................................................................................................................44
Remarks ...................................................................................................................................44
REFERENCES ...........................................................................................................................45
APPENDICES ............................................................................................................................49
Appendix E: test_tf_record.py...................................................................................................50
Appendix F: dataset.py..............................................................................................................51
Appendix H: Clean Speech, Noisy Speech, and Denoised Speech spectrograms .................55
x
LIST OF TABLES
LIST OF FIGURES
Figure 2: 2-layer fully connected network (Badescu & Cavez, 2021) .........................................11
Figure 6: Modified Convolutional Encoder-Decoder Network (CED) (Park & Lee, 2016) ........15
Figure 7: Proposed Redundant CED (R-CED) (Park & Lee, 2016) ............................................16
Figure 9: The Librosa downsampler subsystem. (Signals and Systems - OpenStax CNX,
2021) ............................................................................................................................................22
Figure 10: Librosa split method (Librosa — Librosa 0.8.1 Documentation, 2021).....................22
Figure 11: Librosa stft function (Librosa — Librosa 0.8.1 Documentation, 2021) .....................24
Figure 13: Deep learning training scheme(The MathWorks, Inc., 2021) ....................................25
Figure 14: STFT predictor and target vector inputs (The MathWorks, Inc., 2021) .....................26
Figure 18: Reducing the MSE of our DNN model (Silva, 2019) .................................................28
ABBREVIATIONS
API Application Programming Interface
Decision Directed
DD
FT Fourier Transform
ML Machine Learning
xiii
PC Personal Computer
WT Wavelet Transform
1
CHAPTER 1
INTRODUCTION
1.1. Overview
Speech is the predominant means of communication between human beings and since the
invention of the telephone by Alexander Graham Bell in 1876, speech services have
remained to be the core service in almost all telecommunication systems. Speech coders
in GSM Telephony, are used to compress the bit rate (bandwidth) of the speech signal
before its transmission while keeping an acceptable perceived quality of the decoded
output speech signal. This speech signal is often corrupted with an interfering signal
(babble noise) which has a harmful contaminating effect on the signal-to-noise ratio of
the resulting speech signal.
With the recent advances in speech signal processing techniques, the need to accurately
detect the presence of speech in the incoming signal under different noise environments
has become a major industry concern. Separation of the speech fragment from the non-
speech fragment in an audio signal has been achieved over the years, using Voice Activity
Detectors (VAD). VAD’s are a class of signal processing methods that detects the
presence or absence of speech in short segments of audio signal. They have a pivotal role
as the preprocessing block in a wide range of speech applications, hereby providing
improved channel capacity, reduced co-channel interference and power consumption in
portable electronic devices in cellular radio systems and simultaneous voice and data
applications in multimedia communications.
In the past decades, a lot of work has been done in regards to enhancing the speech on the
one hand, and on the other hand, enhancing VAD decisions. Though there might be some
closeness between these two approaches, the difference in the outcome, however, lies in
2
the voice frequency band which in some cases could be considered as the unwanted
signal.
The voice frequency band – which ranges from approximately 300 to 3400Hz,
present in the additive noise and in the clean input speech signal, seems to cause
some obfuscation to the voice activity detector (VAD), which in turn renders a
perceptually non-speech frame as a clean speech frame.
Most VAD algorithms assume the background noise is stationary – often the
Gaussian distribution – in one speech frame and the same assumption is made for
consecutive speech frames. While, in reality interfering noise signals can
sometimes switch from one form to the other (e.g., railway station to crowd
talking, etc.), hereby often causing VAD detection/decision errors.
1.3. Objectives
In order to attenuate decision errors made by VAD’s, we set out to denoise the noisy
speech (enhance the speech) right before the Voice Activity Detector.
Thus, our main objective in this dissertation is to propose a speech enhancement model
that learns the mapping between noisy speech spectra and clean speech spectra, using
deep neural networks (DNN), to suppress both stationary and non-stationary background
noise, hereby, bringing about improved VAD decisions.
1. Can decision errors made by voice activity detectors, in very low SNR conditions,
be attenuated with the help of Speech Enhancement, bearing in mind mobile
device constraints in narrowband GSM Telephony?
2. How well does speech enhancement affect the perception and intelligibility of the
denoised/enhanced speech signals?
This work will be limited to the ITU-T G.729 Annex B recommendation of the Voice
Activity Detection algorithm. The performance of the speech enhancement method will
be evaluated based on subjective and objective measures, as suggested by
(Krishnamoorthy, 2011). The speech and noise conditions used for analysis and
3
implementation will be sound files freely provided by (Mozilla Common Voice, 2021)
and (Salamon et al., 2014) respectively. The speech sounds will be a subset of Mozilla
Common Voice, provided by Matlab (The MathWorks, Inc., 2021). The evaluation
process will be limited to simulations using the aforementioned sound files and no real-
time implementation of GSM telephony will be done.
This study is structured as follows: Chapter 2 introduces the literature about VADs, GSM
Speech Coders, digital spectral analysis, deep neural networks, digital signal processing,
and previous works. It also presents our hypotheses based on our problem statement.
Chapter 3 describes the strategy developed in this dissertation to address the issues of
voice activity detection. It presents how our neural network is designed and implemented
with VAD. Chapter 4 presents the results of our implementation and conclusions are
drawn.
4
CHAPTER 2
LITERATURE REVIEW
In this section, we present the conceptual idea behind activity detection and a literature
survey of the existing work done in the area. We also go through some recent research
works in speech enhancement with deep learning, we introduce deep learning and present
the models used for our hypothesis.
VAD, also known as speech activity detection (SAD), aims to detect the presence of
speech in an audio signal. This might include a scenario of identifying when the signal
from a hidden microphone contains speech so that a voice recorder can operate (also
known as a Voice Operated Switch or VOS). Another example would be a mobile phone
using a VAD to decide when one person in a call is speaking so it transmits only when
speech is present (by not transmitting frames that do not contain speech, the device might
save over 50% of radio bandwidth and operating power during a typical
conversation) (McLoughlin, 2016).
In cellular radio systems (for instance GSM and CDMA systems) based on Discontinuous
Transmission (DTX) mode, VAD is essential for enhancing system capacity by reducing
co-channel interference and power consumption in portable digital devices. To reduce the
annoying modulation of the background noise at the receiver (noise contrast effects),
Comfort Noise Generation (CNG) is used, inserting a coarse reconstruction of the
background noise at the receiver (Scourias, 1995).
(Hahn & Park, 1992) proposes a simple yet effective speech detection algorithm that
classifies frames based on differential logarithmic energy and zero-crossing rate
5
Another VAD technique to improve word boundary detection for varying background
noise levels was suggested by (Craciun & Gabrea, 2004), where noise parameters are
initially estimated from the initial frames and then updated using a first-order
autoregressive filter during the silence periods. The correlation coefficients for the
instantaneous spectrum and an average of the background noise spectrum are the
parameters employed in this approach. Subsequently, a statistical approach using a basic
binary Markov model is used for voice activity detection.
(Sohn et al., 1999) proposed a Statistical Model-Based VAD (SMVAD), in which the
decision rule was obtained from the Likelihood Ratio Test (LRT) by utilizing the
Maximum Likelihood (ML) criterion to estimate the unknown parameters. Further
improvements were made by optimizing the decision rule for the estimate of unknown
parameters using the Decision-directed (DD) technique (Jongseo Sohn & Wonyong Sung,
1998). To achieve robustness in low SNRs, the proposed algorithm further optimized the
decision rule by adapting the decision threshold using the measured noise energy.
Haigh et al. demonstrated the robustness to various background noise levels for successful
end-of-speech identification using cepstral feature-based thresholds (Haigh & Mason,
1993). Chin-Teng Lin et al. suggested an Enhanced/Improved Time-Frequency (ETF)
and Minimum Mel-Scale Frequency Band (MIMSB) parameters collected from a multi-
band spectral analysis using Mel-scale frequency banks to create a robust word boundary
detection method (ETF VAD) (Chin-Teng Lin et al., 2002).
Jong Won et al. proposed a detection algorithm in which the distributions of noise spectra
and noisy speech spectra including speech-inactive intervals are modeled by a set of
GΓD’s and applied to the LRT for VAD. The parameters of GΓD are estimated through
6
an online Maximum Likelihood (ML) estimation procedure where the Global Speech
Absence Probability (GSAP) is incorporated under a forgetting scheme. The proposed
VAD algorithm based on GΓD proved to outperform the algorithms based on other
statistical models discovered so far (Jong Won Shin et al., 2005).
(Tashev & Mirsamadi, 2016) proposed an algorithm for causal VAD based on DNNs.
The DNN is trained on segments of several consecutive audio frames, and with all
frequency bins together to utilize the correlation between the frames and bins. No
assumptions are made for any prior distribution of the noise and speech signals and the
DNN is expected to learn the dependency between the input features and the VAD
decision. It is shown that the proposed algorithm and DNN structure exceeds the classic,
statistical model-based VAD for both seen and unseen noises.
(Farahani, 2017) proposed the ANS method which, instead of removing the lower lag
autocorrelation components of the noisy signal – as is the case with other autocorrelation-
based noise suppression methods, tries to estimate the noise autocorrelation sequence and
deducts it from the noisy signal autocorrelation sequence. It uses the average
autocorrelation of a number of non-speech frames of the noisy utterance as an estimate
for the noise autocorrelation sequence given by;
𝑝=1
∑𝑖=0 𝑟𝑦𝑦 (𝑖̇, 𝑘)
𝑟̂𝜈𝑣 (𝑘) = ,0 ≤ 𝑘 ≤ 𝑁 − 1 (1)
𝑃
where 𝑃 is the number of non-speech frames, 𝑟𝑦𝑦 (𝑖̇, 𝑘) the autocorrelation sequence of
the noisy speech frame. This resulted in obtaining the autocorrelation sequence of the
clean speech signal expressed as
𝑁−1−𝑘
1
𝑟𝑦𝑦 (𝑚, 𝑘) = ∑ 𝑦(𝑚, 𝑖)𝑦(𝑚, 𝑖 + 𝑘) (3)
𝑁−𝑘
𝑖=0
7
with 𝑦(𝑚, 𝑖) being the noisy speech sample composed of the clean speech 𝑥(𝑚, 𝑖) and
the noise 𝑣(𝑚, 𝑖), 𝑁 is the frame length, 𝑖 is the discrete time index in the frame, and 𝑘 is
the autocorrelation sequence index within each frame.
However, this method requires a VAD to obtain the non-speech frames used to estimate
the noise autocorrelation, which doesn’t align with our objectives, as the VAD might
make errors in classifying clean speech and non-speech frames. We throw more light on
this in section 2.3.
It can be observed from the VAD decision depicted in APPENDIX B and magnified in
APPENDIX C, that the VAD red markers don’t seem to efficiently demarcate the
unvoiced and voiced parts of the noisy speech, as there is a combination of both the clean
speech and the unwanted speech babble noise in the noisy signal.
Thus, it would be more advantageous, if the noisy speech signal were denoised
(enhanced) before being passed through a voice activity detector – a topic which we
discuss further in section 2.4.
Speech Enhancement has been a concern for a long time now. It aims to improve speech
quality by attenuating interfering noise. We want to filter out unwanted noise from an
input noisy signal without damaging the speech quality. For instance, if someone is
talking in a phone call conversation while a piece of music is playing in the background
or while running, a speech enhancement system's job, in this case, is to remove or filter
out the background noise (i.e., background music, or body movement sounds) to improve
the speech signal.
Speech enhancement techniques can be classified into two types based on the number of
speech from a noisy speech using signals captured from only one microphone as opposed
to its implementation and computational expensive nature. However, our focus in this
Most single channel speech enhancement techniques are of the spectral domain
(Kawamura et al., 2012) method which is preferably used in a cell phone. (Ortega-Garcia
& Gonzalez-Rodriguez, 1996) give an overview of single channel speech enhancement
techniques. However, the major limitation of speech enhancement in the spectral domain
is the fact that it still assumes the noise process to be stationary, hence, won’t be
successful for non-stationary forms of background noises.
On the other hand, there has been a lot of recent progress in deep neural networks (DNN)
for different signal processing tasks and several deep learning methods for single-channel
speech enhancement have been developed. Also, recent innovations in convolutional
neural networks (CNN) make them beneficial for speech enhancement to training the
model using spectrogram features.
The next section reviews some of the recent works in speech enhancement with DNNs.
In the following subsections, we explore some of the DNN techniques used for speech
enhancement proposed in the literature.
9
2.4.2.1. 2Hz
(Kumar & Florencio, 2016) proposed a speech enhancement method that focuses
primarily on the presence of multiple noises simultaneously corrupting the speech.
Specifically, it deals with improving speech quality in an office environment where
multiple stationary, as well as non-stationary noises, can be simultaneously present in
speech. It is shown that noise-aware training is quite helpful in speech enhancement as
well as in complex noise conditions
(Park & Lee, 2016) try to solve the problem of speech enhancement by finding a
‘mapping’ between noisy speech spectra and clean speech spectra via supervised learning.
Specifically, it proposes using fully Convolutional Neural Networks (CNN), which
consist of a lesser number of parameters than fully connected networks. The CNN used
is a new architecture, Redundant Convolutional Encoder-Decoder (R-CED), that shows
to be 12 times smaller in size than other networks and achieves better performance. The
network extracts redundant representations of a noisy spectrum at the encoder and maps
it back to a clean spectrum at the decoder. This can be viewed as mapping the spectrum
to higher dimensions and projecting the features back to lower dimensions.
In section 2.5., we briefly present why deep learning is so popular nowadays, the areas
where it is implemented, and some of the most used architectures.
In recent years, the advance in deep learning technologies has provided great support for
the progress in Image Processing, Video Processing, Machine Translation, and Speech
Enhancement research fields. Unlike traditional speech enhancement approaches that
depend on statistical models, like spectral subtraction, Wiener filtering, and minimum
mean square error, deep learning approaches built on a data-driven paradigm have shown
outstanding speech enhancement performance over their predecessors (Yuliani et al.,
10
2021). This is mainly due to its ability to model complex non-linear mapping functions
(Shivakumar & Georgiou, 2016).
The next section provides an overview of Deep Learning and how it works.
Deep Learning is a new area of Machine Learning research, which has been introduced
with the objective of moving Machine Learning closer to one of its original goals:
Artificial Intelligence. Deep Learning is all about learning multiple levels of
representation and abstraction that help to make sense of data such as images, sound, and
text.
Deep Learning consists of Deep Neural Networks, which are in charge of learning from
the input train data. In this dissertation, we will be using two specific types of Deep Neural
Networks, which we describe below:
Convolutional Neural Networks (CNN): These are networks that learn directly
from samples by optimizing their filters (or kernels) through automated learning,
as compared to traditional algorithms where these filters are rather hand-
engineered.
Autoencoders: These are networks that work like Restricted Boltzman Machines
(RBM), but use encoders to encode an unlabeled input dataset into short-codes
and use them to reconstruct (decode) the original input data while extracting the
most valuable information (features) from the input data. (Samaya et al., 2021)
The most basic Neural Network is the fully connected network, which is composed of a
deep network of linear classifiers.
To better understand how a linear classifier works, Figure 1, represents its common
architecture. Its equation is expressed as follows:
𝑌 = 𝑊𝑋 + 𝑏 (4)
11
The network's ability to learn is determined by the weights and bias. The network's goal
is to learn the weights and bias parameters from the training data that minimize the error.
The loss function is the function that measures the error during the learning process.
Cross-Entropy and Mean Squared Error are typical loss functions that could be used
to minimize this error. Cross-Entropy is more commonly used in classification, while
Mean Squared Error is more commonly used in regression. An optimizer is required to
reduce the error. The Gradient Descendent, particularly the Stochastic Gradient
Descendent (SGD), is a well-known optimizer. Linear models are stable, but they have
lots of limitations. To retain the parameters in linear functions while making the overall
model non-linear, a step further must be taken and non-linearities must be introduced
Finally, the concepts of an epoch, batch size, iteration, learning rate, and overfitting must
be explained. The epoch is associated with the processing of the entire training dataset
before the gradient is recalculated. The batch, on the other hand, is associated to calculate
the gradient over smaller portions of the dataset, so several iterations will be required
before an epoch is completed. This is an advantage, as the system gets to be faster. That’s
why it is required to establish a batch size. As its name indicates, the learning rate sets
12
the speed of learning. High learning rates increase the error while low learning rates
increase the overfitting. When overfitting appears, the training should stop. To identify
it, the training loss and the validation loss must be observed. When the validation loss is
increasing and the training loss is decreasing, it is a clear signal of overfitting.
One important type of Neural Networks is the Convolutional Neural Network, which has
been greatly used over the years, to enable machines to view the world as humans do,
perceive it in a similar manner, and even use the knowledge for a multitude of tasks such
as Image & Video recognition, Image Analysis & Classification, Media Recreation,
Recommendation Systems, Natural Language Processing, to name a few.
In section 2.7. and section 2.8., we review the literature behind Autoencoders, CNNs,
CNN extensions proposed by (Park & Lee, 2016), and their relevance to this dissertation.
2.7. Autoencoders
Autoencoders are neural networks that compress the input into a lower-dimensional code
and then decode (reconstruct) the output from this representation (code). This code is a
compact “summary” or “compression” of the input, also called the latent-space
representation.
2.7.1. Working Principle
An autoencoder consists of 3 main components: encoder, code and decoder. The encoder
summarizes (compresses) the input and produces a code, which is used by the decoder to
reconstruct the input. The figure below is a depiction of the architecture of an
autoencoder.
13
First, the input in Figure 3 is passed through the encoder, which is a fully-connected neural
network, to produce the code. The decoder, which has the similar neural network
structure, then produces the output using this code. The idea here is to get an output
identical to the input.
In section 2.8., we review CNN’s and how they can be nested with autoencoder layers.
Inspired by early findings in the study of biological vision, the name "convolutional
neural network" indicates that the network employs a mathematical operation called
convolution. Convolutional networks are a specialized type of neural network that uses
convolution in place of general matrix multiplication in at least one of their layers.
The architecture of a CNN is inspired by the organization of the Visual Cortex and is
analogous to the connectivity pattern of Neurons in the Human Brain. Individual neurons
can only respond to stimuli in a small area of the visual field called the Receptive Field.
A number of similar fields can be stacked on top of each other to span the full visual field
(Saha, 2018).
In computer vision applications, the CNN algorithm takes an input image and gives
relevance (learnable weights and biases) to various aspects/objects in the image, allowing
it to distinguish between them. When compared to other classification algorithms, the
amount of pre-processing required by a CNN is significantly less. While filters are hand-
14
In addition, each filter h_i is replicated across the entire layer. These replicated units share
the same parametrization (weight vector and bias) and form a feature map. In Figure 5 a
CNN feature map can be observed.
For these reasons, Convolutional Neural Networks are perfectly fit for image and video
processing, but also for audio processing.
The image in Figure 6 shows a simple CNN architecture for classifying handwritten digits
images.
15
Figure 6 above shows us how the CNN layers reduce the dimension of images into a form
that is easier to process, without losing features that are critical for getting a good
prediction.
Having seen the working principle of CNNs, we will now review some of the CNN
extensions proposed by (Park & Lee, 2016).
and decoding layers in which each block represents a feature. This is depicted in Figure
7.
Figure 7: Modified Convolutional Encoder-Decoder Network (CED) (Park & Lee, 2016)
16
No pooling layer is present, and thus no up-sampling layer is required. Opposite to CED,
R-CED encodes the features into higher dimensions along the encoder and achieves
compression along the decoder. The number of filters is kept symmetric: at the encoder,
the number of filters is gradually increased, and at the decoder, the number of filters is
gradually decreased. The last layer is a convolution layer, which makes R-CED a fully
convolutional network.
Compared to the R-CED with the same network size (i.e., with the same number of
parameters), CR-CED achieves better performance, both in terms of intelligibility and
perceptual analysis, with less convergence time.
17
In the section 2.8., we develop our hypothesis based on this deep learning network.
The previous subsections discussed existing VAD and speech enhancement methods and
the theory concerning VAD, speech enhancement, and Deep Learning. With the help of
the research conducted by (Park & Lee, 2016), we form our hypothesis in two folds.
H1. Given a segment of noisy spectra {𝑥𝑡 }𝑇𝑡=1 and clean spectra {𝑦𝑡 }𝑇𝑡=1, we aim to
learn a mapping 𝑓 which generates a segment of denoised spectra {𝑓(𝑥𝑡 )}𝑇𝑡=1 that
approximates the clean spectra 𝑦𝑡 in the 𝒍𝟐 norm, i.e.,
Specifically, we formulate 𝑓 using a fully convolutional neural network, such that the past
𝑛𝑇 noisy spectra: {𝑥𝑖 }𝑡𝑖=𝑡−𝑛𝑇 +1 are considered to denoise the current spectra, i.e.,
𝑇
2
∑‖𝑦𝑡 − 𝑓(𝑥𝑡−𝑛𝑇 +1 , … , 𝑥𝑡 )‖2 (6)
𝑡=1
H2. Can the denoised spectra, {𝑓(𝑥𝑡 )}𝑇𝑡=1 , obtained from H1, improve the decisions
made by the ITU-T G.729 Annex B recommended voice activity detection
algorithm?
For this dissertation, we shall adopt the CR-CED network model to perform single
channel speech enhancement of noisy speech signals in order to improve VAD algorithm
decisions.
In the following chapter, we will present the methodology used to implement our CR-
CED DNN architecture for eliminating noise from speech signals.
18
CHAPTER 3
METHODOLOGY
This chapter explains how we use Deep Learning to create an architecture capable of
mapping noisy speech signals to their clean variants. To begin, we'll go over the
technologies used for building/training our denoising algorithm, the dataset we used to
train and test our speech enhancement model. Next, we discuss the modules we used for
preparing the dataset for the Neural Network. Then, we will present the architecture of
our suggested Deep Neural Network model. Finally, we test hypothesis H2 using our
DNN model.
In this section, we present our computing toolkit used for testing our hypothesis. Our
toolkit can be classified into 2 types, namely;
Computing: which involves the computing resources we used for building our
DNN architecture, e.g., servers, computers, etc.
Analytics: which involves any software tool we used for testing hypotheses.
Traditional machine learning techniques are often used when dataset size is small.
However, the performance greatly degrades when the dataset size gets larger. On the other
hand, deep learning exhibits advantageous scalability with a huge dataset size, hence the
need for great computing power. The Graphics Processing Units (GPUs) is usually
responsible for delivering the computing power needed for these tasks instead of CPU as
it comes with a good number of concurrent threads compared to single-thread
performance optimization provided by CPU (“Why Are GPUs Necessary for Training
Deep Learning Models?,” 2017).
Given that we couldn’t afford a PC with a good GPU and recommended 16GB RAM
(Running Kaggle Kernels with a GPU, 2021), we opted for the usage of Kaggle Kernels.
Kaggle provides free access to NVidia K80 GPUs in kernels (Running Kaggle Kernels
with a GPU, 2021) with 16GB RAM available. This results in a 12.5X speedup during
training of a deep learning model with a total run-time of 994 seconds as compared to
13,419 seconds with only one CPU.
19
This toolkit consists of the software and libraries we used in building, training, and testing
our denoising model. This comprises of:
Programming languages
Software packages
A powerful deep learning API we’ll be using for creating our DNN model is Keras which
runs on top of Tensorflow and was developed with the focus on enabling fast
experimentation of Neural Networks architecture due to its total modularity, minimalism,
and extensibility. Furthermore, it supports convolutional networks, recurrent networks,
and combinations of both, including multi-input and multi-output training.
Pandas: which is used for data analysis and manipulation of our audio signals
Numpy: which adds support for large, multi-dimensional arrays and matrices,
along with a large collection of high-level mathematical or scientific calculations
functions to operate on these arrays
Sci-kit learn: which is used for clustering and dimensionality reduction within
our model.
Librosa: which is a music and audio analysis tool library, that provides the
building blocks necessary to create music information retrieval systems.
20
DSP toolbox: This provides algorithms, apps, and scopes for designing,
simulating, and analyzing signal processing systems. This will be used to
resample audio signals within MATLAB for SNR analysis.
Audio toolbox: This provides tools for audio processing, speech analysis, and
acoustic measurements. This will be used to read audio files into MATLAB for
SNR analysis.
Simulink toolbox: This is a MATLAB-based graphical programming
environment for modeling, simulating, and analyzing multidomain dynamical
systems. This will be used in the results section of this dissertation to test our
hypothesis H2.
3.2. Datasets
The experiment was conducted using 2 publicly available audio datasets namely:
The Mozilla Common Voice (MCV): This dataset contains as many as 75,879
recorded clean speech audio – which is about 65GB of 2,637 validated hours
spread in short MP3 files. But due to the lack of adequate computing resources,
we use a minified version of this dataset provided by Mathworks at (Denoise
Speech Using Deep Learning Networks, 2021). It only contains 2,800 recorded
clean speech samples and weighs only 988MB. The MCV project is open source
and anyone can collaborate with it. The wide range of speakers in this dataset is
one of its best features. It includes fragments of male and female recordings from
a wide range of ages and foreign accents.
UrbanSound8K: This dataset contains 8732 labeled sound excerpts of urban
noise sounds classified into 10 different commonly found urban sounds. This
includes: air conditioner, car horn, children playing, dog bark, drilling, engine
idling, gun shot, jackhammer, siren, and street music. These classes are drawn
from the urban sound taxonomy and can be found at
https://urbansounddataset.weebly.com/urbansound8k.html.
We will use these urban sounds as noise signals to the clean speech samples from the
MCV dataset. In other words, similar to Figure 9, we shall first take a clean speech signal
– this can be someone speaking a random sentence, from the MCV dataset, then, we add
noise to it – hereby creating synthetically, a scenario where a woman is speaking and a
21
dog is barking in the background, and finally, we use this artificially created noisy signal
as the input to our deep learning model. Our Neural Network, in turn, will receive this
noisy signal and try to compute a clean representation of it.
Figure 9, displays a visual representation of a clean input signal from the MCV. A noise
signal from the UrbanSound dataset and the resulting noisy input – that is the input speech
after adding noise to it. Also, note that the noise power is set so that the signal-to-noise
ratio (SNR) is zero dB (decibel).
This section deals with a crucial step in any deep learning project, that consists of
implementing some Data Preprocessing modules, which allow for the extraction of
features required for training and testing our deep learning network. This entails:
signals lie in the fact that our data set contains 48kHz recordings of subjects speaking
short sentences and this might result in very poor network computational load.
Another reason for downsampling the audio signals is to mimic the sample rate of speech
coders used for narrowband GSM (EETimes, 2003) telephony applications, which is
known to be 8kHz.
For us to perform this downsampling process without risking any aliasing in our signals,
we use a library in Python, called Librosa, that assists in loading up a signal with our
desired new sampling rate. The library automatically takes care of deciding an appropriate
anti-aliasing filter as well as a proper decimation factor for our desired sampling rate
(8kHz). The following diagram depicts what our Librosa signal downsampler consists of.
Figure 10: The Librosa downsampler subsystem. (Signals and Systems - OpenStax CNX, 2021)
Another preprocessing stage involves removing silent frames from our audio signals.
Similar to the downsampling process, the idea here is to reduce the computational load of
our Deep Neural Network, thereby reducing the processing power, the processing speed,
and increasing the training accuracy of the network.
To achieve this, we use the below method from Librosa to split our audio file on silence:
Figure 11: Librosa split method (Librosa — Librosa 0.8.1 Documentation, 2021)
This method splits audio files into low volumes specified by the top_db parameter. It also
takes an optional parameter, hop_length, to specify the number of samples between
frames under analysis. We consider using 20dB for our top_db parameter and 64 for our
hop length.
23
This stage involves computing the spectral vectors of our audio signals with a 256-point
Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point
(8ms) and frequency resolution of 31.25 Hz (=4kHz/128) per each frequency bin. The
STFT formula is given by:
𝑁−1
where,
The reason we are using a Hamming window is that the FFT transform of a short audio
segment from the main audio signal, erroneously uses the assumption that this signal is
periodic and repeats infinitely before and after the analyzed segment in time. This
erroneous assumption leads to edge eff ects between repeating segments and therefore to
what is known as spectral leakage (a lack of frequency resolution caused by spectral
information “leaking” from one frequency position into adjacent values) which can be
reduced with the use of a windowing function such as Hamming window. The Hamming
window reduces the amplitude of the discontinuities at the side lobes of each finite
segment including any non-harmonic content, hence improving the frequency resolution
of our audio signal. In Figure 12, we illustrate how spectral leakage is reduced using the
hamming window function using a segment of an audio signal from our dataset.
To fulfill this stage of splitting the signal into discrete short time frames before feeding
these to our network, we use the stft function from the Librosa library in Python, as
shown in Figure 12 below;
24
Figure 12: Librosa stft function (Librosa — Librosa 0.8.1 Documentation, 2021)
As per the parameters required by this function, the table below gives a listing of the
values we used.
𝒏_𝒇𝒇𝒕: 256
win_length: 256
window: hamming
In this section, using the vectors obtained from the preprocessing stage, we proceed in
implementing a Deep Neural Network model for denoising in noisy environments. The
deep learning training scheme is shown below.
The magnitude spectra of the noisy and clean audio signals are used as predictor and
target network signals, respectively. The magnitude spectrum of the denoised signal is
the network's output. The regression network minimizes the mean square error between
its output and the input target by using the predictor input. The output magnitude spectrum
and phase of the noisy signal are used to transfer the denoised audio back to the time
domain.
Once we obtain the STFT vectors, as elaborated in the previous section, we reduce the
size of the spectral vector to 129 by dropping the frequency samples corresponding to
negative frequencies (because the time-domain speech signal is real, and so does not lead
to any information loss). Our predictor input consists of 8 consecutive noisy STFT vectors
so that each STFT output estimate is computed based on the current noisy STFT and the
7 previous noisy STFT vectors. In other words, the DNN model is an autoregressive
system that predicts the current signal based on past observations. Therefore, the target
signals consist of a single STFT frequency representation of shape (129,1) from the clean
audio. The diagram below depicts this process.
26
Figure 15: STFT predictor and target vector inputs (The MathWorks, Inc., 2021)
Now that we’ve laid out how our DNN will interact with our various STFT vectors, we
now need to see how the DNN model proper, will work.
Our DNN model will be largely based on the work done by (Park & Lee, 2016), where
the authors proposed a Cascaded Redundant Convolutional Encoder-Decoder Network
(CR-CED). Hence, our model will be based on symmetric encoder-decoder architectures,
in which both components contain repeated blocks of Convolution, ReLU, and Batch
Normalization, rendering our network with 16 of such blocks, which add up to 33,000
parameters, i.e., roughly 132MB of memory which can be implemented in an embedded
system, like a mobile phone.
The figures below show the structure of our network model defined with Keras, which
we generated using the Tensorflow plot_model function. The figure was split into various
parts for space management.
27
Figure 16: DNN model Figure 17: DNN model Figure 18: DNN model
structure part 1 structure part 2 structure part 3
It’s important to note that, there are skip connections between some of the encoder and
decoder blocks such that feature vectors from these blocks are combined through addition.
These skip connections speed up convergence and reduce the vanishing of gradients
during training.
Another point to highlight is that, since one of our assumptions is to use the CR-CED
network which is an extension of CNNs (originally designed for Computer Vision), it is
important to be aware that audio data differ a lot from images. Hence, audio data, in its
raw form, is a 1-dimensional time-series data as compared to images which are 2-
dimensional representations of an instant moment in time.
28
For these reasons, we will have to transform our audio signals into (time/frequency) 2D
representations. More specifically, given an input spectrum of shape (129 × 8) to our
network, convolution is only performed in the frequency axis. This ensures that the
frequency axis remains constant during forwarding propagation.
Once our network produces an output estimate, we optimize (minimize) the mean squared
difference (MSE) between the output and the target (clean audio) signals. Figure 19
illustrates how our DNN model attempts to optimize the MSE between noisy spectral
vectors and target clean audio spectral vectors. It uses the frequency spectrogram of these
audio signals for illustration purposes.
Figure 19: Reducing the MSE of our DNN model (Silva, 2019)
In the next subsection, we present the training process of our model.
Training a model simply means learning (determining) good values for all the weights
and the bias from labeled samples, in this case, our clean and noisy audio signals.
However, to improve the prediction performance of our deep learning algorithm during
this training phase, we split our dataset into train, validation, and test sets. Splitting our
dataset this way also helps avoid a situation where the model fails to make predictions on
stationary noise attributes it has never seen – a concept called, overfitting. The test set
here is used to estimate how well our model will behave with unseen data. The validation
set is used to validate our model in different configurations such as optimizers or loss
functions. The train set will be used to train or fit our model.
29
Another important manipulation we need to make on our dataset is to split it, which
requires us to take into account the following considerations:
This splitting is still aimed at avoiding the overfitting of our model. In machine learning,
deciding on a proper ratio for splitting the dataset usually falls in the range 60%-20%-
20% to 98%-1%-1% as shown in the figure provided by (Data Splitting Technique to
Fit Any Machine Learning Model | by Sachin Kumar | Towards Data Science, 2021).
Finally, to fit our model, we make use of the test_tf_record.py and dataset.py
modules which use TFRecord, provided by Tensorflow as the recommended data format
for training, to save features of our clean and audio signals, which we can now use to fit
our model. These 2 modules are available in Appendices E & F respectively. We use the
fit function provided by Keras to train our model. It takes as arguments, the entire input
data (source) and output data (target), the batch size, the number of epochs, and the
validation dataset (source and target). We use this function in the snippet below:
30
where,
train_dataset is our input training set of clean and noisy audio features
(magnitude and phase spectral vectors)
steps_per_epoch defines the total number of steps (batches of samples) before
declaring one epoch finished and starting the next epoch during training
validation_data, is the data on which to evaluate the loss and any model metrics
at the end of each epoch. The model will not be trained on this data.
epochs defines the number of epochs to train the model.
In the following chapter, we will explain the methods used to evaluate the system and
present the results we obtained.
31
CHAPTER 4
RESULTS
In this section, we evaluate our proposed system with our hypotheses. To test hypothesis
H1, we use objective and subjective measurements, we test hypothesis H2 using graph
comparisons of our denoised signals, and we evaluate the performance of our system
relative to critical cellular GSM device parameters, such as; channel capacity, co-
channel interference and power consumption.
In all speech enhancement algorithms, the improvement in the quality and intelligibility
is of utmost importance for ease and accuracy of information exchange. The speech
quality and intelligibility can be quantified using subjective and objective measures
(Krishnamoorthy, 2011). We implement these measurements in the next subsections.
Subjective speech quality measures are usually obtained using listening tests in which
human participants rate the quality of the speech in accordance with a predetermined
opinion scale. Listeners are presented with the sample speech audios and asked to rate the
quality of the speech on a numerical scale, typically a 5-point scale with 1 indicating poor
quality and 5 indicating excellent quality – a scoring range called Mean Opinion Score
(MOS). However, according to (Taal et al., 2010), such evaluation methods turn out to be
costly and time-consuming.
Hence, to perform the subjective speech intelligibility test, (Rix et al., 2001) suggest the
use of Perceptual Evaluation of Speech Quality (PESQ) to predict the subjective
opinion score of a degraded or enhanced speech. This is because, PESQ is a quite
sophisticated algorithm which has been recommended by ITU-T (P.862) for speech
quality assessment of narrow-band handset telephony and narrow-band speech codecs
(Hu & Loizou, 2008).
The PESQ measure takes a reference signal and the enhanced signal and aligns them in
both time and level. This is followed by a range of perceptually significant transforms
which include Bark spectral analysis, frequency equalization, gain variation equalization,
and loudness mapping (Rix et al., 2001).
32
The range of the PESQ score is −0.5 to 4.5, where -0.5 corresponds to a poor quality,
while 4.5 corresponds to a high quality of speech intelligibility.
Another metric that has proven to be able to quite accurately predict the intelligibility of
noisy/processed speech in a large range of acoustic scenarios, including speech processed
by mobile communication devices, is the Short-Time Objective Intelligibility (STOI).
Recent studies by (Chen et al., 2016) and (Healy et al., 2017), show a good
correspondence between STOI predictions of noisy speech enhanced by DNN-based
speech enhancement systems, and speech intelligibility.
STOI is based on the correlation between the envelopes of clean and degraded speech
signals – denoted by 𝑥 and 𝑦 respectively, decomposed into regions that are
approximately 400ms in length and uses a simple DFT-based Time-Frequency-
decomposition. According to (Taal et al., 2011), the output of STOI is a scalar value
which is expected to have a monotonic relation with the average intelligibility of 𝑦 (e.g.,
the percentage of correctly understood words averaged across a group of users). It is a
function of a Time Frequency dependent intermediate intelligibility measure, which
compares the temporal envelopes of the clean and degraded speech in short-time regions
by means of a correlation coefficient. The following vector notation is used to denote the
short-time temporal envelope of the clean speech:
𝑇
𝑥𝑗,𝑚 = [𝑋𝑗 (𝑚 − 𝑁 + 1), 𝑋𝑗 (𝑚 − 𝑁 + 2), … , 𝑋𝑗 (𝑚)] (8)
where 𝑁 = 30 which equals an analysis of approximately 400ms, 𝑋𝑗 (𝑚) =
(𝑗)−1
√∑𝑘𝑘=𝑘
2
|𝑥̂(𝑘, 𝑚)|2, with 𝑥̂(𝑘, 𝑚) denoting the 𝑘 𝑡ℎ DFT-bin of the 𝑚𝑡ℎ frame of the
1 (𝑗)
clean speech. Similar notation, 𝑦𝑗,𝑚 , is used for the short-time temporal envelope of the
degraded speech.
Thus, the correlation coefficient between 𝑥𝑗,𝑚 and 𝑦𝑗,𝑚 is given by;
𝑇
(𝑥𝑗,𝑚 − 𝜇𝑥𝑗,𝑚 ) (𝑦̅𝑗,𝑚 − 𝜇𝑦̅𝑗,𝑚 )
𝑑𝑗,𝑚 = (9)
‖𝑥𝑗,𝑚 − 𝜇𝑥𝑗,𝑚 ‖ ‖𝑦̅𝑗,𝑚 − 𝜇𝑦̅𝑗,𝑚 ‖
‖𝑥 ‖ 15𝑑𝐵
where 𝑦̅𝑗,𝑚 (𝑛) = min (‖𝑦𝑗,𝑚 ‖ 𝑦𝑗,𝑚 (𝑛), (1 + 10 20 ) 𝑥𝑗,𝑚 (𝑛)), is the normalized and
𝑗,𝑚
clipped version of 𝑦 and 𝜇(.) refers to the sample average of the corresponding vector.
33
Finally, the average of the intermediate intelligibility measure over all frames, referred to
as the STOI score, is given by;
1
𝑑= ∑ 𝑑𝑗,𝑚 (10)
𝐽𝑀
𝑗,𝑚
where 𝑀 represents the total number of frames and 𝐽 the number of one-third octave
bands.
According to (Taal et al., 2010), the output of STOI, 𝑑, takes values −𝟏 ≤ 𝒅 ≤ 𝟏 but is
in practice non-negative (Intelligibility Prediction for Speech Mixed with White Gaussian
Noise at Low Signal-to-Noise Ratios: The Journal of the Acoustical Society of America:
Vol 149, No 2, 2021).
In this dissertation, we based our measurement on 30 clean speech and noise samples
from our test dataset discussed in the methodology. This test dataset comprises clean and
noise samples, which we added individually to obtain 30 noisy speech samples with SNR
of 0𝑑𝐵. These noisy speech samples are then fed to our speech enhancement DNN
algorithm to obtain denoised speech samples. The denoised samples, the noisy speech
samples, and the original clean speech samples are then used to perform the subjective
speech intelligibility test using the aforementioned metrics.
Despite the unavailability of its mathematical representation, we measure this metric with
the help of the ‘PESQ Software’ provided by ITU-T (P.862 : Perceptual Evaluation of
Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment
of Narrow-Band Telephone Networks and Speech Codecs, 2021) embedded in the pesq
function of the pysepm package from (schmiph2, 2019/2021).
The table below shows the results of the first 10 PESQ scores for the noisy and denoised
speech samples.
34
The results above show that, for our first 10 noisy and denoised speech samples, the
perceptual speech quality is significantly increased with the help of our DNN model.
This metric was measured with the help of the stoi function from the pysepm module.
The table below shows the results of the first 10 STOI values for the noisy and denoised
speech samples.
35
The table above shows varying intelligibility scores with 0.6518 being the minimum
STOI and 0.777 being the maximum STOI value for this result set. It can be noticed that
some denoised speech samples have lower STOI values than their noisy counterpart, as
is the case with rows highlighted in blue. This is due to the denoising errors encountered
It’s worth mentioning that, the prediction root-mean-square-error (RMSE) of our model
was evaluated to be up to 𝟎. 𝟒𝟑𝟕𝟓 – indicating that our model can denoise our noisy
speech signals with up to 𝟏𝟒. 𝟐𝟓% feature prediction errors. This is to say that 1.96 ×
43.75% = 𝟖𝟓. 𝟕𝟓% of extracted noisy features could be correctly predicted, hence the
This dissertation focused on the use of the signal-to-noise-ratio (SNR) to measure the
(Krishnamoorthy, 2011):
∑𝑛 𝑠 2 (𝑛)
𝑆𝑁𝑅𝑑𝐵 = 10 log10
∑𝑛[𝑠(𝑛) − 𝑠̂ (𝑛)]2 (11)
where 𝑠(𝑛) is the clean speech and 𝑠̂ (𝑛) is the noisy or denoised speech signals. Since
SNR is very sensitive to the time alignment of the original and processed signal, we
padded the noisy and denoised signals with zeros to align them with the original or clean
speech signal.
The table below shows the results of the first 10 SNR values for the noisy and denoised
speech samples.
37
The results above show that, for our first 10 noisy and denoised speech samples, the SNR
is significantly increased with the help of our DNN model, from 0𝑑𝐵 noisy signal to a
maximum of 𝟕. 𝟒𝟐𝟐𝟕𝒅𝑩.
The next table below shows the average speech metrics (PESQ, STOI, and SNR)
computed over 30 clean speech and noise samples in our test dataset.
38
Mean PESQ Mean PESQ Mean Mean STOI Mean SNR Mean
noisy denoised STOI noisy denoised noisy SNR
denoised
It can be observed from these results that, based on 30 samples from our test dataset, there
is a significant increase in the PESQ, STOI, and SNR of denoised signals obtained from
In this section, we were able to test hypothesis H1 and observed that our DNN network
exhibited an improved performance based on both subjective (PESQ, STOI) and objective
quality (SNR) measures, in denoising noisy speech signals. We test hypothesis H2 in the
next section.
4.2.Testing Hypothesis H2
Now that we’ve been able to test hypothesis H1 using the mentioned speech quality
measures, visually inspecting the waveforms of the resulting signals can also easily tell
us how promising our denoising algorithm can be relative to its application to the ITU-T
To begin with, let’s visualize our noisy input signal, which is composed of the last speech
from our test dataset (115418-9-0-20.wav). The figure below shows the waveforms of
these signals.
39
Passing this denoised signal in the third waveform from the figure above to the G.729
algorithm results in the following waveform.
Given that we’ve been able to test both Hypothesis H1 and H2, we’ve been to obtain
improved intelligibility and voice activity detection, it’s also important, that we measure
41
how well our enhanced voice activity decisions affect a critical GSM resource like the
channel capacity.
Usually, once the GSM speech coder encodes our denoised speech, this encoded speech
is passed onto the GSM Traffic Channel (TCH). This TCH is responsible to carry digitally
encoded speech on the forward and reverse link after a mobile has established connection
with the GSM Base Transceiver Station (BTS).
TCH/FS: TCH/FS which stands for Full Rate Speech Channel (ECSTUFF4U for
Electronics Engineer, 2022), carries encoded speech at a rate of 22.8kbps.
TCH/HS: TCH/HS which stands for Half Rate Speech Channel carries up to
11.4kbps of encoded speech (ECSTUFF4U for Electronics Engineer, 2022). It’s
main purpose is to support two calls in only one GSM channel.
Now, for a noiseless channel, the Nyquist capacity formula defines the theoretical
channel capacity 𝐶 as:
However, since we can’t have a noiseless channel in real life, we base our test using
Shannon capacity to determine the theoretical channel capacity for a noisy channel given
as:
Given that in section 4.1.2., the 𝑆𝑁𝑅𝑑𝐵 was calculated, we obtain the 𝑆𝑁𝑅 here using:
The channel capacity values in Table 6, indicate the maximum rate at which speech can
be transmitted through a 25kHz full duplex channel with very small error probability
(Channel Capacity - an Overview | ScienceDirect Topics, 2022). Hence, it can be
observed that, there is significant increase in channel capacity when enhanced speech is
transmitted across the channel.
Other important GSM resources like co-channel interference and power efficiency are
equally quite necessary to be analyzed, but this will require prior analysis of the GSM
cell structure, co-channel cell distance 𝐷, cell radius 𝑅, and interference power caused by
an interfering co-channel cell base station, which is out of scope of our research work.
4.4.Results Discussion
The summary results in Table 5 give us an idea of how well our DNN model performs in
denoising noisy signals based on 30 noisy sample signals. Comparing the average PESQ
score obtained (𝟐. 𝟒𝟕𝟓𝟓) with that obtained by (Park & Lee, 2016), 𝟐. 𝟑𝟒, it becomes
clear that, our model isn’t too bad in maintaining a good speech quality while denoising.
43
Also, comparing our average STOI value (𝟎. 𝟕𝟎𝟏𝟔) with that obtained by (Park & Lee,
2016), 𝟎. 𝟖𝟑, we can also deduce that our model isn’t too bad at maintaining intelligibility
of denoised speech signals.
Another research work performed by (Badescu & Cavez, 2021), showed a PESQ score of
𝟐. 𝟏𝟗𝟐𝟎, for SNR conditions of −𝟓𝒅𝑩, which when compared to our PESQ score under
SNR conditions of 𝟎𝒅𝑩, points out that our model provides a good intelligibility level
for 0𝑑𝐵 noisy speech signals.
Finally, given a full duplex channel with bandwidth 25kHz, using 30 noisy and denoised
speech samples, our work showed a significant increase in channel capacity with average
𝟓𝟒. 𝟓𝟗𝟖𝒌𝑯𝒛 which represents an increase of 118%.
Hence, we can assert based on these results that, we can generate denoised spectra,
{𝑓(𝑥𝑡 )}𝑇𝑡=1 , that approximates the clean spectra {𝑦𝑖 }𝑇𝑡=1 in the 𝒍𝟐 norm while maintaining
good perception and intelligibility levels, and that our denoised spectra, can significantly
improve the decisions made by the ITU-T G.729 Annex B recommended voice activity
detection algorithm.
44
CONCLUSION
Summary
The results we obtained from the proposed deep neural network, presented in the previous
chapter, establishes that VAD decision errors can be attenuated with the help of Speech
Enhancement, and can improve perception and intelligibility levels, hence, increasing the
signal-to-noise ratio.
Remarks
Our study, showed a close resemblance in our performance metrics results and that
obtained from previous works by (Park & Lee, 2016) and (Badescu & Cavez, 2021).
Our speech enhancement model accuracy was obtained to be 85.75%, i.e., RMSE of
14.25%, indicating the error percentage our model encountered during training.
Despite the technical issues encountered with our datasets and the computational
resources, we believe that past these constraints, greater precision accuracy and hence a
lesser RMSE percentage will be achieved.
Future Work
In our future work, we intend to increase the dataset size of our clean speech dataset
(MCV) – about 71GB – and adjust the training parameters of the noise dataset in our
model, and set appropriate train-validation-test ratios to improve the precision accuracy
of our DNN model.
Lastly, to overcome the problem of frequency and time resolution of the short-time
Fourier transform, we intend to replace the STFT with the Wavelet Transform (WT) of
the speech signals at the preprocessing stage.
45
REFERENCES
Asik, H., & Amca, H. (2019). Hand-over power level adjustment for minimizing cellular mobile
communication systems health concerns. Ciência e Técnica Vitivinícola, Vl. 34. No.
7, pp. 2416-3953.
Badescu, D. M., & Cavez, A. B. (n.d.). Speech Enhancement using Deep Learning. 33.
https://upcommons.upc.edu/bitstream/handle/2117/100596/Speech+Enhancement+usin
g+Deep+Learning.pdf?sequence=1.
Baghdasaryan, D. (2018). Real-Time noise suppression using deep learning | by davit
baghdasaryan | towards data science. Real-Time noise suppression using deep learning.
Retrieved on 17/02/2022 from, https://towardsdatascience.com/real-time-noise-
suppression-using-deep-learning-38719819e051.
Benyassine, A., Shlomot, E., Su, H.-Y., Massaloux, D., Lamblin, C., & Petit, J.-P. (1997). ITU-
T Recommendation G.729 Annex B: A silence compression scheme for use with G.729
optimized for V.70 digital simultaneous voice and data applications. IEEE
Communications Magazine, Vl. 35. No. 9, pp. 64–73. Retrieved from,
https://doi.org/10.1109/35.620527.
Chaudhary, M. (2020). Activation functions: Sigmoid, tanh, relu, leaky relu, softmax. Medium.
Retrieved 14/02/2022 from, https://medium.com/@cmukesh8688/activation-functions-
sigmoid-tanh-relu-leaky-relu-softmax-50d3778dcea5.
Chen, J., Wang, Y., Yoho, S. E., Wang, D., & Healy, E. (2016). Large-scale training to increase
speech intelligibility for hearing-impaired listeners in novel noises. The Journal of the
Acoustical Society of America, Vl. 22. No. 6, pp 67-78. Retrieved from,
https://doi.org/10.1121/1.4948445.
Craciun, A., & Gabrea, M. (2004). Correlation coefficient-based voice activity detector algorithm.
Canadian Conference on Electrical and Computer Engineering. (IEEE Cat.
No.04CH37513). Retrieved from, https://doi.org/10.1109/CCECE.2004.1349763.
Dertat, A. (2017). Applied deep learning - Part 3: Autoencoders. Medium. Retrievded on 28/07/
2022 from, https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-
1c083a f4d.
EETimes. (2003). EETimes - sorting through GSM Codecs: A Tutorial. EETimes. REtrieved on
14/03/2022 from, https://www.eetimes.com/sorting-through-gsm-codecs-a-tutorial.
Hahn, M., & Park, C. K. (1992). An improved speech detection algorithm for isolated Korean
utterances. [Proceedings] ICASSP-92: 1992 IEEE International Conference on
Acoustics, Speech, and Signal Processing. Retrieved from,
https://doi.org/10.1109/ICASSP.1992.2258.
46
Haigh, J. A., & Mason, J. S. (1993). Robust voice activity detection using cepstral features.
Proceedings of TENCON ’93. IEEE Region 10 International Conference on Computers,
Communications and Automation. Retrieved from, https://doi.org/10.1109/ TENCON.
1993 .327987.
Healy, E. W., Delfarah, M., Vasko, J. L., Carter, B. L., & Wang, D. (2017). An algorithm to
increase intelligibility for hearing-impaired listeners in the presence of a competing
talker. The Journal of the Acoustical Society of America, Vl. 141. No. 6, pp. 4230–4239.
Retrieved from, https://doi.org/10.1121/1.4984271.
Hu, Y., & Loizou, P. C. (2008). Evaluation of objective quality measures for speech enhancement.
IEEE Transactions on Audio, Speech, and Language Processing, Vl. 16. No. 1, pp. 229–
238. Retrieved from, https://doi.org/10.1109/TASL.2007.911054.
ITU-T. (2021). P.862 : Perceptual evaluation of speech quality (PESQ): An objective method
for end-to-end speech quality assessment of narrow-band telephone networks and
speech codecs. (n.d.). Retrieved May 22, 2022, from https://www.itu.int/rec/T-REC-
P.862-200102-I/en.
JalFaizy, S. (2017). Why are GPUs necessary for training deep learning models? Retrieved on
23/04/2022 from, https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-
deep-learning.
Jong, W. S., Joon-Hyuk, C., Barbara, S., Hwan, S. Y., & Nam, S. K. (2005). Voice activity
detection based on generalized gamma distribution. Proceedings. (ICASSP ’05). IEEE
International Conference on Acoustics, Speech, and Signal Processing. Retrieved from,
https://doi.org/10.1109/ICASSP.2005.1415230.
Jongseo, S., & Wonyong, S. (1998). A voice activity detector employing soft decision based noise
spectrum adaptation. Proceedings of the 1998 IEEE International Conference on
Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181). Retrieved
from, https://doi.org/10.1109/ICASSP.1998.674443.
Kawamura, A., Thanhikam, W., & Iiguni, Y. (2012). Single channel speech enhancement
techniques in spectral domain. ISRN Mechanical Engineering. Retrieved from,
https://doi.org/10.5402/2012/919234.
Krishnamoorthy, P. (2011). An Overview of subjective and objective quality measures for noisy
speech enhancement algorithms. IETE Technical Review, Vl. 28. No. 4, pp. 292–301.
Retrieved from, https://doi.org/10.4103/0256-4602.83550.
Kumar, A., & Florencio, D. (2016). Speech enhancement in multiple-noise conditions using deep
Neural Networks. Retrieved from, Retrieved from, https://doi.org/10.21437 /Interspeech .
Liu, D., Smaragdis, P., & Kim, M. (2014). Experiments on Deep Learning for Speech Denoising.
New York: Mc. Hill.
Mathworks: (2021). Denoise Speech Using Deep Learning Networks. Retrieved on 23/09/2021,
from https://www.mathworks.com/help/deeplearning/ug/denoise-speech-using-deep-
learning-networks.html.
Mozilla. (2020). Common voice corpus 9.0. (n.d.). Retrieved May 26, 2022, from
https://commonvoice.mozilla.org/
Park, S. R., & Lee, J. (2016). A Fully convolutional neural network for speech enhancement.
ArXiv:1609.07132 [Cs]. Retrieved on 24/11/21 from, http://arxiv.org/abs/1609.07132.
Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2001). Perceptual evaluation of
speech quality (PESQ)-a new method for speech quality assessment of telephone
networks and codecs. IEEE International Conference on Acoustics, Speech, and Signal
Processing. Proceedings, Vl. 2. No. 1, pp. 749–752. Retrieved from,
https://doi.org/10.1109/ ICASSP .2001.941023.
Becker, D. (2018). Running kaggle kernels with a GPU. Retrieved on 4/11/2021, from
https://kaggle.com/dansbecker/running-kaggle-kernels-with-a-gpu.
Salamon, J., Jacoby, C., & Bello, J. P. (2014). A Dataset and Taxonomy for Urban Sound
1044. https://doi.org/10.1145/2647868.2655045
Sachin, K. (2020). Data splitting technique to fit any machine learning model . Towards data
science. Retrieved 5/11/2021 from, https://towardsdatascience.com/data-splitting-
technique-to-fit-any-machine-learning-model-c0d7f3f1c790.
Samaya, M., Jeremy, N., Romeo, K., & Alex, A. (2021). Building deep learning models with
tensorflow—home | coursera [E-learning]. Building deep learning models with
tensorflow. Retrieved on 11/11/2021, from https://www.coursera.org/learn/building-
deep-learning-models-with-tensorflow/home/welcome.
Sachin, K. (2020). Data splitting technique to fit any machine learning model . Towards data
science. Retrieved 5/11/2021 from, https://towardsdatascience.com/data-splitting-
technique-to-fit-any-machine-learning-model-c0d7f3f1c790.
Simone, G., 7 Carl, H. (2021). Intelligibility prediction for speech mixed with white Gaussian
noise at low signal-to-noise ratios. The Journal of the Acoustical Society of America: Vl.
149. No 2. Retrieved on 9/11/2021 from,
https://asa.scitation.org/doi/full/10.1121/10.0003557.
Shivakumar, P. G., & Georgiou, P. (2016). Perception optimized deep denoising autoencoders
for speech enhancement. In, Prashanth, G. S., & Panayiotis, G. (Eds.).
Interspeech. New York: University Press, pp. 3743–3747.
Silva, T. (2019). Practical deep learning audio denoising—thalles’ blog. Speech Denoising Is a
long-standing problem. Given an input noisy signal, we aim to filter out the undesired
noise without degrading the signal of interest. You can imagine someone talking in a
video conference while a piece of music is playing in the background. in this situation, a
speech denoising system has the job of removing the background noise in order to
improve the speech signal. Besides Many other use cases, this application is especially
important for video and audio conferences where noise can significantly decrease speech
Intelligibility. Retrieved on 29/11/2021 from, https://sthalles.github.io/practical-deep-
learning-audio-denoising/.
Sohn, J., Kim, N., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE
Signal Processing Letters, Vl. 6. No. 1, pp. 1–3.
Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2010). A short-time objective
intelligibility measure for time-frequency weighted noisy speech. IEEE Society (Ed.).
IEEE International Conference on Acoustics, Speech and Signal Processing. New York:
EEEi Society, pp. 4214–4217.
Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An Algorithm for intelligibility
prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio,
Speech, and Language Processing, Vl. 19. No. 7, pp. 2125–2136. https://doi.org/10.1109/
TASL. 2011.2114881.
Tashev, I. J., & Mirsamadi, S. (2016). DNN-based causal voice activity detector. 5. Retrived on
17/09/2021 from, https://www.semanticscholar.org.
The MathWorks, Inc. (n.d.). Denoise Speech using deep learning networks—MATLAB &
Simulink. Retrieved on15/03/2021, from
https://www.mathworks.com/help/audio/ug/denoise-speech-using-deep-learning-
networks.html.
Yuliani, A. R., Amri, M. F., Suryawati, E., Ramdan, A., & Pardede, H. F. (2021). Speech
enhancement using deep learning methods. Journal Elektronika Dan Telekomunikasi, Vl,
21. No. 1, 16-19. Rertrieved from, https://doi.org/10.14203/jet.v21.19-26
49
APPENDICES
Appendix E: test_tf_record.py
import tensorflow as tf
import numpy as np
from utils import play
from data_processing.feature_extractor import FeatureExtractor
train_tfrecords_filenames = '../kaggle/working/records/test_0.tfrecords'
def tf_record_parser(record):
keys_to_features = {
"noise_stft_phase": tf.io.FixedLenFeature((), tf.string, default_value=""),
'noise_stft_mag_features': tf.io.FixedLenFeature([], tf.string),
"clean_stft_magnitude": tf.io.FixedLenFeature((), tf.string)
}
n_features = 129
train_dataset = tf.data.TFRecordDataset([train_tfrecords_filenames])
train_dataset = train_dataset.map(tf_record_parser)
train_dataset = train_dataset.repeat(1)
train_dataset = train_dataset.batch(1000)
train_dataset = train_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
window_length=256
overlap=64
sr = 16000
Appendix F: dataset.py
import librosa
import numpy as np
import math
from feature_extractor import FeatureExtractor
from utils import prepare_input_features
import multiprocessing
import os
from utils import get_tf_feature, read_audio
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
np.random.seed(999)
tf.random.set_seed(999)
class Dataset:
def __init__(self, clean_filenames, noise_filenames, **config):
self.clean_filenames = clean_filenames
self.noise_filenames = noise_filenames
self.sample_rate = config['fs']
self.overlap = config['overlap']
self.window_length = config['windowLength']
self.audio_max_duration = config['audio_max_duration']
noise_filename = self._sample_noise_filename()
if os.path.isfile(tfrecord_filename):
print(f"Skipping {tfrecord_filename}")
counter += 1
continue
writer = tf.io.TFRecordWriter(tfrecord_filename)
53
for o in out:
noise_stft_magnitude = o[0]
clean_stft_magnitude = o[1]
noise_stft_phase = o[2]
noise_stft_mag_features = prepare_input_features(noise_stft_magnitude,
numSegments=8, numFeatures=129)
counter += 1
writer.close()
54
audioSource = dsp.AudioFileReader('SamplesPerFrame',80,...
'Filename','noisy-input.wav',...
'OutputDataType', 'single');
scope = dsp.TimeScope(2, 'SampleRate', [8000/80, 8000], ...
'BufferLength', 80000, ...
'YLimits', [-0.3 1.1], ...
'ShowGrid', true, ...
'Title','Decision speech and speech data', ...
'TimeSpanOverrunAction','Scroll');
% Initialize VAD parameters
VAD_cst_param = vadInitCstParams;
clear vadG729
% Run for 10 seconds
numTSteps = 500;
while(numTSteps)
% Retrieve 10 ms of speech data from the audio recorder
speech = audioSource();
% Call the VAD algorithm
decision = vadG729(speech, VAD_cst_param);
% Plot speech frame and decision: 1 for speech, 0 for silence
scope(decision, speech);
numTSteps = numTSteps - 1;
end
release(scope);
55
Appendix H: