Voice Activity Detection Using Deep Learning (Speech Enhancement) For GSM Telephony

UNIVERSITY OF BUEA
COLLEGE OF TECHNOLOGY DEPARTMENT OF ELECTRICAL AND

ELECTRONICS ENGINEERING
VOICE ACTIVITY DETECTION USING DEEP LEARNING FOR GSM

TELEPHONY
By
Daniel Graham Boaz

B. Eng in Computer
Engineering
A Dissertation Submitted to the Department of Electrical and Electronic Engineering,

College of Technology of the University of Buea in Partial Fulfillment
of the Requirements for the Award of the Master of Technology
(M. Tech) Degree in Telecommunication and Networks.
May, 2022
ii
DEDICATION
To my father, Mr. Sandjo Gustave

iii
UNIVERSITY OF BUEA
COLLEGE OF TECHNOLOGY DEPARTMENT OF ELECTRICAL AND

ELECTRONICS ENGINEERING
CERTIFICATION
The dissertation of Daniel Graham Boaz (CT19P008) entitled: “Voice Activity
Detection Using Deep Learning for Gsm Telephony”, Submitted to the Department of
Electrical and Electronic Engineering, College of Technology of the University of Buea
in partial fulfillment of the Requirements for the award of Master of Technology
(M.Tech.) Degree in Telecommunications and Networks has been read, examined and
approved by the examination panel composed of:
 Nana Cyrille (PhD), Chairperson (Associate Professor of Mathematics)

 Tchapga Christina (PhD), Member( Lecturer of Software Engineering)
 Sone Ekonde Michael (PhD), Supervisor (Associate Professor of Telecommunication)
 Feudjio Cyrille (PhD), Co-Supervisor (Lecturer of Software Engineering)
Dr. Feudjio Cyrille Dr. Sone Ekonde Michael (AP)

(Head of Department) (Supervisor)
Dr. Feudjio Cyrille

(Co-Supervisor)
This dissertation has been accepted by the College of Technology.
Date:
Dr. Sone Ekonde Michael (AP)
(Director)
iv
ACKNOWLEDGEMENT
First and foremost, I thank God for letting me live to see this dissertation through. I am
deeply grateful to my supervisors who were more than generous with their expertise and
precious time. A special thanks to Dr. Sone Ekonde Michael and Dr. Feudjio Cyrille, for
their countless hours of reflecting, reading, encouraging, advising and most of all,
patience throughout this dissertation. Thank you all sincerely.
I would like to thank all my Master’s lecturers, the various head of departments, and all
the College of Technology staff for the strong theoretical knowledge which laid the
foundation for the successful completion of this project.
I’m most grateful to my parents for their support, encouragement and motivation
throughout this work. To my mother Sahan Annie Angele, I heartily appreciate her
prayers and efforts towards the completion of this work and for supporting me strongly
in my endeavors.
Lastly but not least, I express sincere thanks to my whole family, all my classmates and
friends who have patiently extended all sorts of help for the accomplishment of this
undertaking.
v
ABSTRACT
Voice Activity Detectors (VAD) are algorithms for detecting the presence of speech
signals in the mixture of speech and noise. They play an essential role in speech coders
for GSM telephony as they operate as a two-way classifier by flagging audio frames
where voice is detected. However, in a low SNR environment, the presence of babble
noise drastically reduces hearing intelligibility of speech resulting in poor VAD decisions.
In this dissertation, we propose using deep learning to learn the mapping between the
noisy speech and clean speech features that will improve VAD decisions. Specifically,
we propose using fully convolutional neural networks (CNN), which automatically
extract distinctive features of noisy and clean speech spectra using a few network
parameters. The proposed network model built, showed improved subjective and
objective measures with average PESQ of 2.4755, average STOI of 0.7016, improved
average SNR of 5.1737𝑑𝐵, and improved average channel capacity of 118% for noisy
speech samples with 0𝑑𝐵 𝑆𝑁𝑅.
Key Words: Speech, Voice Activity Detection (VAD), Deep Neural Network (DNN),
Convolutional Neural Networks (CNN).
vi
TABLE OF CONTENTS
DEDICATION............................................................................................................................. ii
CERTIFICATION ..................................................................................................................... iii
ACKNOWLEDGEMENT ......................................................................................................... iv
ABSTRACT ..................................................................................................................................v
TABLE OF CONTENTS........................................................................................................... vi
ABSTRACT ................................................................................ Error! Bookmark not defined.
LIST OF TABLES .......................................................................................................................x
LIST OF FIGURES ................................................................................................................... xi
ABBREVIATIONS ................................................................................................................... xii
CHAPTER 1 .................................................................................................................................1
INTRODUCTION ........................................................................................................................1
1.1. Overview ........................................................................................................................1
1.2. Problem Statement .........................................................................................................1
1.3. Objectives.......................................................................................................................2
1.4. Research Questions ........................................................................................................2
1.5. Scope and Limitations ....................................................................................................2
1.6. Dissertation Outline .......................................................................................................3
CHAPTER 2 .................................................................................................................................4
LITERATURE REVIEW............................................................................................................4
2.1. Voice Activity Detection................................................................................................4
2.2. Review of VAD algorithms ...........................................................................................4
2.2.1. Speech Detection Using Energy and Zero Crossing Rate ......................................4
vii
2.2.2. Speech Detection Using Correlation Coefficient ...................................................5
2.2.3. Speech Detection Using Statistical Model-Based VAD ........................................5
2.2.4. Speech Detection Using Cepstral Features and Mel-Energy Features ...................5
2.2.5. Speech Detection Using Generalized Gamma Distribution (GГD) .......................5
2.2.6. Speech Detection Using Deep Neural Networks (DNN) .......................................6
2.2.7. Autocorrelation-based noise subtraction (ANS) method .......................................6
2.3. Demystifying the VAD Problem (The Rationale of our Work) .....................................7
2.4. Speech Enhancement .....................................................................................................7
2.4.1. Traditional Single Channel Speech Enhancement .................................................8
2.4.2. Single-Channel Speech Enhancement with DNNs ................................................8
2.4.2.1. 2Hz .................................................................................................................9
2.4.2.2. Speech enhancement in office-environment noises........................................9
2.4.2.3. Speech enhancement with Supervised learning .............................................9
2.5. Why Deep Learning? .....................................................................................................9
2.6. An Overview of Deep Learning ...................................................................................10
2.7. Autoencoders ...............................................................................................................12
2.7.1. Working Principle ................................................................................................12
2.8. Convolutional Neural Networks ..................................................................................13
2.8.1. Working Principle ................................................................................................14
2.8.2. CNN Extensions ...................................................................................................15
2.8.2.1. Convolutional Encoder-Decoder Network (CED) .......................................15
2.8.2.2. Redundant Convolutional Encoder Decoder Network (R-CED) .................16
2.8.2.3. Cascaded R-CED Network (CR-CED) ........................................................16

viii
2.9. Hypothesis Development .............................................................................................17
CHAPTER 3 ...............................................................................................................................18
METHODOLOGY.....................................................................................................................18
3.1. Computing Technologies ..................................................................................................18
3.1.1. Computing toolkit ................................................................................................18
3.1.2. Analytics toolkit ...................................................................................................19
3.1.2.1. Programming languages ...............................................................................19
3.1.2.2. Software packages ........................................................................................19
3.2. Datasets ........................................................................................................................20
3.3. Data Preprocessing .......................................................................................................21
3.3.1. Downsampling Audio signals ..............................................................................21
3.3.2. Removing Silent Frames ......................................................................................22
3.3.3. Computing STFT vectors .....................................................................................23
3.4. Proposed Deep Neural Network Architecture ..............................................................25
3.4.1. Our DNN model ...................................................................................................26
3.4.2. Training our model ...............................................................................................28
CHAPTER 4 ...............................................................................................................................31
RESULTS ...................................................................................................................................31
4.1. Testing Hypothesis H1 ......................................................................................................31
4.1.1. 4.1.1. Subjective Speech Intelligibility Test ........................................................31
4.1.1.1. PESQ Measurement .....................................................................................33
4.1.1.2. STOI Measurement ......................................................................................34
4.1.2. Objective Speech Intelligibility Test ....................................................................36

ix
4.2. Testing Hypothesis H2 .................................................................................................38
4.3. Testing GSM Channel Capacity...................................................................................40
4.4. Results Discussion .......................................................................................................42
CONCLUSION...........................................................................................................................44
Summary ..................................................................................................................................44
Remarks ...................................................................................................................................44
Future Work .............................................................................................................................44
REFERENCES ...........................................................................................................................45
APPENDICES ............................................................................................................................49
Appendix A: Interfering noise in speech transmission illustration ........................................49
Appendix B: VAD Decision of a noisy speech signal ...........................................................49
Appendix C: VAD Decision of Noisy Audio ............................................................................49
Appendix D: Improving VAD decisions with Speech Enhancer ...........................................50
Appendix E: test_tf_record.py...................................................................................................50
Appendix F: dataset.py..............................................................................................................51
Appendix G: MATLAB implementation of G.729 VAD ......................................................54
Appendix H: Clean Speech, Noisy Speech, and Denoised Speech spectrograms .................55
x
LIST OF TABLES
Table 1: Librosa stft function paramters ......................................................................................24
Table 2: PESQ results from DNN model .....................................................................................34
Table 3: STOI results from DNN model ......................................................................................35
Table 4: SNR results from DNN model .......................................................................................37
Table 5: Average of subjective and object metrics ......................................................................38

xi
LIST OF FIGURES
Figure 1: Logistic Classifier diagram (Badescu & Cavez, 2021).................................................10
Figure 2: 2-layer fully connected network (Badescu & Cavez, 2021) .........................................11
Figure 3: CNN exploiting spatially-local correlation (Badescu & Cavez, 2021).........................14
Figure 4: CNN feature map (Badescu & Cavez, 2021)................................................................14
Figure 5: A CNN sequence to classify handwritten digits (Saha, 2018)......................................15
Figure 6: Modified Convolutional Encoder-Decoder Network (CED) (Park & Lee, 2016) ........15
Figure 7: Proposed Redundant CED (R-CED) (Park & Lee, 2016) ............................................16
Figure 8: Plots of clean, noise, and noisy signals.........................................................................21
Figure 9: The Librosa downsampler subsystem. (Signals and Systems - OpenStax CNX,
2021) ............................................................................................................................................22
Figure 10: Librosa split method (Librosa — Librosa 0.8.1 Documentation, 2021).....................22
Figure 11: Librosa stft function (Librosa — Librosa 0.8.1 Documentation, 2021) .....................24
Figure 12: The spectral leakage reduction process ......................................................................24
Figure 13: Deep learning training scheme(The MathWorks, Inc., 2021) ....................................25
Figure 14: STFT predictor and target vector inputs (The MathWorks, Inc., 2021) .....................26
Figure 15: DNN model structure part 1 .......................................................................................27
Figure 18: Reducing the MSE of our DNN model (Silva, 2019) .................................................28
Figure 19: Splitting a Dataset .......................................................................................................29
Figure 20: Test noisy Input signal ................................................................................................39
Figure 21: VAD decision for noisy-input.wav .............................................................................39
Figure 22: Denoised speech signal ...............................................................................................40
Figure 23: VAD decision for denoised speech signal ..................................................................40

xii
ABBREVIATIONS
API Application Programming Interface
ASR Automatic Speech Recognition
CDMA Code Division Multiple Access
CNG Comfort Noise Generation
CNN Convolutional Neural Network
CPU Central Processing Unit

Cascaded Redundant Convolutional Encoder-
CR-CED
Decoder
Decision Directed
DD
DFT Discrete Fourier Transform

DNN Deep Neural Network
DSP Digital Signal Processing
DTX Discontinuous Transmission
ETF Enhanced Time Frequency
FFT Fast Fourier Transform
FT Fourier Transform
GPU Graphics Processing Unit
GSAP global speech absence probability
GSM Global System for Mobile Communications
ITU International Telecommunication Union
LMS Least mean square
likelihood ratio test

LRT
MAD median absolution deviation
MATLAB matrix laboratory
MCV Mozilla Common Voice
MIMSB minimum Mel-scale frequency band
ML Machine Learning
xiii
MOS Mean Opinion Score
MP3 MPEG-1 Audio Layer III
MSE Mean Square Error
NLMS Normalized Least Mean Square
PC Personal Computer
PESQ Perceptual Evaluation of Speech Quality
PWPD Perceptual Wavelet Packet Decomposition
RAM Read Access Memory
RBM Restricted Boltzman Machine
R-CED Redundant Convolutional Encoder-Decoder
ReLU Rectified Linear Units
RMSE root mean square error
RNN Recurrent Neural Networks
SAD speech activity detection
SGD Stochastic Gradient Descendent
SMVAD statistical model-based VAD
SNR signal to noise ratio
STFT short time fourier transform

STOI short time objective intelligibility
TDMA time division multiple access
TEO Teager Energy Operator
VAD voice activity detection
VOS Voice Operated Switch
WT Wavelet Transform
1
CHAPTER 1
INTRODUCTION
1.1. Overview
Speech is the predominant means of communication between human beings and since the
invention of the telephone by Alexander Graham Bell in 1876, speech services have
remained to be the core service in almost all telecommunication systems. Speech coders
in GSM Telephony, are used to compress the bit rate (bandwidth) of the speech signal
before its transmission while keeping an acceptable perceived quality of the decoded
output speech signal. This speech signal is often corrupted with an interfering signal
(babble noise) which has a harmful contaminating effect on the signal-to-noise ratio of
the resulting speech signal.
With the recent advances in speech signal processing techniques, the need to accurately
detect the presence of speech in the incoming signal under different noise environments
has become a major industry concern. Separation of the speech fragment from the non-
speech fragment in an audio signal has been achieved over the years, using Voice Activity
Detectors (VAD). VAD’s are a class of signal processing methods that detects the
presence or absence of speech in short segments of audio signal. They have a pivotal role
as the preprocessing block in a wide range of speech applications, hereby providing
improved channel capacity, reduced co-channel interference and power consumption in
portable electronic devices in cellular radio systems and simultaneous voice and data
applications in multimedia communications.
However, in low SNR conditions – non-stationary environments, where speech is heavily

corrupted by noise, VAD’s, especially those used for narrowband GSM telephony, often
make detection errors in estimating the noise spectrum.
1.2. Problem Statement
In the past decades, a lot of work has been done in regards to enhancing the speech on the
one hand, and on the other hand, enhancing VAD decisions. Though there might be some
closeness between these two approaches, the difference in the outcome, however, lies in
2
the voice frequency band which in some cases could be considered as the unwanted
signal.
This is often caused by two factors:
 The voice frequency band – which ranges from approximately 300 to 3400Hz,
present in the additive noise and in the clean input speech signal, seems to cause
some obfuscation to the voice activity detector (VAD), which in turn renders a
perceptually non-speech frame as a clean speech frame.
 Most VAD algorithms assume the background noise is stationary – often the
Gaussian distribution – in one speech frame and the same assumption is made for
consecutive speech frames. While, in reality interfering noise signals can
sometimes switch from one form to the other (e.g., railway station to crowd
talking, etc.), hereby often causing VAD detection/decision errors.
1.3. Objectives
In order to attenuate decision errors made by VAD’s, we set out to denoise the noisy
speech (enhance the speech) right before the Voice Activity Detector.
Thus, our main objective in this dissertation is to propose a speech enhancement model
that learns the mapping between noisy speech spectra and clean speech spectra, using
deep neural networks (DNN), to suppress both stationary and non-stationary background
noise, hereby, bringing about improved VAD decisions.
1.4. Research Questions
1. Can decision errors made by voice activity detectors, in very low SNR conditions,
be attenuated with the help of Speech Enhancement, bearing in mind mobile
device constraints in narrowband GSM Telephony?
2. How well does speech enhancement affect the perception and intelligibility of the
denoised/enhanced speech signals?
1.5. Scope and Limitations
This work will be limited to the ITU-T G.729 Annex B recommendation of the Voice
Activity Detection algorithm. The performance of the speech enhancement method will
be evaluated based on subjective and objective measures, as suggested by
(Krishnamoorthy, 2011). The speech and noise conditions used for analysis and
3
implementation will be sound files freely provided by (Mozilla Common Voice, 2021)
and (Salamon et al., 2014) respectively. The speech sounds will be a subset of Mozilla
Common Voice, provided by Matlab (The MathWorks, Inc., 2021). The evaluation
process will be limited to simulations using the aforementioned sound files and no real-
time implementation of GSM telephony will be done.
1.6. Dissertation Outline
This study is structured as follows: Chapter 2 introduces the literature about VADs, GSM
Speech Coders, digital spectral analysis, deep neural networks, digital signal processing,
and previous works. It also presents our hypotheses based on our problem statement.
Chapter 3 describes the strategy developed in this dissertation to address the issues of
voice activity detection. It presents how our neural network is designed and implemented
with VAD. Chapter 4 presents the results of our implementation and conclusions are
drawn.
4
CHAPTER 2
LITERATURE REVIEW
In this section, we present the conceptual idea behind activity detection and a literature
survey of the existing work done in the area. We also go through some recent research
works in speech enhancement with deep learning, we introduce deep learning and present
the models used for our hypothesis.
2.1. Voice Activity Detection
VAD, also known as speech activity detection (SAD), aims to detect the presence of
speech in an audio signal. This might include a scenario of identifying when the signal
from a hidden microphone contains speech so that a voice recorder can operate (also
known as a Voice Operated Switch or VOS). Another example would be a mobile phone
using a VAD to decide when one person in a call is speaking so it transmits only when
speech is present (by not transmitting frames that do not contain speech, the device might
save over 50% of radio bandwidth and operating power during a typical
conversation) (McLoughlin, 2016).
In cellular radio systems (for instance GSM and CDMA systems) based on Discontinuous
Transmission (DTX) mode, VAD is essential for enhancing system capacity by reducing
co-channel interference and power consumption in portable digital devices. To reduce the
annoying modulation of the background noise at the receiver (noise contrast effects),
Comfort Noise Generation (CNG) is used, inserting a coarse reconstruction of the
background noise at the receiver (Scourias, 1995).
VAD is an important enabling technology for a variety of speech-based applications.

Various VAD algorithms have been developed that provide varying features and
compromises between latency, sensitivity, accuracy, and computational cost. We review
some of these in the next section.
2.2. Review of VAD algorithms
2.2.1. Speech Detection Using Energy and Zero Crossing Rate
(Hahn & Park, 1992) proposes a simple yet effective speech detection algorithm that
classifies frames based on differential logarithmic energy and zero-crossing rate
5
characteristics. In addition, a feature vector consisting of linear prediction coefficients,

full-band energy, low-band energy, and the zero-crossing rate was recommended in the
G.729 Annex B recommendation (Benyassine et al., 1997).
2.2.2. Speech Detection Using Correlation Coefficient
Another VAD technique to improve word boundary detection for varying background
noise levels was suggested by (Craciun & Gabrea, 2004), where noise parameters are
initially estimated from the initial frames and then updated using a first-order
autoregressive filter during the silence periods. The correlation coefficients for the
instantaneous spectrum and an average of the background noise spectrum are the
parameters employed in this approach. Subsequently, a statistical approach using a basic
binary Markov model is used for voice activity detection.
2.2.3. Speech Detection Using Statistical Model-Based VAD
(Sohn et al., 1999) proposed a Statistical Model-Based VAD (SMVAD), in which the
decision rule was obtained from the Likelihood Ratio Test (LRT) by utilizing the
Maximum Likelihood (ML) criterion to estimate the unknown parameters. Further
improvements were made by optimizing the decision rule for the estimate of unknown
parameters using the Decision-directed (DD) technique (Jongseo Sohn & Wonyong Sung,
1998). To achieve robustness in low SNRs, the proposed algorithm further optimized the
decision rule by adapting the decision threshold using the measured noise energy.
2.2.4. Speech Detection Using Cepstral Features and Mel-Energy Features
Haigh et al. demonstrated the robustness to various background noise levels for successful
end-of-speech identification using cepstral feature-based thresholds (Haigh & Mason,
1993). Chin-Teng Lin et al. suggested an Enhanced/Improved Time-Frequency (ETF)
and Minimum Mel-Scale Frequency Band (MIMSB) parameters collected from a multi-
band spectral analysis using Mel-scale frequency banks to create a robust word boundary
detection method (ETF VAD) (Chin-Teng Lin et al., 2002).
2.2.5. Speech Detection Using Generalized Gamma Distribution (GГD)
Jong Won et al. proposed a detection algorithm in which the distributions of noise spectra
and noisy speech spectra including speech-inactive intervals are modeled by a set of
GΓD’s and applied to the LRT for VAD. The parameters of GΓD are estimated through
6
an online Maximum Likelihood (ML) estimation procedure where the Global Speech
Absence Probability (GSAP) is incorporated under a forgetting scheme. The proposed
VAD algorithm based on GΓD proved to outperform the algorithms based on other
statistical models discovered so far (Jong Won Shin et al., 2005).
2.2.6. Speech Detection Using Deep Neural Networks (DNN)
(Tashev & Mirsamadi, 2016) proposed an algorithm for causal VAD based on DNNs.
The DNN is trained on segments of several consecutive audio frames, and with all
frequency bins together to utilize the correlation between the frames and bins. No
assumptions are made for any prior distribution of the noise and speech signals and the
DNN is expected to learn the dependency between the input features and the VAD
decision. It is shown that the proposed algorithm and DNN structure exceeds the classic,
statistical model-based VAD for both seen and unseen noises.
2.2.7. Autocorrelation-based noise subtraction (ANS) method
(Farahani, 2017) proposed the ANS method which, instead of removing the lower lag
autocorrelation components of the noisy signal – as is the case with other autocorrelation-
based noise suppression methods, tries to estimate the noise autocorrelation sequence and
deducts it from the noisy signal autocorrelation sequence. It uses the average
autocorrelation of a number of non-speech frames of the noisy utterance as an estimate
for the noise autocorrelation sequence given by;
𝑝=1
∑𝑖=0 𝑟𝑦𝑦 (𝑖̇, 𝑘)
𝑟̂𝜈𝑣 (𝑘) = ,0 ≤ 𝑘 ≤ 𝑁 − 1 (1)
𝑃
where 𝑃 is the number of non-speech frames, 𝑟𝑦𝑦 (𝑖̇, 𝑘) the autocorrelation sequence of
the noisy speech frame. This resulted in obtaining the autocorrelation sequence of the
clean speech signal expressed as
𝑟̂𝑥𝑥 (𝑚, 𝑘) = 𝑟𝑦𝑦 (𝑚, 𝑘) − 𝑟̂𝜈𝑣 (𝑘) (2)

where 𝑟𝑦𝑦 (𝑚, 𝑘) is the autocorrelation of the noisy speech sequence given by;
𝑁−1−𝑘
1
𝑟𝑦𝑦 (𝑚, 𝑘) = ∑ 𝑦(𝑚, 𝑖)𝑦(𝑚, 𝑖 + 𝑘) (3)
𝑁−𝑘
𝑖=0
7
with 𝑦(𝑚, 𝑖) being the noisy speech sample composed of the clean speech 𝑥(𝑚, 𝑖) and
the noise 𝑣(𝑚, 𝑖), 𝑁 is the frame length, 𝑖 is the discrete time index in the frame, and 𝑘 is
the autocorrelation sequence index within each frame.
However, this method requires a VAD to obtain the non-speech frames used to estimate
the noise autocorrelation, which doesn’t align with our objectives, as the VAD might
make errors in classifying clean speech and non-speech frames. We throw more light on
this in section 2.3.
2.3. Demystifying the VAD Problem (The Rationale of our Work)
Consider the scenario illustrated in APPENDIX A where, in a telephone conversation,

the background noise is set to be children crying or people chattering – as is the case with
this audio file https://bit.ly/chattering_noise, and the clean speech,
https://bit.ly/clean_speech, are being transmitted over the network.
It can be observed from the VAD decision depicted in APPENDIX B and magnified in
APPENDIX C, that the VAD red markers don’t seem to efficiently demarcate the
unvoiced and voiced parts of the noisy speech, as there is a combination of both the clean
speech and the unwanted speech babble noise in the noisy signal.
Thus, it would be more advantageous, if the noisy speech signal were denoised
(enhanced) before being passed through a voice activity detector – a topic which we
discuss further in section 2.4.
2.4. Speech Enhancement
Speech Enhancement has been a concern for a long time now. It aims to improve speech
quality by attenuating interfering noise. We want to filter out unwanted noise from an
input noisy signal without damaging the speech quality. For instance, if someone is
talking in a phone call conversation while a piece of music is playing in the background
or while running, a speech enhancement system's job, in this case, is to remove or filter
out the background noise (i.e., background music, or body movement sounds) to improve
the speech signal.
Speech enhancement techniques can be classified into two types based on the number of
microphones available; single-channel and multi-channel speech enhancement

8
techniques. The single-channel speech enhancement technique aims to extract clean
speech from a noisy speech using signals captured from only one microphone as opposed
to an array of more than one microphone in multi-channel speech enhancement. Hence,
multi-channel speech enhancement provides better performance than single-channel due
to its implementation and computational expensive nature. However, our focus in this
dissertation will be on single-channel speech enhancement, as it only requires one
microphone to capture signals for our experiment.
2.4.1. Traditional Single Channel Speech Enhancement
Most single channel speech enhancement techniques are of the spectral domain
(Kawamura et al., 2012) method which is preferably used in a cell phone. (Ortega-Garcia
& Gonzalez-Rodriguez, 1996) give an overview of single channel speech enhancement
techniques. However, the major limitation of speech enhancement in the spectral domain
is the fact that it still assumes the noise process to be stationary, hence, won’t be
successful for non-stationary forms of background noises.
On the other hand, there has been a lot of recent progress in deep neural networks (DNN)
for different signal processing tasks and several deep learning methods for single-channel
speech enhancement have been developed. Also, recent innovations in convolutional
neural networks (CNN) make them beneficial for speech enhancement to training the
model using spectrogram features.
The next section reviews some of the recent works in speech enhancement with DNNs.
2.4.2. Single-Channel Speech Enhancement with DNNs
DNN-based learning architectures have shown to be quite successful in related domains

such as speech recognition throughout the years. Deep neural networks (DNNs) have been
investigated for noise reduction as a result of their success in automatic speech
recognition (ASR).
In the following subsections, we explore some of the DNN techniques used for speech
enhancement proposed in the literature.
9
2.4.2.1. 2Hz
(Baghdasaryan, 2018) developed a DNN-based enhancement strategy that eliminates any

whisper of noise in the speech signal. Though the technology isn’t disclosed, it is shown
that the DNN architecture produces remarkable results on a variety of noises.
2.4.2.2. Speech enhancement in office-environment noises
(Kumar & Florencio, 2016) proposed a speech enhancement method that focuses
primarily on the presence of multiple noises simultaneously corrupting the speech.
Specifically, it deals with improving speech quality in an office environment where
multiple stationary, as well as non-stationary noises, can be simultaneously present in
speech. It is shown that noise-aware training is quite helpful in speech enhancement as
well as in complex noise conditions
2.4.2.3. Speech enhancement with Supervised learning
(Park & Lee, 2016) try to solve the problem of speech enhancement by finding a
‘mapping’ between noisy speech spectra and clean speech spectra via supervised learning.
Specifically, it proposes using fully Convolutional Neural Networks (CNN), which
consist of a lesser number of parameters than fully connected networks. The CNN used
is a new architecture, Redundant Convolutional Encoder-Decoder (R-CED), that shows
to be 12 times smaller in size than other networks and achieves better performance. The
network extracts redundant representations of a noisy spectrum at the encoder and maps
it back to a clean spectrum at the decoder. This can be viewed as mapping the spectrum
to higher dimensions and projecting the features back to lower dimensions.
In section 2.5., we briefly present why deep learning is so popular nowadays, the areas
where it is implemented, and some of the most used architectures.
2.5. Why Deep Learning?
In recent years, the advance in deep learning technologies has provided great support for
the progress in Image Processing, Video Processing, Machine Translation, and Speech
Enhancement research fields. Unlike traditional speech enhancement approaches that
depend on statistical models, like spectral subtraction, Wiener filtering, and minimum
mean square error, deep learning approaches built on a data-driven paradigm have shown
outstanding speech enhancement performance over their predecessors (Yuliani et al.,
10
2021). This is mainly due to its ability to model complex non-linear mapping functions
(Shivakumar & Georgiou, 2016).
The next section provides an overview of Deep Learning and how it works.
2.6. An Overview of Deep Learning
Deep Learning is a new area of Machine Learning research, which has been introduced
with the objective of moving Machine Learning closer to one of its original goals:
Artificial Intelligence. Deep Learning is all about learning multiple levels of
representation and abstraction that help to make sense of data such as images, sound, and
text.
Deep Learning consists of Deep Neural Networks, which are in charge of learning from
the input train data. In this dissertation, we will be using two specific types of Deep Neural
Networks, which we describe below:
 Convolutional Neural Networks (CNN): These are networks that learn directly
from samples by optimizing their filters (or kernels) through automated learning,
as compared to traditional algorithms where these filters are rather hand-
engineered.
 Autoencoders: These are networks that work like Restricted Boltzman Machines
(RBM), but use encoders to encode an unlabeled input dataset into short-codes
and use them to reconstruct (decode) the original input data while extracting the
most valuable information (features) from the input data. (Samaya et al., 2021)
The most basic Neural Network is the fully connected network, which is composed of a
deep network of linear classifiers.
Figure 1: Logistic Classifier diagram (Badescu & Cavez, 2021)
To better understand how a linear classifier works, Figure 1, represents its common
architecture. Its equation is expressed as follows:
𝑌 = 𝑊𝑋 + 𝑏 (4)
11
The network's ability to learn is determined by the weights and bias. The network's goal
is to learn the weights and bias parameters from the training data that minimize the error.
The loss function is the function that measures the error during the learning process.
Cross-Entropy and Mean Squared Error are typical loss functions that could be used
to minimize this error. Cross-Entropy is more commonly used in classification, while
Mean Squared Error is more commonly used in regression. An optimizer is required to
reduce the error. The Gradient Descendent, particularly the Stochastic Gradient
Descendent (SGD), is a well-known optimizer. Linear models are stable, but they have
lots of limitations. To retain the parameters in linear functions while making the overall
model non-linear, a step further must be taken and non-linearities must be introduced
(Badescu & Cavez, 2021).
Figure 2: 2-layer fully connected network (Badescu & Cavez, 2021)
In Figure 2, a fully connected Neural Network diagram is represented. In order to

introduce non-linearity to the model, a non-linear function (activation function) must
be inserted. The simplest nonlinear or activation function, which can be observed in
Figure 2 is the Rectified Linear Units function (ReLU). The number of non-linear
function units is the number of "hidden layers", which is another parameter to tune.
Other activation functions include Sigmoid, Tanh, Leaky ReLU, and Softmax. However,
to reduce computational load of the network and avoid the vanishing gradient problem,
it’s been recommended to ReLU (Chaudhary, 2020).
Finally, the concepts of an epoch, batch size, iteration, learning rate, and overfitting must
be explained. The epoch is associated with the processing of the entire training dataset
before the gradient is recalculated. The batch, on the other hand, is associated to calculate
the gradient over smaller portions of the dataset, so several iterations will be required
before an epoch is completed. This is an advantage, as the system gets to be faster. That’s
why it is required to establish a batch size. As its name indicates, the learning rate sets
12
the speed of learning. High learning rates increase the error while low learning rates
increase the overfitting. When overfitting appears, the training should stop. To identify
it, the training loss and the validation loss must be observed. When the validation loss is
increasing and the training loss is decreasing, it is a clear signal of overfitting.
One important type of Neural Networks is the Convolutional Neural Network, which has
been greatly used over the years, to enable machines to view the world as humans do,
perceive it in a similar manner, and even use the knowledge for a multitude of tasks such
as Image & Video recognition, Image Analysis & Classification, Media Recreation,
Recommendation Systems, Natural Language Processing, to name a few.
In section 2.7. and section 2.8., we review the literature behind Autoencoders, CNNs,
CNN extensions proposed by (Park & Lee, 2016), and their relevance to this dissertation.
2.7. Autoencoders
Autoencoders are neural networks that compress the input into a lower-dimensional code
and then decode (reconstruct) the output from this representation (code). This code is a
compact “summary” or “compression” of the input, also called the latent-space
representation.
2.7.1. Working Principle
An autoencoder consists of 3 main components: encoder, code and decoder. The encoder
summarizes (compresses) the input and produces a code, which is used by the decoder to
reconstruct the input. The figure below is a depiction of the architecture of an
autoencoder.
13
Figure 3: Autoencoder architecture (Dertat, 2017)
First, the input in Figure 3 is passed through the encoder, which is a fully-connected neural
network, to produce the code. The decoder, which has the similar neural network
structure, then produces the output using this code. The idea here is to get an output
identical to the input.
In section 2.8., we review CNN’s and how they can be nested with autoencoder layers.
2.8. Convolutional Neural Networks
Inspired by early findings in the study of biological vision, the name "convolutional
neural network" indicates that the network employs a mathematical operation called
convolution. Convolutional networks are a specialized type of neural network that uses
convolution in place of general matrix multiplication in at least one of their layers.
The architecture of a CNN is inspired by the organization of the Visual Cortex and is
analogous to the connectivity pattern of Neurons in the Human Brain. Individual neurons
can only respond to stimuli in a small area of the visual field called the Receptive Field.
A number of similar fields can be stacked on top of each other to span the full visual field
(Saha, 2018).
In computer vision applications, the CNN algorithm takes an input image and gives
relevance (learnable weights and biases) to various aspects/objects in the image, allowing
it to distinguish between them. When compared to other classification algorithms, the
amount of pre-processing required by a CNN is significantly less. While filters are hand-
14
engineered in primitive methods, CNNs can learn these filters/characteristics with

adequate training.
2.8.1. Working Principle
CNNs are designed to exploit spatially-local correlation by enforcing a local connectivity

pattern between neurons of adjacent layers (Badescu & Cavez, 2021). In other words, the
inputs of hidden units in layer m, are from a subset of units in layer m-1, units that have
spatially contiguous receptive fields as we can see in the figure below:
Figure 4: CNN exploiting spatially-local correlation (Badescu & Cavez, 2021)
In addition, each filter h_i is replicated across the entire layer. These replicated units share
the same parametrization (weight vector and bias) and form a feature map. In Figure 5 a
CNN feature map can be observed.
Figure 5: CNN feature map (Badescu & Cavez, 2021)
For these reasons, Convolutional Neural Networks are perfectly fit for image and video
processing, but also for audio processing.
The image in Figure 6 shows a simple CNN architecture for classifying handwritten digits
images.
15
Figure 6: A CNN sequence to classify handwritten digits (Saha, 2018)
Figure 6 above shows us how the CNN layers reduce the dimension of images into a form
that is easier to process, without losing features that are critical for getting a good
prediction.
2.8.2. CNN Extensions
Having seen the working principle of CNNs, we will now review some of the CNN
extensions proposed by (Park & Lee, 2016).
2.8.2.1. Convolutional Encoder-Decoder Network (CED)
Convolutional Encoder-Decoder (CED) network consists of symmetric encoding layers
and decoding layers in which each block represents a feature. This is depicted in Figure
7.
Figure 7: Modified Convolutional Encoder-Decoder Network (CED) (Park & Lee, 2016)
16
The Encoder consists of repetitions of a convolution, batch-normalization (BN), max-

pooling, and a Rectified Linear Units (ReLU) activation layer. Decoder consists of
repetitions of a convolution, batch normalization, and an up-sampling layer. Similar to an
Autoencoder, CED compresses the features along the encoder and then reconstructs the
features along the decoder. To make the CED a fully convolutional network, a
convolution layer is used at the last layer, instead of the Softmax layer.
2.8.2.2. Redundant Convolutional Encoder Decoder Network (R-CED)
R-CED consists of repetitions of a convolution, batch normalization, and a ReLU

activation layer, with each block, representing a feature as shown in Figure 7.
Figure 8: Proposed Redundant CED (R-CED) (Park & Lee, 2016)
No pooling layer is present, and thus no up-sampling layer is required. Opposite to CED,
R-CED encodes the features into higher dimensions along the encoder and achieves
compression along the decoder. The number of filters is kept symmetric: at the encoder,
the number of filters is gradually increased, and at the decoder, the number of filters is
gradually decreased. The last layer is a convolution layer, which makes R-CED a fully
convolutional network.
2.8.2.3. Cascaded R-CED Network (CR-CED)
Cascaded Redundant Convolutional Encoder-Decoder Network (CR-CED) is a variation

of the R-CED network which consists of cascacded correlations of R-CED Networks.
These cascaded correlations are formed during training of the network where, cascades
are increased autonomously in order to minimize the error.
Compared to the R-CED with the same network size (i.e., with the same number of
parameters), CR-CED achieves better performance, both in terms of intelligibility and
perceptual analysis, with less convergence time.
17
In the section 2.8., we develop our hypothesis based on this deep learning network.
2.9. Hypothesis Development
The previous subsections discussed existing VAD and speech enhancement methods and
the theory concerning VAD, speech enhancement, and Deep Learning. With the help of
the research conducted by (Park & Lee, 2016), we form our hypothesis in two folds.
H1. Given a segment of noisy spectra {𝑥𝑡 }𝑇𝑡=1 and clean spectra {𝑦𝑡 }𝑇𝑡=1, we aim to
learn a mapping 𝑓 which generates a segment of denoised spectra {𝑓(𝑥𝑡 )}𝑇𝑡=1 that
approximates the clean spectra 𝑦𝑡 in the 𝒍𝟐 norm, i.e.,
min ∑‖𝑦𝑡 − 𝑓(𝑥𝑡 )‖22 (5)

𝑡=1
Specifically, we formulate 𝑓 using a fully convolutional neural network, such that the past
𝑛𝑇 noisy spectra: {𝑥𝑖 }𝑡𝑖=𝑡−𝑛𝑇 +1 are considered to denoise the current spectra, i.e.,
𝑇
2
∑‖𝑦𝑡 − 𝑓(𝑥𝑡−𝑛𝑇 +1 , … , 𝑥𝑡 )‖2 (6)
𝑡=1
In order to avoid overloading our DNN architecture, experiments conducted by (Liu et

al., 2014) and (Park & Lee, 2016) showed improved performance (based on subjective
measures) with 𝑛𝑇 = 8, so that our input spectra to the network are equivalent to about
100ms of speech segment, whereas the output spectra of the network are of 32ms
duration.
H2. Can the denoised spectra, {𝑓(𝑥𝑡 )}𝑇𝑡=1 , obtained from H1, improve the decisions
made by the ITU-T G.729 Annex B recommended voice activity detection
algorithm?
For this dissertation, we shall adopt the CR-CED network model to perform single
channel speech enhancement of noisy speech signals in order to improve VAD algorithm
decisions.
We will be using MATLAB implementation of ITU-T G.729 Annex B recommended

VAD to test our hypothesis H2.
In the following chapter, we will present the methodology used to implement our CR-
CED DNN architecture for eliminating noise from speech signals.
18
CHAPTER 3
METHODOLOGY
This chapter explains how we use Deep Learning to create an architecture capable of
mapping noisy speech signals to their clean variants. To begin, we'll go over the
technologies used for building/training our denoising algorithm, the dataset we used to
train and test our speech enhancement model. Next, we discuss the modules we used for
preparing the dataset for the Neural Network. Then, we will present the architecture of
our suggested Deep Neural Network model. Finally, we test hypothesis H2 using our
DNN model.
3.1. Computing Technologies
In this section, we present our computing toolkit used for testing our hypothesis. Our
toolkit can be classified into 2 types, namely;
 Computing: which involves the computing resources we used for building our
DNN architecture, e.g., servers, computers, etc.
 Analytics: which involves any software tool we used for testing hypotheses.
3.1.1. Computing toolkit
Traditional machine learning techniques are often used when dataset size is small.
However, the performance greatly degrades when the dataset size gets larger. On the other
hand, deep learning exhibits advantageous scalability with a huge dataset size, hence the
need for great computing power. The Graphics Processing Units (GPUs) is usually
responsible for delivering the computing power needed for these tasks instead of CPU as
it comes with a good number of concurrent threads compared to single-thread
performance optimization provided by CPU (“Why Are GPUs Necessary for Training
Deep Learning Models?,” 2017).
Given that we couldn’t afford a PC with a good GPU and recommended 16GB RAM
(Running Kaggle Kernels with a GPU, 2021), we opted for the usage of Kaggle Kernels.
Kaggle provides free access to NVidia K80 GPUs in kernels (Running Kaggle Kernels
with a GPU, 2021) with 16GB RAM available. This results in a 12.5X speedup during
training of a deep learning model with a total run-time of 994 seconds as compared to
13,419 seconds with only one CPU.
19
3.1.2. Analytics toolkit
This toolkit consists of the software and libraries we used in building, training, and testing
our denoising model. This comprises of:
 Programming languages
 Software packages
3.1.2.1. Programming languages

The programming language used for our model was Python, using Jupyther Notebook
provided by Kaggle kernels. Built on top of Python is a framework called, Tensorflow,
which is a free open-source software library and is seemingly the best Deep Learning
framework nowadays. The library can be run on computers of all kinds, even on
smartphones.
A powerful deep learning API we’ll be using for creating our DNN model is Keras which
runs on top of Tensorflow and was developed with the focus on enabling fast
experimentation of Neural Networks architecture due to its total modularity, minimalism,
and extensibility. Furthermore, it supports convolutional networks, recurrent networks,
and combinations of both, including multi-input and multi-output training.
Another programming language we used is MATLAB, which is a high-performance

language for technical computing developed by MathWorks. MATLAB allows matrix
manipulations, plotting of functions and data, implementation of algorithms, creation of
user interfaces, and interfacing with programs written in other languages, like Python,
Java, etc.
3.1.2.2. Software packages

The software packages used within Python include:
 Pandas: which is used for data analysis and manipulation of our audio signals
 Numpy: which adds support for large, multi-dimensional arrays and matrices,
along with a large collection of high-level mathematical or scientific calculations
functions to operate on these arrays
 Sci-kit learn: which is used for clustering and dimensionality reduction within
our model.
 Librosa: which is a music and audio analysis tool library, that provides the
building blocks necessary to create music information retrieval systems.
20
The software packages used within MATLAB include:
 DSP toolbox: This provides algorithms, apps, and scopes for designing,
simulating, and analyzing signal processing systems. This will be used to
resample audio signals within MATLAB for SNR analysis.
 Audio toolbox: This provides tools for audio processing, speech analysis, and
acoustic measurements. This will be used to read audio files into MATLAB for
SNR analysis.
 Simulink toolbox: This is a MATLAB-based graphical programming
environment for modeling, simulating, and analyzing multidomain dynamical
systems. This will be used in the results section of this dissertation to test our
hypothesis H2.
3.2. Datasets
The experiment was conducted using 2 publicly available audio datasets namely:
 The Mozilla Common Voice (MCV): This dataset contains as many as 75,879
recorded clean speech audio – which is about 65GB of 2,637 validated hours
spread in short MP3 files. But due to the lack of adequate computing resources,
we use a minified version of this dataset provided by Mathworks at (Denoise
Speech Using Deep Learning Networks, 2021). It only contains 2,800 recorded
clean speech samples and weighs only 988MB. The MCV project is open source
and anyone can collaborate with it. The wide range of speakers in this dataset is
one of its best features. It includes fragments of male and female recordings from
a wide range of ages and foreign accents.
 UrbanSound8K: This dataset contains 8732 labeled sound excerpts of urban
noise sounds classified into 10 different commonly found urban sounds. This
includes: air conditioner, car horn, children playing, dog bark, drilling, engine
idling, gun shot, jackhammer, siren, and street music. These classes are drawn
from the urban sound taxonomy and can be found at
https://urbansounddataset.weebly.com/urbansound8k.html.
We will use these urban sounds as noise signals to the clean speech samples from the
MCV dataset. In other words, similar to Figure 9, we shall first take a clean speech signal
– this can be someone speaking a random sentence, from the MCV dataset, then, we add
noise to it – hereby creating synthetically, a scenario where a woman is speaking and a
21
dog is barking in the background, and finally, we use this artificially created noisy signal
as the input to our deep learning model. Our Neural Network, in turn, will receive this
noisy signal and try to compute a clean representation of it.
Figure 9, displays a visual representation of a clean input signal from the MCV. A noise
signal from the UrbanSound dataset and the resulting noisy input – that is the input speech
after adding noise to it. Also, note that the noise power is set so that the signal-to-noise
ratio (SNR) is zero dB (decibel).
Figure 9: Plots of clean, noise, and noisy signals
3.3. Data Preprocessing
This section deals with a crucial step in any deep learning project, that consists of
implementing some Data Preprocessing modules, which allow for the extraction of
features required for training and testing our deep learning network. This entails:
 Downsampling the audio signals to 8kHz,

 Removing silent frames from the audio signals,
 Computing the spectral STFT vectors
3.3.1. Downsampling Audio signals
Downsampling is a process performed on a sequence of samples of a signal or a

continuous function to produce an approximation of the sequence that would have been
obtained by sampling the signal at a lower rate. The idea behind downsampling our audio
22
signals lie in the fact that our data set contains 48kHz recordings of subjects speaking
short sentences and this might result in very poor network computational load.
Another reason for downsampling the audio signals is to mimic the sample rate of speech
coders used for narrowband GSM (EETimes, 2003) telephony applications, which is
known to be 8kHz.
For us to perform this downsampling process without risking any aliasing in our signals,
we use a library in Python, called Librosa, that assists in loading up a signal with our
desired new sampling rate. The library automatically takes care of deciding an appropriate
anti-aliasing filter as well as a proper decimation factor for our desired sampling rate
(8kHz). The following diagram depicts what our Librosa signal downsampler consists of.
Figure 10: The Librosa downsampler subsystem. (Signals and Systems - OpenStax CNX, 2021)
3.3.2. Removing Silent Frames
Another preprocessing stage involves removing silent frames from our audio signals.
Similar to the downsampling process, the idea here is to reduce the computational load of
our Deep Neural Network, thereby reducing the processing power, the processing speed,
and increasing the training accuracy of the network.
To achieve this, we use the below method from Librosa to split our audio file on silence:
Figure 11: Librosa split method (Librosa — Librosa 0.8.1 Documentation, 2021)
This method splits audio files into low volumes specified by the top_db parameter. It also
takes an optional parameter, hop_length, to specify the number of samples between
frames under analysis. We consider using 20dB for our top_db parameter and 64 for our
hop length.
23
3.3.3. Computing STFT vectors
This stage involves computing the spectral vectors of our audio signals with a 256-point
Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point
(8ms) and frequency resolution of 31.25 Hz (=4kHz/128) per each frequency bin. The
STFT formula is given by:
𝑁−1
𝑋𝑙 [𝑘] = ∑ 𝑤[𝑛]𝑥[𝑛 + 𝑙𝐻] 𝑒 (−𝑗2𝜋𝑘𝑛/𝑁) (7)

𝑛=0
where,
𝑙 : the frame number for shifting the window

𝑘: the frequency index of the output STFT: 𝑘 = 0, 1, … , 𝑁 − 1
𝐻: the hop length
𝑁: the length of the DFT frame
𝑋𝑙 [𝑘]: the STFT of the input signal 𝑥[𝑛]
𝑥[𝑛]: the input signal index by 𝑛 (usually time)
𝑥[𝑛 + 𝑙𝐻]: the shifted input signal indexed by 𝑛 + 𝑙𝐻 (usually time)
𝑤[𝑛]: the window function of length 𝐿, in our case Hamming window
The reason we are using a Hamming window is that the FFT transform of a short audio
segment from the main audio signal, erroneously uses the assumption that this signal is
periodic and repeats infinitely before and after the analyzed segment in time. This
erroneous assumption leads to edge eff ects between repeating segments and therefore to
what is known as spectral leakage (a lack of frequency resolution caused by spectral
information “leaking” from one frequency position into adjacent values) which can be
reduced with the use of a windowing function such as Hamming window. The Hamming
window reduces the amplitude of the discontinuities at the side lobes of each finite
segment including any non-harmonic content, hence improving the frequency resolution
of our audio signal. In Figure 12, we illustrate how spectral leakage is reduced using the
hamming window function using a segment of an audio signal from our dataset.
To fulfill this stage of splitting the signal into discrete short time frames before feeding
these to our network, we use the stft function from the Librosa library in Python, as
shown in Figure 12 below;
24
Figure 12: Librosa stft function (Librosa — Librosa 0.8.1 Documentation, 2021)
Figure 13: The spectral leakage reduction process
As per the parameters required by this function, the table below gives a listing of the
values we used.
𝒚: Our digitized 48kHz audio signal
𝒏_𝒇𝒇𝒕: 256
hop_length: 75% of win_length = 64
win_length: 256
window: hamming
Table 1: Librosa stft function parameters
The next section presents our proposed DNN architecture.

25
3.4. Proposed Deep Neural Network Architecture
In this section, using the vectors obtained from the preprocessing stage, we proceed in
implementing a Deep Neural Network model for denoising in noisy environments. The
deep learning training scheme is shown below.
Figure 14: Deep learning training scheme(The MathWorks, Inc., 2021)
The magnitude spectra of the noisy and clean audio signals are used as predictor and
target network signals, respectively. The magnitude spectrum of the denoised signal is
the network's output. The regression network minimizes the mean square error between
its output and the input target by using the predictor input. The output magnitude spectrum
and phase of the noisy signal are used to transfer the denoised audio back to the time
domain.
Once we obtain the STFT vectors, as elaborated in the previous section, we reduce the
size of the spectral vector to 129 by dropping the frequency samples corresponding to
negative frequencies (because the time-domain speech signal is real, and so does not lead
to any information loss). Our predictor input consists of 8 consecutive noisy STFT vectors
so that each STFT output estimate is computed based on the current noisy STFT and the
7 previous noisy STFT vectors. In other words, the DNN model is an autoregressive
system that predicts the current signal based on past observations. Therefore, the target
signals consist of a single STFT frequency representation of shape (129,1) from the clean
audio. The diagram below depicts this process.
26
Figure 15: STFT predictor and target vector inputs (The MathWorks, Inc., 2021)
Now that we’ve laid out how our DNN will interact with our various STFT vectors, we
now need to see how the DNN model proper, will work.
3.4.1. Our DNN model
Our DNN model will be largely based on the work done by (Park & Lee, 2016), where
the authors proposed a Cascaded Redundant Convolutional Encoder-Decoder Network
(CR-CED). Hence, our model will be based on symmetric encoder-decoder architectures,
in which both components contain repeated blocks of Convolution, ReLU, and Batch
Normalization, rendering our network with 16 of such blocks, which add up to 33,000
parameters, i.e., roughly 132MB of memory which can be implemented in an embedded
system, like a mobile phone.
The figures below show the structure of our network model defined with Keras, which
we generated using the Tensorflow plot_model function. The figure was split into various
parts for space management.
27
Figure 16: DNN model Figure 17: DNN model Figure 18: DNN model
structure part 1 structure part 2 structure part 3
It’s important to note that, there are skip connections between some of the encoder and
decoder blocks such that feature vectors from these blocks are combined through addition.
These skip connections speed up convergence and reduce the vanishing of gradients
during training.
Another point to highlight is that, since one of our assumptions is to use the CR-CED
network which is an extension of CNNs (originally designed for Computer Vision), it is
important to be aware that audio data differ a lot from images. Hence, audio data, in its
raw form, is a 1-dimensional time-series data as compared to images which are 2-
dimensional representations of an instant moment in time.
28
For these reasons, we will have to transform our audio signals into (time/frequency) 2D
representations. More specifically, given an input spectrum of shape (129 × 8) to our
network, convolution is only performed in the frequency axis. This ensures that the
frequency axis remains constant during forwarding propagation.
Once our network produces an output estimate, we optimize (minimize) the mean squared
difference (MSE) between the output and the target (clean audio) signals. Figure 19
illustrates how our DNN model attempts to optimize the MSE between noisy spectral
vectors and target clean audio spectral vectors. It uses the frequency spectrogram of these
audio signals for illustration purposes.
Figure 19: Reducing the MSE of our DNN model (Silva, 2019)
In the next subsection, we present the training process of our model.
3.4.2. Training our model
Training a model simply means learning (determining) good values for all the weights
and the bias from labeled samples, in this case, our clean and noisy audio signals.
However, to improve the prediction performance of our deep learning algorithm during
this training phase, we split our dataset into train, validation, and test sets. Splitting our
dataset this way also helps avoid a situation where the model fails to make predictions on
stationary noise attributes it has never seen – a concept called, overfitting. The test set
here is used to estimate how well our model will behave with unseen data. The validation
set is used to validate our model in different configurations such as optimizers or loss
functions. The train set will be used to train or fit our model.
29
Another important manipulation we need to make on our dataset is to split it, which
requires us to take into account the following considerations:
 Computational cost in training the model.

 Computational cost in evaluating the model.
 Training set representativeness.
 Test set representativeness
This splitting is still aimed at avoiding the overfitting of our model. In machine learning,
deciding on a proper ratio for splitting the dataset usually falls in the range 60%-20%-
20% to 98%-1%-1% as shown in the figure provided by (Data Splitting Technique to
Fit Any Machine Learning Model | by Sachin Kumar | Towards Data Science, 2021).
Figure 20: Splitting a Dataset
The percentages are labeled as train-validation-test percentage ratios. In our case, we

used 36%-36%-15% for our Clean speech dataset (MCV), and 88%-2%-10% for our
Noise dataset (UrbanSound8k). The reason for such a low ratio for our clean speech
dataset is due to some technical challenges encountered during the data analysis of the
train.csv file provided within the MCV folder. We also encountered a similar challenge
with the noise dataset, while we intended for ratios in the range shown in the figure above.
Finally, to fit our model, we make use of the test_tf_record.py and dataset.py
modules which use TFRecord, provided by Tensorflow as the recommended data format
for training, to save features of our clean and audio signals, which we can now use to fit
our model. These 2 modules are available in Appendices E & F respectively. We use the
fit function provided by Keras to train our model. It takes as arguments, the entire input
data (source) and output data (target), the batch size, the number of epochs, and the
validation dataset (source and target). We use this function in the snippet below:
30
model.fit(train_dataset, steps_per_epoch=600, validation_data=test_dataset, epochs=999)
where,
 train_dataset is our input training set of clean and noisy audio features
(magnitude and phase spectral vectors)
 steps_per_epoch defines the total number of steps (batches of samples) before
declaring one epoch finished and starting the next epoch during training
 validation_data, is the data on which to evaluate the loss and any model metrics
at the end of each epoch. The model will not be trained on this data.
 epochs defines the number of epochs to train the model.
In the following chapter, we will explain the methods used to evaluate the system and
present the results we obtained.
31
CHAPTER 4
RESULTS
In this section, we evaluate our proposed system with our hypotheses. To test hypothesis
H1, we use objective and subjective measurements, we test hypothesis H2 using graph
comparisons of our denoised signals, and we evaluate the performance of our system
relative to critical cellular GSM device parameters, such as; channel capacity, co-
channel interference and power consumption.
4.1. Testing Hypothesis H1
In all speech enhancement algorithms, the improvement in the quality and intelligibility
is of utmost importance for ease and accuracy of information exchange. The speech
quality and intelligibility can be quantified using subjective and objective measures
(Krishnamoorthy, 2011). We implement these measurements in the next subsections.
4.1.1. 4.1.1. Subjective Speech Intelligibility Test
Subjective speech quality measures are usually obtained using listening tests in which
human participants rate the quality of the speech in accordance with a predetermined
opinion scale. Listeners are presented with the sample speech audios and asked to rate the
quality of the speech on a numerical scale, typically a 5-point scale with 1 indicating poor
quality and 5 indicating excellent quality – a scoring range called Mean Opinion Score
(MOS). However, according to (Taal et al., 2010), such evaluation methods turn out to be
costly and time-consuming.
Hence, to perform the subjective speech intelligibility test, (Rix et al., 2001) suggest the
use of Perceptual Evaluation of Speech Quality (PESQ) to predict the subjective
opinion score of a degraded or enhanced speech. This is because, PESQ is a quite
sophisticated algorithm which has been recommended by ITU-T (P.862) for speech
quality assessment of narrow-band handset telephony and narrow-band speech codecs
(Hu & Loizou, 2008).
The PESQ measure takes a reference signal and the enhanced signal and aligns them in
both time and level. This is followed by a range of perceptually significant transforms
which include Bark spectral analysis, frequency equalization, gain variation equalization,
and loudness mapping (Rix et al., 2001).
32
The range of the PESQ score is −0.5 to 4.5, where -0.5 corresponds to a poor quality,
while 4.5 corresponds to a high quality of speech intelligibility.
Another metric that has proven to be able to quite accurately predict the intelligibility of
noisy/processed speech in a large range of acoustic scenarios, including speech processed
by mobile communication devices, is the Short-Time Objective Intelligibility (STOI).
Recent studies by (Chen et al., 2016) and (Healy et al., 2017), show a good
correspondence between STOI predictions of noisy speech enhanced by DNN-based
speech enhancement systems, and speech intelligibility.
STOI is based on the correlation between the envelopes of clean and degraded speech
signals – denoted by 𝑥 and 𝑦 respectively, decomposed into regions that are
approximately 400ms in length and uses a simple DFT-based Time-Frequency-
decomposition. According to (Taal et al., 2011), the output of STOI is a scalar value
which is expected to have a monotonic relation with the average intelligibility of 𝑦 (e.g.,
the percentage of correctly understood words averaged across a group of users). It is a
function of a Time Frequency dependent intermediate intelligibility measure, which
compares the temporal envelopes of the clean and degraded speech in short-time regions
by means of a correlation coefficient. The following vector notation is used to denote the
short-time temporal envelope of the clean speech:
𝑇
𝑥𝑗,𝑚 = [𝑋𝑗 (𝑚 − 𝑁 + 1), 𝑋𝑗 (𝑚 − 𝑁 + 2), … , 𝑋𝑗 (𝑚)] (8)
where 𝑁 = 30 which equals an analysis of approximately 400ms, 𝑋𝑗 (𝑚) =
(𝑗)−1
√∑𝑘𝑘=𝑘
2
|𝑥̂(𝑘, 𝑚)|2, with 𝑥̂(𝑘, 𝑚) denoting the 𝑘 𝑡ℎ DFT-bin of the 𝑚𝑡ℎ frame of the
1 (𝑗)
clean speech. Similar notation, 𝑦𝑗,𝑚 , is used for the short-time temporal envelope of the
degraded speech.
Thus, the correlation coefficient between 𝑥𝑗,𝑚 and 𝑦𝑗,𝑚 is given by;
𝑇
(𝑥𝑗,𝑚 − 𝜇𝑥𝑗,𝑚 ) (𝑦̅𝑗,𝑚 − 𝜇𝑦̅𝑗,𝑚 )
𝑑𝑗,𝑚 = (9)
‖𝑥𝑗,𝑚 − 𝜇𝑥𝑗,𝑚 ‖ ‖𝑦̅𝑗,𝑚 − 𝜇𝑦̅𝑗,𝑚 ‖
‖𝑥 ‖ 15𝑑𝐵
where 𝑦̅𝑗,𝑚 (𝑛) = min (‖𝑦𝑗,𝑚 ‖ 𝑦𝑗,𝑚 (𝑛), (1 + 10 20 ) 𝑥𝑗,𝑚 (𝑛)), is the normalized and
𝑗,𝑚
clipped version of 𝑦 and 𝜇(.) refers to the sample average of the corresponding vector.
33
Finally, the average of the intermediate intelligibility measure over all frames, referred to
as the STOI score, is given by;
1
𝑑= ∑ 𝑑𝑗,𝑚 (10)
𝐽𝑀
𝑗,𝑚
where 𝑀 represents the total number of frames and 𝐽 the number of one-third octave
bands.
According to (Taal et al., 2010), the output of STOI, 𝑑, takes values −𝟏 ≤ 𝒅 ≤ 𝟏 but is
in practice non-negative (Intelligibility Prediction for Speech Mixed with White Gaussian
Noise at Low Signal-to-Noise Ratios: The Journal of the Acoustical Society of America:
Vol 149, No 2, 2021).
In this dissertation, we based our measurement on 30 clean speech and noise samples
from our test dataset discussed in the methodology. This test dataset comprises clean and
noise samples, which we added individually to obtain 30 noisy speech samples with SNR
of 0𝑑𝐵. These noisy speech samples are then fed to our speech enhancement DNN
algorithm to obtain denoised speech samples. The denoised samples, the noisy speech
samples, and the original clean speech samples are then used to perform the subjective
speech intelligibility test using the aforementioned metrics.
4.1.1.1. PESQ Measurement
Despite the unavailability of its mathematical representation, we measure this metric with
the help of the ‘PESQ Software’ provided by ITU-T (P.862 : Perceptual Evaluation of
Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment
of Narrow-Band Telephone Networks and Speech Codecs, 2021) embedded in the pesq
function of the pysepm package from (schmiph2, 2019/2021).
The table below shows the results of the first 10 PESQ scores for the noisy and denoised
speech samples.
34
Speech File Noise File pesq_befor pesq_afte

e r
1 common_voice_en_17259150.wa 83502-0-0- 1.6925 2.8905

v 7.wav
2 common_voice_en_137653.wav 115418-9-0- 1.5764 3.0212

20.wav

55.wav

92.wav

v 35.wav

v 39.wav

v 50.wav

v 101.wav

v 0.wav

0 v 34.wav
Table 2: PESQ results from DNN model
The results above show that, for our first 10 noisy and denoised speech samples, the
perceptual speech quality is significantly increased with the help of our DNN model.
4.1.1.2. STOI Measurement
This metric was measured with the help of the stoi function from the pysepm module.
The table below shows the results of the first 10 STOI values for the noisy and denoised
speech samples.
35
Speech File Noise File stoi_before stoi_after
1 common_voice_en_17259150.wa 83502-0-0-7.wav 0.6394 0.777

v

20.wav
3 common_voice_en_13900.wav 99192-4-0-55.wav 0.811 0.7221

35.wav
v

39.wav
v

50.wav
v

101.wav
v
9 common_voice_en_18467543.wa 162134-7-5-0.wav 0.304 0.714

v

34.wav
v
Table 3: STOI results from DNN model
The table above shows varying intelligibility scores with 0.6518 being the minimum
STOI and 0.777 being the maximum STOI value for this result set. It can be noticed that
some denoised speech samples have lower STOI values than their noisy counterpart, as
is the case with rows highlighted in blue. This is due to the denoising errors encountered
during the training process of our DNN model.

36
It’s worth mentioning that, the prediction root-mean-square-error (RMSE) of our model
was evaluated to be up to 𝟎. 𝟒𝟑𝟕𝟓 – indicating that our model can denoise our noisy
speech signals with up to 𝟏𝟒. 𝟐𝟓% feature prediction errors. This is to say that 1.96 ×
43.75% = 𝟖𝟓. 𝟕𝟓% of extracted noisy features could be correctly predicted, hence the
inaccuracy in some denoised samples.
4.1.2. Objective Speech Intelligibility Test
Objective quality measures predict perceived speech quality based on a computation of

the numerical distance or distortion between the original and the processed speech.
This dissertation focused on the use of the signal-to-noise-ratio (SNR) to measure the
ratio of signal energy to noise energy expressed in decibels dB and is given by
(Krishnamoorthy, 2011):
∑𝑛 𝑠 2 (𝑛)
𝑆𝑁𝑅𝑑𝐵 = 10 log10
∑𝑛[𝑠(𝑛) − 𝑠̂ (𝑛)]2 (11)
where 𝑠(𝑛) is the clean speech and 𝑠̂ (𝑛) is the noisy or denoised speech signals. Since
SNR is very sensitive to the time alignment of the original and processed signal, we
padded the noisy and denoised signals with zeros to align them with the original or clean
speech signal.
The table below shows the results of the first 10 SNR values for the noisy and denoised
speech samples.
37
Speech File Noise File snr_db_before snr_db_after
Table 4: SNR results from DNN model
The results above show that, for our first 10 noisy and denoised speech samples, the SNR
is significantly increased with the help of our DNN model, from 0𝑑𝐵 noisy signal to a
maximum of 𝟕. 𝟒𝟐𝟐𝟕𝒅𝑩.
The next table below shows the average speech metrics (PESQ, STOI, and SNR)
computed over 30 clean speech and noise samples in our test dataset.
38
Subjective Metrics Objective Metrics
Mean PESQ Mean PESQ Mean Mean STOI Mean SNR Mean
noisy denoised STOI noisy denoised noisy SNR
denoised
𝟏. 𝟖𝟎𝟏𝟔 2.4755 0.6724 0.7016 0𝑑𝐵 5.1737𝑑𝐵
Table 5: Average of subjective and object metrics
It can be observed from these results that, based on 30 samples from our test dataset, there
is a significant increase in the PESQ, STOI, and SNR of denoised signals obtained from
the output of our model.
In this section, we were able to test hypothesis H1 and observed that our DNN network
exhibited an improved performance based on both subjective (PESQ, STOI) and objective
quality (SNR) measures, in denoising noisy speech signals. We test hypothesis H2 in the
next section.
4.2.Testing Hypothesis H2
Now that we’ve been able to test hypothesis H1 using the mentioned speech quality
measures, visually inspecting the waveforms of the resulting signals can also easily tell
us how promising our denoising algorithm can be relative to its application to the ITU-T
G.729 recommended VAD.
To begin with, let’s visualize our noisy input signal, which is composed of the last speech
file in Table 4 (common_voice_en_18467541.wav) and a noise file selected at random
from our test dataset (115418-9-0-20.wav). The figure below shows the waveforms of
these signals.
39
Figure 21: Test noisy Input signal

Passing the noisy speech signal shown in the figure above the into the ITU-T G.729
recommended VAD algorithm implemented by MATLAB – provided in Appendix G
results in the following waveform.
Figure 22: VAD decision for noisy-input.wav

The following figure shows the resulting waveform of our resulting denoised speech
signal. Appendix D shows its corresponding spectrogram.
40
Figure 23: Denoised speech signal
Passing this denoised signal in the third waveform from the figure above to the G.729
algorithm results in the following waveform.
Figure 24: VAD decision for denoised speech signal

It can be observed from Figure 23, that the G.729 seems to better detect the presence of
speech as compared to Figure 21, where it seems to hardly approximate the presence of
speech in the signal.
4.3. Testing GSM Channel Capacity
Given that we’ve been able to test both Hypothesis H1 and H2, we’ve been to obtain
improved intelligibility and voice activity detection, it’s also important, that we measure
41
how well our enhanced voice activity decisions affect a critical GSM resource like the
channel capacity.
Usually, once the GSM speech coder encodes our denoised speech, this encoded speech
is passed onto the GSM Traffic Channel (TCH). This TCH is responsible to carry digitally
encoded speech on the forward and reverse link after a mobile has established connection
with the GSM Base Transceiver Station (BTS).
TCH supports 2 types of information rates namely:
 TCH/FS: TCH/FS which stands for Full Rate Speech Channel (ECSTUFF4U for
Electronics Engineer, 2022), carries encoded speech at a rate of 22.8kbps.
 TCH/HS: TCH/HS which stands for Half Rate Speech Channel carries up to
11.4kbps of encoded speech (ECSTUFF4U for Electronics Engineer, 2022). It’s
main purpose is to support two calls in only one GSM channel.
Now, for a noiseless channel, the Nyquist capacity formula defines the theoretical
channel capacity 𝐶 as:
𝐶 = 2 × 𝐵 × log 2 (𝐿) (12)

where, 𝐵 is the bandwidth of the channel in samples per seconds, and 𝐿 is the number of
signal levels used to respresent the encoded speech data.
However, since we can’t have a noiseless channel in real life, we base our test using
Shannon capacity to determine the theoretical channel capacity for a noisy channel given
as:
𝐶 = 𝐵 × log 2 (1 + 𝑆𝑁𝑅 ) 𝑏𝑖𝑡𝑠/𝑠𝑒𝑐 (13)

where, 𝑆𝑁𝑅 is the signal-to-noise of our encoded speech signal, and we use 25kHz for
𝐵, which is the full-duplex speech channel bandwidth used for communication with
GSM900 Base Stations (BTS) (Asik & Amca, 2022).
Given that in section 4.1.2., the 𝑆𝑁𝑅𝑑𝐵 was calculated, we obtain the 𝑆𝑁𝑅 here using:
𝑆𝑁𝑅 = 10𝑆𝑁𝑅𝑑𝐵 /10 (14)

The table below shows the results of the first 10 channel capacity values for the noisy and
denoised speech samples passed over a noisy channel.
42
Speech File Noise File cap_bef_h cap_aft_hz

z (Hz) (Hz)
1 common_voice_en_17259150.wav 83502-0-0-7.wav 25000 66810.28
Table 6: Channel Capacity for noisy and denoised speech samples
The channel capacity values in Table 6, indicate the maximum rate at which speech can
be transmitted through a 25kHz full duplex channel with very small error probability
(Channel Capacity - an Overview | ScienceDirect Topics, 2022). Hence, it can be
observed that, there is significant increase in channel capacity when enhanced speech is
transmitted across the channel.
Other important GSM resources like co-channel interference and power efficiency are
equally quite necessary to be analyzed, but this will require prior analysis of the GSM
cell structure, co-channel cell distance 𝐷, cell radius 𝑅, and interference power caused by
an interfering co-channel cell base station, which is out of scope of our research work.
4.4.Results Discussion
The summary results in Table 5 give us an idea of how well our DNN model performs in
denoising noisy signals based on 30 noisy sample signals. Comparing the average PESQ
score obtained (𝟐. 𝟒𝟕𝟓𝟓) with that obtained by (Park & Lee, 2016), 𝟐. 𝟑𝟒, it becomes
clear that, our model isn’t too bad in maintaining a good speech quality while denoising.
43
Also, comparing our average STOI value (𝟎. 𝟕𝟎𝟏𝟔) with that obtained by (Park & Lee,
2016), 𝟎. 𝟖𝟑, we can also deduce that our model isn’t too bad at maintaining intelligibility
of denoised speech signals.
Another research work performed by (Badescu & Cavez, 2021), showed a PESQ score of
𝟐. 𝟏𝟗𝟐𝟎, for SNR conditions of −𝟓𝒅𝑩, which when compared to our PESQ score under
SNR conditions of 𝟎𝒅𝑩, points out that our model provides a good intelligibility level
for 0𝑑𝐵 noisy speech signals.
Finally, given a full duplex channel with bandwidth 25kHz, using 30 noisy and denoised
speech samples, our work showed a significant increase in channel capacity with average
𝟓𝟒. 𝟓𝟗𝟖𝒌𝑯𝒛 which represents an increase of 118%.
Hence, we can assert based on these results that, we can generate denoised spectra,
{𝑓(𝑥𝑡 )}𝑇𝑡=1 , that approximates the clean spectra {𝑦𝑖 }𝑇𝑡=1 in the 𝒍𝟐 norm while maintaining
good perception and intelligibility levels, and that our denoised spectra, can significantly
improve the decisions made by the ITU-T G.729 Annex B recommended voice activity
detection algorithm.
44
CONCLUSION
Summary
In this dissertation, we aimed at implementing a speech enhancement system capable of

improving VAD decisions for noisy speech signals using Deep Neural Networks. Inspired
by the research conducted by (Park & Lee, 2016) and (Silva, 2019) we hypothesized that
CR-CED DNN can effectively denoise speech with small a network size and the resulting
denoised speech can improve the decisions made by the ITU-T G.729 VAD.
The results we obtained from the proposed deep neural network, presented in the previous
chapter, establishes that VAD decision errors can be attenuated with the help of Speech
Enhancement, and can improve perception and intelligibility levels, hence, increasing the
signal-to-noise ratio.
Remarks
Our study, showed a close resemblance in our performance metrics results and that
obtained from previous works by (Park & Lee, 2016) and (Badescu & Cavez, 2021).
Our speech enhancement model accuracy was obtained to be 85.75%, i.e., RMSE of
14.25%, indicating the error percentage our model encountered during training.
Despite the technical issues encountered with our datasets and the computational
resources, we believe that past these constraints, greater precision accuracy and hence a
lesser RMSE percentage will be achieved.
Future Work
In our future work, we intend to increase the dataset size of our clean speech dataset
(MCV) – about 71GB – and adjust the training parameters of the noise dataset in our
model, and set appropriate train-validation-test ratios to improve the precision accuracy
of our DNN model.
Lastly, to overcome the problem of frequency and time resolution of the short-time
Fourier transform, we intend to replace the STFT with the Wavelet Transform (WT) of
the speech signals at the preprocessing stage.
45
REFERENCES
Asik, H., & Amca, H. (2019). Hand-over power level adjustment for minimizing cellular mobile
communication systems health concerns. Ciência e Técnica Vitivinícola, Vl. 34. No.
7, pp. 2416-3953.
Badescu, D. M., & Cavez, A. B. (n.d.). Speech Enhancement using Deep Learning. 33.
https://upcommons.upc.edu/bitstream/handle/2117/100596/Speech+Enhancement+usin
g+Deep+Learning.pdf?sequence=1.
Baghdasaryan, D. (2018). Real-Time noise suppression using deep learning | by davit
baghdasaryan | towards data science. Real-Time noise suppression using deep learning.
Retrieved on 17/02/2022 from, https://towardsdatascience.com/real-time-noise-
suppression-using-deep-learning-38719819e051.
Benyassine, A., Shlomot, E., Su, H.-Y., Massaloux, D., Lamblin, C., & Petit, J.-P. (1997). ITU-
T Recommendation G.729 Annex B: A silence compression scheme for use with G.729
optimized for V.70 digital simultaneous voice and data applications. IEEE
Communications Magazine, Vl. 35. No. 9, pp. 64–73. Retrieved from,
https://doi.org/10.1109/35.620527.
Chaudhary, M. (2020). Activation functions: Sigmoid, tanh, relu, leaky relu, softmax. Medium.
Retrieved 14/02/2022 from, https://medium.com/@cmukesh8688/activation-functions-
sigmoid-tanh-relu-leaky-relu-softmax-50d3778dcea5.
Chen, J., Wang, Y., Yoho, S. E., Wang, D., & Healy, E. (2016). Large-scale training to increase
speech intelligibility for hearing-impaired listeners in novel noises. The Journal of the
Acoustical Society of America, Vl. 22. No. 6, pp 67-78. Retrieved from,
https://doi.org/10.1121/1.4948445.
Craciun, A., & Gabrea, M. (2004). Correlation coefficient-based voice activity detector algorithm.
Canadian Conference on Electrical and Computer Engineering. (IEEE Cat.
No.04CH37513). Retrieved from, https://doi.org/10.1109/CCECE.2004.1349763.
Dertat, A. (2017). Applied deep learning - Part 3: Autoencoders. Medium. Retrievded on 28/07/
2022 from, https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-
1c083a f4d.
ECSTUFF4U (2018). for Electronics Engineer. Retrieved on July 28/07/2022, from

https://www.ecstuff4u.com/
EETimes. (2003). EETimes - sorting through GSM Codecs: A Tutorial. EETimes. REtrieved on
14/03/2022 from, https://www.eetimes.com/sorting-through-gsm-codecs-a-tutorial.
Farahani, G. (2017). Autocorrelation-based noise subtraction method with smoothing,

overestimation, energy, and cepstral mean and variance normalization for noisy speech
recognition. EURASIP Journal on Audio, Speech, and Music Processing, Vl. 1. No. 13,
pp. 13-22. REtrieved from, https://doi.org/10.1186/s13636-017-0110-8.
Hahn, M., & Park, C. K. (1992). An improved speech detection algorithm for isolated Korean
utterances. [Proceedings] ICASSP-92: 1992 IEEE International Conference on
Acoustics, Speech, and Signal Processing. Retrieved from,
https://doi.org/10.1109/ICASSP.1992.2258.
46
Haigh, J. A., & Mason, J. S. (1993). Robust voice activity detection using cepstral features.
Proceedings of TENCON ’93. IEEE Region 10 International Conference on Computers,
Communications and Automation. Retrieved from, https://doi.org/10.1109/ TENCON.
1993 .327987.
Healy, E. W., Delfarah, M., Vasko, J. L., Carter, B. L., & Wang, D. (2017). An algorithm to
increase intelligibility for hearing-impaired listeners in the presence of a competing
talker. The Journal of the Acoustical Society of America, Vl. 141. No. 6, pp. 4230–4239.
Retrieved from, https://doi.org/10.1121/1.4984271.
Hu, Y., & Loizou, P. C. (2008). Evaluation of objective quality measures for speech enhancement.
IEEE Transactions on Audio, Speech, and Language Processing, Vl. 16. No. 1, pp. 229–
238. Retrieved from, https://doi.org/10.1109/TASL.2007.911054.
ITU-T. (2021). P.862 : Perceptual evaluation of speech quality (PESQ): An objective method
for end-to-end speech quality assessment of narrow-band telephone networks and
speech codecs. (n.d.). Retrieved May 22, 2022, from https://www.itu.int/rec/T-REC-
P.862-200102-I/en.
JalFaizy, S. (2017). Why are GPUs necessary for training deep learning models? Retrieved on
23/04/2022 from, https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-
deep-learning.
Jong, W. S., Joon-Hyuk, C., Barbara, S., Hwan, S. Y., & Nam, S. K. (2005). Voice activity
detection based on generalized gamma distribution. Proceedings. (ICASSP ’05). IEEE
International Conference on Acoustics, Speech, and Signal Processing. Retrieved from,
https://doi.org/10.1109/ICASSP.2005.1415230.
Jongseo, S., & Wonyong, S. (1998). A voice activity detector employing soft decision based noise
spectrum adaptation. Proceedings of the 1998 IEEE International Conference on
Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181). Retrieved
from, https://doi.org/10.1109/ICASSP.1998.674443.
Kawamura, A., Thanhikam, W., & Iiguni, Y. (2012). Single channel speech enhancement
techniques in spectral domain. ISRN Mechanical Engineering. Retrieved from,
https://doi.org/10.5402/2012/919234.
Krishnamoorthy, P. (2011). An Overview of subjective and objective quality measures for noisy
speech enhancement algorithms. IETE Technical Review, Vl. 28. No. 4, pp. 292–301.
Retrieved from, https://doi.org/10.4103/0256-4602.83550.
Kumar, A., & Florencio, D. (2016). Speech enhancement in multiple-noise conditions using deep
Neural Networks. Retrieved from, Retrieved from, https://doi.org/10.21437 /Interspeech .
Librosa. (2021). Librosa 0.8.1 documentation. (n.d.). Retrieved on 2/11/2021 from,

https://librosa.org/doc/main.
Liu, D., Smaragdis, P., & Kim, M. (2014). Experiments on Deep Learning for Speech Denoising.
New York: Mc. Hill.
Mathworks: (2021). Denoise Speech Using Deep Learning Networks. Retrieved on 23/09/2021,
from https://www.mathworks.com/help/deeplearning/ug/denoise-speech-using-deep-
learning-networks.html.
McLoughlin, I. (2016). Speech and audio processing: A Matlab-based approach. Cambridge:

University Press.
47
Mozilla. (2020). Common voice corpus 9.0. (n.d.). Retrieved May 26, 2022, from
https://commonvoice.mozilla.org/
Ortega-Garcia, J., & Gonzalez-Rodriguez, J. (1996). Overview of speech enhancement techniques

for automatic speaker recognition. Proceeding of Fourth International Conference on
Spoken Language Processing. ICSLP Journal, Vl. 96. No. 2, pp. 929–932. Retrieved
from, https://doi.org/10.1109/ICSLP.1996.607754.
Park, S. R., & Lee, J. (2016). A Fully convolutional neural network for speech enhancement.
ArXiv:1609.07132 [Cs]. Retrieved on 24/11/21 from, http://arxiv.org/abs/1609.07132.
Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2001). Perceptual evaluation of
speech quality (PESQ)-a new method for speech quality assessment of telephone
networks and codecs. IEEE International Conference on Acoustics, Speech, and Signal
Processing. Proceedings, Vl. 2. No. 1, pp. 749–752. Retrieved from,
https://doi.org/10.1109/ ICASSP .2001.941023.
Becker, D. (2018). Running kaggle kernels with a GPU. Retrieved on 4/11/2021, from
https://kaggle.com/dansbecker/running-kaggle-kernels-with-a-gpu.
Saha, S. (2018, December 17). A Comprehensive Guide to Convolutional Neural Networks—The

ELI5 way. Medium. https://towardsdatascience.com/a-comprehensive-guide-to-
convolutional-neural-networks-the-eli5-way-3bd2b1164a53.
Salamon, J., Jacoby, C., & Bello, J. P. (2014). A Dataset and Taxonomy for Urban Sound
Research. Proceedings of the 22nd ACM International Conference on Multimedia, 1041–
1044. https://doi.org/10.1145/2647868.2655045
Sachin, K. (2020). Data splitting technique to fit any machine learning model . Towards data
science. Retrieved 5/11/2021 from, https://towardsdatascience.com/data-splitting-
technique-to-fit-any-machine-learning-model-c0d7f3f1c790.
Samaya, M., Jeremy, N., Romeo, K., & Alex, A. (2021). Building deep learning models with
tensorflow—home | coursera [E-learning]. Building deep learning models with
tensorflow. Retrieved on 11/11/2021, from https://www.coursera.org/learn/building-
deep-learning-models-with-tensorflow/home/welcome.
Sachin, K. (2020). Data splitting technique to fit any machine learning model . Towards data
science. Retrieved 5/11/2021 from, https://towardsdatascience.com/data-splitting-
technique-to-fit-any-machine-learning-model-c0d7f3f1c790.
Simone, G., 7 Carl, H. (2021). Intelligibility prediction for speech mixed with white Gaussian
noise at low signal-to-noise ratios. The Journal of the Acoustical Society of America: Vl.
149. No 2. Retrieved on 9/11/2021 from,
https://asa.scitation.org/doi/full/10.1121/10.0003557.
Schmiph2. & Philiposm, C. L . (2021). Pysepm—python speech enhancement performance

measures (Quality and Intelligibility). Asterdam: Python.
Scourias, J. (1995). Overview of the global system for mobile communications. wataloo:
University Press.
48
Shivakumar, P. G., & Georgiou, P. (2016). Perception optimized deep denoising autoencoders
for speech enhancement. In, Prashanth, G. S., & Panayiotis, G. (Eds.).
Interspeech. New York: University Press, pp. 3743–3747.
Silva, T. (2019). Practical deep learning audio denoising—thalles’ blog. Speech Denoising Is a
long-standing problem. Given an input noisy signal, we aim to filter out the undesired
noise without degrading the signal of interest. You can imagine someone talking in a
video conference while a piece of music is playing in the background. in this situation, a
speech denoising system has the job of removing the background noise in order to
improve the speech signal. Besides Many other use cases, this application is especially
important for video and audio conferences where noise can significantly decrease speech
Intelligibility. Retrieved on 29/11/2021 from, https://sthalles.github.io/practical-deep-
learning-audio-denoising/.
Sohn, J., Kim, N., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE
Signal Processing Letters, Vl. 6. No. 1, pp. 1–3.
Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2010). A short-time objective
intelligibility measure for time-frequency weighted noisy speech. IEEE Society (Ed.).
IEEE International Conference on Acoustics, Speech and Signal Processing. New York:
EEEi Society, pp. 4214–4217.
Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An Algorithm for intelligibility
prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio,
Speech, and Language Processing, Vl. 19. No. 7, pp. 2125–2136. https://doi.org/10.1109/
TASL. 2011.2114881.
Tashev, I. J., & Mirsamadi, S. (2016). DNN-based causal voice activity detector. 5. Retrived on
17/09/2021 from, https://www.semanticscholar.org.
The MathWorks, Inc. (n.d.). Denoise Speech using deep learning networks—MATLAB &
Simulink. Retrieved on15/03/2021, from
https://www.mathworks.com/help/audio/ug/denoise-speech-using-deep-learning-
networks.html.
Yuliani, A. R., Amri, M. F., Suryawati, E., Ramdan, A., & Pardede, H. F. (2021). Speech
enhancement using deep learning methods. Journal Elektronika Dan Telekomunikasi, Vl,
21. No. 1, 16-19. Rertrieved from, https://doi.org/10.14203/jet.v21.19-26
49
APPENDICES
Appendix A: Interfering noise in speech transmission illustration
Appendix B: VAD Decision of a noisy speech signal
Appendix C: VAD Decision of Noisy Audio

50
Appendix D: Improving VAD decisions with Speech Enhancer
Appendix E: test_tf_record.py
import tensorflow as tf
import numpy as np
from utils import play
from data_processing.feature_extractor import FeatureExtractor
train_tfrecords_filenames = '../kaggle/working/records/test_0.tfrecords'
def tf_record_parser(record):
keys_to_features = {
"noise_stft_phase": tf.io.FixedLenFeature((), tf.string, default_value=""),
'noise_stft_mag_features': tf.io.FixedLenFeature([], tf.string),
"clean_stft_magnitude": tf.io.FixedLenFeature((), tf.string)
}
features = tf.io.parse_single_example(record, keys_to_features)
noise_stft_mag_features = tf.io.decode_raw(features['noise_stft_mag_features'], tf.floa

t32)
clean_stft_magnitude = tf.io.decode_raw(features['clean_stft_magnitude'], tf.float32)
noise_stft_phase = tf.io.decode_raw(features['noise_stft_phase'], tf.float32)
n_features = 129
noise_stft_mag_features = tf.reshape(noise_stft_mag_features, (n_features, 8, 1), name=

"noise_stft_mag_features")
clean_stft_magnitude = tf.reshape(clean_stft_magnitude, (n_features, 1, 1), name="clean
_stft_magnitude")
noise_stft_phase = tf.reshape(noise_stft_phase, (n_features,), name="noise_stft_phase")
return noise_stft_mag_features, clean_stft_magnitude, noise_stft_phase
train_dataset = tf.data.TFRecordDataset([train_tfrecords_filenames])
train_dataset = train_dataset.map(tf_record_parser)
train_dataset = train_dataset.repeat(1)
train_dataset = train_dataset.batch(1000)
train_dataset = train_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
window_length=256
overlap=64
sr = 16000
feature_extractor = FeatureExtractor(None, windowLength=window_length, overlap=overlap,

sample_rate=sr)
def revert_features_to_audio(features, phase, cleanMean=None, cleanStd=None):

# scale the outpus back to the original range
if cleanMean and cleanStd:
features = cleanStd * features + cleanMean
phase = np.transpose(phase, (1, 0))

features = np.squeeze(features)
51
Appendix F: dataset.py
import librosa
import numpy as np
import math
from feature_extractor import FeatureExtractor
from utils import prepare_input_features
import multiprocessing
import os
from utils import get_tf_feature, read_audio
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
np.random.seed(999)
tf.random.set_seed(999)
class Dataset:
def __init__(self, clean_filenames, noise_filenames, **config):
self.clean_filenames = clean_filenames
self.noise_filenames = noise_filenames
self.sample_rate = config['fs']
self.overlap = config['overlap']
self.window_length = config['windowLength']
self.audio_max_duration = config['audio_max_duration']
# some functions were clipped here for space management

# … … …
def parallel_audio_processing(self, clean_filename):
clean_audio, _ = read_audio(clean_filename, self.sample_rate)
# remove silent frame from clean audio

clean_audio = self._remove_silent_frames(clean_audio)
noise_filename = self._sample_noise_filename()
# read the noise filename

noise_audio, sr = read_audio(noise_filename, self.sample_rate)
# remove silent frame from noise audio

noise_audio = self._remove_silent_frames(noise_audio)
# sample random fixed-sized snippets of audio

clean_audio = self._audio_random_crop(clean_audio, duration=self.audio_max_dura
tion)
# add noise to input image

noiseInput = self._add_noise_to_clean_audio(clean_audio, noise_audio)
# extract stft features from noisy audio

noisy_input_fe = FeatureExtractor(noiseInput, windowLength=self.window_length,
overlap=self.overlap,
sample_rate=self.sample_rate)
52
noisy_input_fe = FeatureExtractor(noiseInput, windowLength=self.window_length,

noise_spectrogram = noisy_input_fe.get_stft_spectrogram()
# Or get the phase angle (in radians)

# noisy_stft_magnitude, noisy_stft_phase = librosa.magphase(noisy_stft_features
)
noise_phase = np.angle(noise_spectrogram)
# get the magnitude of the spectral

noise_magnitude = np.abs(noise_spectrogram)
# extract stft features from clean audio

clean_audio_fe = FeatureExtractor(clean_audio, windowLength=self.window_length,
clean_spectrogram = clean_audio_fe.get_stft_spectrogram()
# clean_spectrogram = cleanAudioFE.get_mel_spectrogram()
# get the clean phase

clean_phase = np.angle(clean_spectrogram)
# get the clean spectral magnitude

clean_magnitude = np.abs(clean_spectrogram)
# clean_magnitude = 2 * clean_magnitude / np.sum(scipy.signal.hamming(self.wind
ow_length, sym=False))
clean_magnitude = self._phase_aware_scaling(clean_magnitude, clean_phase, noise

_phase)
scaler = StandardScaler(copy=False, with_mean=True, with_std=True)

noise_magnitude = scaler.fit_transform(noise_magnitude)
clean_magnitude = scaler.transform(clean_magnitude)
return noise_magnitude, clean_magnitude, noise_phase
def create_tf_record(self, *, prefix, subset_size, parallel=True):

counter = 0
p = multiprocessing.Pool(multiprocessing.cpu_count())
for i in range(0, len(self.clean_filenames), subset_size):
tfrecord_filename = '/kaggle/working/records/' + prefix + '_' + str(counter

) + '.tfrecords'
if os.path.isfile(tfrecord_filename):
print(f"Skipping {tfrecord_filename}")
counter += 1
continue
writer = tf.io.TFRecordWriter(tfrecord_filename)
53
clean_filenames_sublist = self.clean_filenames[i:i + subset_size]
print(f"Processing files from: {i} to {i + subset_size}")

if parallel:
out = p.map(self.parallel_audio_processing, clean_filenames_sublist)
else:
out = [self.parallel_audio_processing(filename) for filename in clean_f
ilenames_sublist]
for o in out:
noise_stft_magnitude = o[0]
clean_stft_magnitude = o[1]
noise_stft_phase = o[2]
noise_stft_mag_features = prepare_input_features(noise_stft_magnitude,
numSegments=8, numFeatures=129)
noise_stft_mag_features = np.transpose(noise_stft_mag_features, (2, 0,

1))
clean_stft_magnitude = np.transpose(clean_stft_magnitude, (1, 0))
noise_stft_phase = np.transpose(noise_stft_phase, (1, 0))
noise_stft_mag_features = np.expand_dims(noise_stft_mag_features, axis=

3)
clean_stft_magnitude = np.expand_dims(clean_stft_magnitude, axis=2)
for x_, y_, p_ in zip(noise_stft_mag_features, clean_stft_magnitude, no

ise_stft_phase):
y_ = np.expand_dims(y_, 2)
example = get_tf_feature(x_, y_, p_)
writer.write(example.SerializeToString())
counter += 1
writer.close()
54
Appendix G: MATLAB implementation of G.729 VAD
audioSource = dsp.AudioFileReader('SamplesPerFrame',80,...
'Filename','noisy-input.wav',...
'OutputDataType', 'single');
scope = dsp.TimeScope(2, 'SampleRate', [8000/80, 8000], ...
'BufferLength', 80000, ...
'YLimits', [-0.3 1.1], ...
'ShowGrid', true, ...
'Title','Decision speech and speech data', ...
'TimeSpanOverrunAction','Scroll');
% Initialize VAD parameters
VAD_cst_param = vadInitCstParams;
clear vadG729
% Run for 10 seconds
numTSteps = 500;
while(numTSteps)
% Retrieve 10 ms of speech data from the audio recorder
speech = audioSource();
% Call the VAD algorithm
decision = vadG729(speech, VAD_cst_param);
% Plot speech frame and decision: 1 for speech, 0 for silence
scope(decision, speech);
numTSteps = numTSteps - 1;
end
release(scope);
55
Appendix H:
Clean Speech, Noisy Speech, and Denoised Speech spectrograms

Voice Activity Detection Using Deep Learning (Speech Enhancement) For GSM Telephony

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Voice Activity Detection Using Deep Learning (Speech Enhancement) For GSM Telephony

Uploaded by

Copyright:

Available Formats

UNIVERSITY OF BUEA

COLLEGE OF TECHNOLOGY DEPARTMENT OF ELECTRICAL AND

VOICE ACTIVITY DETECTION USING DEEP LEARNING FOR GSM

Daniel Graham Boaz

A Dissertation Submitted to the Department of Electrical and Electronic Engineering,

To my father, Mr. Sandjo Gustave

COLLEGE OF TECHNOLOGY DEPARTMENT OF ELECTRICAL AND

 Nana Cyrille (PhD), Chairperson (Associate Professor of Mathematics)

Dr. Feudjio Cyrille Dr. Sone Ekonde Michael (AP)

Dr. Feudjio Cyrille

This dissertation has been accepted by the College of Technology.

patience throughout this dissertation. Thank you all sincerely.

foundation for the successful completion of this project.

CERTIFICATION ..................................................................................................................... iii

ABSTRACT ................................................................................ Error! Bookmark not defined.

LIST OF TABLES .......................................................................................................................x

LIST OF FIGURES ................................................................................................................... xi

ABBREVIATIONS ................................................................................................................... xii

1.1. Overview ........................................................................................................................1

1.2. Problem Statement .........................................................................................................1

1.4. Research Questions ........................................................................................................2

1.5. Scope and Limitations ....................................................................................................2

1.6. Dissertation Outline .......................................................................................................3

2.1. Voice Activity Detection................................................................................................4

2.2. Review of VAD algorithms ...........................................................................................4

2.2.2. Speech Detection Using Correlation Coefficient ...................................................5

2.2.3. Speech Detection Using Statistical Model-Based VAD ........................................5

2.2.5. Speech Detection Using Generalized Gamma Distribution (GГD) .......................5

2.2.6. Speech Detection Using Deep Neural Networks (DNN) .......................................6

2.2.7. Autocorrelation-based noise subtraction (ANS) method .......................................6

2.4. Speech Enhancement .....................................................................................................7

2.4.1. Traditional Single Channel Speech Enhancement .................................................8

2.4.2. Single-Channel Speech Enhancement with DNNs ................................................8

2.4.2.1. 2Hz .................................................................................................................9

2.4.2.2. Speech enhancement in ofﬁce-environment noises........................................9

2.4.2.3. Speech enhancement with Supervised learning .............................................9

2.5. Why Deep Learning? .....................................................................................................9

2.6. An Overview of Deep Learning ...................................................................................10

2.7. Autoencoders ...............................................................................................................12

2.7.1. Working Principle ................................................................................................12

2.8. Convolutional Neural Networks ..................................................................................13

2.8.1. Working Principle ................................................................................................14

2.8.2. CNN Extensions ...................................................................................................15

2.8.2.1. Convolutional Encoder-Decoder Network (CED) .......................................15

2.8.2.2. Redundant Convolutional Encoder Decoder Network (R-CED) .................16

2.8.2.3. Cascaded R-CED Network (CR-CED) ........................................................16

2.9. Hypothesis Development .............................................................................................17

3.1. Computing Technologies ..................................................................................................18

3.1.1. Computing toolkit ................................................................................................18

3.1.2. Analytics toolkit ...................................................................................................19

3.1.2.1. Programming languages ...............................................................................19

3.1.2.2. Software packages ........................................................................................19

3.2. Datasets ........................................................................................................................20

3.3. Data Preprocessing .......................................................................................................21

3.3.1. Downsampling Audio signals ..............................................................................21

3.3.2. Removing Silent Frames ......................................................................................22

3.3.3. Computing STFT vectors .....................................................................................23

3.4. Proposed Deep Neural Network Architecture ..............................................................25

3.4.1. Our DNN model ...................................................................................................26

3.4.2. Training our model ...............................................................................................28

4.1. Testing Hypothesis H1 ......................................................................................................31

4.1.1. 4.1.1. Subjective Speech Intelligibility Test ........................................................31

4.1.1.1. PESQ Measurement .....................................................................................33