Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

UNIVERSITY OF BUEA

COLLEGE OF TECHNOLOGY DEPARTMENT OF ELECTRICAL AND


ELECTRONICS ENGINEERING

VOICE ACTIVITY DETECTION USING DEEP LEARNING FOR GSM


TELEPHONY

By

Daniel Graham Boaz


B. Eng in Computer
Engineering

A Dissertation Submitted to the Department of Electrical and Electronic Engineering,


College of Technology of the University of Buea in Partial Fulfillment
of the Requirements for the Award of the Master of Technology
(M. Tech) Degree in Telecommunication and Networks.

May, 2022
ii

DEDICATION

To my father, Mr. Sandjo Gustave


iii

UNIVERSITY OF BUEA

COLLEGE OF TECHNOLOGY DEPARTMENT OF ELECTRICAL AND


ELECTRONICS ENGINEERING

CERTIFICATION
The dissertation of Daniel Graham Boaz (CT19P008) entitled: “Voice Activity
Detection Using Deep Learning for Gsm Telephony”, Submitted to the Department of
Electrical and Electronic Engineering, College of Technology of the University of Buea
in partial fulfillment of the Requirements for the award of Master of Technology
(M.Tech.) Degree in Telecommunications and Networks has been read, examined and
approved by the examination panel composed of:

 Nana Cyrille (PhD), Chairperson (Associate Professor of Mathematics)


 Tchapga Christina (PhD), Member( Lecturer of Software Engineering)
 Sone Ekonde Michael (PhD), Supervisor (Associate Professor of Telecommunication)
 Feudjio Cyrille (PhD), Co-Supervisor (Lecturer of Software Engineering)

Dr. Feudjio Cyrille Dr. Sone Ekonde Michael (AP)


(Head of Department) (Supervisor)

Dr. Feudjio Cyrille


(Co-Supervisor)

This dissertation has been accepted by the College of Technology.

Date:
Dr. Sone Ekonde Michael (AP)
(Director)
iv

ACKNOWLEDGEMENT

First and foremost, I thank God for letting me live to see this dissertation through. I am

deeply grateful to my supervisors who were more than generous with their expertise and

precious time. A special thanks to Dr. Sone Ekonde Michael and Dr. Feudjio Cyrille, for

their countless hours of reflecting, reading, encouraging, advising and most of all,

patience throughout this dissertation. Thank you all sincerely.

I would like to thank all my Master’s lecturers, the various head of departments, and all

the College of Technology staff for the strong theoretical knowledge which laid the

foundation for the successful completion of this project.

I’m most grateful to my parents for their support, encouragement and motivation

throughout this work. To my mother Sahan Annie Angele, I heartily appreciate her

prayers and efforts towards the completion of this work and for supporting me strongly

in my endeavors.

Lastly but not least, I express sincere thanks to my whole family, all my classmates and

friends who have patiently extended all sorts of help for the accomplishment of this

undertaking.
v

ABSTRACT

Voice Activity Detectors (VAD) are algorithms for detecting the presence of speech
signals in the mixture of speech and noise. They play an essential role in speech coders
for GSM telephony as they operate as a two-way classifier by flagging audio frames
where voice is detected. However, in a low SNR environment, the presence of babble
noise drastically reduces hearing intelligibility of speech resulting in poor VAD decisions.
In this dissertation, we propose using deep learning to learn the mapping between the
noisy speech and clean speech features that will improve VAD decisions. Specifically,
we propose using fully convolutional neural networks (CNN), which automatically
extract distinctive features of noisy and clean speech spectra using a few network
parameters. The proposed network model built, showed improved subjective and
objective measures with average PESQ of 2.4755, average STOI of 0.7016, improved
average SNR of 5.1737𝑑𝐵, and improved average channel capacity of 118% for noisy
speech samples with 0𝑑𝐵 𝑆𝑁𝑅.
Key Words: Speech, Voice Activity Detection (VAD), Deep Neural Network (DNN),
Convolutional Neural Networks (CNN).
vi

TABLE OF CONTENTS
DEDICATION............................................................................................................................. ii

CERTIFICATION ..................................................................................................................... iii

ACKNOWLEDGEMENT ......................................................................................................... iv

ABSTRACT ..................................................................................................................................v

TABLE OF CONTENTS........................................................................................................... vi

ABSTRACT ................................................................................ Error! Bookmark not defined.

LIST OF TABLES .......................................................................................................................x

LIST OF FIGURES ................................................................................................................... xi

ABBREVIATIONS ................................................................................................................... xii

CHAPTER 1 .................................................................................................................................1

INTRODUCTION ........................................................................................................................1

1.1. Overview ........................................................................................................................1

1.2. Problem Statement .........................................................................................................1

1.3. Objectives.......................................................................................................................2

1.4. Research Questions ........................................................................................................2

1.5. Scope and Limitations ....................................................................................................2

1.6. Dissertation Outline .......................................................................................................3

CHAPTER 2 .................................................................................................................................4

LITERATURE REVIEW............................................................................................................4

2.1. Voice Activity Detection................................................................................................4

2.2. Review of VAD algorithms ...........................................................................................4

2.2.1. Speech Detection Using Energy and Zero Crossing Rate ......................................4
vii

2.2.2. Speech Detection Using Correlation Coefficient ...................................................5

2.2.3. Speech Detection Using Statistical Model-Based VAD ........................................5

2.2.4. Speech Detection Using Cepstral Features and Mel-Energy Features ...................5

2.2.5. Speech Detection Using Generalized Gamma Distribution (GГD) .......................5

2.2.6. Speech Detection Using Deep Neural Networks (DNN) .......................................6

2.2.7. Autocorrelation-based noise subtraction (ANS) method .......................................6

2.3. Demystifying the VAD Problem (The Rationale of our Work) .....................................7

2.4. Speech Enhancement .....................................................................................................7

2.4.1. Traditional Single Channel Speech Enhancement .................................................8

2.4.2. Single-Channel Speech Enhancement with DNNs ................................................8

2.4.2.1. 2Hz .................................................................................................................9

2.4.2.2. Speech enhancement in office-environment noises........................................9

2.4.2.3. Speech enhancement with Supervised learning .............................................9

2.5. Why Deep Learning? .....................................................................................................9

2.6. An Overview of Deep Learning ...................................................................................10

2.7. Autoencoders ...............................................................................................................12

2.7.1. Working Principle ................................................................................................12

2.8. Convolutional Neural Networks ..................................................................................13

2.8.1. Working Principle ................................................................................................14

2.8.2. CNN Extensions ...................................................................................................15

2.8.2.1. Convolutional Encoder-Decoder Network (CED) .......................................15

2.8.2.2. Redundant Convolutional Encoder Decoder Network (R-CED) .................16

2.8.2.3. Cascaded R-CED Network (CR-CED) ........................................................16


viii

2.9. Hypothesis Development .............................................................................................17

CHAPTER 3 ...............................................................................................................................18

METHODOLOGY.....................................................................................................................18

3.1. Computing Technologies ..................................................................................................18

3.1.1. Computing toolkit ................................................................................................18

3.1.2. Analytics toolkit ...................................................................................................19

3.1.2.1. Programming languages ...............................................................................19

3.1.2.2. Software packages ........................................................................................19

3.2. Datasets ........................................................................................................................20

3.3. Data Preprocessing .......................................................................................................21

3.3.1. Downsampling Audio signals ..............................................................................21

3.3.2. Removing Silent Frames ......................................................................................22

3.3.3. Computing STFT vectors .....................................................................................23

3.4. Proposed Deep Neural Network Architecture ..............................................................25

3.4.1. Our DNN model ...................................................................................................26

3.4.2. Training our model ...............................................................................................28

CHAPTER 4 ...............................................................................................................................31

RESULTS ...................................................................................................................................31

4.1. Testing Hypothesis H1 ......................................................................................................31

4.1.1. 4.1.1. Subjective Speech Intelligibility Test ........................................................31

4.1.1.1. PESQ Measurement .....................................................................................33

4.1.1.2. STOI Measurement ......................................................................................34

4.1.2. Objective Speech Intelligibility Test ....................................................................36


ix

4.2. Testing Hypothesis H2 .................................................................................................38

4.3. Testing GSM Channel Capacity...................................................................................40

4.4. Results Discussion .......................................................................................................42

CONCLUSION...........................................................................................................................44

Summary ..................................................................................................................................44

Remarks ...................................................................................................................................44

Future Work .............................................................................................................................44

REFERENCES ...........................................................................................................................45

APPENDICES ............................................................................................................................49

Appendix A: Interfering noise in speech transmission illustration ........................................49

Appendix B: VAD Decision of a noisy speech signal ...........................................................49

Appendix C: VAD Decision of Noisy Audio ............................................................................49

Appendix D: Improving VAD decisions with Speech Enhancer ...........................................50

Appendix E: test_tf_record.py...................................................................................................50

Appendix F: dataset.py..............................................................................................................51

Appendix G: MATLAB implementation of G.729 VAD ......................................................54

Appendix H: Clean Speech, Noisy Speech, and Denoised Speech spectrograms .................55
x

LIST OF TABLES

Table 1: Librosa stft function paramters ......................................................................................24

Table 2: PESQ results from DNN model .....................................................................................34

Table 3: STOI results from DNN model ......................................................................................35

Table 4: SNR results from DNN model .......................................................................................37

Table 5: Average of subjective and object metrics ......................................................................38


xi

LIST OF FIGURES

Figure 1: Logistic Classifier diagram (Badescu & Cavez, 2021).................................................10

Figure 2: 2-layer fully connected network (Badescu & Cavez, 2021) .........................................11

Figure 3: CNN exploiting spatially-local correlation (Badescu & Cavez, 2021).........................14

Figure 4: CNN feature map (Badescu & Cavez, 2021)................................................................14

Figure 5: A CNN sequence to classify handwritten digits (Saha, 2018)......................................15

Figure 6: Modified Convolutional Encoder-Decoder Network (CED) (Park & Lee, 2016) ........15

Figure 7: Proposed Redundant CED (R-CED) (Park & Lee, 2016) ............................................16

Figure 8: Plots of clean, noise, and noisy signals.........................................................................21

Figure 9: The Librosa downsampler subsystem. (Signals and Systems - OpenStax CNX,

2021) ............................................................................................................................................22

Figure 10: Librosa split method (Librosa — Librosa 0.8.1 Documentation, 2021).....................22

Figure 11: Librosa stft function (Librosa — Librosa 0.8.1 Documentation, 2021) .....................24

Figure 12: The spectral leakage reduction process ......................................................................24

Figure 13: Deep learning training scheme(The MathWorks, Inc., 2021) ....................................25

Figure 14: STFT predictor and target vector inputs (The MathWorks, Inc., 2021) .....................26

Figure 15: DNN model structure part 1 .......................................................................................27

Figure 16: DNN model structure part 2 .......................................................................................27

Figure 17: DNN model structure part 3 .......................................................................................27

Figure 18: Reducing the MSE of our DNN model (Silva, 2019) .................................................28

Figure 19: Splitting a Dataset .......................................................................................................29

Figure 20: Test noisy Input signal ................................................................................................39

Figure 21: VAD decision for noisy-input.wav .............................................................................39

Figure 22: Denoised speech signal ...............................................................................................40

Figure 23: VAD decision for denoised speech signal ..................................................................40


xii

ABBREVIATIONS
API Application Programming Interface

ASR Automatic Speech Recognition

CDMA Code Division Multiple Access

CNG Comfort Noise Generation

CNN Convolutional Neural Network

CPU Central Processing Unit


Cascaded Redundant Convolutional Encoder-
CR-CED
Decoder

Decision Directed
DD

DFT Discrete Fourier Transform


DNN Deep Neural Network

DSP Digital Signal Processing

DTX Discontinuous Transmission

ETF Enhanced Time Frequency

FFT Fast Fourier Transform

FT Fourier Transform

GPU Graphics Processing Unit

GSAP global speech absence probability

GSM Global System for Mobile Communications

ITU International Telecommunication Union

LMS Least mean square

likelihood ratio test


LRT

MAD median absolution deviation

MATLAB matrix laboratory

MCV Mozilla Common Voice

MIMSB minimum Mel-scale frequency band

ML Machine Learning
xiii

MOS Mean Opinion Score

MP3 MPEG-1 Audio Layer III

MSE Mean Square Error

NLMS Normalized Least Mean Square

PC Personal Computer

PESQ Perceptual Evaluation of Speech Quality

PWPD Perceptual Wavelet Packet Decomposition

RAM Read Access Memory

RBM Restricted Boltzman Machine

R-CED Redundant Convolutional Encoder-Decoder

ReLU Rectified Linear Units

RMSE root mean square error

RNN Recurrent Neural Networks

SAD speech activity detection

SGD Stochastic Gradient Descendent

SMVAD statistical model-based VAD

SNR signal to noise ratio

STFT short time fourier transform


STOI short time objective intelligibility

TDMA time division multiple access

TEO Teager Energy Operator

VAD voice activity detection

VOS Voice Operated Switch

WT Wavelet Transform
1

CHAPTER 1

INTRODUCTION

1.1. Overview

Speech is the predominant means of communication between human beings and since the
invention of the telephone by Alexander Graham Bell in 1876, speech services have
remained to be the core service in almost all telecommunication systems. Speech coders
in GSM Telephony, are used to compress the bit rate (bandwidth) of the speech signal
before its transmission while keeping an acceptable perceived quality of the decoded
output speech signal. This speech signal is often corrupted with an interfering signal
(babble noise) which has a harmful contaminating effect on the signal-to-noise ratio of
the resulting speech signal.

With the recent advances in speech signal processing techniques, the need to accurately
detect the presence of speech in the incoming signal under different noise environments
has become a major industry concern. Separation of the speech fragment from the non-
speech fragment in an audio signal has been achieved over the years, using Voice Activity
Detectors (VAD). VAD’s are a class of signal processing methods that detects the
presence or absence of speech in short segments of audio signal. They have a pivotal role
as the preprocessing block in a wide range of speech applications, hereby providing
improved channel capacity, reduced co-channel interference and power consumption in
portable electronic devices in cellular radio systems and simultaneous voice and data
applications in multimedia communications.

However, in low SNR conditions – non-stationary environments, where speech is heavily


corrupted by noise, VAD’s, especially those used for narrowband GSM telephony, often
make detection errors in estimating the noise spectrum.

1.2. Problem Statement

In the past decades, a lot of work has been done in regards to enhancing the speech on the
one hand, and on the other hand, enhancing VAD decisions. Though there might be some
closeness between these two approaches, the difference in the outcome, however, lies in
2

the voice frequency band which in some cases could be considered as the unwanted
signal.

This is often caused by two factors:

 The voice frequency band – which ranges from approximately 300 to 3400Hz,
present in the additive noise and in the clean input speech signal, seems to cause
some obfuscation to the voice activity detector (VAD), which in turn renders a
perceptually non-speech frame as a clean speech frame.
 Most VAD algorithms assume the background noise is stationary – often the
Gaussian distribution – in one speech frame and the same assumption is made for
consecutive speech frames. While, in reality interfering noise signals can
sometimes switch from one form to the other (e.g., railway station to crowd
talking, etc.), hereby often causing VAD detection/decision errors.

1.3. Objectives

In order to attenuate decision errors made by VAD’s, we set out to denoise the noisy
speech (enhance the speech) right before the Voice Activity Detector.

Thus, our main objective in this dissertation is to propose a speech enhancement model
that learns the mapping between noisy speech spectra and clean speech spectra, using
deep neural networks (DNN), to suppress both stationary and non-stationary background
noise, hereby, bringing about improved VAD decisions.

1.4. Research Questions

1. Can decision errors made by voice activity detectors, in very low SNR conditions,
be attenuated with the help of Speech Enhancement, bearing in mind mobile
device constraints in narrowband GSM Telephony?
2. How well does speech enhancement affect the perception and intelligibility of the
denoised/enhanced speech signals?

1.5. Scope and Limitations

This work will be limited to the ITU-T G.729 Annex B recommendation of the Voice
Activity Detection algorithm. The performance of the speech enhancement method will
be evaluated based on subjective and objective measures, as suggested by
(Krishnamoorthy, 2011). The speech and noise conditions used for analysis and
3

implementation will be sound files freely provided by (Mozilla Common Voice, 2021)
and (Salamon et al., 2014) respectively. The speech sounds will be a subset of Mozilla
Common Voice, provided by Matlab (The MathWorks, Inc., 2021). The evaluation
process will be limited to simulations using the aforementioned sound files and no real-
time implementation of GSM telephony will be done.

1.6. Dissertation Outline

This study is structured as follows: Chapter 2 introduces the literature about VADs, GSM
Speech Coders, digital spectral analysis, deep neural networks, digital signal processing,
and previous works. It also presents our hypotheses based on our problem statement.
Chapter 3 describes the strategy developed in this dissertation to address the issues of
voice activity detection. It presents how our neural network is designed and implemented
with VAD. Chapter 4 presents the results of our implementation and conclusions are
drawn.
4

CHAPTER 2

LITERATURE REVIEW

In this section, we present the conceptual idea behind activity detection and a literature
survey of the existing work done in the area. We also go through some recent research
works in speech enhancement with deep learning, we introduce deep learning and present
the models used for our hypothesis.

2.1. Voice Activity Detection

VAD, also known as speech activity detection (SAD), aims to detect the presence of
speech in an audio signal. This might include a scenario of identifying when the signal
from a hidden microphone contains speech so that a voice recorder can operate (also
known as a Voice Operated Switch or VOS). Another example would be a mobile phone
using a VAD to decide when one person in a call is speaking so it transmits only when
speech is present (by not transmitting frames that do not contain speech, the device might
save over 50% of radio bandwidth and operating power during a typical
conversation) (McLoughlin, 2016).

In cellular radio systems (for instance GSM and CDMA systems) based on Discontinuous
Transmission (DTX) mode, VAD is essential for enhancing system capacity by reducing
co-channel interference and power consumption in portable digital devices. To reduce the
annoying modulation of the background noise at the receiver (noise contrast effects),
Comfort Noise Generation (CNG) is used, inserting a coarse reconstruction of the
background noise at the receiver (Scourias, 1995).

VAD is an important enabling technology for a variety of speech-based applications.


Various VAD algorithms have been developed that provide varying features and
compromises between latency, sensitivity, accuracy, and computational cost. We review
some of these in the next section.

2.2. Review of VAD algorithms

2.2.1. Speech Detection Using Energy and Zero Crossing Rate

(Hahn & Park, 1992) proposes a simple yet effective speech detection algorithm that
classifies frames based on differential logarithmic energy and zero-crossing rate
5

characteristics. In addition, a feature vector consisting of linear prediction coefficients,


full-band energy, low-band energy, and the zero-crossing rate was recommended in the
G.729 Annex B recommendation (Benyassine et al., 1997).

2.2.2. Speech Detection Using Correlation Coefficient

Another VAD technique to improve word boundary detection for varying background
noise levels was suggested by (Craciun & Gabrea, 2004), where noise parameters are
initially estimated from the initial frames and then updated using a first-order
autoregressive filter during the silence periods. The correlation coefficients for the
instantaneous spectrum and an average of the background noise spectrum are the
parameters employed in this approach. Subsequently, a statistical approach using a basic
binary Markov model is used for voice activity detection.

2.2.3. Speech Detection Using Statistical Model-Based VAD

(Sohn et al., 1999) proposed a Statistical Model-Based VAD (SMVAD), in which the
decision rule was obtained from the Likelihood Ratio Test (LRT) by utilizing the
Maximum Likelihood (ML) criterion to estimate the unknown parameters. Further
improvements were made by optimizing the decision rule for the estimate of unknown
parameters using the Decision-directed (DD) technique (Jongseo Sohn & Wonyong Sung,
1998). To achieve robustness in low SNRs, the proposed algorithm further optimized the
decision rule by adapting the decision threshold using the measured noise energy.

2.2.4. Speech Detection Using Cepstral Features and Mel-Energy Features

Haigh et al. demonstrated the robustness to various background noise levels for successful
end-of-speech identification using cepstral feature-based thresholds (Haigh & Mason,
1993). Chin-Teng Lin et al. suggested an Enhanced/Improved Time-Frequency (ETF)
and Minimum Mel-Scale Frequency Band (MIMSB) parameters collected from a multi-
band spectral analysis using Mel-scale frequency banks to create a robust word boundary
detection method (ETF VAD) (Chin-Teng Lin et al., 2002).

2.2.5. Speech Detection Using Generalized Gamma Distribution (GГD)

Jong Won et al. proposed a detection algorithm in which the distributions of noise spectra
and noisy speech spectra including speech-inactive intervals are modeled by a set of
GΓD’s and applied to the LRT for VAD. The parameters of GΓD are estimated through
6

an online Maximum Likelihood (ML) estimation procedure where the Global Speech
Absence Probability (GSAP) is incorporated under a forgetting scheme. The proposed
VAD algorithm based on GΓD proved to outperform the algorithms based on other
statistical models discovered so far (Jong Won Shin et al., 2005).

2.2.6. Speech Detection Using Deep Neural Networks (DNN)

(Tashev & Mirsamadi, 2016) proposed an algorithm for causal VAD based on DNNs.
The DNN is trained on segments of several consecutive audio frames, and with all
frequency bins together to utilize the correlation between the frames and bins. No
assumptions are made for any prior distribution of the noise and speech signals and the
DNN is expected to learn the dependency between the input features and the VAD
decision. It is shown that the proposed algorithm and DNN structure exceeds the classic,
statistical model-based VAD for both seen and unseen noises.

2.2.7. Autocorrelation-based noise subtraction (ANS) method

(Farahani, 2017) proposed the ANS method which, instead of removing the lower lag
autocorrelation components of the noisy signal – as is the case with other autocorrelation-
based noise suppression methods, tries to estimate the noise autocorrelation sequence and
deducts it from the noisy signal autocorrelation sequence. It uses the average
autocorrelation of a number of non-speech frames of the noisy utterance as an estimate
for the noise autocorrelation sequence given by;

𝑝=1
∑𝑖=0 𝑟𝑦𝑦 (𝑖̇, 𝑘)
𝑟̂𝜈𝑣 (𝑘) = ,0 ≤ 𝑘 ≤ 𝑁 − 1 (1)
𝑃
where 𝑃 is the number of non-speech frames, 𝑟𝑦𝑦 (𝑖̇, 𝑘) the autocorrelation sequence of
the noisy speech frame. This resulted in obtaining the autocorrelation sequence of the
clean speech signal expressed as

𝑟̂𝑥𝑥 (𝑚, 𝑘) = 𝑟𝑦𝑦 (𝑚, 𝑘) − 𝑟̂𝜈𝑣 (𝑘) (2)


where 𝑟𝑦𝑦 (𝑚, 𝑘) is the autocorrelation of the noisy speech sequence given by;

𝑁−1−𝑘
1
𝑟𝑦𝑦 (𝑚, 𝑘) = ∑ 𝑦(𝑚, 𝑖)𝑦(𝑚, 𝑖 + 𝑘) (3)
𝑁−𝑘
𝑖=0
7

with 𝑦(𝑚, 𝑖) being the noisy speech sample composed of the clean speech 𝑥(𝑚, 𝑖) and
the noise 𝑣(𝑚, 𝑖), 𝑁 is the frame length, 𝑖 is the discrete time index in the frame, and 𝑘 is
the autocorrelation sequence index within each frame.

However, this method requires a VAD to obtain the non-speech frames used to estimate
the noise autocorrelation, which doesn’t align with our objectives, as the VAD might
make errors in classifying clean speech and non-speech frames. We throw more light on
this in section 2.3.

2.3. Demystifying the VAD Problem (The Rationale of our Work)

Consider the scenario illustrated in APPENDIX A where, in a telephone conversation,


the background noise is set to be children crying or people chattering – as is the case with
this audio file https://bit.ly/chattering_noise, and the clean speech,
https://bit.ly/clean_speech, are being transmitted over the network.

It can be observed from the VAD decision depicted in APPENDIX B and magnified in
APPENDIX C, that the VAD red markers don’t seem to efficiently demarcate the
unvoiced and voiced parts of the noisy speech, as there is a combination of both the clean
speech and the unwanted speech babble noise in the noisy signal.

Thus, it would be more advantageous, if the noisy speech signal were denoised
(enhanced) before being passed through a voice activity detector – a topic which we
discuss further in section 2.4.

2.4. Speech Enhancement

Speech Enhancement has been a concern for a long time now. It aims to improve speech
quality by attenuating interfering noise. We want to filter out unwanted noise from an
input noisy signal without damaging the speech quality. For instance, if someone is
talking in a phone call conversation while a piece of music is playing in the background
or while running, a speech enhancement system's job, in this case, is to remove or filter
out the background noise (i.e., background music, or body movement sounds) to improve
the speech signal.

Speech enhancement techniques can be classified into two types based on the number of

microphones available; single-channel and multi-channel speech enhancement


8

techniques. The single-channel speech enhancement technique aims to extract clean

speech from a noisy speech using signals captured from only one microphone as opposed

to an array of more than one microphone in multi-channel speech enhancement. Hence,

multi-channel speech enhancement provides better performance than single-channel due

to its implementation and computational expensive nature. However, our focus in this

dissertation will be on single-channel speech enhancement, as it only requires one

microphone to capture signals for our experiment.

2.4.1. Traditional Single Channel Speech Enhancement

Most single channel speech enhancement techniques are of the spectral domain
(Kawamura et al., 2012) method which is preferably used in a cell phone. (Ortega-Garcia
& Gonzalez-Rodriguez, 1996) give an overview of single channel speech enhancement
techniques. However, the major limitation of speech enhancement in the spectral domain
is the fact that it still assumes the noise process to be stationary, hence, won’t be
successful for non-stationary forms of background noises.

On the other hand, there has been a lot of recent progress in deep neural networks (DNN)
for different signal processing tasks and several deep learning methods for single-channel
speech enhancement have been developed. Also, recent innovations in convolutional
neural networks (CNN) make them beneficial for speech enhancement to training the
model using spectrogram features.

The next section reviews some of the recent works in speech enhancement with DNNs.

2.4.2. Single-Channel Speech Enhancement with DNNs

DNN-based learning architectures have shown to be quite successful in related domains


such as speech recognition throughout the years. Deep neural networks (DNNs) have been
investigated for noise reduction as a result of their success in automatic speech
recognition (ASR).

In the following subsections, we explore some of the DNN techniques used for speech
enhancement proposed in the literature.
9

2.4.2.1. 2Hz

(Baghdasaryan, 2018) developed a DNN-based enhancement strategy that eliminates any


whisper of noise in the speech signal. Though the technology isn’t disclosed, it is shown
that the DNN architecture produces remarkable results on a variety of noises.

2.4.2.2. Speech enhancement in office-environment noises

(Kumar & Florencio, 2016) proposed a speech enhancement method that focuses
primarily on the presence of multiple noises simultaneously corrupting the speech.
Specifically, it deals with improving speech quality in an office environment where
multiple stationary, as well as non-stationary noises, can be simultaneously present in
speech. It is shown that noise-aware training is quite helpful in speech enhancement as
well as in complex noise conditions

2.4.2.3. Speech enhancement with Supervised learning

(Park & Lee, 2016) try to solve the problem of speech enhancement by finding a
‘mapping’ between noisy speech spectra and clean speech spectra via supervised learning.
Specifically, it proposes using fully Convolutional Neural Networks (CNN), which
consist of a lesser number of parameters than fully connected networks. The CNN used
is a new architecture, Redundant Convolutional Encoder-Decoder (R-CED), that shows
to be 12 times smaller in size than other networks and achieves better performance. The
network extracts redundant representations of a noisy spectrum at the encoder and maps
it back to a clean spectrum at the decoder. This can be viewed as mapping the spectrum
to higher dimensions and projecting the features back to lower dimensions.

In section 2.5., we briefly present why deep learning is so popular nowadays, the areas
where it is implemented, and some of the most used architectures.

2.5. Why Deep Learning?

In recent years, the advance in deep learning technologies has provided great support for
the progress in Image Processing, Video Processing, Machine Translation, and Speech
Enhancement research fields. Unlike traditional speech enhancement approaches that
depend on statistical models, like spectral subtraction, Wiener filtering, and minimum
mean square error, deep learning approaches built on a data-driven paradigm have shown
outstanding speech enhancement performance over their predecessors (Yuliani et al.,
10

2021). This is mainly due to its ability to model complex non-linear mapping functions
(Shivakumar & Georgiou, 2016).

The next section provides an overview of Deep Learning and how it works.

2.6. An Overview of Deep Learning

Deep Learning is a new area of Machine Learning research, which has been introduced
with the objective of moving Machine Learning closer to one of its original goals:
Artificial Intelligence. Deep Learning is all about learning multiple levels of
representation and abstraction that help to make sense of data such as images, sound, and
text.

Deep Learning consists of Deep Neural Networks, which are in charge of learning from
the input train data. In this dissertation, we will be using two specific types of Deep Neural
Networks, which we describe below:

 Convolutional Neural Networks (CNN): These are networks that learn directly
from samples by optimizing their filters (or kernels) through automated learning,
as compared to traditional algorithms where these filters are rather hand-
engineered.
 Autoencoders: These are networks that work like Restricted Boltzman Machines
(RBM), but use encoders to encode an unlabeled input dataset into short-codes
and use them to reconstruct (decode) the original input data while extracting the
most valuable information (features) from the input data. (Samaya et al., 2021)

The most basic Neural Network is the fully connected network, which is composed of a
deep network of linear classifiers.

Figure 1: Logistic Classifier diagram (Badescu & Cavez, 2021)

To better understand how a linear classifier works, Figure 1, represents its common
architecture. Its equation is expressed as follows:

𝑌 = 𝑊𝑋 + 𝑏 (4)
11

The network's ability to learn is determined by the weights and bias. The network's goal
is to learn the weights and bias parameters from the training data that minimize the error.
The loss function is the function that measures the error during the learning process.
Cross-Entropy and Mean Squared Error are typical loss functions that could be used
to minimize this error. Cross-Entropy is more commonly used in classification, while
Mean Squared Error is more commonly used in regression. An optimizer is required to
reduce the error. The Gradient Descendent, particularly the Stochastic Gradient
Descendent (SGD), is a well-known optimizer. Linear models are stable, but they have
lots of limitations. To retain the parameters in linear functions while making the overall
model non-linear, a step further must be taken and non-linearities must be introduced

(Badescu & Cavez, 2021).

Figure 2: 2-layer fully connected network (Badescu & Cavez, 2021)

In Figure 2, a fully connected Neural Network diagram is represented. In order to


introduce non-linearity to the model, a non-linear function (activation function) must
be inserted. The simplest nonlinear or activation function, which can be observed in
Figure 2 is the Rectified Linear Units function (ReLU). The number of non-linear
function units is the number of "hidden layers", which is another parameter to tune.
Other activation functions include Sigmoid, Tanh, Leaky ReLU, and Softmax. However,
to reduce computational load of the network and avoid the vanishing gradient problem,
it’s been recommended to ReLU (Chaudhary, 2020).

Finally, the concepts of an epoch, batch size, iteration, learning rate, and overfitting must
be explained. The epoch is associated with the processing of the entire training dataset
before the gradient is recalculated. The batch, on the other hand, is associated to calculate
the gradient over smaller portions of the dataset, so several iterations will be required
before an epoch is completed. This is an advantage, as the system gets to be faster. That’s
why it is required to establish a batch size. As its name indicates, the learning rate sets
12

the speed of learning. High learning rates increase the error while low learning rates
increase the overfitting. When overfitting appears, the training should stop. To identify
it, the training loss and the validation loss must be observed. When the validation loss is
increasing and the training loss is decreasing, it is a clear signal of overfitting.

One important type of Neural Networks is the Convolutional Neural Network, which has
been greatly used over the years, to enable machines to view the world as humans do,
perceive it in a similar manner, and even use the knowledge for a multitude of tasks such
as Image & Video recognition, Image Analysis & Classification, Media Recreation,
Recommendation Systems, Natural Language Processing, to name a few.

In section 2.7. and section 2.8., we review the literature behind Autoencoders, CNNs,
CNN extensions proposed by (Park & Lee, 2016), and their relevance to this dissertation.

2.7. Autoencoders

Autoencoders are neural networks that compress the input into a lower-dimensional code
and then decode (reconstruct) the output from this representation (code). This code is a
compact “summary” or “compression” of the input, also called the latent-space
representation.
2.7.1. Working Principle

An autoencoder consists of 3 main components: encoder, code and decoder. The encoder
summarizes (compresses) the input and produces a code, which is used by the decoder to
reconstruct the input. The figure below is a depiction of the architecture of an
autoencoder.
13

Figure 3: Autoencoder architecture (Dertat, 2017)

First, the input in Figure 3 is passed through the encoder, which is a fully-connected neural
network, to produce the code. The decoder, which has the similar neural network
structure, then produces the output using this code. The idea here is to get an output
identical to the input.
In section 2.8., we review CNN’s and how they can be nested with autoencoder layers.

2.8. Convolutional Neural Networks

Inspired by early findings in the study of biological vision, the name "convolutional
neural network" indicates that the network employs a mathematical operation called
convolution. Convolutional networks are a specialized type of neural network that uses
convolution in place of general matrix multiplication in at least one of their layers.

The architecture of a CNN is inspired by the organization of the Visual Cortex and is
analogous to the connectivity pattern of Neurons in the Human Brain. Individual neurons
can only respond to stimuli in a small area of the visual field called the Receptive Field.
A number of similar fields can be stacked on top of each other to span the full visual field
(Saha, 2018).

In computer vision applications, the CNN algorithm takes an input image and gives
relevance (learnable weights and biases) to various aspects/objects in the image, allowing
it to distinguish between them. When compared to other classification algorithms, the
amount of pre-processing required by a CNN is significantly less. While filters are hand-
14

engineered in primitive methods, CNNs can learn these filters/characteristics with


adequate training.

2.8.1. Working Principle

CNNs are designed to exploit spatially-local correlation by enforcing a local connectivity


pattern between neurons of adjacent layers (Badescu & Cavez, 2021). In other words, the
inputs of hidden units in layer m, are from a subset of units in layer m-1, units that have
spatially contiguous receptive fields as we can see in the figure below:

Figure 4: CNN exploiting spatially-local correlation (Badescu & Cavez, 2021)

In addition, each filter h_i is replicated across the entire layer. These replicated units share
the same parametrization (weight vector and bias) and form a feature map. In Figure 5 a
CNN feature map can be observed.

Figure 5: CNN feature map (Badescu & Cavez, 2021)

For these reasons, Convolutional Neural Networks are perfectly fit for image and video
processing, but also for audio processing.

The image in Figure 6 shows a simple CNN architecture for classifying handwritten digits
images.
15

Figure 6: A CNN sequence to classify handwritten digits (Saha, 2018)

Figure 6 above shows us how the CNN layers reduce the dimension of images into a form
that is easier to process, without losing features that are critical for getting a good
prediction.

2.8.2. CNN Extensions

Having seen the working principle of CNNs, we will now review some of the CNN
extensions proposed by (Park & Lee, 2016).

2.8.2.1. Convolutional Encoder-Decoder Network (CED)

Convolutional Encoder-Decoder (CED) network consists of symmetric encoding layers

and decoding layers in which each block represents a feature. This is depicted in Figure

7.

Figure 7: Modified Convolutional Encoder-Decoder Network (CED) (Park & Lee, 2016)
16

The Encoder consists of repetitions of a convolution, batch-normalization (BN), max-


pooling, and a Rectified Linear Units (ReLU) activation layer. Decoder consists of
repetitions of a convolution, batch normalization, and an up-sampling layer. Similar to an
Autoencoder, CED compresses the features along the encoder and then reconstructs the
features along the decoder. To make the CED a fully convolutional network, a
convolution layer is used at the last layer, instead of the Softmax layer.

2.8.2.2. Redundant Convolutional Encoder Decoder Network (R-CED)

R-CED consists of repetitions of a convolution, batch normalization, and a ReLU


activation layer, with each block, representing a feature as shown in Figure 7.

Figure 8: Proposed Redundant CED (R-CED) (Park & Lee, 2016)

No pooling layer is present, and thus no up-sampling layer is required. Opposite to CED,
R-CED encodes the features into higher dimensions along the encoder and achieves
compression along the decoder. The number of filters is kept symmetric: at the encoder,
the number of filters is gradually increased, and at the decoder, the number of filters is
gradually decreased. The last layer is a convolution layer, which makes R-CED a fully
convolutional network.

2.8.2.3. Cascaded R-CED Network (CR-CED)

Cascaded Redundant Convolutional Encoder-Decoder Network (CR-CED) is a variation


of the R-CED network which consists of cascacded correlations of R-CED Networks.
These cascaded correlations are formed during training of the network where, cascades
are increased autonomously in order to minimize the error.

Compared to the R-CED with the same network size (i.e., with the same number of
parameters), CR-CED achieves better performance, both in terms of intelligibility and
perceptual analysis, with less convergence time.
17

In the section 2.8., we develop our hypothesis based on this deep learning network.

2.9. Hypothesis Development

The previous subsections discussed existing VAD and speech enhancement methods and
the theory concerning VAD, speech enhancement, and Deep Learning. With the help of
the research conducted by (Park & Lee, 2016), we form our hypothesis in two folds.

H1. Given a segment of noisy spectra {𝑥𝑡 }𝑇𝑡=1 and clean spectra {𝑦𝑡 }𝑇𝑡=1, we aim to
learn a mapping 𝑓 which generates a segment of denoised spectra {𝑓(𝑥𝑡 )}𝑇𝑡=1 that
approximates the clean spectra 𝑦𝑡 in the 𝒍𝟐 norm, i.e.,

min ∑‖𝑦𝑡 − 𝑓(𝑥𝑡 )‖22 (5)


𝑡=1

Specifically, we formulate 𝑓 using a fully convolutional neural network, such that the past
𝑛𝑇 noisy spectra: {𝑥𝑖 }𝑡𝑖=𝑡−𝑛𝑇 +1 are considered to denoise the current spectra, i.e.,

𝑇
2
∑‖𝑦𝑡 − 𝑓(𝑥𝑡−𝑛𝑇 +1 , … , 𝑥𝑡 )‖2 (6)
𝑡=1

In order to avoid overloading our DNN architecture, experiments conducted by (Liu et


al., 2014) and (Park & Lee, 2016) showed improved performance (based on subjective
measures) with 𝑛𝑇 = 8, so that our input spectra to the network are equivalent to about
100ms of speech segment, whereas the output spectra of the network are of 32ms
duration.

H2. Can the denoised spectra, {𝑓(𝑥𝑡 )}𝑇𝑡=1 , obtained from H1, improve the decisions
made by the ITU-T G.729 Annex B recommended voice activity detection
algorithm?

For this dissertation, we shall adopt the CR-CED network model to perform single
channel speech enhancement of noisy speech signals in order to improve VAD algorithm
decisions.

We will be using MATLAB implementation of ITU-T G.729 Annex B recommended


VAD to test our hypothesis H2.

In the following chapter, we will present the methodology used to implement our CR-
CED DNN architecture for eliminating noise from speech signals.
18

CHAPTER 3

METHODOLOGY

This chapter explains how we use Deep Learning to create an architecture capable of
mapping noisy speech signals to their clean variants. To begin, we'll go over the
technologies used for building/training our denoising algorithm, the dataset we used to
train and test our speech enhancement model. Next, we discuss the modules we used for
preparing the dataset for the Neural Network. Then, we will present the architecture of
our suggested Deep Neural Network model. Finally, we test hypothesis H2 using our
DNN model.

3.1. Computing Technologies

In this section, we present our computing toolkit used for testing our hypothesis. Our
toolkit can be classified into 2 types, namely;

 Computing: which involves the computing resources we used for building our
DNN architecture, e.g., servers, computers, etc.
 Analytics: which involves any software tool we used for testing hypotheses.

3.1.1. Computing toolkit

Traditional machine learning techniques are often used when dataset size is small.
However, the performance greatly degrades when the dataset size gets larger. On the other
hand, deep learning exhibits advantageous scalability with a huge dataset size, hence the
need for great computing power. The Graphics Processing Units (GPUs) is usually
responsible for delivering the computing power needed for these tasks instead of CPU as
it comes with a good number of concurrent threads compared to single-thread
performance optimization provided by CPU (“Why Are GPUs Necessary for Training
Deep Learning Models?,” 2017).

Given that we couldn’t afford a PC with a good GPU and recommended 16GB RAM
(Running Kaggle Kernels with a GPU, 2021), we opted for the usage of Kaggle Kernels.
Kaggle provides free access to NVidia K80 GPUs in kernels (Running Kaggle Kernels
with a GPU, 2021) with 16GB RAM available. This results in a 12.5X speedup during
training of a deep learning model with a total run-time of 994 seconds as compared to
13,419 seconds with only one CPU.
19

3.1.2. Analytics toolkit

This toolkit consists of the software and libraries we used in building, training, and testing
our denoising model. This comprises of:

 Programming languages
 Software packages

3.1.2.1. Programming languages


The programming language used for our model was Python, using Jupyther Notebook
provided by Kaggle kernels. Built on top of Python is a framework called, Tensorflow,
which is a free open-source software library and is seemingly the best Deep Learning
framework nowadays. The library can be run on computers of all kinds, even on
smartphones.

A powerful deep learning API we’ll be using for creating our DNN model is Keras which
runs on top of Tensorflow and was developed with the focus on enabling fast
experimentation of Neural Networks architecture due to its total modularity, minimalism,
and extensibility. Furthermore, it supports convolutional networks, recurrent networks,
and combinations of both, including multi-input and multi-output training.

Another programming language we used is MATLAB, which is a high-performance


language for technical computing developed by MathWorks. MATLAB allows matrix
manipulations, plotting of functions and data, implementation of algorithms, creation of
user interfaces, and interfacing with programs written in other languages, like Python,
Java, etc.

3.1.2.2. Software packages


The software packages used within Python include:

 Pandas: which is used for data analysis and manipulation of our audio signals
 Numpy: which adds support for large, multi-dimensional arrays and matrices,
along with a large collection of high-level mathematical or scientific calculations
functions to operate on these arrays
 Sci-kit learn: which is used for clustering and dimensionality reduction within
our model.
 Librosa: which is a music and audio analysis tool library, that provides the
building blocks necessary to create music information retrieval systems.
20

The software packages used within MATLAB include:

 DSP toolbox: This provides algorithms, apps, and scopes for designing,
simulating, and analyzing signal processing systems. This will be used to
resample audio signals within MATLAB for SNR analysis.
 Audio toolbox: This provides tools for audio processing, speech analysis, and
acoustic measurements. This will be used to read audio files into MATLAB for
SNR analysis.
 Simulink toolbox: This is a MATLAB-based graphical programming
environment for modeling, simulating, and analyzing multidomain dynamical
systems. This will be used in the results section of this dissertation to test our
hypothesis H2.

3.2. Datasets

The experiment was conducted using 2 publicly available audio datasets namely:

 The Mozilla Common Voice (MCV): This dataset contains as many as 75,879
recorded clean speech audio – which is about 65GB of 2,637 validated hours
spread in short MP3 files. But due to the lack of adequate computing resources,
we use a minified version of this dataset provided by Mathworks at (Denoise
Speech Using Deep Learning Networks, 2021). It only contains 2,800 recorded
clean speech samples and weighs only 988MB. The MCV project is open source
and anyone can collaborate with it. The wide range of speakers in this dataset is
one of its best features. It includes fragments of male and female recordings from
a wide range of ages and foreign accents.
 UrbanSound8K: This dataset contains 8732 labeled sound excerpts of urban
noise sounds classified into 10 different commonly found urban sounds. This
includes: air conditioner, car horn, children playing, dog bark, drilling, engine
idling, gun shot, jackhammer, siren, and street music. These classes are drawn
from the urban sound taxonomy and can be found at
https://urbansounddataset.weebly.com/urbansound8k.html.

We will use these urban sounds as noise signals to the clean speech samples from the
MCV dataset. In other words, similar to Figure 9, we shall first take a clean speech signal
– this can be someone speaking a random sentence, from the MCV dataset, then, we add
noise to it – hereby creating synthetically, a scenario where a woman is speaking and a
21

dog is barking in the background, and finally, we use this artificially created noisy signal
as the input to our deep learning model. Our Neural Network, in turn, will receive this
noisy signal and try to compute a clean representation of it.

Figure 9, displays a visual representation of a clean input signal from the MCV. A noise
signal from the UrbanSound dataset and the resulting noisy input – that is the input speech
after adding noise to it. Also, note that the noise power is set so that the signal-to-noise
ratio (SNR) is zero dB (decibel).

Figure 9: Plots of clean, noise, and noisy signals

3.3. Data Preprocessing

This section deals with a crucial step in any deep learning project, that consists of
implementing some Data Preprocessing modules, which allow for the extraction of
features required for training and testing our deep learning network. This entails:

 Downsampling the audio signals to 8kHz,


 Removing silent frames from the audio signals,
 Computing the spectral STFT vectors

3.3.1. Downsampling Audio signals

Downsampling is a process performed on a sequence of samples of a signal or a


continuous function to produce an approximation of the sequence that would have been
obtained by sampling the signal at a lower rate. The idea behind downsampling our audio
22

signals lie in the fact that our data set contains 48kHz recordings of subjects speaking
short sentences and this might result in very poor network computational load.

Another reason for downsampling the audio signals is to mimic the sample rate of speech
coders used for narrowband GSM (EETimes, 2003) telephony applications, which is
known to be 8kHz.

For us to perform this downsampling process without risking any aliasing in our signals,
we use a library in Python, called Librosa, that assists in loading up a signal with our
desired new sampling rate. The library automatically takes care of deciding an appropriate
anti-aliasing filter as well as a proper decimation factor for our desired sampling rate
(8kHz). The following diagram depicts what our Librosa signal downsampler consists of.

Figure 10: The Librosa downsampler subsystem. (Signals and Systems - OpenStax CNX, 2021)

3.3.2. Removing Silent Frames

Another preprocessing stage involves removing silent frames from our audio signals.
Similar to the downsampling process, the idea here is to reduce the computational load of
our Deep Neural Network, thereby reducing the processing power, the processing speed,
and increasing the training accuracy of the network.

To achieve this, we use the below method from Librosa to split our audio file on silence:

Figure 11: Librosa split method (Librosa — Librosa 0.8.1 Documentation, 2021)

This method splits audio files into low volumes specified by the top_db parameter. It also
takes an optional parameter, hop_length, to specify the number of samples between
frames under analysis. We consider using 20dB for our top_db parameter and 64 for our
hop length.
23

3.3.3. Computing STFT vectors

This stage involves computing the spectral vectors of our audio signals with a 256-point
Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point
(8ms) and frequency resolution of 31.25 Hz (=4kHz/128) per each frequency bin. The
STFT formula is given by:

𝑁−1

𝑋𝑙 [𝑘] = ∑ 𝑤[𝑛]𝑥[𝑛 + 𝑙𝐻] 𝑒 (−𝑗2𝜋𝑘𝑛/𝑁) (7)


𝑛=0

where,

𝑙 : the frame number for shifting the window


𝑘: the frequency index of the output STFT: 𝑘 = 0, 1, … , 𝑁 − 1
𝐻: the hop length
𝑁: the length of the DFT frame
𝑋𝑙 [𝑘]: the STFT of the input signal 𝑥[𝑛]
𝑥[𝑛]: the input signal index by 𝑛 (usually time)
𝑥[𝑛 + 𝑙𝐻]: the shifted input signal indexed by 𝑛 + 𝑙𝐻 (usually time)
𝑤[𝑛]: the window function of length 𝐿, in our case Hamming window

The reason we are using a Hamming window is that the FFT transform of a short audio
segment from the main audio signal, erroneously uses the assumption that this signal is
periodic and repeats infinitely before and after the analyzed segment in time. This
erroneous assumption leads to edge eff ects between repeating segments and therefore to
what is known as spectral leakage (a lack of frequency resolution caused by spectral
information “leaking” from one frequency position into adjacent values) which can be
reduced with the use of a windowing function such as Hamming window. The Hamming
window reduces the amplitude of the discontinuities at the side lobes of each finite
segment including any non-harmonic content, hence improving the frequency resolution
of our audio signal. In Figure 12, we illustrate how spectral leakage is reduced using the
hamming window function using a segment of an audio signal from our dataset.

To fulfill this stage of splitting the signal into discrete short time frames before feeding
these to our network, we use the stft function from the Librosa library in Python, as
shown in Figure 12 below;
24

Figure 12: Librosa stft function (Librosa — Librosa 0.8.1 Documentation, 2021)

Figure 13: The spectral leakage reduction process

As per the parameters required by this function, the table below gives a listing of the
values we used.

𝒚: Our digitized 48kHz audio signal

𝒏_𝒇𝒇𝒕: 256

hop_length: 75% of win_length = 64

win_length: 256

window: hamming

Table 1: Librosa stft function parameters

The next section presents our proposed DNN architecture.


25

3.4. Proposed Deep Neural Network Architecture

In this section, using the vectors obtained from the preprocessing stage, we proceed in
implementing a Deep Neural Network model for denoising in noisy environments. The
deep learning training scheme is shown below.

Figure 14: Deep learning training scheme(The MathWorks, Inc., 2021)

The magnitude spectra of the noisy and clean audio signals are used as predictor and
target network signals, respectively. The magnitude spectrum of the denoised signal is
the network's output. The regression network minimizes the mean square error between
its output and the input target by using the predictor input. The output magnitude spectrum
and phase of the noisy signal are used to transfer the denoised audio back to the time
domain.

Once we obtain the STFT vectors, as elaborated in the previous section, we reduce the
size of the spectral vector to 129 by dropping the frequency samples corresponding to
negative frequencies (because the time-domain speech signal is real, and so does not lead
to any information loss). Our predictor input consists of 8 consecutive noisy STFT vectors
so that each STFT output estimate is computed based on the current noisy STFT and the
7 previous noisy STFT vectors. In other words, the DNN model is an autoregressive
system that predicts the current signal based on past observations. Therefore, the target
signals consist of a single STFT frequency representation of shape (129,1) from the clean
audio. The diagram below depicts this process.
26

Figure 15: STFT predictor and target vector inputs (The MathWorks, Inc., 2021)

Now that we’ve laid out how our DNN will interact with our various STFT vectors, we
now need to see how the DNN model proper, will work.

3.4.1. Our DNN model

Our DNN model will be largely based on the work done by (Park & Lee, 2016), where
the authors proposed a Cascaded Redundant Convolutional Encoder-Decoder Network
(CR-CED). Hence, our model will be based on symmetric encoder-decoder architectures,
in which both components contain repeated blocks of Convolution, ReLU, and Batch
Normalization, rendering our network with 16 of such blocks, which add up to 33,000
parameters, i.e., roughly 132MB of memory which can be implemented in an embedded
system, like a mobile phone.

The figures below show the structure of our network model defined with Keras, which
we generated using the Tensorflow plot_model function. The figure was split into various
parts for space management.
27

Figure 16: DNN model Figure 17: DNN model Figure 18: DNN model
structure part 1 structure part 2 structure part 3

It’s important to note that, there are skip connections between some of the encoder and
decoder blocks such that feature vectors from these blocks are combined through addition.
These skip connections speed up convergence and reduce the vanishing of gradients
during training.

Another point to highlight is that, since one of our assumptions is to use the CR-CED
network which is an extension of CNNs (originally designed for Computer Vision), it is
important to be aware that audio data differ a lot from images. Hence, audio data, in its
raw form, is a 1-dimensional time-series data as compared to images which are 2-
dimensional representations of an instant moment in time.
28

For these reasons, we will have to transform our audio signals into (time/frequency) 2D
representations. More specifically, given an input spectrum of shape (129 × 8) to our
network, convolution is only performed in the frequency axis. This ensures that the
frequency axis remains constant during forwarding propagation.

Once our network produces an output estimate, we optimize (minimize) the mean squared
difference (MSE) between the output and the target (clean audio) signals. Figure 19
illustrates how our DNN model attempts to optimize the MSE between noisy spectral
vectors and target clean audio spectral vectors. It uses the frequency spectrogram of these
audio signals for illustration purposes.

Figure 19: Reducing the MSE of our DNN model (Silva, 2019)
In the next subsection, we present the training process of our model.

3.4.2. Training our model

Training a model simply means learning (determining) good values for all the weights
and the bias from labeled samples, in this case, our clean and noisy audio signals.
However, to improve the prediction performance of our deep learning algorithm during
this training phase, we split our dataset into train, validation, and test sets. Splitting our
dataset this way also helps avoid a situation where the model fails to make predictions on
stationary noise attributes it has never seen – a concept called, overfitting. The test set
here is used to estimate how well our model will behave with unseen data. The validation
set is used to validate our model in different configurations such as optimizers or loss
functions. The train set will be used to train or fit our model.
29

Another important manipulation we need to make on our dataset is to split it, which
requires us to take into account the following considerations:

 Computational cost in training the model.


 Computational cost in evaluating the model.
 Training set representativeness.
 Test set representativeness

This splitting is still aimed at avoiding the overfitting of our model. In machine learning,
deciding on a proper ratio for splitting the dataset usually falls in the range 60%-20%-
20% to 98%-1%-1% as shown in the figure provided by (Data Splitting Technique to
Fit Any Machine Learning Model | by Sachin Kumar | Towards Data Science, 2021).

Figure 20: Splitting a Dataset

The percentages are labeled as train-validation-test percentage ratios. In our case, we


used 36%-36%-15% for our Clean speech dataset (MCV), and 88%-2%-10% for our
Noise dataset (UrbanSound8k). The reason for such a low ratio for our clean speech
dataset is due to some technical challenges encountered during the data analysis of the
train.csv file provided within the MCV folder. We also encountered a similar challenge
with the noise dataset, while we intended for ratios in the range shown in the figure above.

Finally, to fit our model, we make use of the test_tf_record.py and dataset.py
modules which use TFRecord, provided by Tensorflow as the recommended data format
for training, to save features of our clean and audio signals, which we can now use to fit
our model. These 2 modules are available in Appendices E & F respectively. We use the
fit function provided by Keras to train our model. It takes as arguments, the entire input
data (source) and output data (target), the batch size, the number of epochs, and the
validation dataset (source and target). We use this function in the snippet below:
30

model.fit(train_dataset, steps_per_epoch=600, validation_data=test_dataset, epochs=999)

where,

 train_dataset is our input training set of clean and noisy audio features
(magnitude and phase spectral vectors)
 steps_per_epoch defines the total number of steps (batches of samples) before
declaring one epoch finished and starting the next epoch during training
 validation_data, is the data on which to evaluate the loss and any model metrics
at the end of each epoch. The model will not be trained on this data.
 epochs defines the number of epochs to train the model.

In the following chapter, we will explain the methods used to evaluate the system and
present the results we obtained.
31

CHAPTER 4

RESULTS

In this section, we evaluate our proposed system with our hypotheses. To test hypothesis
H1, we use objective and subjective measurements, we test hypothesis H2 using graph
comparisons of our denoised signals, and we evaluate the performance of our system
relative to critical cellular GSM device parameters, such as; channel capacity, co-
channel interference and power consumption.

4.1. Testing Hypothesis H1

In all speech enhancement algorithms, the improvement in the quality and intelligibility
is of utmost importance for ease and accuracy of information exchange. The speech
quality and intelligibility can be quantified using subjective and objective measures
(Krishnamoorthy, 2011). We implement these measurements in the next subsections.

4.1.1. 4.1.1. Subjective Speech Intelligibility Test

Subjective speech quality measures are usually obtained using listening tests in which
human participants rate the quality of the speech in accordance with a predetermined
opinion scale. Listeners are presented with the sample speech audios and asked to rate the
quality of the speech on a numerical scale, typically a 5-point scale with 1 indicating poor
quality and 5 indicating excellent quality – a scoring range called Mean Opinion Score
(MOS). However, according to (Taal et al., 2010), such evaluation methods turn out to be
costly and time-consuming.

Hence, to perform the subjective speech intelligibility test, (Rix et al., 2001) suggest the
use of Perceptual Evaluation of Speech Quality (PESQ) to predict the subjective
opinion score of a degraded or enhanced speech. This is because, PESQ is a quite
sophisticated algorithm which has been recommended by ITU-T (P.862) for speech
quality assessment of narrow-band handset telephony and narrow-band speech codecs
(Hu & Loizou, 2008).

The PESQ measure takes a reference signal and the enhanced signal and aligns them in
both time and level. This is followed by a range of perceptually significant transforms
which include Bark spectral analysis, frequency equalization, gain variation equalization,
and loudness mapping (Rix et al., 2001).
32

The range of the PESQ score is −0.5 to 4.5, where -0.5 corresponds to a poor quality,
while 4.5 corresponds to a high quality of speech intelligibility.

Another metric that has proven to be able to quite accurately predict the intelligibility of
noisy/processed speech in a large range of acoustic scenarios, including speech processed
by mobile communication devices, is the Short-Time Objective Intelligibility (STOI).
Recent studies by (Chen et al., 2016) and (Healy et al., 2017), show a good
correspondence between STOI predictions of noisy speech enhanced by DNN-based
speech enhancement systems, and speech intelligibility.

STOI is based on the correlation between the envelopes of clean and degraded speech
signals – denoted by 𝑥 and 𝑦 respectively, decomposed into regions that are
approximately 400ms in length and uses a simple DFT-based Time-Frequency-
decomposition. According to (Taal et al., 2011), the output of STOI is a scalar value
which is expected to have a monotonic relation with the average intelligibility of 𝑦 (e.g.,
the percentage of correctly understood words averaged across a group of users). It is a
function of a Time Frequency dependent intermediate intelligibility measure, which
compares the temporal envelopes of the clean and degraded speech in short-time regions
by means of a correlation coefficient. The following vector notation is used to denote the
short-time temporal envelope of the clean speech:

𝑇
𝑥𝑗,𝑚 = [𝑋𝑗 (𝑚 − 𝑁 + 1), 𝑋𝑗 (𝑚 − 𝑁 + 2), … , 𝑋𝑗 (𝑚)] (8)
where 𝑁 = 30 which equals an analysis of approximately 400ms, 𝑋𝑗 (𝑚) =

(𝑗)−1
√∑𝑘𝑘=𝑘
2
|𝑥̂(𝑘, 𝑚)|2, with 𝑥̂(𝑘, 𝑚) denoting the 𝑘 𝑡ℎ DFT-bin of the 𝑚𝑡ℎ frame of the
1 (𝑗)

clean speech. Similar notation, 𝑦𝑗,𝑚 , is used for the short-time temporal envelope of the
degraded speech.

Thus, the correlation coefficient between 𝑥𝑗,𝑚 and 𝑦𝑗,𝑚 is given by;

𝑇
(𝑥𝑗,𝑚 − 𝜇𝑥𝑗,𝑚 ) (𝑦̅𝑗,𝑚 − 𝜇𝑦̅𝑗,𝑚 )
𝑑𝑗,𝑚 = (9)
‖𝑥𝑗,𝑚 − 𝜇𝑥𝑗,𝑚 ‖ ‖𝑦̅𝑗,𝑚 − 𝜇𝑦̅𝑗,𝑚 ‖

‖𝑥 ‖ 15𝑑𝐵
where 𝑦̅𝑗,𝑚 (𝑛) = min (‖𝑦𝑗,𝑚 ‖ 𝑦𝑗,𝑚 (𝑛), (1 + 10 20 ) 𝑥𝑗,𝑚 (𝑛)), is the normalized and
𝑗,𝑚

clipped version of 𝑦 and 𝜇(.) refers to the sample average of the corresponding vector.
33

Finally, the average of the intermediate intelligibility measure over all frames, referred to
as the STOI score, is given by;

1
𝑑= ∑ 𝑑𝑗,𝑚 (10)
𝐽𝑀
𝑗,𝑚

where 𝑀 represents the total number of frames and 𝐽 the number of one-third octave
bands.

According to (Taal et al., 2010), the output of STOI, 𝑑, takes values −𝟏 ≤ 𝒅 ≤ 𝟏 but is
in practice non-negative (Intelligibility Prediction for Speech Mixed with White Gaussian
Noise at Low Signal-to-Noise Ratios: The Journal of the Acoustical Society of America:
Vol 149, No 2, 2021).

In this dissertation, we based our measurement on 30 clean speech and noise samples
from our test dataset discussed in the methodology. This test dataset comprises clean and
noise samples, which we added individually to obtain 30 noisy speech samples with SNR
of 0𝑑𝐵. These noisy speech samples are then fed to our speech enhancement DNN
algorithm to obtain denoised speech samples. The denoised samples, the noisy speech
samples, and the original clean speech samples are then used to perform the subjective
speech intelligibility test using the aforementioned metrics.

4.1.1.1. PESQ Measurement

Despite the unavailability of its mathematical representation, we measure this metric with
the help of the ‘PESQ Software’ provided by ITU-T (P.862 : Perceptual Evaluation of
Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment
of Narrow-Band Telephone Networks and Speech Codecs, 2021) embedded in the pesq
function of the pysepm package from (schmiph2, 2019/2021).

The table below shows the results of the first 10 PESQ scores for the noisy and denoised
speech samples.
34

Speech File Noise File pesq_befor pesq_afte


e r

1 common_voice_en_17259150.wa 83502-0-0- 1.6925 2.8905


v 7.wav

2 common_voice_en_137653.wav 115418-9-0- 1.5764 3.0212


20.wav

3 common_voice_en_13900.wav 99192-4-0- 1.9117 2.4588


55.wav

4 common_voice_en_113715.wav 73524-0-0- 2.3059 2.9907


92.wav

5 common_voice_en_19680737.wa 189982-0-0- 1.7759 2.5065


v 35.wav

6 common_voice_en_18248920.wa 136558-9-1- 1.4425 2.6127


v 39.wav

7 common_voice_en_17957640.wa 155241-9-0- 1.6901 2.8396


v 50.wav

8 common_voice_en_19471376.wa 73524-0-0- 1.4425 3.0938


v 101.wav

9 common_voice_en_18467543.wa 162134-7-5- 2.2385 2.6869


v 0.wav

1 common_voice_en_18740242.wa 129750-2-0- 1.7119 2.4366


0 v 34.wav
Table 2: PESQ results from DNN model

The results above show that, for our first 10 noisy and denoised speech samples, the
perceptual speech quality is significantly increased with the help of our DNN model.

4.1.1.2. STOI Measurement

This metric was measured with the help of the stoi function from the pysepm module.
The table below shows the results of the first 10 STOI values for the noisy and denoised
speech samples.
35

Speech File Noise File stoi_before stoi_after

1 common_voice_en_17259150.wa 83502-0-0-7.wav 0.6394 0.777


v

2 common_voice_en_137653.wav 115418-9-0- 0.6065 0.7007


20.wav

3 common_voice_en_13900.wav 99192-4-0-55.wav 0.811 0.7221

4 common_voice_en_113715.wav 73524-0-0-92.wav 0.732 0.6976

5 common_voice_en_19680737.wa 189982-0-0- 0.669 0.7057


35.wav
v

6 common_voice_en_18248920.wa 136558-9-1- 0.7336 0.67


39.wav
v

7 common_voice_en_17957640.wa 155241-9-0- 0.5037 0.6586


50.wav
v

8 common_voice_en_19471376.wa 73524-0-0- 0.788 0.735


101.wav
v

9 common_voice_en_18467543.wa 162134-7-5-0.wav 0.304 0.714


v

10 common_voice_en_18740242.wa 129750-2-0- 0.71 0.6518


34.wav
v

Table 3: STOI results from DNN model

The table above shows varying intelligibility scores with 0.6518 being the minimum

STOI and 0.777 being the maximum STOI value for this result set. It can be noticed that

some denoised speech samples have lower STOI values than their noisy counterpart, as

is the case with rows highlighted in blue. This is due to the denoising errors encountered

during the training process of our DNN model.


36

It’s worth mentioning that, the prediction root-mean-square-error (RMSE) of our model

was evaluated to be up to 𝟎. 𝟒𝟑𝟕𝟓 – indicating that our model can denoise our noisy

speech signals with up to 𝟏𝟒. 𝟐𝟓% feature prediction errors. This is to say that 1.96 ×

43.75% = 𝟖𝟓. 𝟕𝟓% of extracted noisy features could be correctly predicted, hence the

inaccuracy in some denoised samples.

4.1.2. Objective Speech Intelligibility Test

Objective quality measures predict perceived speech quality based on a computation of


the numerical distance or distortion between the original and the processed speech.

This dissertation focused on the use of the signal-to-noise-ratio (SNR) to measure the

ratio of signal energy to noise energy expressed in decibels dB and is given by

(Krishnamoorthy, 2011):

∑𝑛 𝑠 2 (𝑛)
𝑆𝑁𝑅𝑑𝐵 = 10 log10
∑𝑛[𝑠(𝑛) − 𝑠̂ (𝑛)]2 (11)

where 𝑠(𝑛) is the clean speech and 𝑠̂ (𝑛) is the noisy or denoised speech signals. Since

SNR is very sensitive to the time alignment of the original and processed signal, we

padded the noisy and denoised signals with zeros to align them with the original or clean

speech signal.

The table below shows the results of the first 10 SNR values for the noisy and denoised
speech samples.
37

Speech File Noise File snr_db_before snr_db_after

1 common_voice_en_17259150.wav 83502-0-0-7.wav 0.0000 6.9692

2 common_voice_en_137653.wav 115418-9-0-20.wav 0.0000 3.7557

3 common_voice_en_13900.wav 99192-4-0-55.wav 0.0000 5.6884

4 common_voice_en_113715.wav 73524-0-0-92.wav 0.0000 3.2927

5 common_voice_en_19680737.wav 189982-0-0-35.wav 0.0000 3.0613

6 common_voice_en_18248920.wav 136558-9-1-39.wav 0.0000 4.0556

7 common_voice_en_17957640.wav 155241-9-0-50.wav 0.0000 2.7045

8 common_voice_en_18467543.wav 162134-7-5-0.wav 0.0000 5.1004

9 common_voice_en_18740242.wav 129750-2-0-34.wav 0.0000 5.9649

10 common_voice_en_18467541.wav 115241-9-0-2.wav 0.0000 7.4227

Table 4: SNR results from DNN model

The results above show that, for our first 10 noisy and denoised speech samples, the SNR

is significantly increased with the help of our DNN model, from 0𝑑𝐵 noisy signal to a

maximum of 𝟕. 𝟒𝟐𝟐𝟕𝒅𝑩.

The next table below shows the average speech metrics (PESQ, STOI, and SNR)
computed over 30 clean speech and noise samples in our test dataset.
38

Subjective Metrics Objective Metrics

Mean PESQ Mean PESQ Mean Mean STOI Mean SNR Mean
noisy denoised STOI noisy denoised noisy SNR
denoised

𝟏. 𝟖𝟎𝟏𝟔 2.4755 0.6724 0.7016 0𝑑𝐵 5.1737𝑑𝐵

Table 5: Average of subjective and object metrics

It can be observed from these results that, based on 30 samples from our test dataset, there

is a significant increase in the PESQ, STOI, and SNR of denoised signals obtained from

the output of our model.

In this section, we were able to test hypothesis H1 and observed that our DNN network

exhibited an improved performance based on both subjective (PESQ, STOI) and objective

quality (SNR) measures, in denoising noisy speech signals. We test hypothesis H2 in the

next section.

4.2.Testing Hypothesis H2

Now that we’ve been able to test hypothesis H1 using the mentioned speech quality

measures, visually inspecting the waveforms of the resulting signals can also easily tell

us how promising our denoising algorithm can be relative to its application to the ITU-T

G.729 recommended VAD.

To begin with, let’s visualize our noisy input signal, which is composed of the last speech

file in Table 4 (common_voice_en_18467541.wav) and a noise file selected at random

from our test dataset (115418-9-0-20.wav). The figure below shows the waveforms of

these signals.
39

Figure 21: Test noisy Input signal


Passing the noisy speech signal shown in the figure above the into the ITU-T G.729
recommended VAD algorithm implemented by MATLAB – provided in Appendix G
results in the following waveform.

Figure 22: VAD decision for noisy-input.wav


The following figure shows the resulting waveform of our resulting denoised speech
signal. Appendix D shows its corresponding spectrogram.
40

Figure 23: Denoised speech signal

Passing this denoised signal in the third waveform from the figure above to the G.729
algorithm results in the following waveform.

Figure 24: VAD decision for denoised speech signal


It can be observed from Figure 23, that the G.729 seems to better detect the presence of
speech as compared to Figure 21, where it seems to hardly approximate the presence of
speech in the signal.

4.3. Testing GSM Channel Capacity

Given that we’ve been able to test both Hypothesis H1 and H2, we’ve been to obtain
improved intelligibility and voice activity detection, it’s also important, that we measure
41

how well our enhanced voice activity decisions affect a critical GSM resource like the
channel capacity.

Usually, once the GSM speech coder encodes our denoised speech, this encoded speech
is passed onto the GSM Traffic Channel (TCH). This TCH is responsible to carry digitally
encoded speech on the forward and reverse link after a mobile has established connection
with the GSM Base Transceiver Station (BTS).

TCH supports 2 types of information rates namely:

 TCH/FS: TCH/FS which stands for Full Rate Speech Channel (ECSTUFF4U for
Electronics Engineer, 2022), carries encoded speech at a rate of 22.8kbps.
 TCH/HS: TCH/HS which stands for Half Rate Speech Channel carries up to
11.4kbps of encoded speech (ECSTUFF4U for Electronics Engineer, 2022). It’s
main purpose is to support two calls in only one GSM channel.

Now, for a noiseless channel, the Nyquist capacity formula defines the theoretical
channel capacity 𝐶 as:

𝐶 = 2 × 𝐵 × log 2 (𝐿) (12)


where, 𝐵 is the bandwidth of the channel in samples per seconds, and 𝐿 is the number of
signal levels used to respresent the encoded speech data.

However, since we can’t have a noiseless channel in real life, we base our test using
Shannon capacity to determine the theoretical channel capacity for a noisy channel given
as:

𝐶 = 𝐵 × log 2 (1 + 𝑆𝑁𝑅 ) 𝑏𝑖𝑡𝑠/𝑠𝑒𝑐 (13)


where, 𝑆𝑁𝑅 is the signal-to-noise of our encoded speech signal, and we use 25kHz for
𝐵, which is the full-duplex speech channel bandwidth used for communication with
GSM900 Base Stations (BTS) (Asik & Amca, 2022).

Given that in section 4.1.2., the 𝑆𝑁𝑅𝑑𝐵 was calculated, we obtain the 𝑆𝑁𝑅 here using:

𝑆𝑁𝑅 = 10𝑆𝑁𝑅𝑑𝐵 /10 (14)


The table below shows the results of the first 10 channel capacity values for the noisy and
denoised speech samples passed over a noisy channel.
42

Speech File Noise File cap_bef_h cap_aft_hz


z (Hz) (Hz)
1 common_voice_en_17259150.wav 83502-0-0-7.wav 25000 66810.28

2 common_voice_en_137653.wav 115418-9-0-20.wav 25000 40307.62

3 common_voice_en_13900.wav 99192-4-0-55.wav 25000 54915.89

4 common_voice_en_113715.wav 73524-0-0-92.wav 25000 43538.49

5 common_voice_en_19680737.wav 189982-0-0-35.wav 25000 40079.51

6 common_voice_en_18248920.wav 136558-9-1-39.wav 25000 51855.73

7 common_voice_en_17957640.wav 155241-9-0-50.wav 25000 42884.4

8 common_voice_en_19471376.wav 73524-0-0-101.wav 25000 38657.97

9 common_voice_en_18467543.wav 162134-7-5-0.wav 25000 52011.5

10 common_voice_en_18740242.wav 129750-2-0-34.wav 25000 66601.62

Table 6: Channel Capacity for noisy and denoised speech samples

The channel capacity values in Table 6, indicate the maximum rate at which speech can
be transmitted through a 25kHz full duplex channel with very small error probability
(Channel Capacity - an Overview | ScienceDirect Topics, 2022). Hence, it can be
observed that, there is significant increase in channel capacity when enhanced speech is
transmitted across the channel.

Other important GSM resources like co-channel interference and power efficiency are
equally quite necessary to be analyzed, but this will require prior analysis of the GSM
cell structure, co-channel cell distance 𝐷, cell radius 𝑅, and interference power caused by
an interfering co-channel cell base station, which is out of scope of our research work.

4.4.Results Discussion

The summary results in Table 5 give us an idea of how well our DNN model performs in
denoising noisy signals based on 30 noisy sample signals. Comparing the average PESQ
score obtained (𝟐. 𝟒𝟕𝟓𝟓) with that obtained by (Park & Lee, 2016), 𝟐. 𝟑𝟒, it becomes
clear that, our model isn’t too bad in maintaining a good speech quality while denoising.
43

Also, comparing our average STOI value (𝟎. 𝟕𝟎𝟏𝟔) with that obtained by (Park & Lee,
2016), 𝟎. 𝟖𝟑, we can also deduce that our model isn’t too bad at maintaining intelligibility
of denoised speech signals.

Another research work performed by (Badescu & Cavez, 2021), showed a PESQ score of
𝟐. 𝟏𝟗𝟐𝟎, for SNR conditions of −𝟓𝒅𝑩, which when compared to our PESQ score under
SNR conditions of 𝟎𝒅𝑩, points out that our model provides a good intelligibility level
for 0𝑑𝐵 noisy speech signals.

Finally, given a full duplex channel with bandwidth 25kHz, using 30 noisy and denoised
speech samples, our work showed a significant increase in channel capacity with average
𝟓𝟒. 𝟓𝟗𝟖𝒌𝑯𝒛 which represents an increase of 118%.

Hence, we can assert based on these results that, we can generate denoised spectra,
{𝑓(𝑥𝑡 )}𝑇𝑡=1 , that approximates the clean spectra {𝑦𝑖 }𝑇𝑡=1 in the 𝒍𝟐 norm while maintaining
good perception and intelligibility levels, and that our denoised spectra, can significantly
improve the decisions made by the ITU-T G.729 Annex B recommended voice activity
detection algorithm.
44

CONCLUSION

Summary

In this dissertation, we aimed at implementing a speech enhancement system capable of


improving VAD decisions for noisy speech signals using Deep Neural Networks. Inspired
by the research conducted by (Park & Lee, 2016) and (Silva, 2019) we hypothesized that
CR-CED DNN can effectively denoise speech with small a network size and the resulting
denoised speech can improve the decisions made by the ITU-T G.729 VAD.

The results we obtained from the proposed deep neural network, presented in the previous
chapter, establishes that VAD decision errors can be attenuated with the help of Speech
Enhancement, and can improve perception and intelligibility levels, hence, increasing the
signal-to-noise ratio.

Remarks

Our study, showed a close resemblance in our performance metrics results and that
obtained from previous works by (Park & Lee, 2016) and (Badescu & Cavez, 2021).

Our speech enhancement model accuracy was obtained to be 85.75%, i.e., RMSE of
14.25%, indicating the error percentage our model encountered during training.

Despite the technical issues encountered with our datasets and the computational
resources, we believe that past these constraints, greater precision accuracy and hence a
lesser RMSE percentage will be achieved.

Future Work

In our future work, we intend to increase the dataset size of our clean speech dataset
(MCV) – about 71GB – and adjust the training parameters of the noise dataset in our
model, and set appropriate train-validation-test ratios to improve the precision accuracy
of our DNN model.

Lastly, to overcome the problem of frequency and time resolution of the short-time
Fourier transform, we intend to replace the STFT with the Wavelet Transform (WT) of
the speech signals at the preprocessing stage.
45

REFERENCES

Asik, H., & Amca, H. (2019). Hand-over power level adjustment for minimizing cellular mobile
communication systems health concerns. Ciência e Técnica Vitivinícola, Vl. 34. No.
7, pp. 2416-3953.

Badescu, D. M., & Cavez, A. B. (n.d.). Speech Enhancement using Deep Learning. 33.
https://upcommons.upc.edu/bitstream/handle/2117/100596/Speech+Enhancement+usin
g+Deep+Learning.pdf?sequence=1.
Baghdasaryan, D. (2018). Real-Time noise suppression using deep learning | by davit
baghdasaryan | towards data science. Real-Time noise suppression using deep learning.
Retrieved on 17/02/2022 from, https://towardsdatascience.com/real-time-noise-
suppression-using-deep-learning-38719819e051.

Benyassine, A., Shlomot, E., Su, H.-Y., Massaloux, D., Lamblin, C., & Petit, J.-P. (1997). ITU-
T Recommendation G.729 Annex B: A silence compression scheme for use with G.729
optimized for V.70 digital simultaneous voice and data applications. IEEE
Communications Magazine, Vl. 35. No. 9, pp. 64–73. Retrieved from,
https://doi.org/10.1109/35.620527.

Chaudhary, M. (2020). Activation functions: Sigmoid, tanh, relu, leaky relu, softmax. Medium.
Retrieved 14/02/2022 from, https://medium.com/@cmukesh8688/activation-functions-
sigmoid-tanh-relu-leaky-relu-softmax-50d3778dcea5.

Chen, J., Wang, Y., Yoho, S. E., Wang, D., & Healy, E. (2016). Large-scale training to increase
speech intelligibility for hearing-impaired listeners in novel noises. The Journal of the
Acoustical Society of America, Vl. 22. No. 6, pp 67-78. Retrieved from,
https://doi.org/10.1121/1.4948445.

Craciun, A., & Gabrea, M. (2004). Correlation coefficient-based voice activity detector algorithm.
Canadian Conference on Electrical and Computer Engineering. (IEEE Cat.
No.04CH37513). Retrieved from, https://doi.org/10.1109/CCECE.2004.1349763.

Dertat, A. (2017). Applied deep learning - Part 3: Autoencoders. Medium. Retrievded on 28/07/
2022 from, https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-
1c083a f4d.

ECSTUFF4U (2018). for Electronics Engineer. Retrieved on July 28/07/2022, from


https://www.ecstuff4u.com/

EETimes. (2003). EETimes - sorting through GSM Codecs: A Tutorial. EETimes. REtrieved on
14/03/2022 from, https://www.eetimes.com/sorting-through-gsm-codecs-a-tutorial.

Farahani, G. (2017). Autocorrelation-based noise subtraction method with smoothing,


overestimation, energy, and cepstral mean and variance normalization for noisy speech
recognition. EURASIP Journal on Audio, Speech, and Music Processing, Vl. 1. No. 13,
pp. 13-22. REtrieved from, https://doi.org/10.1186/s13636-017-0110-8.

Hahn, M., & Park, C. K. (1992). An improved speech detection algorithm for isolated Korean
utterances. [Proceedings] ICASSP-92: 1992 IEEE International Conference on
Acoustics, Speech, and Signal Processing. Retrieved from,
https://doi.org/10.1109/ICASSP.1992.2258.
46

Haigh, J. A., & Mason, J. S. (1993). Robust voice activity detection using cepstral features.
Proceedings of TENCON ’93. IEEE Region 10 International Conference on Computers,
Communications and Automation. Retrieved from, https://doi.org/10.1109/ TENCON.
1993 .327987.

Healy, E. W., Delfarah, M., Vasko, J. L., Carter, B. L., & Wang, D. (2017). An algorithm to
increase intelligibility for hearing-impaired listeners in the presence of a competing
talker. The Journal of the Acoustical Society of America, Vl. 141. No. 6, pp. 4230–4239.
Retrieved from, https://doi.org/10.1121/1.4984271.

Hu, Y., & Loizou, P. C. (2008). Evaluation of objective quality measures for speech enhancement.
IEEE Transactions on Audio, Speech, and Language Processing, Vl. 16. No. 1, pp. 229–
238. Retrieved from, https://doi.org/10.1109/TASL.2007.911054.

ITU-T. (2021). P.862 : Perceptual evaluation of speech quality (PESQ): An objective method
for end-to-end speech quality assessment of narrow-band telephone networks and
speech codecs. (n.d.). Retrieved May 22, 2022, from https://www.itu.int/rec/T-REC-
P.862-200102-I/en.
JalFaizy, S. (2017). Why are GPUs necessary for training deep learning models? Retrieved on
23/04/2022 from, https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-
deep-learning.

Jong, W. S., Joon-Hyuk, C., Barbara, S., Hwan, S. Y., & Nam, S. K. (2005). Voice activity
detection based on generalized gamma distribution. Proceedings. (ICASSP ’05). IEEE
International Conference on Acoustics, Speech, and Signal Processing. Retrieved from,
https://doi.org/10.1109/ICASSP.2005.1415230.

Jongseo, S., & Wonyong, S. (1998). A voice activity detector employing soft decision based noise
spectrum adaptation. Proceedings of the 1998 IEEE International Conference on
Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181). Retrieved
from, https://doi.org/10.1109/ICASSP.1998.674443.

Kawamura, A., Thanhikam, W., & Iiguni, Y. (2012). Single channel speech enhancement
techniques in spectral domain. ISRN Mechanical Engineering. Retrieved from,
https://doi.org/10.5402/2012/919234.

Krishnamoorthy, P. (2011). An Overview of subjective and objective quality measures for noisy
speech enhancement algorithms. IETE Technical Review, Vl. 28. No. 4, pp. 292–301.
Retrieved from, https://doi.org/10.4103/0256-4602.83550.

Kumar, A., & Florencio, D. (2016). Speech enhancement in multiple-noise conditions using deep
Neural Networks. Retrieved from, Retrieved from, https://doi.org/10.21437 /Interspeech .

Librosa. (2021). Librosa 0.8.1 documentation. (n.d.). Retrieved on 2/11/2021 from,


https://librosa.org/doc/main.

Liu, D., Smaragdis, P., & Kim, M. (2014). Experiments on Deep Learning for Speech Denoising.
New York: Mc. Hill.

Mathworks: (2021). Denoise Speech Using Deep Learning Networks. Retrieved on 23/09/2021,
from https://www.mathworks.com/help/deeplearning/ug/denoise-speech-using-deep-
learning-networks.html.

McLoughlin, I. (2016). Speech and audio processing: A Matlab-based approach. Cambridge:


University Press.
47

Mozilla. (2020). Common voice corpus 9.0. (n.d.). Retrieved May 26, 2022, from
https://commonvoice.mozilla.org/

Ortega-Garcia, J., & Gonzalez-Rodriguez, J. (1996). Overview of speech enhancement techniques


for automatic speaker recognition. Proceeding of Fourth International Conference on
Spoken Language Processing. ICSLP Journal, Vl. 96. No. 2, pp. 929–932. Retrieved
from, https://doi.org/10.1109/ICSLP.1996.607754.

Park, S. R., & Lee, J. (2016). A Fully convolutional neural network for speech enhancement.
ArXiv:1609.07132 [Cs]. Retrieved on 24/11/21 from, http://arxiv.org/abs/1609.07132.

Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2001). Perceptual evaluation of
speech quality (PESQ)-a new method for speech quality assessment of telephone
networks and codecs. IEEE International Conference on Acoustics, Speech, and Signal
Processing. Proceedings, Vl. 2. No. 1, pp. 749–752. Retrieved from,
https://doi.org/10.1109/ ICASSP .2001.941023.

Becker, D. (2018). Running kaggle kernels with a GPU. Retrieved on 4/11/2021, from
https://kaggle.com/dansbecker/running-kaggle-kernels-with-a-gpu.

Saha, S. (2018, December 17). A Comprehensive Guide to Convolutional Neural Networks—The


ELI5 way. Medium. https://towardsdatascience.com/a-comprehensive-guide-to-
convolutional-neural-networks-the-eli5-way-3bd2b1164a53.

Salamon, J., Jacoby, C., & Bello, J. P. (2014). A Dataset and Taxonomy for Urban Sound

Research. Proceedings of the 22nd ACM International Conference on Multimedia, 1041–

1044. https://doi.org/10.1145/2647868.2655045

Sachin, K. (2020). Data splitting technique to fit any machine learning model . Towards data
science. Retrieved 5/11/2021 from, https://towardsdatascience.com/data-splitting-
technique-to-fit-any-machine-learning-model-c0d7f3f1c790.

Samaya, M., Jeremy, N., Romeo, K., & Alex, A. (2021). Building deep learning models with
tensorflow—home | coursera [E-learning]. Building deep learning models with
tensorflow. Retrieved on 11/11/2021, from https://www.coursera.org/learn/building-
deep-learning-models-with-tensorflow/home/welcome.

Sachin, K. (2020). Data splitting technique to fit any machine learning model . Towards data
science. Retrieved 5/11/2021 from, https://towardsdatascience.com/data-splitting-
technique-to-fit-any-machine-learning-model-c0d7f3f1c790.

Simone, G., 7 Carl, H. (2021). Intelligibility prediction for speech mixed with white Gaussian
noise at low signal-to-noise ratios. The Journal of the Acoustical Society of America: Vl.
149. No 2. Retrieved on 9/11/2021 from,
https://asa.scitation.org/doi/full/10.1121/10.0003557.

Schmiph2. & Philiposm, C. L . (2021). Pysepm—python speech enhancement performance


measures (Quality and Intelligibility). Asterdam: Python.
Scourias, J. (1995). Overview of the global system for mobile communications. wataloo:
University Press.
48

Shivakumar, P. G., & Georgiou, P. (2016). Perception optimized deep denoising autoencoders
for speech enhancement. In, Prashanth, G. S., & Panayiotis, G. (Eds.).
Interspeech. New York: University Press, pp. 3743–3747.

Silva, T. (2019). Practical deep learning audio denoising—thalles’ blog. Speech Denoising Is a
long-standing problem. Given an input noisy signal, we aim to filter out the undesired
noise without degrading the signal of interest. You can imagine someone talking in a
video conference while a piece of music is playing in the background. in this situation, a
speech denoising system has the job of removing the background noise in order to
improve the speech signal. Besides Many other use cases, this application is especially
important for video and audio conferences where noise can significantly decrease speech
Intelligibility. Retrieved on 29/11/2021 from, https://sthalles.github.io/practical-deep-
learning-audio-denoising/.

Sohn, J., Kim, N., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE
Signal Processing Letters, Vl. 6. No. 1, pp. 1–3.

Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2010). A short-time objective
intelligibility measure for time-frequency weighted noisy speech. IEEE Society (Ed.).
IEEE International Conference on Acoustics, Speech and Signal Processing. New York:
EEEi Society, pp. 4214–4217.

Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An Algorithm for intelligibility
prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio,
Speech, and Language Processing, Vl. 19. No. 7, pp. 2125–2136. https://doi.org/10.1109/
TASL. 2011.2114881.

Tashev, I. J., & Mirsamadi, S. (2016). DNN-based causal voice activity detector. 5. Retrived on
17/09/2021 from, https://www.semanticscholar.org.

The MathWorks, Inc. (n.d.). Denoise Speech using deep learning networks—MATLAB &
Simulink. Retrieved on15/03/2021, from
https://www.mathworks.com/help/audio/ug/denoise-speech-using-deep-learning-
networks.html.

Yuliani, A. R., Amri, M. F., Suryawati, E., Ramdan, A., & Pardede, H. F. (2021). Speech
enhancement using deep learning methods. Journal Elektronika Dan Telekomunikasi, Vl,
21. No. 1, 16-19. Rertrieved from, https://doi.org/10.14203/jet.v21.19-26
49

APPENDICES

Appendix A: Interfering noise in speech transmission illustration

Appendix B: VAD Decision of a noisy speech signal

Appendix C: VAD Decision of Noisy Audio


50

Appendix D: Improving VAD decisions with Speech Enhancer

Appendix E: test_tf_record.py

import tensorflow as tf
import numpy as np
from utils import play
from data_processing.feature_extractor import FeatureExtractor

train_tfrecords_filenames = '../kaggle/working/records/test_0.tfrecords'

def tf_record_parser(record):
keys_to_features = {
"noise_stft_phase": tf.io.FixedLenFeature((), tf.string, default_value=""),
'noise_stft_mag_features': tf.io.FixedLenFeature([], tf.string),
"clean_stft_magnitude": tf.io.FixedLenFeature((), tf.string)
}

features = tf.io.parse_single_example(record, keys_to_features)

noise_stft_mag_features = tf.io.decode_raw(features['noise_stft_mag_features'], tf.floa


t32)
clean_stft_magnitude = tf.io.decode_raw(features['clean_stft_magnitude'], tf.float32)
noise_stft_phase = tf.io.decode_raw(features['noise_stft_phase'], tf.float32)

n_features = 129

noise_stft_mag_features = tf.reshape(noise_stft_mag_features, (n_features, 8, 1), name=


"noise_stft_mag_features")
clean_stft_magnitude = tf.reshape(clean_stft_magnitude, (n_features, 1, 1), name="clean
_stft_magnitude")
noise_stft_phase = tf.reshape(noise_stft_phase, (n_features,), name="noise_stft_phase")

return noise_stft_mag_features, clean_stft_magnitude, noise_stft_phase

train_dataset = tf.data.TFRecordDataset([train_tfrecords_filenames])
train_dataset = train_dataset.map(tf_record_parser)
train_dataset = train_dataset.repeat(1)
train_dataset = train_dataset.batch(1000)
train_dataset = train_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

window_length=256
overlap=64
sr = 16000

feature_extractor = FeatureExtractor(None, windowLength=window_length, overlap=overlap,


sample_rate=sr)

def revert_features_to_audio(features, phase, cleanMean=None, cleanStd=None):


# scale the outpus back to the original range
if cleanMean and cleanStd:
features = cleanStd * features + cleanMean

phase = np.transpose(phase, (1, 0))


features = np.squeeze(features)
51

Appendix F: dataset.py

import librosa
import numpy as np
import math
from feature_extractor import FeatureExtractor
from utils import prepare_input_features
import multiprocessing
import os
from utils import get_tf_feature, read_audio
import tensorflow as tf
from sklearn.preprocessing import StandardScaler

np.random.seed(999)
tf.random.set_seed(999)

class Dataset:
def __init__(self, clean_filenames, noise_filenames, **config):
self.clean_filenames = clean_filenames
self.noise_filenames = noise_filenames
self.sample_rate = config['fs']
self.overlap = config['overlap']
self.window_length = config['windowLength']
self.audio_max_duration = config['audio_max_duration']

# some functions were clipped here for space management


# … … …

def parallel_audio_processing(self, clean_filename):

clean_audio, _ = read_audio(clean_filename, self.sample_rate)

# remove silent frame from clean audio


clean_audio = self._remove_silent_frames(clean_audio)

noise_filename = self._sample_noise_filename()

# read the noise filename


noise_audio, sr = read_audio(noise_filename, self.sample_rate)

# remove silent frame from noise audio


noise_audio = self._remove_silent_frames(noise_audio)

# sample random fixed-sized snippets of audio


clean_audio = self._audio_random_crop(clean_audio, duration=self.audio_max_dura
tion)

# add noise to input image


noiseInput = self._add_noise_to_clean_audio(clean_audio, noise_audio)

# extract stft features from noisy audio


noisy_input_fe = FeatureExtractor(noiseInput, windowLength=self.window_length,
overlap=self.overlap,
sample_rate=self.sample_rate)
52

noisy_input_fe = FeatureExtractor(noiseInput, windowLength=self.window_length,


overlap=self.overlap,
sample_rate=self.sample_rate)
noise_spectrogram = noisy_input_fe.get_stft_spectrogram()

# Or get the phase angle (in radians)


# noisy_stft_magnitude, noisy_stft_phase = librosa.magphase(noisy_stft_features
)
noise_phase = np.angle(noise_spectrogram)

# get the magnitude of the spectral


noise_magnitude = np.abs(noise_spectrogram)

# extract stft features from clean audio


clean_audio_fe = FeatureExtractor(clean_audio, windowLength=self.window_length,
overlap=self.overlap,
sample_rate=self.sample_rate)
clean_spectrogram = clean_audio_fe.get_stft_spectrogram()
# clean_spectrogram = cleanAudioFE.get_mel_spectrogram()

# get the clean phase


clean_phase = np.angle(clean_spectrogram)

# get the clean spectral magnitude


clean_magnitude = np.abs(clean_spectrogram)
# clean_magnitude = 2 * clean_magnitude / np.sum(scipy.signal.hamming(self.wind
ow_length, sym=False))

clean_magnitude = self._phase_aware_scaling(clean_magnitude, clean_phase, noise


_phase)

scaler = StandardScaler(copy=False, with_mean=True, with_std=True)


noise_magnitude = scaler.fit_transform(noise_magnitude)
clean_magnitude = scaler.transform(clean_magnitude)

return noise_magnitude, clean_magnitude, noise_phase

def create_tf_record(self, *, prefix, subset_size, parallel=True):


counter = 0
p = multiprocessing.Pool(multiprocessing.cpu_count())

for i in range(0, len(self.clean_filenames), subset_size):

tfrecord_filename = '/kaggle/working/records/' + prefix + '_' + str(counter


) + '.tfrecords'

if os.path.isfile(tfrecord_filename):
print(f"Skipping {tfrecord_filename}")
counter += 1
continue
writer = tf.io.TFRecordWriter(tfrecord_filename)
53

clean_filenames_sublist = self.clean_filenames[i:i + subset_size]

print(f"Processing files from: {i} to {i + subset_size}")


if parallel:
out = p.map(self.parallel_audio_processing, clean_filenames_sublist)
else:
out = [self.parallel_audio_processing(filename) for filename in clean_f
ilenames_sublist]

for o in out:
noise_stft_magnitude = o[0]
clean_stft_magnitude = o[1]
noise_stft_phase = o[2]

noise_stft_mag_features = prepare_input_features(noise_stft_magnitude,
numSegments=8, numFeatures=129)

noise_stft_mag_features = np.transpose(noise_stft_mag_features, (2, 0,


1))
clean_stft_magnitude = np.transpose(clean_stft_magnitude, (1, 0))
noise_stft_phase = np.transpose(noise_stft_phase, (1, 0))

noise_stft_mag_features = np.expand_dims(noise_stft_mag_features, axis=


3)
clean_stft_magnitude = np.expand_dims(clean_stft_magnitude, axis=2)

for x_, y_, p_ in zip(noise_stft_mag_features, clean_stft_magnitude, no


ise_stft_phase):
y_ = np.expand_dims(y_, 2)
example = get_tf_feature(x_, y_, p_)
writer.write(example.SerializeToString())

counter += 1
writer.close()
54

Appendix G: MATLAB implementation of G.729 VAD

audioSource = dsp.AudioFileReader('SamplesPerFrame',80,...
'Filename','noisy-input.wav',...
'OutputDataType', 'single');
scope = dsp.TimeScope(2, 'SampleRate', [8000/80, 8000], ...
'BufferLength', 80000, ...
'YLimits', [-0.3 1.1], ...
'ShowGrid', true, ...
'Title','Decision speech and speech data', ...
'TimeSpanOverrunAction','Scroll');
% Initialize VAD parameters
VAD_cst_param = vadInitCstParams;
clear vadG729
% Run for 10 seconds
numTSteps = 500;
while(numTSteps)
% Retrieve 10 ms of speech data from the audio recorder
speech = audioSource();
% Call the VAD algorithm
decision = vadG729(speech, VAD_cst_param);
% Plot speech frame and decision: 1 for speech, 0 for silence
scope(decision, speech);
numTSteps = numTSteps - 1;
end
release(scope);
55

Appendix H:

Clean Speech, Noisy Speech, and Denoised Speech spectrograms

You might also like