16 MAB Cognitive Journal

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

A New Biologically Inspired Fuzzy Expert

System-Based Voiced/Unvoiced Decision


Algorithm for Speech Enhancement

M. A. Ben Messaoud, A. Bouzid &


N. Ellouze

Cognitive Computation

ISSN 1866-9956

Cogn Comput
DOI 10.1007/s12559-015-9376-2

1 23
Your article is protected by copyright and all
rights are held exclusively by Springer Science
+Business Media New York. This e-offprint is
for personal use only and shall not be self-
archived in electronic repositories. If you wish
to self-archive your article, please use the
accepted manuscript version for posting on
your own website. You may further deposit
the accepted manuscript version in any
repository, provided it is only made publicly
available 12 months after official publication
or later and provided acknowledgement is
given to the original source of publication
and a link is inserted to the published article
on Springer's website. The link must be
accompanied by the following text: "The final
publication is available at link.springer.com”.

1 23
Author's personal copy
Cogn Comput
DOI 10.1007/s12559-015-9376-2

A New Biologically Inspired Fuzzy Expert System-Based Voiced/


Unvoiced Decision Algorithm for Speech Enhancement
M. A. Ben Messaoud1 • A. Bouzid1 • N. Ellouze1

Received: 7 March 2015 / Accepted: 28 December 2015


 Springer Science+Business Media New York 2016

Abstract In this paper, we propose a speech enhance- Introduction


ment approach for a single-microphone system. The main
idea is to apply a specific transformation on the speech Speech enhancement methods currently available in the
signal depending on the voicing state of the signal. We literature can be subdivided into multi-channels and mono-
apply a voiced/unvoiced algorithm based on the multi-scale channel approaches [1]. Mono-channel is much more
product analysis with the use of fuzzy logic to make more arduous, as the mutual relation between channels cannot be
cognitively inspired use of speech information. A comb analysed.
filtering is applied on the voiced frames of the noisy speech The enhancement of a mono-channel speech signal
signal, and a spectral subtraction is operated on the degraded by additive background noise has been largely
unvoiced frames of the same signal. Further, the harmonics studied. Many methods in the field of speech enhancement
are enhanced by performing a designed comb filtering have been proposed to improve the quality, increase the
using an adjustable bandwidth. The comb filter is tuned by intelligibility of degraded speech signals and reduce the
an accurate fundamental frequency estimation method. The listener’s fatigue.
fundamental frequency estimation method is based on Speech enhancement methods can be generally divided
computing the multi-scale product analysis of the noisy into several categories: spectral subtractive algorithms,
speech. Experimental results show that the proposed wavelet transform, statistical-model-based algorithms,
approach is capable of reducing noise in adverse noise subspace methods, harmonic structure of speech and cog-
environments with little speech degradation and outper- nitive and brain-inspired intelligent.
forms several competitive methods. Spectral subtraction (SS) [2, 3] is one of the first algo-
rithms applied to the problem of speech enhancement. It is
Keywords Speech enhancement  Multi-scale product  simple and provides a trade-off between speech distortion
Voiced decision  Fundamental frequency estimation  and residual noise to some extent, and it suffers from an
Comb filter  Fuzzy logic artefact known as ‘‘musical noise’’ having an unnatural
structure that is perceptually annoying, composed of tones
random frequencies and has an increased variance. Many
methods have been proposed to eliminate this phenomenon
including perceptually motivated approaches [4, 5].
Wavelet transforms (WT) have been applied to various
speech applications [6]. The main challenge in such
denoising approaches based on the thresholding of the
wavelet coefficients of the noisy speech is the estimation of
& M. A. Ben Messaoud a threshold value that marks a difference between the
anouar.benmessaoud@yahoo.fr
wavelet coefficients of noise and that of clean speech.
1
National School of Engineers of Tunis, University of Tunis Then, by using the threshold, the designing of a thresh-
El Manar, Tunis, Tunisia olding scheme to minimize the effect of wavelet

123
Author's personal copy
Cogn Comput

coefficients corresponding to the noise is another difficult [17]. At the cognitive level, perception of degraded speech
task considering the fact that the conventional discrete WT- is aided by the knowledge of context of conversation, the
based denoising approaches exhibit a satisfactory perfor- syntax and semantics of the context and high-level features
mance only at a relatively high signal-to-noise ratio. like intonation and duration. At the acoustic level, sound
Statistical model-based algorithms are one of the most perception in degraded conditions happens mostly by
commonly used classes of speech denoising methods. In extrapolation of information from the high SNR regions to
the minimum mean-square error (MMSE estimator) [7, 8], the low SNR regions in the temporal domain. Furthermore,
the frequency spectrum of the noisy speech is modified to it is generally understood that the cognitive and brain-in-
reduce the noise from noisy speech in the frequency spired intelligent also plays a central importance in the
domain. A relatively large variance of spectral coefficients perception of speech [18, 19]. So, we may consider another
is the problem of such an estimator. While adapting filter category of speech enhancement methods that consist to
gains of the MMSE estimator, spectral outliers may emerge use the cognitive and brain-inspired intelligent. Abel et al.
that is especially difficult to avoid under noisy conditions. [20, 21] propose a multimodal speech enhancement sys-
One of the major problems of Wiener filter-based methods tem, based on visual and audio extraction features to filter
[9–11] is the requirement of obtaining clean speech the noisy speech in a cognitively inspired manner by using
statistics necessary for their implementation. Both the a neuro-fuzzy system. The visual modality consists in
MMSE and the Wiener estimators have a moderate com- employing a visually derived Wiener filtering based on lip
putation load, but they offer no mechanism to control tracking technique. Then, the authors use the beamforming
trade-off between speech distortion and residual noise. approach to exploiting spatial diversity of source and noise.
The signal subspace approach for speech enhancement In [22, 23], DNNs (deep neural network) were used to
is an interesting generalization of the spectral weighting estimate a smoothed ideal ratio mask in the mel-frequency
methods. This technique has been originally proposed in domain for robust automatic speech recognition. Xu et al.
[12]. The speech estimation is considered there as a con- [24] propose to learn the complex mapping function from
strained optimization problem, where the speech distor- noisy to clean speech with nonlinear DNN-based regres-
tions are minimized subject to the residual noise power sion models using multi-condition training data encom-
level. Two linear estimators have been proposed: time- passing different key factors in noisy speech, including
domain-constrained (TDC) and spectral domain-con- speakers, noise types and signal-to-noise ratios (SNRs).
strained (SDC). Unlike DFT-based methods, signal sub- During voiced speech segments, the regular glottal exci-
space approaches decompose a signal space into speech tation of the vocal tract produces energy at the fundamental
subspace and noise subspace using Karhunen–Loeve frequency F0 and its multiples. An alternative strategy of
transform (KLT). Then, the spectral weighting is per- speech enhancement comprising all those approaches is to
formed in the signal subspace only. The components pro- process the speech based on the fundamental frequency F0
jected onto the noise subspace are simply nulled which estimation. The voiced noisy speech is improved by
results in a significantly better performance when com- enhancing the harmonic structures [25]. This harmonicity-
pared to the conventional frequency domain methods based enhancement is performed by extracting the spectral
where a full-band spectrum must be processed. components at integer multiples of fundamental frequency
Other popular approach is the harmonic structure. It (F0) with sinusoidal modelling [26] or adaptive comb filter
consists to use the harmonic structure of voiced speech for [27–30]. The F0 is used to pick out the voiced speech from
enhancing the speech quality [13–16]. In the work of [13, noise. In this strategy of methods, robust and accurate F0
14], the signal is modelled as harmonic plus noise com- estimation and voicing detection are essential issues, because
ponents. Then, the voiced signal is enhanced by estimating any errors in these values cause a severe deterioration in the
the harmonic components. In [15], the authors use an achieved speech quality. For example, errors in F0 estimation
extension of the hidden Markov model (HMM)-based may distort the target speech signal, while inaccuracies in
MMSE estimator to enhance the harmonics for voiced voicing detection may cause the loss of voiced segments.
speech by incorporating a ternary voicing state. In [16], the Most voicing decision algorithms exploit almost any
authors have adopted here sinusoidal model with the elementary speech signal parameter that may be computed
speech enhancement context. independently of the type of input signal: energy, ampli-
The perceptual aspects of speech are considerably more tude, short-term autocorrelation coefficients, zero-crossings
complicated and less well understood. However, there are a count, ratio of signal amplitudes in different sub-bands or
number of commonly accepted aspects of speech percep- after pre-processing as linear prediction error, or the sal-
tion which play an important role in speech enhancement ience of a pitch estimate. Voicing decision algorithms can
systems. Perceptual cues of highly degraded speech can be be grouped into three essential steps. The first has been
thought of two levels, namely cognitive and acoustic levels designed specifically to solve the voiced/unvoiced

123
Author's personal copy
Cogn Comput

problem, using statistical inference on the training data [31, The remainder of this article is structured as follows. A
32]. Most methods in this category use a statistical para- detailed description of our approach is provided in Sect. 2.
metric method, which exploit only a few basic parameters In Sect. 3, we describe the evaluation results. Section 4
(a fixed threshold) by examining whether the value of a concludes this work.
certain feature exceeds a predetermined threshold. Inap-
propriate selection of the threshold, regardless of input
signal characteristics, results in performance degradation. Our Proposed Approach
The second category that has attempted to solve this
problem is fundamental frequency estimators that, as an This section describes the principle components of our
aside, make voiced/unvoiced decisions based on the degree approach for speech enhancement. It works as follows:
of periodicity of the signal [33, 34]. In this category, first, the signal is divided into 25.6 ms windows with 50 %
nonparametric methods based on linear discrimination overlap between frames. This frame length was chosen
functions, multi-layer feed forward and recurrent neural empirically, as it is enough long for good spectral esti-
networks were adopted. The third category is more com- mation of fundamental frequency value in each frame but
plex algorithms based on acoustical features, and pattern not so long as to modify temporal change in the speech
recognition techniques were used to separate the speech signals. Then, frames of original speech signals are clas-
segments into voiced/unvoiced [35, 36]. sified to voiced/unvoiced frames by a fuzzy logic decision
In this paper, we propose an approach for speech algorithm. Second, we apply a designed comb filter on
enhancement based on voiced/unvoiced algorithm classi- voiced frames and a spectral subtraction method using
fication that can work in low SNR environments and under connected time–frequency regions noise estimation on
varying noise conditions. The decision to classify a seg- unvoiced frames to remove the residual noise. The comb
ment into voiced or unvoiced is based on the values of the filter is tuned by the estimated fundamental frequency. The
number of group, energy and zero-crossings. Since these fundamental frequency is estimated in the voiced frames
values are not precise, particularly in the presence of using the spectral compression multi-scale product method.
background noise, it will be helpful to use fuzzy logic, in Finally, we use the overlap-add procedure to reconstruct
order to improve the performance of our method based on the enhanced speech signal.
automatic cognitive classification and to increase the The various steps implied in the proposed approach are
robustness of the approach under noisy conditions. So in illustrated in Fig. 1. Each one will be detailed in the fol-
this paper, we present a method that classifies the speech lowing subsections.
based on fuzzy expert system of the spectral multi-scale The voiced/unvoiced (V/UV) decision is at the heart of
product (SMP) peaks, short-time energy of multi-scale our suggested approach as the processing for the speech
product (MP) and zero-crossings of MP. enhancement depends on the voicing state using the fuzzy
For noisy voiced speech frames, the harmonic speech expert system. Accordingly, the voiced/unvoiced (V/UV)
structure obtained through short-time Fourier analysis is decision algorithm will be closely analysed.
enhanced by applying an adaptive comb filter. The quasi-
periodic signal produced by the interaction between the Voiced/Unvoiced Classification Algorithm
glottal excitation and the vocal tract is not totally station-
ary. As a consequence, the spectral peaks belonging to The voiced/unvoiced algorithm is based essentially on the
higher harmonics are more shifted in frequency than those multi-scale product (MP) characteristics. In fact, the MP
corresponding to lower harmonics. This inharmonicity attempts to enhance the peaks of the gradients caused by
provokes that high-order harmonics are partially or totally true edges, while suppressing false peaks caused by noise.
removed when applying a purely harmonic comb filter. To It also permits to obtain a more simplified signal having a
palliate this problem, we have implemented a comb filter periodic structure in the voiced frames and a quasi-null
with rectangular pass bands structure in the spectral signal in unvoiced frames [37].
domain to enhance the harmonic structure. In unvoiced Our voiced/unvoiced decision algorithm is given in the
frames, a spectral subtraction technique with geometric block diagram depicted in the Fig. 2.
advent of spectral subtraction using connected time–fre- The WT is a multi-scale analysis which has been shown
quency regions noise estimation is applied. The experi- to be very well suited for speech processing as glottal
mental results indicate the superiority of our approach in closure instant (GCI) detection, pitch estimation, speech
terms of the speech quality, and intelligibility under a wide enhancement and recognition and so on. Using the WT, a
range of noisy environments, such as white, and real speech signal can be analysed at specific scales corre-
acoustic interferences. sponding to the range of human speech.

123
Author's personal copy
Cogn Comput

Voiced
frame Comb
filtering

V/UV Reconstruction Enhanced


Noisy
Framing decision of the speech
speech speech
algorithm
Spectral
subtraction
Unvoiced
frame

Fig. 1 Block diagram of the proposed speech enhancement system

Voiced
Group /Unvoiced
FFT classification decision
s1=1/2
Multi-Scale Product
Wavelet transform

Short time Rules of Fuzzy


Noisy
(MP)

s2=1 Energy ZCR, En, expert


speech system
Eni nbr group
frame

Short time
s3=2
Zero-crossings
rate Zci

Fig. 2 Block diagram of the voiced/unvoiced classification algorithm

One of the most important WT applications is the signal studied MP method of signal in the presence of noise. In
singularity detection. In fact, continuous WT produce wavelet domain, it is well known that edge structures are
modulus maxima at signal singularities allowing their present at each sub-band, while noise decreases rapidly
localization. These maxima appear at specific singularity along the scales. It has been observed that multiplying the
points, depending on wavelet vanishing moments. adjacent scales could sharpen edges while smoothing out
The choice of the mother wavelet is crucial to detect noise [42].
discontinuities. To choose the adequate wavelet, we must The MP consists of making the product of WTC of the
take into account two considerable characteristics. The first noisy speech frame x(n) at three successive dyadic scales as
one is the number of the vanishing moments, and the follows:
second one is relied on the wavelet support. WT with Y
j¼1
n vanishing moments can be interpreted as a multi-scale pð nÞ ¼ W2 j xðnÞ: ð1Þ
differential operator of nth order of a smoothed signal [38]. j¼1
This provides a relationship between the differentiability of
where W2 j xðnÞ is the wavelet transform of the noisy speech
the signal and wavelet modulus maxima decay at fine
frame x(n) at scale 2j.
scales.
The wavelet used in this analysis is the quadratic spline
However, one-scale analysis does not give a good pre-
function. The quadratic spline function has a compact
cision. So, decision algorithm using multiple scales is
support and is continuously differentiable. It is the first
proposed by different works to circumvent this problem
derivative of the cubic spline function. Thus, it has only
[38, 39].
one vanishing moment. Wavelet decompositions is made at
The MP was first used in image processing. It is based
scales s1 = 2-1, s2 = 20 and s3 = 21. Odd number of
on the multiplication of wavelet transform coefficients
terms in p(n) preserves the sign of maxima. Choosing the
(WTC) at some scales. It attempts to enhance the peaks of
product of three levels of wavelet decomposition is gen-
the gradients caused by true edges, while suppressing
erally optimal and allows detection of small peaks.
false peaks caused by noise. Xu and al. [40] rely on the
Figure 3 depicts the speech signal (Fig. 3a) added to a
variations in scale of the WT. They use multiplication of
Gaussian noise with an SNR equal to -5 dB (Fig. 3b)
WT of the image at adjacent scales to distinguish
followed by the MP of the noisy speech.
important edges from noise. Sadler and Swami [41] have

123
Author's personal copy
Cogn Comput

Fig. 3 a Clean speech, b speech corrupted by -5 dB white noise. c Multi-scale product (MP) of noisy speech

We can clearly see the effect of the multi-scale product complete the decision, it is necessary to use more than one
in eliminating the noise sounds Fig. 3c. The MP has the parameter. In this case, each parameter will correct some of
same behaviour regarding the noise sounds. the other parameters mistakes, because each parameter’s
In the voiced frames, the MP has a periodic structure base is different from the other one.
with maxima corresponding to the glottal opening instant The MP of a voiced frame gives a low zero-crossings
(GOI) and minima corresponding to the glottal closing rate, a high energy because of the periodic structure of the
instant (GCI) unlike the unvoiced speech case [38], so a signal and a number of group of SMP equal to 1 or to 2,
voicing detection algorithm can be derived. whereas an unvoiced frame of the MP gives a high zero-
Figure 4 depicts a voiced region of speech signal pro- crossings count, a low energy near to zero and a number of
nounced by a women speaker. The speech signal is fol- SMP group greater than 2.
lowed with its electroglottographic (EGG) signal, and at
the bottom, we find the multi-scale product (MP) of the Spectral MP group
voiced speech signal. The EGG signal was the easiest way
to measure the GCI and GOI as it is a direct representation The first parameter for the determination of the V/UV
of the glottal activity. On the MP signal, Fig. 4c shows decisions is the group classification. We operate the spec-
minima corresponding to GCIs and maxima representing trum of the signal multi-scale product (SMP) by using the
the GOIs crop up. fast Fourier transform (FFT) algorithm. We define the
The proposed voiced/unvoiced algorithm uses three number of group constituted by computing the distance
criteria to decide the voicing nature. It is about the group separating two successive peak positions of the MP spec-
classification of the spectral multi-scale product peaks trum. Then, we rank this distance in the growing order to
(SMP), the short-time energy of MP and the short-time compose the Ei vector with i corresponding to the ith
zero-crossings of MP. frame. If the difference between two successive elements
The reason of using three parameters is that each of the vector is less or equal to (THDIST), the two elements
parameter will perform well and accurately, just when the belong to the same group. If the number of group is equal
value of that parameter is very high or very low. So the use to 1 or to 2, the frame is voiced; else, the frame is declared
of three parameters will help us in this matter. Also to unvoiced.

123
Author's personal copy
Cogn Comput

Fig. 4 Speech signal pronounced by a women speaker; a speech signal, b EGG signal and c the multi-scale product

Short-time energy of MP THZCR which are the median values of three parameters for
each frame in the speech database.
The second parameter for the determination of the V/UV Energy of multi-scale product (MP) for unvoiced speech
decisions is the energy. The threshold of energy (THE) is significantly smaller than for voiced speech; hence, the
must be higher to decide that the frame is voiced. For the short-time energy can be used to distinguish voice and
energy parameter, the voiced frame has high energy unvoiced speech. The short-time energy of MP can also be
because of its periodicity. used to distinguish speech from silence. But the use of
short-time energy of MP is not sufficient alone; hence, it is
Short-time zero-crossings of MP used coupled with the zero-crossings of MP and the
spectrum of MP in the classification of speech. Hence, we
The third parameter for the determination of the V/UV state that voice speech should be characterized by rela-
decisions is the zero-crossings rate (ZCR) of the MP. It can tively high energy, relatively low zero-crossings rate, and a
be defined as: number of group are lower or equal to 2, while unvoiced
, speech will have relatively high zero-crossings rate, rela-
PN
ZCRðnÞ ¼ absðsgn½pðiÞ  sgn½pði  1Þ tively low energy, and a number of group are higher than 2.
i¼1 2N
Also, we have not said what we mean by high and low
( ð2Þ
1; pðiÞ  0 values of short-time energy, short-time zero-crossings rate,
with sgn½pðiÞ ¼ and it is really not possible to be precise. Hence, we see
0; pðiÞ\0
that this is a problem that fit to be solved by fuzzy logic. In
where p(n) is the multi-scale product of the speech frame this paper, we propose a method to classify the speech into
and sgn½ is the sign function. The threshold of ZCR voiced and unvoiced detection using short-time energy,
(THZCR) should be lower to decide that the frame is voiced. zero-crossings and number of group in a fuzzy logic sys-
The choice of the three thresholds is experimentally tem. The voiced/unvoiced algorithm uses the fuzzy logic.
determined using the NOIZEUS database. The voiced/un- Fuzzy logic (FL) is a method to computing based on
voiced decision is related to the optimal choice of the ‘‘degrees of truth’’ rather than the usual Boolean logic. FL
thresholds of the three parameters. The optimal thresholds includes 0 and 1 as extreme cases of truth but also includes
are obtained after the computation of the requested feature the various states of truth in between. FL seems closer to
for each frame. The median value is taken as threshold. the way our brains work. We aggregate data and form a
This procedure is used to extract the THDIST, THE and number of partial truths which we aggregate further into

123
Author's personal copy
Cogn Comput

higher truths which in turn, when certain thresholds are Enhancement of Voiced Frames
exceeded, cause certain further results such as motor
reaction. A similar kind of process is used in artificial To enhance the speech, a designed comb filter is applied on
computer neural network and expert systems. Fuzzy sets the voiced frames. The idea consists of taking the voiced
are a super-set of classical sets. Each element in a fuzzy set speech frame and eliminates any frequency components
is associated with real number which represents the degree that do not accord with the fundamental frequency and its
of membership of the element in the set. FL allows highly harmonics.
nonlinear, poorly understood or mathematically complex The different steps are illustrated in Fig. 5. It consists of
systems to be modelled reliably and efficiently. And it multiplying the frequency response of our designed filter
deals well with noise data. These characteristics show that by the magnitude spectrum of the voiced speech frame.
FL might be an effective tool for speech enhancement. Then, the inverse fast Fourier transform (IFFT) is used to
Most applications of fuzzy logic use it as the underlying obtain the final enhanced voiced frame. The phase must be
logic system for fuzzy expert systems (FES). A FES is an staying the same after comb filtering.
expert system that uses a collection of fuzzy membership
functions and rules, instead of Boolean logic, to reason Comb Filtering Description
about data. The rules in a fuzzy expert system are usually
of a form similar to the following: A comb filter is a filter with multiple pass bands and stop
if x is low and y is high, then z = medium. bands. For transmitting only the harmonic components of
Where x and y are input variables, z is an output vari- the speech signal, the pass bands must be centred at mul-
able, low is a membership function (fuzzy subset) defined tiples of the speech fundamental frequency, i.e. the fre-
on x, high is a membership function defined on y, and quency response of the comb filter has to be a periodic
medium is a membership function defined on z. The part of function with period equal to the fundamental frequency.
the rule between the ‘‘if’’ and ‘‘then’’ is the rule’s premise. Because voiced speech signals have time-varying funda-
This is a fuzzy logic expression that describes to what mental frequency, the comb filter for enhancement of
degree the rule is applicable. The part of the rule following voiced speech has to be an adaptive filter tuned by the
the ‘‘then’’ is the rule’s consequent. This part of the rule instantaneous fundamental frequency of the speech. It
assigns a membership function to each of one or more means that the comb filter varies from frame to frame.
output variables. We use a rectangular comb filter whose pass bands have
Based on our work [43], we use an expert system gen- a fix width of 10 Hz and tuned by the fundamental fre-
erator called CLIPS. It is a productive development and quency F0. This comb filter is a feedforward filter with a
delivery expert system tool which provides a complete number of delays equal to fs/F0 samples, where fs is the
environment for the construction of rule-based expert sampling frequency and F0 is the fundamental frequency.
systems. The harmonic transfer function of the designed comb filter
We apply the following rules: is presented in Fig. 6. The spacing between the pass bands
is defined by the fundamental frequency F0. The spectral
R1: If a frame has a low zero-crossings rate, then it is
maxima Spmax and the spectral minima Spmin are given by
voiced.
the required attenuation for the SNR. The frequency
R2: If a frame has a high zero-crossings rate, then it is
response of our designed comb filter is as follows:
unvoiced.
R3: If a frame has a high energy, then it is voiced. Hðf Þ ¼ Spmax for iF0  Bd=2  f  iF0 þ Bd=2
R4: If a frame has a low energy, then it is unvoiced. ¼ Spmin for ði  1ÞF0  Bd=2  f  iF0 þ Bd=2
R5: If a frame has a number of group less than three,
then it is voiced. ð3Þ
R6: If a frame has a number of group more than two, where Bd\F0=2 is the bandwidth of the comb filter,
then it is unvoiced. h i
i = 1,2,… fix fs=2F0 . The frequency 0  f  fs=2 is
We incorporate fuzzy logic with the rules to improve the
decision-making capability of the expert system.
The priorities of rules are given by the fuzzy CLIPS F0
expert system. First, for each rule, the system calculated
the error probability, and the one with the least error Noisy voiced Harmonic Enhanced
probability was chosen as the first priority. Then, to choose frame FFT comb filter IFFT voiced frame
the second priority, the system calculated the conditional
error probability for the rest of the rules. Fig. 5 Block diagram of the noisy voiced frame enhancement

123
Author's personal copy
Cogn Comput

Fig. 6 Harmonic transfer function of a designed comb filter, where Bd = 10 Hz and F0 = 88 Hz

sampled with the frequency step fs=Nf , where Nf is the decomposed into three steps, as shown in Fig. 7. The first
dimension of the applied FFT. step consists of computing the MP of the voiced sound at
Figure 6 illustrates the transfer function of our designed successive scales. The second step consists of calculating
comb filter with adjustable bandwidth. the spectrum function (SF) of the obtained signal. These
When a frame is considered as voiced, the signal is two steps were described by our work reported in [44]. In
filtered by our comb filter tuned to the estimated funda- the third step, the spectrum of the MP is compressed by
mental frequency F0 and its integer multiples. The esti- integer factors (R = 2, 3, 4) and the obtained functions are
mation of the fundamental frequency F0 is detailed in the multiplied.
subsection of fundamental frequency estimation based on The first peak in the spectrum multi-scale product
the spectrum compression of the multi-scale product (SMP) coincides with the second peak in the SMP com-
(SCMP). The use of this designed comb filter involves pressed by a factor of two, which coincides with the third
some implementations issues that can be manifested by the peak in the SMP compressed by a factor of three. So, when
problem of the number of harmonics. The number of har-
monics of the voiced speech is difficult to estimate. In Noisy Speech
general, this number depends on the speaker and the fun- Frame
damental frequency estimation, but is usually assumed
fixed in advance, as in our approach. Therefore, if the Multi-Scale Product
number of harmonics is different to the chosen number of (MP)
bands, either noise components will be captured or high-
frequency speech harmonics will be eliminated.
FFT
Our method for estimating the fundamental frequency
will be described in the following subsection.
Product of Compression
Fundamental Frequency Estimation Based on SCMP Spectrum (SCMP)

The comb filtering step requires an accurate estimation of


the fundamental frequency. In this work, we use the Pitch Estimation
spectrum compression (SC) of the multi-scale product Fig. 7 Block diagram of the proposed method for fundamental
(MP) to estimate F0 inspired by our work in [37]. It can be frequency estimation

123
Author's personal copy
Cogn Comput

the various spectrums are multiplied together, the har- Time-frequency


connected speech
monics line up and reinforce the fundamental frequency F0. regions estimation
The harmonic product measures the maximum coincidence
of harmonics for each fast Fourier transform of the multi- Noisy Smoothing
Noisy
scale product. FFT estimation
unvoiced frame estimate
Figure 8 illustrates the SCMP method. The signals of
the Fig. 8a–c represent, respectively, the SMP compressed Enhanced Enhanced Gain
with a factor R = 2, 3 and 4 of the first voiced noisy speech unvoiced frame IFFT spectrum
function
region from Fig. 3b. The signal illustrated in the Fig. 8d
shows the multiplication of the functions issued from the Fig. 9 Block diagram of the noisy unvoiced frame enhancement
compression of the SMP with one peak giving the funda-
mental frequency estimation.
uxðkÞ ¼ usðkÞ þ nðkÞ ð4Þ
Enhancement of Unvoiced Frames by Spectral First, we apply the FFT to compute periodogram of
Subtraction noisy speech, that is PUX (j,k) = |UX (j,k)|2. After com-
puting periodograms, they are under process of smoothing.
For unvoiced frames, we employ an improved method based The smoothed periodograms are temporally minimum
on the psycho-acoustical model of the spectral density sub- tracked and are used for purpose of speech presence
traction called ‘‘geometric approach for spectral subtraction’’ detection. This detection is utilized to attain low biased
(GA) proposed by [4] which is robust under various types of noise power spectral density (PSD) estimates PN0 (j,k) and
interferences. The unvoiced noisy speech spectrum UX(jxn) for noise periodogram estimates PN (j,k) which is equal to
at frequency xn is calculated by the summation of two PUX (j,k) in speech absence condition. But if speech is
complex-valued spectra. The spectral subtraction (SS) with present, noise periodogram estimate is equal to noise PSD
GA is described in the Fig. 9. estimation. In later case, recursive smoothed bias com-
Consider usðkÞ is the unvoiced clean speech and n(k) is pensation parameter is put on minimum tracked values.
the noise signal and uxðkÞ is the noisy unvoiced speech The bias compensation factor is updated during absence of
frame: speech in frames while remains unchanged during speech

Fig. 8 SCMP of a voiced noisy speech. a Spectrum compression of MP with R = 2. b Spectrum compression of MP with R = 3. c Spectrum
compression of MP with R = 4. d Spectrum functions multiplication

123
Author's personal copy
Cogn Comput

presence. The noise magnitude periodogram estimation aUX ejuUX ¼ aUS ejuUS þ aN ejuN ð5Þ
|N(j,k)| is computed from noise PSD estimation, and on the
basis of these information, decision of speech presence is where aUX, uUX, aUS, uUS and aN, uN are, respectively, the
made and used in speech enhancement method. The noisy magnitudes, and phase angles of unvoiced noisy speech,
speech periodograms PUX (j,k) are spectrally smoothed. unvoiced clean speech and noise.
The PUX (j,k) bands are composed of weighted sum of Unlike the traditional SS, the GA consider phase during
2X ? 1 band. computations. We apply the sinus function:
The spectrally smoothed periodograms are temporally G ¼ aUX sinðuN  uUX Þ ¼ aUS sinðuN  uUS Þ ð6Þ
smoothed recursively with time–frequency changing
jGj ¼ a2UX sin2 ðuN  uUX Þ ¼ a2US sin2 ðuN  uUS Þ ð7Þ
smoothing factor to create the temporally spectrally
 
smoothed periodogram P(j,k). jGj ¼ a2UX 1  cos2 ðuN  uUX Þ ð8Þ
The temporal minimum values Pmin(j,k) are computed  
from P(j,k) by tracing within minimum search window. jGj ¼ a2US 1  cos2 ðuN  uUS Þ ð9Þ
The speech presence results in increase of power in We obtain a new version of the gain function G:
temporally smoothed spectrum because of the additive 2 
noise at particular time–frequency regions. As a result, G2 ¼ aUS a2 ð10Þ
UX
ratio of temporally smoothed spectrum to noise PSD esti- rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 
mate becomes more robust to estimate the SNR and noise- G ¼ aUS a2
UX
to-noise ratio at specific time–frequency regions. rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

2
The smoothing phenomenon ensures the speech pres- ¼ 1  cos ð u N  u UX = Þ ð11Þ
1  cos2 ðuN  uUX Þ
ence detection even in conditions where noisy speech
power is unstable. As a result, connected speech presence Finally, the estimated signal spectrum is passed through
and absence regions can be achieved. IFFT which utilize phase of noisy signal, and we obtain
Here we have computed two different noise estimations: enhanced speech.
one is noise PSD estimation and second is noise spectrum The GA based on power SS technique has the advantage
estimation. The PSD estimation is used in speech of minimizing the musical noise and incorporating the
enhancement algorithm, while noise spectrum estimation cross terms implying phase differences between the noisy
shows the properties of residual noise from speech speech and the noise.
enhancement algorithm. The speech enhancement algo-
rithm for this noise estimation is spectral subtraction with
geometric approach. The block diagram for the connected Experimental Results
time–frequency region noise estimation algorithm is illus-
trated in Fig. 10. In this section, the proposed approach for speech
Then, we compute the gain function by transforming the enhancement is evaluated using objective and subjective
equation above to polar form: tests.

Fig. 10 Block diagram of the Spectrum of noisy Spectral PUX(j,k)) Temporal


time–frequency connected unvoiced frame smoothing smoothing
speech regions PUX(j,k)
P(j,k)

Speech
Maximum presence
tracking

Noise Speech
N(j,k))
estimation enhancement

Spectrum of
Enhanced
unvoiced frame

123
Author's personal copy
Cogn Comput

Simulation Conditions As seen, the most problematic noise is the car interfer-
ence, with a GPE of 7.34 % at -5 dB. In this case, the
To evaluate and compare the performance of our proposed SWIPE method gives the best result because it considers
approach, we carry out simulations with the real noisy only the highly voiced frames.
speech sentences from the NOIZEUS database [45], which To evaluate our voicing decision algorithm, we deter-
is composed of phonetically balanced utterances. The noisy mine the UV/V error defined as the unvoiced frames con-
speech database contains 30 sentences belonging to three sidered as voiced, and the V/UV error corresponding to the
male and three female speakers. The speech signal is percentage of voiced frames considered as unvoiced.
sampled at 8 kHz and is composed of 30 sentences Table 2 presents the performances of our voiced/un-
belonging to three male and three female speakers, and was voiced classification algorithm with fuzzy expert system
corrupted by four background noises (airport, car, train and (proposed algo_MP ? FES), and without FES (proposed
babble) at four SNR levels (-5, 0, 5 and 10 dB). To algo_MP) and SWIPE methods in a noisy environment.
simulate the receiving frequency characteristics of tele- As depicted in Table 2, the proposed voiced/unvoiced
phone handsets, the speech and noise signals were filtered decision algorithm that used the fuzzy expert system with
by the modified intermediate reference system (IRS) filters MP in this work appears the most robust algorithm for
used in [46] for evaluation of the PESQ measure. V/UV errors. In the case of UV/V error, the SWIPE method
For the experiments, the length of the signal frames is set shows the lowest values. This can be explained by the fact
to 25.6 ms. The hop size between frames is set to 32 ms. In that this algorithm does not consider the weak voicing state
the blocks devoted to enhance voiced and unvoiced frames, like in the beginning and the end of any voiced sound.
windowing with a Hanning window is applied, and the order However, our proposed algorithm has the highest perfor-
of the FFT is set to 8192 frequency bins. This FFT size mances in all cases, depending on the fuzzy logic expert
provides a high enough resolution for the proper enhance- system.
ment of low frequencies and was chosen empirically as a In the Table 2, it can be noticed that the UV/V error
trade-off between achieved quality and complexity. decreases when the V/UV error increases. In fact, for
example under the car noise, our voiced/unvoiced classi-
Evaluations of Fundamental Frequency Estimation fication algorithm, respectively, SWIPE in car noise for
and Voiced/Unvoiced Classification Algorithms example, when the UV/V decreases from 1.70 to 1.28 %
(1.03–0.09 %), the V/UV increases from 5.21 to 10.70 %
Here, we have only analysed the influence of the voicing (8.37–15.56 %). It is obvious that our proposed algorithm
detector over the enhancement process. In order to illus-
trate the impact of F0 determination in the system, we
Table 1 Fundamental frequency estimation performance of GPE in a
measure the gross pitch error (GPE) of the SCMP algo-
noisy environment
rithm and we compare it to the spectral multi-scale product
method (SMP) [39] and the sawtooth waveform inspired Type of noise SNR level (dB) SCMP SMP [39] SWIPE [47]
pitch estimator (SWIPE) [47]. The SMP and SWIPE GPE (%)
methods were configured with the same settings (frame Airport 10 2.28 2.45 2.56
size and hop size) as the proposed algorithm. 5 2.73 3.04 3.43
The GPE is defined as the percentage of the correctly 0 2.86 3.17 3.45
classified voiced frames which have an incorrect funda- -5 3.05 3.34 3.78
mental frequency.
Car 10 3.56 5.83 3.06
The V/UV error is defined as the percentage of voiced
5 4.30 7.11 3.94
frames that are classified as unvoiced, and the UV/V error
0 5.06 8.41 4.63
is defined as the percentage of unvoiced frames that are
-5 7.34 9.20 6.03
classified as voiced.
Train 10 1.09 1.43 1.17
The reference F0 contours for the thirty voiced utter-
5 1.25 1.63 1.38
ances of the NOIZEUS speech database and the reference
0 1.57 1.98 1.79
voiced/unvoiced classification were built using our algo-
-5 2.19 3.17 2.88
rithm autocorrelation of the multi-scale product analysis
Babble 10 0.86 1.37 1.21
[48] on the clean speech and are manually corrected.
5 1.30 2.85 2.40
Table 1 presents the GPE of the SCMP, SMP and
0 1.81 3.27 2.73
SWIPE methods in a noisy environment.
As depicted in Table 1, when the SNR level decreases, -5 2.93 4.58 3.46
the SCMP algorithm remains robust even at -5 dB. Bold values indicate the highest improvements

123
Author's personal copy
Cogn Comput

Table 2 Performances of the voiced/unvoiced classification


Type of SNR level V/UV (%) UV/V(%)
noise (dB)
Proposed Proposed SWIPE Proposed Proposed SWIPE
algo_MP ? FES algo_MP [47] algo_MP ? FES algo_MP [47]

Airport 10 3.47 4.02 4.73 1.52 2.33 1.09


5 4.09 4.56 5.25 1.44 1.89 0.71
0 4.63 5.73 6.14 1.59 3.46 0.49
-5 4.95 6.10 8.52 1.78 1.67 0.15
Car 10 5.21 7.35 8.37 1.70 3.51 1.03
5 7.38 9.81 11.56 1.64 4.28 0.48
0 9.86 11.57 13.89 1.03 5.43 0.24
-5 10.70 13.93 15.56 1.28 7.98 0.09
Train 10 4.36 5.10 5.01 5.26 5.73 1.95
5 3.28 5.42 6.29 6.91 7.27 1.55
0 3.52 7.53 8.15 4.70 4.85 0.74
-5 2.77 7.11 9.18 3.25 3.73 0.53
Babble 10 3.14 4.39 5.30 4.53 5.82 2.17
5 2.08 7.80 8.52 4.39 7.83 1.68
0 4.36 9.31 10.47 3.07 3.25 0.82
-5 7.95 9.84 11.90 2.18 4.08 0.44
Bold values indicate the highest improvements

based on fuzzy logic is more accurate than the other Itakura–Saito distance measure, based on the dissimilarity
algorithms in the voiced classification. between the original and the enhanced speech, is computed
between sets of linear prediction coefficients (LPC), esti-
Objective Evaluation Measures mated over synchronous frames. This measure is heavily
influenced by spectral dissimilarity due to mismatch in for-
To test the performance of proposed speech enhancement mant locations, with little contribution from errors in match-
system, the objective quality measurement tests, signal-to- ing spectral valleys. Such behaviour is desirable since the
noise ratio (SNR), segmental signal-to-noise ratio auditory system is more sensitive to errors in formant location
(segSNR) and Itakura–Saito distance test were used. and bandwidth than to spectral valleys between peaks.
The following equation was computed for evaluation of In this paper, the average Itakura–Saito measure (as
SNR results of enhanced speech signals: defined by the following formula) across all speech frames
   of the given sentence was computed to evaluate the speech
SNRoutput ¼ 10 log10 r2s r2n ð12Þ
enhancement algorithm:
where r2s and r2n are the variances of noise and clean   
speech, respectively. We assume that due to the indepen- ISdða; bÞ ¼ ða  bÞT Rða  bÞ aT Ra ð14Þ
dency of speech and noise, the variance of the noisy speech
where a is the LPC of original speech signal x(n), b is the
signal is equal to the sum of the speech variance and noise
LPC of enhanced speech signal x~ðnÞ, and R is the (Toe-
variance.
plitz) autocorrelation matrix.
The frame-based segmental SNR is a reasonable mea-
Our approach is compared to two recent speech
sure of speech quality. It is formed by averaging frame
enhancement methods: deep neural network (DNN) [24]
level estimates as follows:
! and a priori SNR estimator with Ephraim’s (MMSE-
X 1 PNmþN1 2
1M n¼Nm x ð nÞ logSTSA) [6].
segSNR ¼ 10 log10 PNmþN1
M m¼0 ½x~ðnÞ  xðnÞ2 For noisy speech signals at SNRs (-5, 0, 5, 10 dB),
n¼Nm
SNRoutput, segSNR and Itakura–Saito distance (ISd) results
ð13Þ
of enhanced speech signals were presented as in Table 3.
where x(n) denotes original speech signal, x~ðnÞ denotes Objective quality measurement test results proved the
enhanced speech signal, M denotes number of frames, and superiority of the proposed speech enhancement system to the
N denotes number of samples in each short time frame. other popular methods DNN and MMSE-logSTSA estimators.

123
Author's personal copy
Cogn Comput

Table 3 Performance comparison of different methods in terms of SRNoutput, segSNR, ISd


Type of SNR level SNRoutput SegSNR ISd
noise (dB)
Prop- DNN MMSE- Prop- DNN MMSE- Prop- DNN MMSE-
App [24] logSTSA [6] App [24] logSTSA [6] App [24] logSTSA [6]

Airport 10 15.18 14.93 13.38 4.59 4.45 1.68 7.02 7.36 5.14
5 12.33 11.28 11.65 3.12 3.09 0.24 3.90 3.94 3.71
0 8.45 7.45 8.07 3.04 2.38 -1.45 2.59 2.47 2.50
-5 5.60 4.52 4.98 2.36 2.27 -2.41 1.10 0.86 0.48
Car 10 14.83 14.46 13.76 6.78 6.67 2.93 5.83 5.59 4.93
5 11.67 11.16 11.84 5.01 4.55 1.87 4.78 4.56 3.64
0 8.34 6.44 8.27 4.85 4.09 1.87 4.65 4.10 2.29
-5 5.23 4.07 5.09 2.14 2.33 -2.10 2.35 2.09 0.34
Train 10 13.52 13.81 13.29 7.94 8.18 4.74 5.09 5.35 4.48
5 11.66 12.07 11.54 5.08 5.57 3.43 3.82 4.10 3.23
0 7.75 7.14 7.02 3.71 3.69 2.93 3.54 3.23 2.68
-5 3.42 2.93 2.86 2.32 1.45 0.38 1.89 1.65 1.00
Babble 10 15.59 12.77 14.93 7.28 6.09 3.43 6.02 5.80 5.05
5 12.81 10.24 12.08 5.15 4.85 2.04 4.00 3.84 3.51
0 8.68 6.62 8.45 3.86 3.07 1.09 2.73 2.50 2.25
-5 6.22 4.02 5.83 1.33 0.76 -1.38 1.39 1.17 0.84
Bold values indicate the highest improvements

Table 4 Performance comparison of different methods in terms of PESQ, WSS and LLR measures
Type of SNR PESQ WSS LLR
noise level (dB)
Prop-App DNN MMSE- Prop-App DNN MMSE- Prop-App DNN MMSE-
[24] logSTSA [6] [24] logSTSA [6] [24] logSTSA [6]

Airport 10 2.69 2.60 2.54 52.45 53.50 56.18 0.59 0.83 1.18
5 2.39 2.34 2.33 63.79 67.33 71.02 1.13 1.25 1.36
0 1.93 1.85 1.88 74.36 72.91 82.87 1.27 1.39 1.48
-5 1.56 1.42 1.45 83.22 80.58 89.53 1.59 1.73 1.91
Car 10 2.51 2.46 2.53 45.18 47.61 51.35 0.56 0.53 0.61
5 2.28 2.23 2.25 57.45 59.83 60.79 0.71 0.75 0.78
0 1.98 1.85 1.97 65.28 70.75 73.63 0.88 0.93 1.01
-5 1.76 1.61 1.71 79.52 82.61 87.45 1.12 1.21 1.23
Train 10 2.67 2.43 2.25 49.37 51.53 54.35 0.55 0.59 0.61
5 2.48 2.39 1.94 67.28 69.40 71.22 0.67 0.72 0.79
0 2.23 2.10 1.81 79.16 82.25 85.25 0.81 0.89 1.01
-5 1.97 1.68 1.45 91.36 92.78 97.44 1.02 1.13 1.19
Babble 10 2.64 2.54 2.47 48.43 51.34 57.18 0.98 1.02 1.10
5 2.31 2.27 2.09 66.45 69.57 72.55 1.16 1.29 1.31
0 2.11 1.85 1.78 81.07 83.26 88.34 1.31 1.52 1.59
-5 1.87 1.68 1.32 97.58 99.08 103.56 1.77 1.77 1.82
Bold values indicate the highest improvements

Subjective Evaluation Measures • The perceptual evaluation of speech quality (PESQ)


score is a mean opinion score, showing high
To measure the enhancement quality of the noisy speech, correlations with subjective listening tests. It ranges
we calculate three parameters: from 1 to 5. The higher PESQ score indicates the

123
Author's personal copy
Cogn Comput

Table 5 Subjective quality test


Type of noise SNR level (dB) Prop-App DNN [24] MMSE-logSTSA [6]
(MOS) results
Airport 10 3.05 2.98 2.78
5 2.39 2.27 2.11
0 2.06 1.96 1.59
Car 10 3.44 3.35 3.06
5 3.32 3.10 2.81
0 2.56 2.43 2.17
Train 10 2.93 2.81 2.72
5 2.56 2.39 2.21
0 2.18 1.87 1.67
Babble 10 3.85 3.78 3.31
5 3.68 3.49 3.04
0 3.03 2.90 2.13
Bold values indicate the highest improvements

higher perceptual quality and the weaker speech Informal listening tests are also realized, where a group
distortions. of ten listeners (seven women and three men) are asked to
• The weighted spectral slope (WSS) is found by perceptually evaluate six enhanced speech signals from
calculating the weighted difference between the adja- the NOIZEUS database with three background noises
cent spectral magnitudes in each frequency band. The (white, car and babble) at three SNR levels (0, 5 and
spectral slope is obtained as the difference between 10 dB). The listeners have used the mean opinion score
adjacent spectral magnitudes. (MOS) to evaluate the difference between the residual
• The log-likelihood ratio (LLR) measure is based on the noise characteristics of the enhanced speech from 1 to 5.
difference between the all-pole models of the enhanced Table 5 presents the statistic results of subjective eval-
and clean speech. It is mainly related to the similarity of uation for four speech enhancement methods.
spectral envelope.
The selection of these measures is based on their cor-
relation with the subjective quality assessments. The defi- Conclusions
nition of all these measures can be found in [49].
The results of the PESQ scores, WSS and LLR values This paper concerns an approach for speech enhancement
are detailed in Table 4. that relies on the harmonic structure of the underlying
As it can be seen in Table 4, the proposed approach is speech. The main idea is to apply convenient transforma-
characterized by the highest PESQ scores showing that our tion on voiced sounds and other specific processing on
approach has a better perceived quality in the most cases. unvoiced sounds to eliminate noise. Our V/UV decision
However, the MMSE-logSTSA method performs best in algorithm is based essentially on the multi-scale product
the case of stationary noise. This can be explained by the (MP) characteristics. Then, we apply a fuzzy expert system
fact that the MMSE estimator can provide an accurate to determine the most suitable decision to use for a frame
estimation of the noise. of speech. The voiced frames are filtered by a comb filter,
Our approach presents the lowest WSS scores confirm- and the unvoiced sounds undergo a spectral subtraction.
ing its performance. The comb filter is designed with a controllable bandwidth
Table 4 shows that our approach provides the best LLR and is tuned by the estimated fundamental frequency (F0).
values. This parameter confirms that the results for the two The F0 is determined in the voiced frames using our
methods: DNN and MMSE-logSTSA, are quite similar at spectral compression multi-scale product method.
low-input SNR values. Simulation results show that the proposed approach
We can conclude again that our approach overcomes the yields better results in terms of higher PESQ scores and
disadvantage of the DNN, the MMSE or the spectral sub- lower WSS values than those of compared results. Informal
traction methods used alone. The three parameters perfor- listening tests also confirms the efficiency of the proposed
mance evaluation shows that our approach is globally the approach that results in a better enhanced speech than
most suitable for noise elimination. obtained by recent methods.

123
Author's personal copy
Cogn Comput

Future work concerns the addition of fuzzy logic rules to Conferrence Acoustic Speech Signal Processing (ICASSP); 1993;
filter the noisy speech signal in a more cognitive manner. 367–74.
14. Dubost S, Cappe O. Enhancement of speech based on non-
parametric estimation of a time varying harmonic representation.
Acknowledgments The authors would like to thank Dr. A. Abel for In: Proceedings of IEEE International Conferrence Acoustic
his help throughout the revision of paper by her thesis. Speech Signal Processing (ICASSP); 2000. p. 1859–64.
15. Deisher ME, Spanias AS. HMM-based speech enhancement
Compliance with Ethical Standards using harmonic modeling. In: Proceedings of IEEE International
Conferrence Acoustic Speech Signal Processing (ICASSP); 1997;
Conflict of Interest M. A. Ben Messaoud, A. Bouzid and N. 1175–84.
Ellouze declare that they have no conflict of interest. 16. Jensen J, Hansen JHL. Speech enhancement using a constrained
iterative sinusoidal model. IEEE Trans Speech Audio Process.
Informed Consent All procedures followed were in accordance 2001;9:731–810.
with the ethical standards of the responsible committee on human 17. Squartini S, Schuller B, Hussain A. Cognitive and emotional
experimentation (institutional and national) and with the Helsinki information processing for human–machine interaction. Cognit
Declaration of 1975, as revised in 2008 (5). Additional informed Comput. 2012;4(4):383–93.
consent was obtained from all patients for which identifying infor- 18. Espinosa-Duro V, Faundez-Zanuy M, Mekyska J. Beyond cog-
mation is included in this article. nitive signals. Cognit Comput. 2011;3(2):374–8.
19. Esposito A. The perceptual and cognitive role of visual and
Human and Animal Rights This article does not contain any auditory channels in conveying emotional information. Cognit
studies with human or animal subjects performed by the any of the Comput. 2009;1(3):268–311.
authors. 20. Abel A, Hussain A. Novel two-stage audiovisual speech filtering
in noisy environments. Cognit Comput. 2014;6:200–18.
21. Abel A, Hussain A. Cognitively inspired audiovisual speech fil-
tering: towards an intelligent, fuzzy based, multimodal, two-stage
References speech enhancement system. Springer Briefs in Cognitive Com-
putation, Springer International Publishing; 2015.
1. Hussain A, Chetouani M, Squartini S, Bastari A, Piazza F. 22. Rotili R, Principi E, Squartini S, Schuller B. A Real-time speech
Nonlinear speech enhancement: an overview. In: Stylianou Y, enhancement framework in noisy and reverberated acoustic sce-
Faundez-Zanuy M, Esposito A, editors. LNCS 4391. Berlin: narios. Cognit Comput. 2013;5:504–13.
Springer; 2007. p. 217–48. 23. Narayanan A, Wang DL. Ideal ratio mask estimation using deep
2. Boll SF. Suppression of acoustic noise in speech using spectral neural networks for robust speech recognition. In: Proceedings of
subtraction. IEEE Trans Acoust Speech Signal Process. ICASSP; 2013. pp. 1520–6149.
1979;27:113–8. 24. Xu Y, Du J, Dai L, Lee C. An experimental study on speech
3. Hu HT, Kuo FJ, Wang HJ. Supplementary schemes to spectral enhancement based on deep neural networks. IEEE Signal Pro-
subtraction for speech enhancement. Speech Commun. cess Lett. 2014;21:65–74.
2002;36:205–14. 25. Cho E, Smith JO, Widrow B. Exploiting the harmonic structure
4. Lu Y, Loizou PC. A geometric approach to spectral subtraction. for speech enhancement. In: Proceedings of IEEE International
Speech Commun. 2008;50:453–514. Conferrence Acoustic Speech Signal Processing (ICASSP); 2012.
5. Cadore J, Valverde-Albacete FJ, Gallardo-Antolı́n A, Peláez- 26. George E, Smith M. Speech analysis/synthesis and modification
Moreno C. Auditory-inspired morphological processing of speech using an analysis-by-synthesis/overlap-add sinusoidal model.
spectrograms: applications in automatic speech recognition and IEEE Trans Speech Audio Process. 1997;5:389–418.
speech enhancement. Cognit Comput. 2013;5:426–516. 27. Nehorai A, Porat B. Adaptive comb filtering for harmonic signal
6. Hu Y, Loizou PC. Speech enhancement based on wavelet enhancement. IEEE Trans Acoust Speech Signal Process.
thresholding the multitaper spectrum. IEEE Trans Speech Audio 1986;34:1124–215.
Process. 2004;12:59–69. 28. Chen JH, Gersho A. Adaptive postfiltering for quality enhance-
7. Ding GH, Huang T, Xu B. Suppression of additive noise using a ment of coded speech. IEEE Trans Speech Audio Process.
power spectral density MMSE estimator. IEEE Trans Signal 1995;3:59–113.
Process Lett. 2004;11:585–604. 29. Grancharov V, Plasberg JH, Samuelsson J, Kleijn WB. Gener-
8. Cohen I. Speech enhancement using a noncausal a priori SNR alized postfilter for speech quality enhancement. IEEE Trans
estimator. IEEE Trans Signal Process Lett. 2004;11:725–34. Audio Speech Lang Process. 2008;16:57–8.
9. Lee KY, Jung S. Time-domain approach using multiple Kalman 30. Jin W, Liu X, Scordilis MS. Speech enhancement using harmonic
filters and EM algorithm to speech enhancement with nonsta- emphasis and comb filtering. IEEE Trans Audio Speech Lang
tionary noise. IEEE Trans Speech Audio Process. 2000;8: Process. 2010;18:356–413.
282–310. 31. Ahmadi S, Spanias A. Cepstrum-based pitch detection using a
10. Zavarehei E, Vaseghi S. Speech enhancement in temporal DFT new statistical V/UV classification algorithm. IEEE Trans Speech
trajectories using Kalman filters. In: Interspeech, Lisbon; 2005. Audio Process. 1999;7:333–6.
11. Huag F, Lee T, Kleijn WB. Transform-domain wiener filter for 32. Fisher E, Tabrikian J, Dubnov S. Generalized likelihood ratio test
speech periodicity. In: IEEE International Conference Acoustic for voiced–unvoiced decision in noisy speech using the harmonic
Speech Signal Processing (ICASSP); 2012. p. 4577–84. model. IEEE Trans Audio Speech Lang Process. 2006;14:
12. Hu Y, Loizou PC. A subspace approach for enhancing speech 502–9.
corrupted by colored noise. IEEE Signal Process Lett. 2002;9: 33. Nakatani T, Amano S, Irino T, Ishizuka K, Kondo T. A method
204–13. for fundamental frequency estimation and voicing decision:
13. Hardwick J, Yoo CD, Lim JS. Speech enhancement using the application to infant utterances recorded in real acoustical envi-
dual excitation model. In: Proceedings of IEEE International ronments. Speech Commun. 2008;50:203–12.

123
Author's personal copy
Cogn Comput

34. Talkin D. A robust algorithm for pitch tracking (RAPT). In: 42. Mallat S. A wavelet tour of signal processing. 3rd ed. San Diego:
Talkin D, editor. Speech coding and synthesis. Amsterdam: Academic Press; 2008.
Elsevier; 1995. p. 495–518. 43. Touzi A, Ben Messaoud MA. New approach for conception and
35. de Cheveigné A, Kawahara H. YIN, a fundamental frequency implementation of object oriented expert system using UML. Int
estimator for speech and music. J Acoust Soc Am. 2002;111: Arab J Inf Technol. 2009;6:99–108.
1917–2014. 44. Ben Messaoud MA, Bouzid A, Ellouze N. An efficient method
36. Beritelli F, Casale S, Russo S, Serrano S. Adaptive V/UV speech for fundamental frequency determination of noisy speech. In:
detection based on characterization of background noise. EUR- Drugman T, Dutoit T, editors. LNCS 7911. Springer: Berlin;
ASIP J Audio Speech Music Process. 2009;. doi:10.1155/2009/ 2013. p. 33–41.
965436. 45. Hu Y, Loizou PC. Subjective comparison and evaluation of
37. Ben Messaoud MA, Bouzid A, Ellouze, N. Estimation du pitch et speech enhancement algorithms. Speech Commun. 2007;49:
décision de voisement par compression spectrale de l’auto- 588–614.
corrélation du produit multi-échelle. In: Proceedings of Journée 46. ITU-T P.862. Perceptual evaluation of speech quality (PESQ),
d’Etude de la parole (JEP-TALN-RECITAL 2012); 2012; and objective method for end-to-end speech quality assessment of
pp. 201–8. narrowband telephone networks and speech codecs. In: ITU-T
38. Bouzid A, Ellouze N. Electroglottographic measures based on Recommendation; 2000; p. 862.
GCI and GOI detection using multiscale product. Int J Comput 47. Camacho A, Harris JG. A sawtooth waveform inspired pitch
Commun Control. 2008;3:21–32. estimator for speech and music. J Acoust Soc Am. 2008;124:
39. Ben Messaoud MA, Bouzid A, Ellouze N. Using multi-scale 1638–715.
product spectrum for single and multi-pitch estimation. IET 48. Ben Messaoud MA, Bouzid A, Ellouze N. Autocorrelation of the
Signal Process J. 2011;5:344–412. speech multi-scale product for voicing decision and pitch esti-
40. Xu Y, Weaver JB, Healy DM, Lu J. Wavelet transform domain mation. Cognit Comput. 2010;2:151–9.
filters: a spatially selective noise filtration technique. IEEE Trans 49. Loizou PC. Speech enhancement: theory and practice. Dallas:
Image Process. 1994;3:747–812. CRC Press; 2007.
41. Sadler BM, Swami A. Analysis of multi-scale products for step
detection and estimation. IEEE Trans Inf Theory. 1999;45:
1043–9.

123

You might also like