Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2020.2986877, IEEE/ACM
Transactions on Audio, Speech, and Language Processing
1

Multi-level Single-Channel Speech Enhancement


Using a Unified Framework for Estimating
Magnitude and Phase spectra
Lavanya T, Nagarajan T, Member, IEEE, and Vijayalakshmi P, Senior Member, IEEE,

Abstract—Speech enhancement algorithms aim to improve the subspace and noise subspace, where the estimate of clean
quality and intelligibility of noise corrupted speech, through signal is obtained by nullifying the noise subspace [6][7],
spectral or temporal modifications. Most of the existing speech (iv) statistical approaches that estimate short-time spectral
enhancement algorithms achieve this by modifying the magnitude
spectrum alone, while keeping the phase spectrum intact. In the amplitude of speech, taking into account the uncertainty
current work, both phase and magnitude spectra are modified of signal presence in noisy observations [8] [9], and (v)
to enhance noisy speech using a multi-level speech enhancement neural networks-based approaches [10] [11]. The widely used
technique. Proposed phase compensation (PC) function achieves statistical approaches offer conventional estimation techniques
first-level enhancement by modifying the phase spectrum alone. such as, minimum mean square error-based short-time spectral
Second-level enhancement performs energy redistribution in the
phase compensated speech signal to make weak speech and amplitude estimator (MMSE) [12], log minimum mean
non-speech regions highly contrastive. Energy redistribution square error-based short-time spectral amplitude estimator
from the energy-rich voiced to the weak unvoiced regions is (log MMSE) [8], and log MMSE with speech presence
carried out using adaptive power law transformation (APLT) uncertainty (SPU) estimation (log MMSE+baseline SPU)[9].
technique by optimizing the parameters with a total energy These statistical approaches estimate a gain function, based on
constraint employing particle swarm optimization algorithm. Log
MMSE technique with a novel speech presence uncertainty (SPU) the probability of speech absence or presence, to modify the
estimation method is proposed for third-level enhancement. noisy magnitude spectrum, that is combined with unaltered
The compensated phase spectrum and the magnitude spectrum phase spectrum during signal reconstruction. This sort of
estimated using log MMSE, with proposed SPU estimation (log signal reconstruction using estimated magnitude and unaltered
MMSE + proposed SPU), are used to reconstruct the enhanced noisy phase spectra may lead to an inconsistent spectrogram
speech signal. The proposed speech enhancement technique
is compared with recent speech enhancement techniques that [13]. Therefore, in the current work, both the phase and
estimate both magnitude and phase, for various noise levels (-5 magnitude spectra are modified to improve the intelligibility
to +5dB), in terms of objective and subjective measures. It is and quality of the noisy speech signal.
observed that the proposed technique improves signal quality The importance of phase information has been explored
and maintains or improves intelligibility under stationary, and in the reconstruction of images [14], and enhancement of
non-stationary noise conditions.
noisy speech signals [15]. The authors of [16] carried
out phase randomization for frequencies that correspond
I. I NTRODUCTION to residual noise that are perceived even after modifying
the magnitude spectrum of noisy speech. In the context
PEECH is a vital medium for communication between
S human and human, human and machine and vice-versa.
However, speech tends to lose its naturalness, when it is
of speech, it was found that the phase spectrum holds
intelligibility information [15] and it is demonstrated that a
longer window size of 20 − 40ms, during phase spectrum
produced or delivered in a noisy environment, resulting estimation, has a major impact on intelligibility improvement.
in lack of audibility and loss of information. The speech Phase-based enhancement techniques combine the estimated
under noisy environment can be enhanced by means of phase spectrum with the existing or modified magnitude
speech enhancement (SE) algorithms, whose performance spectrum, to reconstruct the enhanced speech signal [17].
solely depend on the accurate estimation of the noise Several methodologies to estimate clean phase spectrum
spectrum. Conventional speech enhancement algorithms can (from noisy speech spectrum) include, (i) defining a gain
be broadly classified in to the following types: (i) spectral function proportional to the ratio between the phase spectra
subtraction algorithms that estimate the clean speech spectrum of the noisy speech and noise estimate [18], (ii) estimating
by subtracting noise spectrum from noisy speech spectrum phase from the MSE-based complex cepstrum [19], (iii)
[1] [2], (ii) filtering-based approaches that employ a linear f0 -based phase estimation techniques that estimate phase
estimator to minimize the mean square error between clean and spectrum only at voiced speech areas (noisy phase is
enhanced speech [3] [4] [5], (iii) subspace-based approaches kept unmodified at unvoiced speech areas) [20] [21], (iv)
based on the decomposition of noisy speech into signal source-filter model-based phase decomposition [22] [23], (v)
DNN-based speech enhancement techniques that estimate both
The authors are with Speech Lab, SSN College of Engineering,
Chennai, (e-mail: lavanyat@ssn.edu.in, nagarajant@ssn.edu.in, magnitude and phase [24], and (vi) phase compensation using
vijayalakshmip@ssn.edu.in) conjugate symmetry property of Fourier transform [17]. In

2329-9290 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 05,2020 at 21:15:25 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2020.2986877, IEEE/ACM
Transactions on Audio, Speech, and Language Processing
2

the current work, a modified phase compensation technique speech absence probability to weak speech and speech pause
is proposed for the estimation of phase spectrum. regions under low SNR conditions. This equitable treatment of
In phase compensation-based enhancement techniques, these regions may result in attenuation of weak speech regions
researchers have analysed the effect of various window during noise reduction.
functions and window sizes for phase spectrum estimation In the current work, to retain the weak speech components,
and speech intelligibility improvement. The authors of [25], even in low SNR conditions, an SPU estimation technique
[26], explicitly suggested to use rectangular window for which estimates the SAP values, as a function of RMS value,
estimating the phase spectrum, with the tacit understanding is proposed. Here, the SAP is estimated for the overlapping
that using a tapered window would cause loss of information. frames corresponding to the unprocessed noisy magnitude
In [17], the authors studied the effect of mismatched windows speech spectrum and the gain factor obtained is used to modify
(i.e, Hamming and Chebyshev windows for extracting the the magnitude spectrum of energy redistributed signal.
magnitude and phase spectrum respectively), and the effective Based on the above observations and intuitions, a multi-level
dynamic range (20 − 40)dB of Chebyshev window, in speech enhancement technique is proposed as depicted in Fig.
estimating the phase spectrum. It was observed that, the use of 1, where the phase spectrum is estimated using proposed
mismatched window, and appropriately varying the attenuation phase compensation technique and the magnitude spectrum
factor of the Chebyshev window resulted in improved speech is estimated using proposed SPU estimation-based log MMSE
quality. technique (log MMSE+proposed SPU). Main objective of the
In general, the phase compensation function estimates three levels are given below:
a clean phase spectrum, which when combined with the • In the first-level, the noisy signal is subjected to phase
corresponding noisy magnitude spectrum results in a partially compensation, where the amount of noise reduction is
clean speech spectrum. In the current work, mismatched proportional to the signal energy. This is expected to
windows are used to estimate phase and magnitude spectrum, reduce the noise and prevent the attenuation of weak
where a tapered Chebyshev window with an attenuation speech components.
factor of higher dynamic range (40dB) is used for • In the second-level, adaptive power law transformation
estimating the phase information using proposed phase along with constrained particle swarm optimization
compensation (Proposed PC) technique. Unlike the baseline algorithm is employed to achieve energy redistribution.
phase compensation (Baseline PC) function [17] that divides • In the third-level, the log MMSE+proposed SPU
the noisy speech into two regions namely, high and low SNR estimation technique estimates the magnitude of the
regions, proposed PC function divides the low SNR region energy re-distributed pre-processed signal.
further into unvoiced and speech-pause regions, and treats The estimated magnitude and phase spectrum are combined
them separately. This in turn provides a scope to prevent the together and the enhanced signal is reconstructed using IFFT
attenuation of weak speech regions. operation and overlap-add technique.
Further, to obtain a more-detailed clean speech spectrum
The overall performance of the proposed system is
estimate, the magnitude spectrum of noisy speech is processed
compared with recent techniques, [17], [20], [21] and [24],
in the subsequent levels. Prior to magnitude spectrum
that also consider estimating both magnitude and phase for
estimation, in the second-level, the phase compensated
speech enhancement.
signal energy is redistributed using adaptive power
The organisation of this paper is as follows: Section
law transformation (APLT) [27] (an image enhancement
II provides a detailed description of the proposed phase
technique). The parameters of APLT are optimized using
compensation (proposed PC) technique. Section III
particle swarm optimization (PSO) algorithm [28], in such a
describes the energy redistribution using APLT with
way that the total energy of the signal is maintained. This
constrained optimization. Speech enhancement using log
kind of redistribution, prior to estimation of clean magnitude
MMSE+proposed SPU is discussed in Section IV. The
spectrum, is expected to prevent attenuation of weak speech
experimental setup and the performance of the proposed
components.
multi-level speech enhancement algorithm under various
Among the statistics-based magnitude spectrum estimation
noise conditions in comparison with recent techniques are
approaches discussed earlier, log MMSE with the baseline
discussed in Section V and Section VI respectively. Finally,
speech presence uncertainty (log MMSE+baseline SPU)
Section VII concludes the paper.
estimation [9] is found to achieve effective noise reduction in
the noisy speech signal. The performance of this technique
has improved with the frame-wise SPU estimation, rather II. P ROPOSED PHASE COMPENSATION TECHNIQUE
than considering speech absence to be equiprobable to the As discussed earlier, this work focusses on estimating both
speech presence [29], or an empirically fixed value [8]. The the magnitude and the phase spectrum, where a modified phase
speech presence uncertainty (SPU) estimator proposed in compensation function (proposed PC) is proposed to obtain
[9], assigns a large speech absence probability (SAP) value the latter. Phase compensation technique [17] [30] modifies
if the neighbouring frequency bins do not contain speech. the conjugate symmetry property of noisy complex speech
Additionally, in order to avoid attenuation of weak speech spectrum using a compensation function as explained below:
components, even the noise dominant frames are not given Let the Fourier transform (FT) derived noisy complex
a maximum SAP value of 1. This results in assigning equal speech spectrum be, Y (i, k) = X(i, k) + D(i, k), where

2329-9290 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 05,2020 at 21:15:25 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2020.2986877, IEEE/ACM
Transactions on Audio, Speech, and Language Processing
3

Fig. 1: Block diagram of the proposed multi-level speech enhancement technique

X(i, k) and D(i, k) be the speech and noise spectra TABLE I: Angular spacing (θ̂) Vs Magnitude of resultant
respectively. Here, i denotes the frame index with N-FFT vector (R̂)
points, where 0 ≤ k ≤ N − 1. The symmetry property of FT
is exploited here to estimate the clean phase spectrum from θ̂ 0o 45o 90o 135o 180o
Y (i, k), through phase compensation. Phase compensation cos(θ̂) 1 0.7071 0 -0.7071 -1
function is an offset factor that is used to modify the angular
spacing between complex conjugate spectral component pairs. R̂ 2M 1.85M 1.41M 0.77M 0
The modified angular spacing is expected to be the estimate
of clean phase. Therefore, the estimated phase spectrum when
combined with the estimated magnitude spectrum, may result  θ̂1 and θ̂2 and |
vector pairs, are, θ̂1 | 6= |θ̂2 |, where, θ̂1 =
in an enhanced speech signal. Estimation of phase spectrum tan −1 b
and θ̂2 = tan −1 −b
(a+Λ) (a−Λ) .
using the proposed PC function is illustrated below: • The angular spacing between the modified complex
• Let a+ib and a−ib be any conjugate vector pair in noisy conjugate spectral vectors can be given by, θ̂ = θ̂1 + θ̂2 .
complex speech spectrum, Y (i, k) • It is noticed that the modified angular spacing (θ̂)
• Here, θ1 and θ2 be the angles associated with a + iband increases (i.e. θ̂ > θ), when the conjugate vectors (a ± ib)
a−ib, with reference to real axis, where, θ1 = tan−1 ab are added with a factor ±Λ.
  • Table I describes the variation in resultant vector
and θ2 = tan−1 −b a magnitude, R̂, with respect to θ̂.
• During IFFT operation, the magnitude of resultant vector, • From Table I it is clear that the increase in angular
R, is computed. spacing θ̂ causes a decrement in magnitude of the
p
R = 2(a2 + b2 )(1 + cos(θ)) (1) resultant vector R̂ of modified complex vectors, provided
|Λ| > M , where M is the magnitude of the unmodified
Here, θ is the angular spacing between the vectors, given conjugate vector pairs.
by θ = θ1 + θ2 • When |Λ| < M , the phase compensation function have
• As these conjugate spectral vectors are equidistant from no or less impact on modifying the angular spacing,
real axis due to symmetry property, θ2 = −θ1 , leading therefore, θ̂ ≈ θ.
to a resultant vector with a magnitude, R = 2M . • The modified angular spacing can take values in the
p interval [0, π], depending upon the value of Λ (refer to
M = (a2 + b2 ) (2)
Table I).
• Let the phase compensation function for the conjugate • The magnitude of resultant vector of the modified
vectors, a ± ib be, ±Λ. complex vectors, R̂ can have a maximum magnitude,
• This function is added with the conjugate vectors to 2M , with an angular spacing of 0o , when θ̂ = θ, (i.e.
generate the modified complex vectors, (a + Λ) + ib, when |Λ| < M ) (refer to Table I).
(a − Λ) − ib. • The magnitude of resultant vector will become zero when
• The new angles associated with the modified conjugate the angular spacing reaches 180o (refer to Table I). That

2329-9290 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 05,2020 at 21:15:25 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2020.2986877, IEEE/ACM
Transactions on Audio, Speech, and Language Processing
4

is, the magnitude of the resultant vector can be modified In R2 regions: In the low energy regions, R2 , since both the
by tuning the angular spacing, θ̂, appropriately. noise and signal power influence the compensation function,
The baseline phase compensation (baseline PC) function Λ2,R2 (i, k) = Λ2 (i, k). However, as the R2 regions are
[17], Λ1 (i, k), is directly proportional to the noise magnitude weak unvoiced regions, Pnoisy,R2 << Pnoisy,R1 , making the
spectrum, |D̂(i, k)|, and is given by, denominator (Pnoisy + Pnoise )R2 < Pnoisy,R1 . Therefore,
Λ2,R2 (i, k) > Λ2,R1 (i, k) and 0o < θ̂ < 180o , and 0 <
Λ1 (i, k) = λΨ(k)|D̂(i, k)| (3) R̂ < 2M . Thus the phase compensation function is expected
Here, λ is an empirically found constant value and Ψ(k) is an to attenuate noise components in these regions and preserve
anti-symmetric function given by, speech components.
 In R3 regions: Since Pnoise = Pnoisy in the R3 regions,
1;

 0≤k≤ N 2 −1 the phase compensation function reduces to,
Ψ(k) = 0; N (4)

1 N
 2(Pnoise (i,k)) ; 0 ≤ k ≤ 2 − 1
2 
−1; N + 1 ≤ k ≤ N − 1

 

k=N
2
Λ2,R3 (i, k) = 0; 2
(7)
N

As Λ1 (i, k) is directly proportional to |D̂(i, k)|, it attenuates 
 −1
; 2 +1≤k ≤N −1
2(Pnoise (i,k))

noise regions to a greater extent compared to speech regions.
This may attenuate the weak speech regions as well when the It is clear that the value of Λ2,R3 (i, k) > Λ2,R2 (i, k). As
signal power is too weak, especially at low SNR conditions. discussed earlier, higher the value of compensation function,
Therefore, the proposed PC function, Λ2 (i, k), is framed to higher will be the angular spacing and smaller the resultant
modify the angular spacing in such a way that it attenuates the magnitude. Therefore, R3 regions experience much attenuation
noise components and enhances the signal regions, especially in terms of the resultant magnitude as expected. Following this
the weak speech regions. description, it can be concluded that, for any noisy speech
signal, Λ2,R1 < Λ2,R2 < Λ2,R3 . Therefore, the proposed PC
technique is expected to remove noise and preserve the weak

 1
; 0≤k≤ N 2 −1
 (Pnoisy (i,k)+Pnoise (i,k)) speech regions as well.


Λ2 (i, k) = 0; k=N 2
Further, it can be noted that in the R1 and R2
N +1≤k ≤N −1 regions, Λ2 (i, k) is inversely proportional to Pnoisy (i, k)

−1
;


(Pnoisy (i,k)+Pnoise (i,k)) 2

and Pnoisy (i, k) + Pnoise (i, k), respectively. We know that,
(5)
Pnoisy (i, k) >> |D̂(i, k)| and Pnoisy (i, k) + Pnoise (i, k) >
Here, Pnoisy & Pnoise are the noisy signal power and
|D̂(i, k)|, respectively in R1 and R2 regions. Therefore,
noise power respectively. Unlike Λ1 (i, k), which is directly
Λ1 (i, k) > Λ2 (i, k) or Λ1 (i, k) ≈ Λ2 (i, k) in R1 and
proportional to noise power, the proposed PC function is
Λ1 (i, k) >> Λ2 (i, k) in R2 regions. Since Pnoise(i,k) ≈
inversely proportional to the sum of noisy signal power and
|D̂(i, k)| in R3 regions, the following three cases can exist (i)
noise power. Here, Λ2 (i, k) is made anti-symmetric to imitate
when Pnoise (i, k) > 1, Λ2 (i, k) achieves less noise reduction
the phase spectrum.
compared to Λ1 (i, k), (ii) when Pnoise (i, k) < 1, Λ2 (i, k)
In the current work, the noisy speech is classified into three
achieves greater noise reduction compared to Λ1 (i, k), (iii)
regions namely, (i) energy-rich voiced regions (R1 ), (ii) weak
when Pnoise (i, k) = 1, Λ1 (i, k) and Λ2 (i, k) attains same
unvoiced regions (R2 ) and, (iii) completely noise regions (R3 ).
amount of attenuation.
It is known that, noise affects the weak unvoiced regions more,
The significance of proposed PC function, Λ2 , in enhancing
compared to the voiced regions, since energy of R1 > energy
the noisy speech signal compared to the baseline PC function,
of R2 . And at the same time, the energy of R2 ' energy of R3 ,
Λ1 , can be explained, using vector diagrams, (refer to Fig. 2)
since they possess similar acoustic characteristics. Therefore,
as given below:
care must be taken to ensure that the weak speech components −−→ −−−→
are retained to a greater extent during noise cancellation. The In Fig. 2 - (i), (ii) and (iii), Vact and Vact∗ are the actual
working of proposed PC function varies across R1 , R2 and complex conjugate spectral vectors of noisy speech signal and
−−−−−−−→ ∗
R3 , and is described in detail below. the resultant vector magnitude is given by, Re[Vact + Vact ].
∗ ∗
In R1 regions: In the high SNR regions, R1 , as the value of V1 , V1 and V2 , V2 , are the modified vectors, generated by
Pnoise << Pnoisy the proposed PC function will be inversely adding the actual vectors with the baseline PC (Λ1 ) and the
proportional to noisy signal energy. That is, proposed PC functions (Λ2 ) respectively.
 • In R1 region (refer to Fig. 2 - (i)), since Λ1 ≥ Λ2 ,
1 N
 Pnoisy (i,k) ; 0 ≤ k ≤ 2 − 1 both the baseline PC and proposed PC functions mildly

−−−−−→

Λ2,R1 (i, k) = 0; k=N (6) attenuate the vectors, therefore, both Re[V2 + V2∗ ] and
2 −−−−−→∗


 −1
; N +1≤k ≤N −1 Re[V1 + V1 ] are approximately equal to or slightly
Pnoisy (i,k) 2
less than the actual magnitude of noisy speech,
−−−−−−−→ ∗
As Λ2,R1 (n, k) will take a value close to zero leading to a Re[Vact + Vact ].
very small variation in θ̂ and the spectral vectors may remain • In R2 region (refer to Fig. 2 - (ii)), since Λ1 >> Λ2 ,
symmetric (R̂ ≈ 2M ). This is in accordance with the previous attenuation provided by the baseline PC method is high,
findings in [31]. compared to the proposed PC technique. Hence, in R2

2329-9290 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 05,2020 at 21:15:25 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2020.2986877, IEEE/ACM
Transactions on Audio, Speech, and Language Processing
5

Fig. 2: Phasor diagrams representing phase compensation at (i) energy-rich voiced region (R1 ), (ii) unvoiced region (R2 ), and
(iii) completely noise region (R3 )

Fig. 3: Consonant-Vowel-Consonant (CVC) transition segments, (I) /nd-aa-n/ and (II) /ch-a-th/, enhanced by the Proposed PC
technique, where the noisy signal is corrupted by white noise at 0dB SNR level:- (a) clean speech segment, (b) noisy speech
segment; noisy speech segment enhanced by (c) Baseline PC technique, and (d) Proposed PC technique

2329-9290 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 05,2020 at 21:15:25 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2020.2986877, IEEE/ACM
Transactions on Audio, Speech, and Language Processing
6

−−−−−−−→ ∗ −−−−−→ −−−−−→


region, Re[Vact + Vact ] > Re[V2 + V2∗ ] > Re[V1 + V1∗ ].
Therefore, it is clear that the proposed PC enhances the
weak speech regions, unlike baseline PC.
• In R3 region (refer to Fig. 2 - (iii)) (for Pnoise (i, k) <
−−−−−−−→ ∗ −−−−−→
1), Λ2 > Λ1 , and Re[Vact + Vact ] > Re[V1 + V1∗ ] >
−−−−−→∗
Re[V2 + V2 ]. Therefore, in this case, the proposed
method attenuates the noise regions to a greater extent,
compared to the baseline technique.
In the baseline PC method, the |D̂(i, k)| (refer to equation
3) is estimated from the initial frames and is kept constant
throughout the utterance. However, this may not be suitable
for non-stationary noise conditions. To overcome this issue
in the current work, the adaptive (Martin’s) noise estimation
algorithm [32] is adopted for frame-level noise power
computation. The magnitude and phase spectra of noisy speech
Fig. 4: Spectrogram of:- (a) clean speech, (b) noisy speech
are derived using Hamming window and Chebysev window
signal corrupted by white noise at 0 dB, (c) speech enhanced
(with 40 dB attenuation factor) respectively. Signals at a
by baseline PC, and (d) speech enhanced by proposed PC
sampling rate of 8 kHz (unless and until specified) is used
throughout this work, and the frame size is chosen to be 32ms
[17] with 25% overlap.
speech spectrum [12] [8] [29]. Decreased energy-level contrast
The proposed PC function, Λ2 (i, k), is added with the noisy
between the noise and the weak speech regions may cause
complex speech spectrum, Y (i, k), to generate the modified
them to get attenuated equally, during noise cancellation. This
complex spectrum, Ymod (i, k).
may lead to decreased overall speech quality and intelligibility.
Ymod (i, k) = Y (i, k) + Λ2 (i, k) (8) Few consonant classes such as unvoiced stops, fricatives,
etc., are inherently low energy sound units. Under noisy
The phase spectrum of Ymod is expected to be the estimate of
conditions, these units suffer further degradation in energy, and
clean phase spectrum, θ̂(i, k).
become indistinguishable from the noise segments. Therefore,
θ̂(i, k) = ∠Ymod (i, k) (9) prior to magnitude estimation, energy level of weak speech
regions is increased, so that they become highly contrastive
The phase compensated complex speech spectrum, Y1 , with the noise regions. Since, speech is non-stationary,
(a partially clean spectrum), is obtained by combining uniformly amplifying the frames of different SNR levels,
the estimated phase spectrum, with the unmodified noisy may cause loudness to reach the level of discomfort [33]
magnitude spectrum. [34]. Therefore, the proposed technique redistributes energy
Y1 (i, k) = Y (i, k) ej∗θ̂(i,k)

(10) between energy-rich (high SNR) and weak energy (low SNR)
regions, with a constraint that the total energy of the signal is
The phase-compensated time-domain signal, y1 (n), is maintained. Noise-only frames are found with the help of a
reconstructed from Y1 (i, k), using IFFT operation. Since, threshold computed from the average noise energy estimated
Y1 (i, k) exhibits no symmetry, IFFT operation may yield a from the initial frames of phase compensated signal such that
complex number. Therefore, to reconstruct the real-valued, energy redistribution to these frames is avoided.
time domain signal, imaginary part is discarded. In the current work, energy redistribution in the phase
y1 (n) = Re(IF F T (Y1 (i, k)) (11) compensated signal is carried out using adaptive power law
transformation (APLT) technique [27], that is discussed below.
From Fig. 3, it is clear that the proposed PC
technique achieved better enhancement especially in the The phase compensated signal, y1 (n), is segmented with
consonant-vowel-consonant (CVC) regions than baseline PC. a frame size of 32ms duration [30] with 8ms overlap
Fig. 4 depicts the spectrographic representation of noisy using rectangular window. The APLT function for energy
speech signal enhanced by baseline PC and proposed PC redistribution is given below:
techniques. This clearly shows that our proposed method
attenuates noise and preserves the weak speech regions. Enew (m) = c[1 + (k ∗ d(m))]Eold (m)[1−(k∗d(m))](γ) (12)
Following the phase estimation using the proposed PC Here, d(m) is the deviation between old energy, Eold (m), and
technique, estimation of magnitude spectrum is carried out that mean energy value, µ, of y1 (n).
involves a pre-processing approach, and is discussed below.
d(m) = Eold (m) − µ (13)
III. E NERGY REDISTRIBUTION WITH CONSTRAINED
OPTIMIZATION USING APLT AND PSO
N −1
Most of the speech enhancement algorithms solely depend X
Eold (m) = y12 (n) (14)
on the estimation of clean magnitude spectrum from the noisy n=0

2329-9290 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 05,2020 at 21:15:25 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2020.2986877, IEEE/ACM
Transactions on Audio, Speech, and Language Processing
7

Fig. 5: Flow diagram of proposed energy redistribution with constrained optimization

Fig. 6: An illustration of energy redistribution using APLT with constrained optimization

to control the degree to which the energy Enew (m) is


J
X increased or decreased. Therefore, constrained optimization
µ= Eold (m) (15)
using particle swarm optimization (PSO) is carried out to
m=1
estimate c and γ for every frame, in such a way that the total
Here, m, n, J and N denote the frame index, sample index, energy of the signal is maintained, as given below.
total number of frames in a signal and total number of samples J
X J
X
in a frame respectively. k is a constant which is set to 0.001, Enew (m) = Eold (m) (16)
an empirically chosen value. The APLT function is framed m=1 m=1
in such a way that it compensates for the deviation, d(m), ’c’ and ’γ’ values for every frame are initialized with values
between Eold (m) and µ, by transforming Eold (m) to a new chosen in the interval [0,1]. The energy modified signal, y2 (n),
value, Enew (m) (refer to equation 12). When the deviation, is given by,
d(m) is high, the transformed energy value, Enew (m) will be
greater than Eold (m) and vice-versa. The energy controlling     Enew (m)
y2 (n) m = y1 (n) m ∗ (17)
parameters, c and γ (refer to equation 12), can be used Eold (m)

2329-9290 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 05,2020 at 21:15:25 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2020.2986877, IEEE/ACM
Transactions on Audio, Speech, and Language Processing
8

As discussed earlier, in order to restrict the energy distribution The associated gain value for speech presence is given by,
to noise-only regions, a noise-power based threshold value is ( Z

)
used. In the noise-dominant frames (i.e., when energy lies ξ(i, k) e−t
G1 = exp 0.5 dt (19)
below the threshold), the energy is replaced with a small 1 + ξ(i, k) v(i,k) t
value (say, 0.0001). Subsequently, when the energy of the where,
current frame is greater than the threshold value, it is replaced γ(i, k)ξ(i, k)
by the newly found energy value, Enew (m). Therefore, the v(i, k) ,
(1 + ξ(i, k))
proposed energy redistribution technique with constrained
optimization, redistributes energy from the high energy voiced The apriori SNR (ξ(i, k)) and the aposteriori SNR (γ(i, k))
to the low energy unvoiced regions and a fixed value (here, are given in the following equations:
0.0001) for speech-pause regions. Flow diagram depicted in λX2 (i, k)
Fig. 5 describes the working of proposed energy redistribution ξ(i, k) , (20)
λD2 (i, k)
technique. Fig. 6 shows the signal energy redistribution carried
out using the proposed technique, for a phase compensated Y2 (i, k)| 2

signal, whereas Fig. 7 shows the signals before and after γ(i, k) , (21)
λD2 (i, k)
proposed energy redistribution. In Fig. 7, a high SNR region
(marked in red, using a rectangle) and a low SNR region, To prevent attenuation of weak SNR speech regions during
(marked in green, using an ellipse), corresponding to phase noise removal, in our proposed work, ξmin is set to a
compensated and energy redistributed signal under 0dB white maximum value of 1 (in [9], it is set to 0.1).
noise condition, are shown. It can be inferred from the Fig. Probability of conditional speech presence, p(i, k), is given
7 that the energy of marked low SNR region corresponding as,
to the phase compensated signal is improved in terms of its  −1
q(i, k)
amplitude after energy redistribution and vice-versa at high p(i, k) = 1 + (1 + ξ(i, k)exp(−v(i, k)))
SNR region. From the energy redistributed signal, y2 (n), 1 − q(i, k)
(22)
magnitude estimation is carried out using the log MMSE with
where q(i, k) is the speech absence probability.
proposed SPU estimation (log MMSE+proposed SPU), and is
In [9], it is stated that the conventional log MMSE
explained in the following section.
with a modified speech presence uncertainty estimator (log
IV. LOG MMSE WITH PROPOSED S PEECH P RESENCE MMSE+baseline SPU) achieved greater noise reduction
U NCERTAINTY ESTIMATION TECHNIQUE especially at low SNR levels. It estimates the speech absence
log MMSE, a well-established error minimization probability q(i, k) of a frame by considering contextual
technique, is used in our current work, along with our signal characteristics at preceding, succeeding, and the current
proposed speech presence uncertainty (proposed SPU) frame. However, this may assign a maximum speech absence
estimation technique to estimate the clean magnitude probability (SAP) to the weak sound units that are surrounded
spectrum. The proposed technique (log MMSE+proposed either by silence or a weak sound unit or by both. Therefore,
SPU) estimates magnitude spectrum from the pre-processed the baseline technique may attenuate weak speech and the
(phase compensated and energy redistributed) signal, y2 (n). speech pause regions equally. To overcome this, in the current
Here, the magnitude spectrum is extracted using a modified work, speech frames are classified into 4 classes based on
Hanning window function [35]. the RMS level, namely, (i) very low (VL), (ii) low (L), (iii)
Let Y2 (i, k) be the spectrum of pre-processed signal. Since mid (M) and (iv) high (H). Each level is assigned with an
noisy magnitude spectrum is used in reconstructing the phase empirically chosen weight (W) where the high regions are
compensated signal, the pre-processed signal will still have those for which,
some amount of residual noise, i.e. Y2 (i, k) = X2 (i, k) + 1) RM S >= 0dB. Similarly, RMS threshold defined for
D2 (i, k), where X2 (i, k) and D2 (i, k) be the speech and M, L, and VL regions are,
noise spectra, respectively, of the pre-processed signal. The 2) (−5dB ≤ RM S < 0dB),
log MMSE+baseline SPU based spectral amplitude estimate 3) (−7dB < RM S < −5dB), and
[9] is given by, 4) (RM S ≤ −7dB), respectively.

A(i,

b k) , G(i, k) Y2 (i, k) (18) The corresponding empirically chosen weights associated
with VL, L, M and H regions are 0, 0.35, 0.98 and 1
where G is the gain function for the estimator [9]. Here i respectively. The proposed SPU estimation technique estimates
denotes the frame index and k denotes the frequency bin q(i, k) value for every ith frame in the noise corrupted speech
index with N-FFT points. The spectral gain function, G, is the signal as given below:
product of hypothetical gains corresponding to signal presence v
u N −1
and absence probabilities (say G1 and G2 respectively), u1 X
p(i,k) 1−p(i,k) q(i, k) = t Y (i, k)| 2 ∗ W (i) (23)
which is given by G = G1 .G2 [9]. Here, G2 is N
k=0
a lower bound threshold value, assigned to prevent the gain
function (G) from reducing to zero, when the speech absence where |Y (i, k)| is the magnitude spectrum of noisy speech
probability reaches its maximum. signal, and W (i) is the weight factor associated with ith frame.

2329-9290 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 05,2020 at 21:15:25 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2020.2986877, IEEE/ACM
Transactions on Audio, Speech, and Language Processing
9

Fig. 7: Effect of energy redistribution at high SNR (indicated by rectangle) and low SNR (indicated by ellipse) regions:- (a)
Phase compensated signal, (b) Energy redistributed signal

The enhanced speech signal, x̂(n), is reconstructed by


taking inverse fast Fourier transform (IFFT).
x̂(n) = Re(IF F T (X̂(i, k))) (25)
Spectrographic representation, comparing the techniques
involved in the proposed multi-level speech enhancement, with
their existing counterparts, is shown in Fig. 10. It is inferred
that the proposed technique achieves greater noise reduction
along with effective restoration of speech components.

V. E XPERIMENTAL SETUP
Fig. 8: Speech absence probability computed for a noisy signal As discussed earlier, overall performance of the proposed
corrupted at -5 dB SNR white noise condition, using proposed system is compared with techniques namely, (a) PASE-DNN
technique (indicated by dashed red line), and the baseline [24], (b) PDA+MMSE-STSA [21], (c) PRVS+MMSE-STSA
technique (indicated by solid green line) [20], and (d) Baseline PC+MMSE-STSA [17]. In techniques
(b), (c) and (d), for magnitude spectrum estimation,
conventional MMSE-STSA technique [12] is used. Here,
In Fig. 8, speech absence probability (SAP), for a speech PASE-DNN [24] employs a 3-layered feed-forward
signal corrupted by white noise at -5dB SNR, computed by architecture, where ideal amplitude mask (IAM) and
the baseline SPU and the proposed SPU (proposed) estimation instantaneous frequency deviation (IFD) are set to be the
techniques is plotted. For better understanding, the speech training targets to estimate both magnitude and phase from
absence probability (SAP) values are plotted against the the noisy speech signal.
corresponding clean speech.
Comparison is made between the signals enhanced by A. Speech corpus
the log MMSE+baseline SPU (directly applied on noisy
The proposed speech enhancement algorithm based on the
signal) and log MMSE+proposed SPU estimation technique
estimate of phase and magnitude spectra is evaluated on noisy
(applied on the preprocessed signal). Fig. 9 indicates that
speech signal in Tamil, and English. Literary Tamil (LT), and
the speech enhanced by the proposed technique effectively
Colloquial Tamil (CT) speech data are collected from native
recovers most of the sound units (especially the weak energy
Tamil speakers, in the age group of 20-30 years, using a
regions), when compared to the baseline technique. Here,
hand-held microphone in a laboratory environment. In order
the magnitude spectrum estimated using log MMSE+proposed
to evaluate the algorithm for English data, IEEE sentence
SPU estimation technique, and the phase spectrum extracted
database [36], spoken by a native speaker of English, is
from the phase compensated signal (again using modified
used. Nasalized English sentences spoken by a non-native
Hanning window function [35]), are combined to reconstruct
English speaker, recorded in a laboratory environment, is also
the enhanced speech spectrum, X̂(i, k).
used for evaluation. Data from various sources are considered
X̂(i, k) = Â(i, k) ∗ exp(j ∗ θ̂(i, k)) (24) for evaluation to check the effectiveness of the proposed

2329-9290 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 05,2020 at 21:15:25 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2020.2986877, IEEE/ACM
Transactions on Audio, Speech, and Language Processing
10

Fig. 9: Comparing the baseline and proposed SPU estimation techniques:- (a) clean signal, (b) noisy signal at 0 dB white noise
condition, (c) noisy signal enhanced using log MMSE+baseline SPU technique, and (d) energy redistributed signal enhanced
using log MMSE+proposed SPU technique

B. Noisy speech data


Both the stationary and non-stationary noise conditions are
considered for evaluation. Stationary noises such as, white and
pink noise, and non-stationary noises including babble and
buccaneer noises [32] are considered for evaluation. These
noises are used for corrupting the speech signals at -5, -4, -2, 0,
2, 4, and 5 dB SNR levels. Totally 980 sentences (35*7*4) are
used for objective and subjective evaluation of the algorithm.

VI. P ERFORMANCE EVALUATION


A. objective evaluation of quality and intelligibility
Perceptual evaluation of speech quality (PESQ) [29], and
short time objective intelligibility (STOI) [37] measures are
used for evaluating the proposed algorithm and the other
techniques used for comparison, in terms of quality and
intelligibility respectively. PESQ and STOI metrics provide a
score in the range of [-0.5 4.5] and [0 1], respectively. Higher
value in PESQ or STOI score indicates better performance.
The PESQ and STOI scores obtained for the various
noise conditions are tabulated in Table. II, and the following
inferences are made for all the SNR levels considered.
a) PESQ-based evaluation:
Fig. 10: Spectrographic representation of proposed multi-level • Both under stationary and non-stationary noise
speech enhancement technique:- (a) clean signal, (b) noisy conditions, the PESQ scores indicate that the proposed
signal corrupted by white noise at 0 dB SNR level, (c) algorithm achieved a higher improvement in quality
baseline PC, (d) proposed PC, (e) APLT-based energy compared to all the techniques under all the considered
redistribution with constrained optimization using PSO, (f) log situations, except PASE-DNN.
MMSE+baseline SPU, (g) log MMSE+proposed SPU • The PESQ scores obtained by PASE-DNN, at high SNR
regions, (> 0dB), is only 0.05 − 0.15 higher than the
proposed technique. However, the proposed technique
algorithm in handling the linguistic variations under various leads PASE-DNN in terms of PESQ under all low SNR
noise conditions. 10 sentences each, for LT, CT, and IEEE conditions (≤ 0dB), for all noise conditions.
database, and, 5 nasal sentences (totally 35 sentences) are used b) STOI-based evaluation:
for evaluating the algorithm. Speech signals considered here • The STOI scores indicate that the proposed technique
are originally sampled at 48 kHz, and down-sampled to 8 kHz. maintains the intelligibility to a greater extent especially
For training PASE-DNN, one hour Tamil speech corpus and at low SNR conditions except under non-stationary noise
30 minutes of TIMIT speech corpora are used. condition.

2329-9290 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 05,2020 at 21:15:25 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2020.2986877, IEEE/ACM
Transactions on Audio, Speech, and Language Processing
11

TABLE II: Performance comparison of the proposed technique with the recent techniques

White Pink Babble Buccaneer


SNR Methods
PESQ STOI PESQ STOI PESQ STOI PESQ STOI
Noisy 1.1953 0.4859 1.3886 0.5161 1.2931 0.5048 1.2587 0.4768
Proposed method 1.3971 0.4864 1.5463 0.4935 1.3965 0.493 1.36 0.4123
PASE-DNN 1.2243 0.4827 1.4176 0.4989 1.2714 0.5025 1.2362 0.3629
-5
PDA+MMSE-STSA 1.1953 0.488 1.3405 0.516 1.2279 0.483 1.2618 0.4789
PRVS+MMSE-STSA 1.1342 0.4718 1.3955 0.5147 1.2228 0.4609 1.2211 0.4732
Baseline PC+MMSE-STSA 1.3116 0.4709 1.4519 0.5183 0.914 0.4457 1.3469 0.4693
Noisy 1.2165 0.5031 1.4314 0.5374 1.3708 0.5263 1.2868 0.4962
Proposed method 1.4023 0.5024 1.624 0.5132 1.4799 0.5161 1.4971 0.4833
PASE-DNN 1.3396 0.4945 1.5545 0.5302 1.4306 0.5278 1.3142 0.4662
-4
PDA+MMSE-STSA 1.2195 0.5052 1.3722 0.5373 1.3343 0.4961 1.2893 0.4983
PRVS+MMSE-STSA 1.158 0.4882 1.5012 0.5335 1.2766 0.486 1.333 0.4912
Baseline PC+MMSE-STSA 1.3391 0.4905 1.5306 0.5143 1.0809 0.4723 1.391 0.4908
Noisy 1.2977 0.5392 1.5146 0.5817 1.5736 0.5703 1.3957 0.5372
Proposed method 1.5438 0.5403 1.8165 0.5554 1.6478 0.5625 1.5961 0.5241
PASE-DNN 1.6056 0.5287 1.8225 0.5799 1.6079 0.5719 1.4124 0.5172
-2
PDA+MMSE-STSA 1.2979 0.5412 1.4594 0.5815 1.5311 0.5452 1.3913 0.5393
PRVS+MMSE-STSA 1.3077 0.5246 1.6943 0.5715 1.4877 0.5357 1.5174 0.529
Baseline PC+MMSE-STSA 1.3914 0.5302 1.6696 0.5855 1.3842 0.5265 1.5224 0.5332
Noisy 1.3564 0.5765 1.6107 0.6269 1.6859 0.6151 1.4743 0.5804
Proposed method 1.6802 0.5791 2.0573 0.5997 1.849 0.6082 1.7515 0.568
PASE-DNN 1.6816 0.5646 2.0659 0.6124 1.8251 0.6167 1.4911 0.5506
0
PDA+MMSE-STSA 1.3567 0.5784 1.5587 0.6266 1.6372 0.5952 1.4743 0.5825
PRVS+MMSE-STSA 1.5082 0.5661 1.883 0.6119 1.6421 0.5834 1.7012 0.5695
Baseline PC+MMSE+STSA 1.507 0.5692 1.8204 0.6222 1.5996 0.5786 1.6608 0.5279
Noisy 1.4231 0.6144 1.7168 0.6709 1.7601 0.6589 1.5647 0.6243
Proposed method 1.8811 0.6053 2.2764 0.6461 1.9755 0.6516 1.9459 0.6134
PASE-DNN 1.7872 0.6003 2.2768 0.6502 2.1045 0.6603 1.9629 0.6102
2
PDA+MMSE-STSA 1.4251 0.6061 1.6699 0.6706 1.7414 0.6399 1.5647 0.6263
PRVS+MMSE-STSA 1.691 0.606 2.031 0.651 1.7658 0.628 1.8626 0.6133
Baseline PC+MMSE-STSA 1.6423 0.6023 1.9329 0.6545 1.755 0.6299 1.7997 0.6113
Noisy 1.4978 0.6518 1.8343 0.7124 1.8864 0.7002 1.6669 0.6672
Proposed method 2.1135 0.6533 2.4217 0.6885 2.089 0.6922 2.1472 0.6562
PASE-DNN 1.9242 0.6361 2.4865 0.6884 2.1932 0.7015 2.2123 0.6486
4
PDA+MMSE-STSA 1.4968 0.6533 1.7925 0.712 1.8663 0.6796 1.6669 0.669
PRVS+MMSE-STSA 1.8623 0.6412 2.1598 0.6859 1.8934 0.669 2.0032 0.6542
Baseline PC+MMSE-STSA 1.7795 0.6311 2.0616 0.6862 1.919 0.6756 1.9448 0.6484
Noisy 1.5395 0.6701 1.8953 0.7317 1.8792 0.7094 1.7228 0.6878
Proposed method 2.2263 0.6739 2.4581 0.7078 2.1451 0.7106 2.2338 0.6756
PASE-DNN 2.0783 0.6542 2.5391 0.7101 2.2583 0.721 2.3612 0.6599
5
PDA+MMSE-STSA 1.5391 0.6715 1.8569 0.7313 1.9373 0.7032 1.7228 0.6895
PRVS+MMSE-STSA 1.9383 0.6571 2.2164 0.7004 1.957 0.6873 2.0982 0.672
2329-9290 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 05,2020 at 21:15:25 UTC from IEEE Xplore. Restrictions apply.
Baseline PC+MMSE-STSA 1.8559 0.6459 2.1233 0.7005 2.1023 0.6891 2.0253 0.6661
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2020.2986877, IEEE/ACM
Transactions on Audio, Speech, and Language Processing
12

VII. C ONCLUSION
The proposed multi-level speech enhancement technique
involves estimation of magnitude and phase spectra from
the noisy complex speech spectrum. The proposed algorithm
is compared with related techniques that also consider
estimating both magnitude and phase spectra for improved
signal estimation. Evaluation is carried out under both
stationary and non-stationary noise conditions for various
SNR levels. The objective PESQ scores indicate that the
proposed algorithm achieves an improvement of (16.88 −
44.61)% and (11.36 − 30)% respectively for white and pink
noise conditions and an improvement of (8 − 14.15)% and
(8.04 − 30)% respectively for babble and buccaneer noise
conditions, with reference to the noisy speech. Similarly, the
STOI scores indicate that the proposed algorithm maintains
Fig. 11: Percentage of correctly identified words derived from the intelligibility under all the considered situations except
DMOS analysis under buccaneer noise condition, where the STOI faces a
marginal decrement of (1 − 3)%. The subjective ratings are
highly correlated with the objective measures and it also
• On an average, the proposed approach comparatively indicates that the proposed algorithm improves speech quality
shows a marginal decrement of only 0.01 − 0.03 under without any degradation in intelligibility. Despite the remark
buccaneer noise condition, whereas under babble noise that the neural network-based systems outperform established
condition, a decrement of only 0.01 is found, in terms of methods, they require enormous amount of training data and
STOI. complex models for improved system performance. Further,
It should be noted that, all the stages, of the proposed it should also be noted that our proposed technique has
algorithm, are optimized to be robust to noise at low SNR shown significant improvement in performance, compared to
speech regions (since high SNR speech regions are less that of the PASE-DNN technique, with less computational
affected by noise, their spectral consistency is maintained by complexity.
keeping them less modified). The results also indicate that, for
the proposed approach, the improvement in enhanced speech ACKNOWLEDGEMENT
quality is high especially under low SNR conditions, which
The authors would like to thank Prof. Yukoh Wakabayashi,
is of greater interest in the context of speech enhancement.
Tokyo Metropolitan University, Tokyo, Japan, for sharing
The proposed algorithm has improved the quality with no
their source code with us. The authors would also like to
further degradation to intelligibility almost in all the cases.
acknowledge the authors of [17], [20], [21] and [24], for their
However, as said earlier, a marginal decrement in STOI is
source codes downloaded from the GitHub and MathWorks
evident under buccaneer noise condition. It should also be
repositories.
noted that the proposed and all the techniques considered for
comparison show occasional increment in STOI with reference
to STOI of the noisy speech. This may be due to the fact R EFERENCES
that noise suppression has a minimum or detrimental effect [1] S. Boll, “Suppression of acoustic noise in speech using spectral
on intelligibility [38]. subtraction,” IEEE Transactions on acoustics, speech, and signal
processing, vol. 27, no. 2, pp. 113–120, 1979.
[2] M. P. A. Jeeva, T. Nagarajan, and P. Vijayalakshmi, “Discrete cosine
B. Subjective evaluation using DMOS transform-derived spectrum-based speech enhancement algorithm using
temporal-domain multiband filtering,” IET Signal Processing, vol. 10(8),
Subjective evaluation is carried out using DMOS score, that pp. 965–980, 2016.
evaluates intelligibility as the percentage of correctly identified [3] ——, “Temporal domain filtering approach for multiband speech
words (out of the total number of words used for testing). enhancement,” International Conference on Microwave, Optical and
Communication Engineering (ICMOCE), pp. 385–388, 2015.
Sentences used for objective evaluation are used for subjective [4] ——, “Formant-filters based multi-band speech enhancement
evaluation as well, where the average duration of each sentence algorithm for intelligibility improvement,” National Conference
is 3 to 5s. The sentences are played to 23 naive listeners on Communications (NCC), pp. 1–5, 2016.
[5] M. A. A. El-Fattah, M. I. Dessousky, A. M. Abbas, S. M. Diab, E. M.
(who are native speakers of Tamil in the age group of 21 El-Rabaie, W. Al-Nuaimy, S. A. Alshebeili, and F. E. A. El-Samie,
to 30 years), in a laboratory environment using a loud speaker “Speech enhancement with an adaptive weiner filter,” International
system (for a maximum of two times) and the listeners are Journal of Speech Technology, vol. 17(1), pp. 53–64, 2014.
[6] S. R. C. H. You and S. N. Koh, “Audible noise reduction in eigen
asked to write down what they heard. domain for speech enhancement,” IEEE/ACM Transactions on Audio,
The results indicate that both for stationary and Speech, and Language Processing, vol. 15(6), pp. 1753–1765, 2007.
non-stationary noise conditions, there exists a close overlap [7] F. Asano, S. Hayamizu, T. Yamada, and S. Nakamura, “Speech
enhancement based on the subspace method,” IEEE/ACM Transactions
between the proposed method and the techniques considered on Audio, Speech, and Language Processing, vol. 8(5), pp. 497–507,
for comparison almost in all the considered situations. 2000.

2329-9290 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 05,2020 at 21:15:25 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2020.2986877, IEEE/ACM
Transactions on Audio, Speech, and Language Processing
13

[8] Y. Ephraim and D. Malah, “Speech enhancement using a minimum [31] P. Mowlaee and R. Saeidi, “Iterative closed-loop phase-aware
mean-square error log-spectral amplitude estimator,” IEEE Transactions single-channel speech enhancement,” IEEE Signal Processing Letters,
on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, vol. 20(12), pp. 1235–1239, 2013.
1985. [32] P. C. Loizou, Speech Enhancement: Theory and Practice. CRC press,
[9] I. Cohen, “Optimal speech enhancement under signal presence 2nd edition, 2013.
uncertainty using log-spectral amplitude estimator,” IEEE Signal [33] J. G. Harris and M. D. Skowronski, “Energy redistribution speech
Processing Letters, vol. 9, no. 4, pp. 113–116, 2002. intelligibility enhancement, vocalic and transitional cues,” The Journal
[10] Y. Xu, J. Du, L. Dai, and C. Lee, “A regression approach to speech of the Acoustical Society of America, vol. 112, no. 5, pp. 2305–2305,
enhancement based on deep neural networks,” IEEE/ACM Transactions 2002.
on Audio, Speech, and Language Processing, vol. 23(1), pp. 7–19, 2015. [34] C. Taal, R. C. Hendriks, and H. Richard, “Speech energy redistribution
[11] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for for intelligibility improvement in noise based on a perceptual distortion
monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, measure,” Computer Speech & Language, vol. 28, no. 4, pp. 858–872,
and Language Processing, vol. 24(3), pp. 483–492, 2016. 2014.
[12] Y. Ephraim and D. Malah, “Speech enhancement using a minimum [35] D. Griffin and J. Lim, “Signal estimation from modified short-time
mean-square error short-time spectral amplitude estimator,” IEEE fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal
Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, Processing, vol. 32, no. 2, pp. 236–243, 1984.
pp. 1109––1121, 1984. [36] “IEEE recommended practice for speech quality measurements,” IEEE
[13] T. Gerkmann, M. Krawczyk-Becker, and J. Le Roux, “Phase processing Transactions on Audio and Electroacoustics, vol. 17, no. 3, pp. 225–246,
for single-channel speech enhancement: History and recent advances,” 1969.
IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 55–66, 2015. [37] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time
[14] A. V. Oppenheim and J. S. Lim, “The importance of phase in signals,” objective intelligibility measure for time-frequency weighted noisy
Proceedings of the IEEE, vol. 69, pp. 529–541, 1981. speech,” IEEE International Conference on Acoustics, Speech and Signal
[15] L. D. Alsteris and K. K. Paliwal, “Further intelligibility results from Processing (ICASSP), pp. 4214–4217, 2010.
human listening tests using the short-time phase spectrum,” Speech [38] G. Hilkhuysen, N. Gaubitch, M. Brookes, and M. Huckvale, “Effects
Communication, vol. 48, no. 6, pp. 727–736, 2006. of noise suppression on intelligibility: Dependency on signal-to-noise
[16] A. Sugiyama and R. Miyahara, “Phase randomization- a new paradigm ratios,” Journal of Acoustical Society of America, vol. 131, no. 1, pp.
for single-channel signal enhancement,” IEEE International Conference 531–539, 2012.
on Acoustics, Speech and Signal Processing (ICASSP), pp. 7487–7491,
2013.
[17] K. Paliwal, K. Wójcicki, and B. Shannon, “The importance of phase
in speech enhancement,” Speech Communication, vol. 53, no. 4, pp.
465–494, 2011. T. Lavanya , received her M.E. degree in
[18] S. Singh, A. M. Mutawa, M. Gupta, M. Tripathy, and R. S. Anand, Communication systems from SSN College of
“Phase based single-channel speech enhancement using phase ratio,” Engineering, Chennai, in 2016. Currently, she is
IEEE 6th International Conference on Computer Applications In pursuing her PhD degree at Speech Lab, SSN
Electrical Engineering-Recent Advances (CERA), pp. 393–396, 2017. College of Engineering, Chennai. Her research
[19] R. Maia and Y. Stylianou, “Iterative estimation of phase using complex interests include speech enhancement and speech
cepstrum representation,” IEEE International Conference on Acoustics, signal processing.
Speech and Signal Processing (ICASSP), pp. 4990–4994, 2016.
[20] M. Krawczyk-Becker and T. Gerkmann, “STFT phase reconstruction
in voiced speech for an improved single-channel speech enhancement,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing,
vol. 22, no. 12, pp. 1931–1940, 2014.
[21] Y. Wakabayashi, T. Fukumori, M. Nakayama, T. Nishiura, and
Y. Yamashita, “Single-channel speech enhancement with phase
reconstruction based on phase distortion averaging,” IEEE Transactions T. Nagarajan , Member IEEE, earned his
on acoustics, speech, and signal processing, vol. 26, no. 9, pp. Ph.D from the department of Computer Science
1559–1569, 2018. and Engineering, Indian Institute of Technology,
[22] J. Kulmer and P. Mowlaee, “Phase estimation in single channel speech Madras in the year 2004. Subsequently, he worked
enhancement using phase decomposition,” IEEE Signal Processing as a Post-doctoral fellow at INRS – EMT,
Letters, vol. 22, no. 5, pp. 598–602, 2015. Montreal, Canada, for two years. He is currently a
[23] P. Mowlaee and J. Kulmer, “Phase estimation in single-channel speech Professor and heads the department of Information
enhancement: Limits-potential,” IEEE/ACM Transactions on Audio, Technology, SSN College of Engineering, Chennai,
Speech, and Language Processing, vol. 23, no. 8, pp. 1283–1294, 2015. India. His areas of research include speech signal
[24] N. Zheng and X. Zhang, “Phase-aware speech enhancement based on processing, continuous speech recognition, statistical
deep neural networks,” IEEE Transactions on acoustics, speech, and parametric speech synthesis, speaker verification and
signal processing, vol. 27, no. 1, pp. 63–76, 2019. identification, spoken language identification, segmentation of speech into
[25] N. Reddy and M. Swamy, “Derivative of phase spectrum of truncated sub-word units, discriminative training techniques, and improved acoustic
autoregressive signals,” IEEE Transactions on Circuits and Systems, modeling techniques
vol. 32, no. 6, pp. 616–618, 1985.
[26] L. D. Alsteris and K. K. Paliwal, “Importance of window shape
for phase-only reconstruction of speech,” 2004 IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.
573–576, 2004. P. Vijayalakshmi , (IEEE M – ’08 – 15, SM
[27] Tsai and Chun-Ming, “Adaptive local power-law transformation for color – ’16 onwards), Member IEEE signal processing
image enhancement,” Applied Mathematics & Information Sciences, society, and Fellow IETE, received her M.E, degree
vol. 7, no. 5, pp. 2019–2026, 2013. in Communication systems from National Institute
[28] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in of Technology, Trichy in the year 1999 and earned
Proceedings of ICNN’95 - International Conference on Neural her Ph.D from Indian Institute of Technology,
Networks, 1995, pp. 1942–1948. Madras, in 2007. She worked as a doctoral trainee
[29] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for a year at INRS – EMT, Montreal, Canada.
for speech enhancement,” IEEE Transactions on Audio, Speech, and She is currently a professor in the department of
Language Processing, vol. 16, no. 1, pp. 229–238, 2008. Electronics and Communication Engineering, SSN
[30] K. Wójcicki, M. Milacic, A. Stark, J. Lyons, and K. Paliwal, “Exploiting College of Engineering, Chennai, India. Her areas of
conjugate symmetry of the short-time fourier spectrum for speech research include speech signal processing, speech-based assistive technology,
enhancement,” IEEE Signal Processing Letters, vol. 15, pp. 461–464, speech recognition, speech synthesis and speech enhancement.
2008.

2329-9290 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 05,2020 at 21:15:25 UTC from IEEE Xplore. Restrictions apply.

You might also like