Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL.

28, 2020 1315

Multi-Level Single-Channel Speech Enhancement


Using a Unified Framework for Estimating
Magnitude and Phase Spectra
Lavanya T , Nagarajan T, Member, IEEE, and Vijayalakshmi P , Senior Member, IEEE

Abstract—Speech enhancement algorithms aim to improve the following types: (i) spectral subtraction algorithms that estimate
quality and intelligibility of noise corrupted speech, through spec- the clean speech spectrum by subtracting noise spectrum from
tral or temporal modifications. Most of the existing speech en- noisy speech spectrum [1], [2], (ii) filtering-based approaches
hancement algorithms achieve this by modifying the magnitude
spectrum alone, while keeping the phase spectrum intact. In the that employ a linear estimator to minimize the mean square error
current work, both phase and magnitude spectra are modified between clean and enhanced speech [3]–[5], (iii) subspace-based
to enhance noisy speech using a multi-level speech enhancement approaches based on the decomposition of noisy speech into
technique. Proposed phase compensation (PC) function achieves signal subspace and noise subspace, where the estimate of clean
first-level enhancement by modifying the phase spectrum alone.
signal is obtained by nullifying the noise subspace [6], [7], (iv)
Second-level enhancement performs energy redistribution in the
phase compensated speech signal to make weak speech and non- statistical approaches that estimate short-time spectral amplitude
speech regions highly contrastive. Energy redistribution from the of speech, taking into account the uncertainty of signal presence
energy-rich voiced to the weak unvoiced regions is carried out in noisy observations [8], [9], and (v) neural networks-based
using adaptive power law transformation (APLT) technique by approaches [10], [11]. The widely used statistical approaches
optimizing the parameters with a total energy constraint employing
offer conventional estimation techniques such as, minimum
particle swarm optimization algorithm. Log MMSE technique with
a novel speech presence uncertainty (SPU) estimation method is mean square error-based short-time spectral amplitude esti-
proposed for third-level enhancement. The compensated phase mator (MMSE) [12], log minimum mean square error-based
spectrum and the magnitude spectrum estimated using log MMSE, short-time spectral amplitude estimator (log MMSE) [8], and
with proposed SPU estimation (log MMSE + proposed SPU), are log MMSE with speech presence uncertainty (SPU) estimation
used to reconstruct the enhanced speech signal. The proposed
(log MMSE+baseline SPU) [9]. These statistical approaches
speech enhancement technique is compared with recent speech
enhancement techniques that estimate both magnitude and phase, estimate a gain function, based on the probability of speech
for various noise levels (−5 to +5 dB), in terms of objective and absence or presence, to modify the noisy magnitude spectrum,
subjective measures. It is observed that the proposed technique that is combined with unaltered phase spectrum during signal
improves signal quality and maintains or improves intelligibility reconstruction. This sort of signal reconstruction using estimated
under stationary, and non-stationary noise conditions.
magnitude and unaltered noisy phase spectra may lead to an
Index Terms—Energy redistribution, phase compensation, inconsistent spectrogram [13]. Therefore, in the current work,
single channel speech enhancement, speech presence uncertainty. both the phase and magnitude spectra are modified to improve
the intelligibility and quality of the noisy speech signal.
I. INTRODUCTION
The importance of phase information has been explored in the
PEECH is a vital medium for communication between
S human and human, human and machine and vice-versa.
However, speech tends to lose its naturalness, when it is pro-
reconstruction of images [14], and enhancement of noisy speech
signals [15]. The authors of [16] carried out phase randomiza-
tion for frequencies that correspond to residual noise that are
duced or delivered in a noisy environment, resulting in lack perceived even after modifying the magnitude spectrum of noisy
of audibility and loss of information. The speech under noisy speech. In the context of speech, it was found that the phase spec-
environment can be enhanced by means of speech enhancement trum holds intelligibility information [15] and it is demonstrated
(SE) algorithms, whose performance solely depend on the ac- that a longer window size of 20–40 ms, during phase spectrum
curate estimation of the noise spectrum. Conventional speech estimation, has a major impact on intelligibility improvement.
enhancement algorithms can be broadly classified in to the Phase-based enhancement techniques combine the estimated
phase spectrum with the existing or modified magnitude spec-
Manuscript received September 18, 2019; revised January 25, 2020; accepted trum, to reconstruct the enhanced speech signal [17]. Several
March 30, 2020. Date of publication April 13, 2020; date of current version methodologies to estimate clean phase spectrum (from noisy
May 7, 2020. The associate editor coordinating the review of this manuscript
and approving it for publication was Dr. Andy. W. H. Khong. (Corresponding speech spectrum) include, (i) defining a gain function propor-
author: Lavanya T.) tional to the ratio between the phase spectra of the noisy speech
The authors are with the Speech Lab, SSN College of Engineering, and noise estimate [18], (ii) estimating phase from the MSE-
Chennai 603110, India (e-mail: lavanyat@ssn.edu.in; nagarajant@ssn.edu.in;
vijayalakshmip@ssn.edu.in). based complex cepstrum [19], (iii) f0 -based phase estimation
Digital Object Identifier 10.1109/TASLP.2020.2986877 techniques that estimate phase spectrum only at voiced speech
2329-9290 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 13,2020 at 19:47:04 UTC from IEEE Xplore. Restrictions apply.
1316 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020

areas (noisy phase is kept unmodified at unvoiced speech ar- of weak speech components, even the noise dominant frames
eas) [20], [21], (iv) source-filter model-based phase decomposi- are not given a maximum SAP value of 1. This results in
tion [22], [23], (v) DNN-based speech enhancement techniques assigning equal speech absence probability to weak speech and
that estimate both magnitude and phase [24], and (vi) phase speech pause regions under low SNR conditions. This equitable
compensation using conjugate symmetry property of Fourier treatment of these regions may result in attenuation of weak
transform [17]. In the current work, a modified phase compensa- speech regions during noise reduction.
tion technique is proposed for the estimation of phase spectrum. In the current work, to retain the weak speech components,
In phase compensation-based enhancement techniques, re- even in low SNR conditions, an SPU estimation technique
searchers have analysed the effect of various window functions which estimates the SAP values, as a function of RMS value, is
and window sizes for phase spectrum estimation and speech proposed. Here, the SAP is estimated for the overlapping frames
intelligibility improvement. The authors of [25], [26], explicitly corresponding to the unprocessed noisy magnitude speech spec-
suggested to use rectangular window for estimating the phase trum and the gain factor obtained is used to modify the magnitude
spectrum, with the tacit understanding that using a tapered win- spectrum of energy redistributed signal.
dow would cause loss of information. In [17], the authors studied Based on the above observations and intuitions, a multi-
the effect of mismatched windows (i.e, Hamming and Cheby- level speech enhancement technique is proposed as depicted in
shev windows for extracting the magnitude and phase spectrum Fig. 1, where the phase spectrum is estimated using proposed
respectively), and the effective dynamic range (20−40) dB of phase compensation technique and the magnitude spectrum is
Chebyshev window, in estimating the phase spectrum. It was estimated using proposed SPU estimation-based log MMSE
observed that, the use of mismatched window, and appropriately technique (log MMSE+proposed SPU). Main objective of the
varying the attenuation factor of the Chebyshev window resulted three levels are given below:
in improved speech quality. r In the first-level, the noisy signal is subjected to phase
In general, the phase compensation function estimates a clean compensation, where the amount of noise reduction is
phase spectrum, which when combined with the corresponding proportional to the signal energy. This is expected to re-
noisy magnitude spectrum results in a partially clean speech duce the noise and prevent the attenuation of weak speech
spectrum. In the current work, mismatched windows are used components.
to estimate phase and magnitude spectrum, where a tapered r In the second-level, adaptive power law transformation
Chebyshev window with an attenuation factor of higher dynamic along with constrained particle swarm optimization algo-
range (40 dB) is used for estimating the phase information using rithm is employed to achieve energy redistribution.
proposed phase compensation (Proposed PC) technique. Unlike r In the third-level, the log MMSE+proposed SPU esti-
the baseline phase compensation (Baseline PC) function [17] mation technique estimates the magnitude of the energy
that divides the noisy speech into two regions namely, high and re-distributed pre-processed signal.
low SNR regions, proposed PC function divides the low SNR The estimated magnitude and phase spectrum are combined
region further into unvoiced and speech-pause regions, and treats together and the enhanced signal is reconstructed using IFFT
them separately. This in turn provides a scope to prevent the operation and overlap-add technique.
attenuation of weak speech regions. The overall performance of the proposed system is com-
Further, to obtain a more-detailed clean speech spectrum pared with recent techniques, [17], [20], [21] and [24], that
estimate, the magnitude spectrum of noisy speech is processed in also consider estimating both magnitude and phase for speech
the subsequent levels. Prior to magnitude spectrum estimation, in enhancement.
the second-level, the phase compensated signal energy is redis- The organisation of this paper is as follows: Section II pro-
tributed using adaptive power law transformation (APLT) [27] vides a detailed description of the proposed phase compensation
(an image enhancement technique). The parameters of APLT (proposed PC) technique. Section III describes the energy re-
are optimized using particle swarm optimization (PSO) algo- distribution using APLT with constrained optimization. Speech
rithm [28], in such a way that the total energy of the signal is enhancement using log MMSE+proposed SPU is discussed in
maintained. This kind of redistribution, prior to estimation of Section IV. The experimental setup and the performance of
clean magnitude spectrum, is expected to prevent attenuation of the proposed multi-level speech enhancement algorithm under
weak speech components. various noise conditions in comparison with recent techniques
Among the statistics-based magnitude spectrum estimation are discussed in Section V and Section VI respectively. Finally,
approaches discussed earlier, log MMSE with the baseline Section VII concludes the paper.
speech presence uncertainty (log MMSE+baseline SPU) estima-
tion [9] is found to achieve effective noise reduction in the noisy
speech signal. The performance of this technique has improved II. PROPOSED PHASE COMPENSATION TECHNIQUE
with the frame-wise SPU estimation, rather than considering As discussed earlier, this work focusses on estimating both
speech absence to be equiprobable to the speech presence [29], or the magnitude and the phase spectrum, where a modified phase
an empirically fixed value [8]. The speech presence uncertainty compensation function (proposed PC) is proposed to obtain the
(SPU) estimator proposed in [9], assigns a large speech absence latter. Phase compensation technique [17], [30] modifies the
probability (SAP) value if the neighbouring frequency bins do conjugate symmetry property of noisy complex speech spectrum
not contain speech. Additionally, in order to avoid attenuation using a compensation function as explained below:

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 13,2020 at 19:47:04 UTC from IEEE Xplore. Restrictions apply.
T et al.: MULTI-LEVEL SINGLE-CHANNEL SPEECH ENHANCEMENT 1317

Fig. 1. Block diagram of the proposed multi-level speech enhancement technique.

Let the Fourier transform (FT) derived noisy complex speech TABLE I
ANGULAR SPACING (θ̂) VS. MAGNITUDE OF RESULTANT VECTOR (R̂)
spectrum be, Y (i, k) = X(i, k) + D(i, k), where X(i, k) and
D(i, k) be the speech and noise spectra respectively. Here, i
denotes the frame index with N-FFT points, where 0 ≤ k ≤
N − 1. The symmetry property of FT is exploited here to
estimate the clean phase spectrum from Y (i, k), through phase
compensation. Phase compensation function is an offset factor
that is used to modify the angular spacing between complex con-
jugate spectral component pairs. The modified angular spacing
is expected to be the estimate of clean phase. Therefore, the r The new angles associated with the modified conjugate
estimated phase spectrum when combined with the estimated vector pairs, are, θ̂1 and θ̂2 and |θ̂1 | = |θ̂2 |, where, θ̂1 =
−b
magnitude spectrum, may result in an enhanced speech signal. tan−1 ( (a+Λ)
b
) and θ̂2 = tan−1 ( (a−Λ) ).
Estimation of phase spectrum using the proposed PC function r The angular spacing between the modified complex con-
is illustrated below: jugate spectral vectors can be given by, θ̂ = θ̂1 + θ̂2 .
r Let a + ib and a − ib be any conjugate vector pair in noisy r It is noticed that the modified angular spacing (θ̂) increases
complex speech spectrum, Y (i, k) (i.e. θ̂ > θ), when the conjugate vectors (a ± ib) are added
r Here, θ1 and θ2 be the angles associated with a + ib and with a factor ±Λ.
a − ib, with reference to real axis, where, θ1 = tan−1 ( ab ) r Table I describes the variation in resultant vector magni-
and θ2 = tan−1 ( −b
a )
tude, R̂, with respect to θ̂.
r During IFFT operation, the magnitude of resultant vector, r From Table I it is clear that the increase in angular spacing
R, is computed. θ̂ causes a decrement in magnitude of the resultant vector
 R̂ of modified complex vectors, provided |Λ| > M , where
R= 2(a2 + b2 )(1 + cos(θ)) (1) M is the magnitude of the unmodified conjugate vector
pairs.
Here, θ is the angular spacing between the vectors, given r When |Λ| < M , the phase compensation function have no
by θ = θ1 + θ2 or less impact on modifying the angular spacing, therefore,
r As these conjugate spectral vectors are equidistant from θ̂ ≈ θ.
real axis due to symmetry property, θ2 = −θ1 , leading to r The modified angular spacing can take values in the interval
a resultant vector with a magnitude, R = 2M . [0, π], depending upon the value of Λ (refer to Table I).
r The magnitude of resultant vector of the modified complex

M = (a2 + b2 ) (2) vectors, R̂ can have a maximum magnitude, 2M , with an
angular spacing of 0o , when θ̂ = θ, (i.e. when |Λ| < M )
r Let the phase compensation function for the conjugate (refer to Table I).
vectors, a ± ib be, ±Λ. r The magnitude of resultant vector will become zero when
r This function is added with the conjugate vectors to gen- the angular spacing reaches 180o (refer to Table I). That is,
erate the modified complex vectors, (a + Λ) + ib, (a − the magnitude of the resultant vector can be modified by
Λ) − ib. tuning the angular spacing, θ̂, appropriately.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 13,2020 at 19:47:04 UTC from IEEE Xplore. Restrictions apply.
1318 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020

The baseline phase compensation (baseline PC) function [17], In R2 regions: In the low energy regions, R2 , since both the
Λ1 (i, k), is directly proportional to the noise magnitude spec- noise and signal power influence the compensation function,
trum, |D̂(i, k)|, and is given by, Λ2,R2 (i, k) = Λ2 (i, k). However, as the R2 regions are weak un-
voiced regions, Pnoisy,R2 << Pnoisy,R1 , making the denomina-
Λ1 (i, k) = λΨ(k)|D̂(i, k)| (3) tor (Pnoisy + Pnoise )R2 < Pnoisy,R1 . Therefore, Λ2,R2 (i, k) >
Here, λ is an empirically found constant value and Ψ(k) is an Λ2,R1 (i, k) and 0o < θ̂ < 180o , and 0 < R̂ < 2M . Thus the
anti-symmetric function given by, phase compensation function is expected to attenuate noise
⎧ components in these regions and preserve speech components.

⎪ 1; 0≤k≤ N 2 −1
In R3 regions: Since Pnoise = Pnoisy in the R3 regions, the
⎨ phase compensation function reduces to,
Ψ(k) = 0; N (4)


2


−1; N +1≤k ≤N −1 ⎪

1
; 0≤k≤ N
2 −1
2 ⎨ 2(Pnoise (i,k))
Λ2,R3 (i, k) = 0; k=N (7)
As Λ1 (i, k) is directly proportional to |D̂(i, k)|, it attenuates ⎪

2
⎩ −1 N
noise regions to a greater extent compared to speech regions. 2(Pnoise (i,k)) ; 2 +1≤k ≤N −1
This may attenuate the weak speech regions as well when the
signal power is too weak, especially at low SNR conditions. It is clear that the value of Λ2,R3 (i, k) > Λ2,R2 (i, k). As
Therefore, the proposed PC function, Λ2 (i, k), is framed to discussed earlier, higher the value of compensation function,
modify the angular spacing in such a way that it attenuates the higher will be the angular spacing and smaller the resultant
noise components and enhances the signal regions, especially magnitude. Therefore, R3 regions experience much attenuation
the weak speech regions. in terms of the resultant magnitude as expected. Following this
⎧ description, it can be concluded that, for any noisy speech signal,


1
; 0≤k≤ N 2 −1 Λ2,R1 < Λ2,R2 < Λ2,R3 . Therefore, the proposed PC technique
⎨ (Pnoisy (i,k)+Pnoise (i,k))
Λ2 (i, k) = 0; k= 2N is expected to remove noise and preserve the weak speech

⎪ regions as well.
⎩ −1
; N +1≤k ≤N −1
(Pnoisy (i,k)+Pnoise (i,k)) 2 Further, it can be noted that in the R1 and R2 regions, Λ2 (i, k)
(5) is inversely proportional to Pnoisy (i, k) and Pnoisy (i, k) +
Here, Pnoisy & Pnoise are the noisy signal power and noise Pnoise (i, k), respectively. We know that, Pnoisy (i, k) >>
power respectively. Unlike Λ1 (i, k), which is directly propor- |D̂(i, k)| and Pnoisy (i, k) + Pnoise (i, k) > |D̂(i, k)|, respec-
tional to noise power, the proposed PC function is inversely tively in R1 and R2 regions. Therefore, Λ1 (i, k) > Λ2 (i, k)
proportional to the sum of noisy signal power and noise power. or Λ1 (i, k) ≈ Λ2 (i, k) in R1 and Λ1 (i, k) >> Λ2 (i, k) in R2
Here, Λ2 (i, k) is made anti-symmetric to imitate the phase regions. Since Pnoise(i,k) ≈ |D̂(i, k)| in R3 regions, the fol-
spectrum. lowing three cases can exist (i) when Pnoise (i, k) > 1, Λ2 (i, k)
In the current work, the noisy speech is classified into three achieves less noise reduction compared to Λ1 (i, k), (ii) when
regions namely, (i) energy-rich voiced regions (R1 ), (ii) weak Pnoise (i, k) < 1, Λ2 (i, k) achieves greater noise reduction com-
unvoiced regions (R2 ) and, (iii) completely noise regions (R3 ). pared to Λ1 (i, k), (iii) when Pnoise (i, k) = 1, Λ1 (i, k) and
It is known that, noise affects the weak unvoiced regions more, Λ2 (i, k) attains same amount of attenuation.
compared to the voiced regions, since energy of R1 > energy The significance of proposed PC function, Λ2 , in enhancing
of R2 . And at the same time, the energy of R2  energy of R3 , the noisy speech signal compared to the baseline PC function,
since they possess similar acoustic characteristics. Therefore, Λ1 , can be explained, using vector diagrams, (refer to Fig. 2) as
care must be taken to ensure that the weak speech components given below:
are retained to a greater extent during noise cancellation. The −−→ −−→
In Fig. 2-(i), (ii) and (iii), Vact and Vact∗ are the actual com-
working of proposed PC function varies across R1 , R2 and R3 , plex conjugate spectral vectors of noisy speech signal and the
and is described in detail below. −−−−−−−→ ∗
resultant vector magnitude is given by, Re[Vact + Vact ]. V1 , V1∗
In R1 regions: In the high SNR regions, R1 , as the value of ∗
and V2 , V2 , are the modified vectors, generated by adding the
Pnoise << Pnoisy the proposed PC function will be inversely actual vectors with the baseline PC (Λ1 ) and the proposed PC
proportional to noisy signal energy. That is, functions (Λ2 ) respectively.
⎧ r In R1 region (refer to Fig. 2-(i)), since Λ1 ≥ Λ2 , both the


1
; 0≤k≤ N 2 −1
⎨ Pnoisy (i,k) baseline PC and proposed PC functions mildly attenuate
Λ2,R1 (i, k) = 0; k=N −−−−−→ −−−−−→
⎪ 2
(6) the vectors, therefore, both Re[V2 + V2∗ ] and Re[V1 + V1∗ ]

⎩ N
−1
; +1≤k ≤N −1 are approximately equal to or slightly less than the actual
Pnoisy (i,k) 2 −−−−−−−→ ∗
magnitude of noisy speech, Re[Vact + Vact ].
As Λ2,R1 (n, k) will take a value close to zero leading to a r In R2 region (refer to Fig. 2-(ii)), since Λ1 >> Λ2 , at-
very small variation in θ̂ and the spectral vectors may remain tenuation provided by the baseline PC method is high,
symmetric (R̂ ≈ 2M ). This is in accordance with the previous compared to the proposed PC technique. Hence, in R2
−−−−−−−→ ∗ −−−−−→ −−−−−→
findings in [31]. region, Re[Vact + Vact ] > Re[V2 + V2∗ ] > Re[V1 + V1∗ ].

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 13,2020 at 19:47:04 UTC from IEEE Xplore. Restrictions apply.
T et al.: MULTI-LEVEL SINGLE-CHANNEL SPEECH ENHANCEMENT 1319

Fig. 2. Phasor diagrams representing phase compensation at (i) energy-rich voiced region (R1 ), (ii) unvoiced region (R2 ), and (iii) completely noise region (R3 ).

Therefore, it is clear that the proposed PC enhances the The phase spectrum of Ymod is expected to be the estimate of
weak speech regions, unlike baseline PC. clean phase spectrum, θ̂(i, k).
r In R3 region (refer to Fig. 2-(iii)) (for Pnoise (i, k) <
−−−−−−−→ −−−−−→ θ̂(i, k) = ∠Ymod (i, k) (9)

1), Λ2 > Λ1 , and Re[Vact + Vact ] > Re[V1 + V1∗ ] >
−−−−−→ The phase compensated complex speech spectrum, Y1 , (a par-
Re[V2 + V2∗ ]. Therefore, in this case, the proposed method
tially clean spectrum), is obtained by combining the estimated
attenuates the noise regions to a greater extent, compared
phase spectrum, with the unmodified noisy magnitude spectrum.
to the baseline technique.
In the baseline PC method, the |D̂(i, k)| (refer to equation Y1 (i, k) = |Y (i, k)| ej∗θ̂(i,k) (10)
3) is estimated from the initial frames and is kept constant
throughout the utterance. However, this may not be suitable The phase-compensated time-domain signal, y1 (n), is recon-
for non-stationary noise conditions. To overcome this issue in structed from Y1 (i, k), using IFFT operation. Since, Y1 (i, k)
the current work, the adaptive (Martin’s) noise estimation algo- exhibits no symmetry, IFFT operation may yield a complex
rithm [32] is adopted for frame-level noise power computation. number. Therefore, to reconstruct the real-valued, time domain
The magnitude and phase spectra of noisy speech are derived signal, imaginary part is discarded.
using Hamming window and Chebysev window (with 40 dB y1 (n) = Re(IF F T (Y1 (i, k)) (11)
attenuation factor) respectively. Signals at a sampling rate of
From Fig. 3, it is clear that the proposed PC technique
8 kHz (unless and until specified) is used throughout this work,
achieved better enhancement especially in the consonant-vowel-
and the frame size is chosen to be 32 ms [17] with 25% overlap.
consonant (CVC) regions than baseline PC. Fig. 4 depicts the
The proposed PC function, Λ2 (i, k), is added with the noisy
spectrographic representation of noisy speech signal enhanced
complex speech spectrum, Y (i, k), to generate the modified
by baseline PC and proposed PC techniques. This clearly shows
complex spectrum, Ymod (i, k).
that our proposed method attenuates noise and preserves the
Ymod (i, k) = Y (i, k) + Λ2 (i, k) (8) weak speech regions.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 13,2020 at 19:47:04 UTC from IEEE Xplore. Restrictions apply.
1320 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020

Fig. 3. Consonant-Vowel-Consonant (CVC) transition segments, (I) /nd-aa-n/ and (II) /ch-a-th/, enhanced by the Proposed PC technique, where the noisy signal
is corrupted by white noise at 0 dB SNR level:- (a) clean speech segment, (b) noisy speech segment; noisy speech segment enhanced by (c) Baseline PC technique,
and (d) Proposed PC technique.

weak energy (low SNR) regions, with a constraint that the total
energy of the signal is maintained. Noise-only frames are found
with the help of a threshold computed from the average noise
energy estimated from the initial frames of phase compensated
signal such that energy redistribution to these frames is avoided.
In the current work, energy redistribution in the phase com-
pensated signal is carried out using adaptive power law trans-
formation (APLT) technique [27], that is discussed below.
The phase compensated signal, y1 (n), is segmented with a
frame size of 32 ms duration [30] with 8 ms overlap using rect-
angular window. The APLT function for energy redistribution is
given below:

Fig. 4. Spectrogram of:- (a) clean speech, (b) noisy speech signal corrupted Enew (m) = c[1 + (k ∗ d(m))]Eold (m)[1−(k∗d(m))](γ) (12)
by white noise at 0 dB, (c) speech enhanced by baseline PC, and (d) speech
enhanced by proposed PC.
Here, d(m) is the deviation between old energy, Eold (m), and
mean energy value, μ, of y1 (n).
Following the phase estimation using the proposed PC tech- d(m) = Eold (m) − μ (13)
nique, estimation of magnitude spectrum is carried out that

N −1
involves a pre-processing approach, and is discussed below.
Eold (m) = y12 (n) (14)
n=0
III. ENERGY REDISTRIBUTION WITH CONSTRAINED
OPTIMIZATION USING APLT AND PSO 
J
μ= Eold (m) (15)
Most of the speech enhancement algorithms solely depend m=1
on the estimation of clean magnitude spectrum from the noisy
speech spectrum [12] [8] [29]. Decreased energy-level contrast Here, m, n, J and N denote the frame index, sample index,
between the noise and the weak speech regions may cause them total number of frames in a signal and total number of samples
to get attenuated equally, during noise cancellation. This may in a frame respectively. k is a constant which is set to 0.001,
lead to decreased overall speech quality and intelligibility. an empirically chosen value. The APLT function is framed
Few consonant classes such as unvoiced stops, fricatives, etc., in such a way that it compensates for the deviation, d(m),
are inherently low energy sound units. Under noisy conditions, between Eold (m) and μ, by transforming Eold (m) to a new
these units suffer further degradation in energy, and become value, Enew (m) (refer to equation 12). When the deviation,
indistinguishable from the noise segments. Therefore, prior to d(m) is high, the transformed energy value, Enew (m) will be
magnitude estimation, energy level of weak speech regions is greater than Eold (m) and vice-versa. The energy controlling
increased, so that they become highly contrastive with the noise parameters, c and γ (refer to equation 12), can be used to
regions. Since, speech is non-stationary, uniformly amplifying control the degree to which the energy Enew (m) is increased
the frames of different SNR levels, may cause loudness to reach or decreased. Therefore, constrained optimization using particle
the level of discomfort [33] [34]. Therefore, the proposed tech- swarm optimization (PSO) is carried out to estimate c and γ for
nique redistributes energy between energy-rich (high SNR) and every frame, in such a way that the total energy of the signal is

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 13,2020 at 19:47:04 UTC from IEEE Xplore. Restrictions apply.
T et al.: MULTI-LEVEL SINGLE-CHANNEL SPEECH ENHANCEMENT 1321

Fig. 5. Flow diagram of proposed energy redistribution with constrained optimization.

Fig. 6. An illustration of energy redistribution using APLT with constrained optimization.

maintained, as given below. As discussed earlier, in order to restrict the energy distribution
to noise-only regions, a noise-power based threshold value is

J 
J used. In the noise-dominant frames (i.e., when energy lies below
Enew (m) = Eold (m) (16) the threshold), the energy is replaced with a small value (say,
m=1 m=1 0.0001). Subsequently, when the energy of the current frame is
greater than the threshold value, it is replaced by the newly found
‘c’ and ‘γ’ values for every frame are initialized with values energy value, Enew (m). Therefore, the proposed energy redis-
chosen in the interval [0, 1]. The energy modified signal, y2 (n), tribution technique with constrained optimization, redistributes
is given by, energy from the high energy voiced to the low energy unvoiced
regions and a fixed value (here, 0.0001) for speech-pause re-
Enew (m) gions. Flow diagram depicted in Fig. 5 describes the working of
[y2 (n)]m = [y1 (n)]m ∗ (17)
Eold (m) proposed energy redistribution technique. Fig. 6 shows the signal

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 13,2020 at 19:47:04 UTC from IEEE Xplore. Restrictions apply.
1322 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020

Fig. 7. Effect of energy redistribution at high SNR (indicated by rectangle) and low SNR (indicated by ellipse) regions:- (a) Phase compensated signal, (b) Energy
redistributed signal.

energy redistribution carried out using the proposed technique, where G is the gain function for the estimator [9]. Here i denotes
for a phase compensated signal, whereas Fig. 7 shows the signals the frame index and k denotes the frequency bin index with
before and after proposed energy redistribution. In Fig. 7, a high N-FFT points. The spectral gain function, G, is the product of
SNR region (marked in red, using a rectangle) and a low SNR hypothetical gains corresponding to signal presence and absence
region, (marked in green, using an ellipse), corresponding to probabilities (say G1 and G2 respectively), which is given by
p(i,k) 1−p(i,k)
phase compensated and energy redistributed signal under 0 dB G = G1 .G2 [9]. Here, G2 is a lower bound threshold
white noise condition, are shown. It can be inferred from the value, assigned to prevent the gain function (G) from reducing to
Fig. 7 that the energy of marked low SNR region corresponding zero, when the speech absence probability reaches its maximum.
to the phase compensated signal is improved in terms of its The associated gain value for speech presence is given by,
amplitude after energy redistribution and vice-versa at high SNR

ξ(i, k) e−t
region. From the energy redistributed signal, y2 (n), magnitude G1 = exp 0.5 dt (19)
estimation is carried out using the log MMSE with proposed 1 + ξ(i, k) v(i,k) t
SPU estimation (log MMSE+proposed SPU), and is explained where,
in the following section.
γ(i, k)ξ(i, k)
v(i, k) 
(1 + ξ(i, k))
IV. LOG MMSE WITH PROPOSED SPEECH PRESENCE
The apriori SNR (ξ(i, k)) and the aposteriori SNR (γ(i, k)) are
UNCERTAINTY ESTIMATION TECHNIQUE
given in the following equations:
log MMSE, a well-established error minimization technique, λX2 (i, k)
is used in our current work, along with our proposed speech ξ(i, k)  (20)
λD2 (i, k)
presence uncertainty (proposed SPU) estimation technique to
estimate the clean magnitude spectrum. The proposed tech- |Y2 (i, k)| 2
nique (log MMSE+proposed SPU) estimates magnitude spec- γ(i, k)  (21)
λD2 (i, k)
trum from the pre-processed (phase compensated and energy
To prevent attenuation of weak SNR speech regions during
redistributed) signal, y2 (n). Here, the magnitude spectrum is
noise removal, in our proposed work, ξmin is set to a maximum
extracted using a modified Hanning window function [35].
value of 1 (in [9], it is set to 0.1).
Let Y2 (i, k) be the spectrum of pre-processed signal. Since
Probability of conditional speech presence, p(i, k), is given
noisy magnitude spectrum is used in reconstructing the phase
as,
compensated signal, the pre-processed signal will still have some
−1
amount of residual noise, i.e. Y2 (i, k) = X2 (i, k) + D2 (i, k), q(i, k)
where X2 (i, k) and D2 (i, k) be the speech and noise spectra, re- p(i, k) = 1 + (1 + ξ(i, k)exp(−v(i, k)))
1 − q(i, k)
spectively, of the pre-processed signal. The log MMSE+baseline (22)
SPU based spectral amplitude estimate [9] is given by, where q(i, k) is the speech absence probability.
In [9], it is stated that the conventional log MMSE with a modi-
 k)  G(i, k) |Y2 (i, k)|
A(i, (18) fied speech presence uncertainty estimator (log MMSE+baseline

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 13,2020 at 19:47:04 UTC from IEEE Xplore. Restrictions apply.
T et al.: MULTI-LEVEL SINGLE-CHANNEL SPEECH ENHANCEMENT 1323

using log MMSE+proposed SPU estimation technique, and the


phase spectrum extracted from the phase compensated signal
(again using modified Hanning window function [35]), are com-
bined to reconstruct the enhanced speech spectrum, X̂(i, k).
X̂(i, k) = Â(i, k) ∗ exp(j ∗ θ̂(i, k)) (24)
The enhanced speech signal, x̂(n), is reconstructed by taking
inverse fast Fourier transform (IFFT).
x̂(n) = Re(IF F T (X̂(i, k))) (25)
Spectrographic representation, comparing the techniques in-
Fig. 8. Speech absence probability computed for a noisy signal corrupted volved in the proposed multi-level speech enhancement, with
at −5 dB SNR white noise condition, using proposed technique (indicated by their existing counterparts, is shown in Fig. 10. It is inferred that
dashed red line), and the baseline technique (indicated by solid green line). the proposed technique achieves greater noise reduction along
with effective restoration of speech components.

SPU) achieved greater noise reduction especially at low SNR V. EXPERIMENTAL SETUP
levels. It estimates the speech absence probability q(i, k) of a
frame by considering contextual signal characteristics at preced- As discussed earlier, overall performance of the proposed sys-
ing, succeeding, and the current frame. However, this may assign tem is compared with techniques namely, (a) PASE-DNN [24],
a maximum speech absence probability (SAP) to the weak sound (b) PDA+MMSE-STSA [21], (c) PRVS+MMSE-STSA [20],
units that are surrounded either by silence or a weak sound unit or and (d) Baseline PC+MMSE-STSA [17]. In techniques (b),
by both. Therefore, the baseline technique may attenuate weak (c) and (d), for magnitude spectrum estimation, conventional
speech and the speech pause regions equally. To overcome this, MMSE-STSA technique [12] is used. Here, PASE-DNN [24]
in the current work, speech frames are classified into 4 classes employs a 3-layered feed-forward architecture, where ideal
based on the RMS level, namely, (i) very low (VL), (ii) low (L), amplitude mask (IAM) and instantaneous frequency deviation
(iii) mid (M) and (iv) high (H). Each level is assigned with an (IFD) are set to be the training targets to estimate both magnitude
empirically chosen weight (W) where the high regions are those and phase from the noisy speech signal.
for which,
1) RM S >= 0 dB. Similarly, RMS threshold defined for M, A. Speech Corpus
L, and VL regions are, The proposed speech enhancement algorithm based on the
2) (−5 dB ≤ RM S < 0 dB), estimate of phase and magnitude spectra is evaluated on noisy
3) (−7 dB < RM S < −5 dB), and speech signal in Tamil, and English. Literary Tamil (LT), and
4) (RM S ≤ −7 dB), respectively. Colloquial Tamil (CT) speech data are collected from native
The corresponding empirically chosen weights associated Tamil speakers, in the age group of 20–30 years, using a hand-
with VL, L, M and H regions are 0, 0.35, 0.98 and 1 respectively. held microphone in a laboratory environment. In order to evalu-
The proposed SPU estimation technique estimates q(i, k) value ate the algorithm for English data, IEEE sentence database [36],
for every ith frame in the noise corrupted speech signal as given spoken by a native speaker of English, is used. Nasalized English
below: sentences spoken by a non-native English speaker, recorded

1 N −1 in a laboratory environment, is also used for evaluation. Data
q(i, k) = |Y (i, k)| 2 ∗ W (i) (23) from various sources are considered for evaluation to check the
N k=0
effectiveness of the proposed algorithm in handling the linguistic
where |Y (i, k)| is the magnitude spectrum of noisy speech
variations under various noise conditions. 10 sentences each,
signal, and W (i) is the weight factor associated with ith frame.
for LT, CT, and IEEE database, and, 5 nasal sentences (totally
In Fig. 8, speech absence probability (SAP), for a speech signal
35 sentences) are used for evaluating the algorithm. Speech
corrupted by white noise at −5 dB SNR, computed by the
signals considered here are originally sampled at 48 kHz, and
baseline SPU and the proposed SPU (proposed) estimation tech-
down-sampled to 8 kHz. For training PASE-DNN, one hour
niques is plotted. For better understanding, the speech absence
Tamil speech corpus and 30 minutes of TIMIT speech corpora
probability (SAP) values are plotted against the corresponding
are used.
clean speech.
Comparison is made between the signals enhanced by the
log MMSE+baseline SPU (directly applied on noisy signal) and B. Noisy Speech Data
log MMSE+proposed SPU estimation technique (applied on the Both the stationary and non-stationary noise conditions are
preprocessed signal). Fig. 9 indicates that the speech enhanced considered for evaluation. Stationary noises such as, white and
by the proposed technique effectively recovers most of the sound pink noise, and non-stationary noises including babble and buc-
units (especially the weak energy regions), when compared to caneer noises [32] are considered for evaluation. These noises
the baseline technique. Here, the magnitude spectrum estimated are used for corrupting the speech signals at −5, −4, −2, 0, 2, 4,

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 13,2020 at 19:47:04 UTC from IEEE Xplore. Restrictions apply.
1324 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020

Fig. 9. Comparing the baseline and proposed SPU estimation techniques:- (a) clean signal, (b) noisy signal at 0 dB white noise condition, (c) noisy signal
enhanced using log MMSE+baseline SPU technique, and (d) energy redistributed signal enhanced using log MMSE+proposed SPU technique.

for evaluating the proposed algorithm and the other techniques


used for comparison, in terms of quality and intelligibility re-
spectively. PESQ and STOI metrics provide a score in the range
of [−0.5 4.5] and [0 1], respectively. Higher value in PESQ or
STOI score indicates better performance.
The PESQ and STOI scores obtained for the various noise
conditions are tabulated in Table II, and the following inferences
are made for all the SNR levels considered.
1) PESQ-Based Evaluation:
r Both under stationary and non-stationary noise conditions,
the PESQ scores indicate that the proposed algorithm
achieved a higher improvement in quality compared to all
the techniques under all the considered situations, except
PASE-DNN.
r The PESQ scores obtained by PASE-DNN, at high SNR
regions, (> 0 dB), is only 0.05−0.15 higher than the pro-
posed technique. However, the proposed technique leads
PASE-DNN in terms of PESQ under all low SNR condi-
tions (≤ 0 dB), for all noise conditions.
2) STOI-Based Evaluation:
r The STOI scores indicate that the proposed technique
maintains the intelligibility to a greater extent especially
at low SNR conditions except under non-stationary noise
condition.
Fig. 10. Spectrographic representation of proposed multi-level speech en- r On an average, the proposed approach comparatively
hancement technique:- (a) clean signal, (b) noisy signal corrupted by white noise
at 0 dB SNR level, (c) baseline PC, (d) proposed PC, (e) APLT-based energy shows a marginal decrement of only 0.01−0.03 under
redistribution with constrained optimization using PSO, (f) log MMSE+baseline buccaneer noise condition, whereas under babble noise
SPU, (g) log MMSE+proposed SPU. condition, a decrement of only 0.01 is found, in terms of
STOI.
It should be noted that, all the stages, of the proposed algo-
and 5 dB SNR levels. Totally 980 sentences (35*7*4) are used rithm, are optimized to be robust to noise at low SNR speech
for objective and subjective evaluation of the algorithm. regions (since high SNR speech regions are less affected by
noise, their spectral consistency is maintained by keeping them
VI. PERFORMANCE EVALUATION less modified). The results also indicate that, for the proposed
approach, the improvement in enhanced speech quality is high
A. Objective Evaluation of Quality and Intelligibility especially under low SNR conditions, which is of greater interest
Perceptual evaluation of speech quality (PESQ) [29], and in the context of speech enhancement. The proposed algo-
short time objective intelligibility (STOI) [37] measures are used rithm has improved the quality with no further degradation to

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 13,2020 at 19:47:04 UTC from IEEE Xplore. Restrictions apply.
T et al.: MULTI-LEVEL SINGLE-CHANNEL SPEECH ENHANCEMENT 1325

TABLE II
PERFORMANCE COMPARISON OF THE PROPOSED TECHNIQUE WITH THE RECENT TECHNIQUES

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 13,2020 at 19:47:04 UTC from IEEE Xplore. Restrictions apply.
1326 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020

respectively for white and pink noise conditions and an improve-


ment of (8–14.15)% and (8.04–30)% respectively for babble and
buccaneer noise conditions, with reference to the noisy speech.
Similarly, the STOI scores indicate that the proposed algorithm
maintains the intelligibility under all the considered situations
except under buccaneer noise condition, where the STOI faces a
marginal decrement of (1–3)%. The subjective ratings are highly
correlated with the objective measures and it also indicates that
the proposed algorithm improves speech quality without any
degradation in intelligibility. Despite the remark that the neural
network-based systems outperform established methods, they
require enormous amount of training data and complex models
for improved system performance. Further, it should also be
noted that our proposed technique has shown significant im-
provement in performance, compared to that of the PASE-DNN
technique, with less computational complexity.

ACKNOWLEDGMENT
Fig. 11. Percentage of correctly identified words derived from DMOS The authors would like to thank Prof. Yukoh Wakabayashi,
analysis. Tokyo Metropolitan University, Tokyo, Japan, for sharing their
source code with us. The authors would also like to acknowledge
the authors of [17], [20], [21] and [24], for their source codes
intelligibility almost in all the cases. However, as said earlier, a
downloaded from the GitHub and MathWorks repositories.
marginal decrement in STOI is evident under buccaneer noise
condition. It should also be noted that the proposed and all the
techniques considered for comparison show occasional incre- REFERENCES
ment in STOI with reference to STOI of the noisy speech. This [1] S. Boll, “Suppression of acoustic noise in speech using spectral sub-
may be due to the fact that noise suppression has a minimum or traction,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 2,
pp. 113–120, Apr. 1979.
detrimental effect on intelligibility [38]. [2] M. P. A. Jeeva, T. Nagarajan, and P. Vijayalakshmi, “Discrete cosine
transform-derived spectrum-based speech enhancement algorithm using
temporal-domain multiband filtering,” IET Signal Process., vol. 10, no. 8,
B. Subjective Evaluation Using DMOS pp. 965–980, 2016.
Subjective evaluation is carried out using DMOS score, that [3] M. P. A. Jeeva, T. Nagarajan, and P. Vijayalakshmi, “Temporal domain
filtering approach for multiband speech enhancement,” in Proc. Int. Conf.
evaluates intelligibility as the percentage of correctly identified Microw., Opt. Commun. Eng., 2015, pp. 385–388.
words (out of the total number of words used for testing). [4] M. P. A. Jeeva, T. Nagarajan, and P. Vijayalakshmi, “Formant-filters
Sentences used for objective evaluation are used for subjective based multi-band speech enhancement algorithm for intelligibility im-
provement,” in Proc. Nat. Conf. Commun., 2016, pp. 1–5.
evaluation as well, where the average duration of each sentence [5] M. A. A. El-Fattah et al., “Speech enhancement with an adaptive Weiner
is 3 to 5 s. The sentences are played to 23 naive listeners (who filter,” Int. J. Speech Technol., vol. 17, no. 1, pp. 53–64, 2014.
are native speakers of Tamil in the age group of 21 to 30 years), [6] S. R. C. H. You and S. N. Koh, “Audible noise reduction in Eigen
domain for speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang.
in a laboratory environment using a loud speaker system (for Process., vol. 15, no. 6, pp. 1753–1765, Aug. 2007.
a maximum of two times) and the listeners are asked to write [7] F. Asano, S. Hayamizu, T. Yamada, and S. Nakamura, “Speech enhance-
down what they heard. ment based on the subspace method,” IEEE/ACM Trans. Audio, Speech,
Lang. Process., vol. 8, no. 5, pp. 497–507, Sep. 2000.
The results indicate that both for stationary and non-stationary [8] Y. Ephraim and D. Malah, “Speech enhancement using a minimum
noise conditions, there exists a close overlap between the pro- mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust.,
posed method and the techniques considered for comparison Speech, Signal Process., vol. 33, no. 2, pp. 443–445, Apr. 1985.
[9] I. Cohen, “Optimal speech enhancement under signal presence uncertainty
almost in all the considered situations. using log-spectral amplitude estimator,” IEEE Signal Process. Lett., vol. 9,
no. 4, pp. 113–116, Apr. 2002.
[10] Y. Xu, J. Du, L. Dai, and C. Lee, “A regression approach to speech
VII. CONCLUSION enhancement based on deep neural networks,” IEEE/ACM Trans. Audio,
The proposed multi-level speech enhancement technique in- Speech, Lang. Process., vol. 23, no. 1, pp. 7–19, Jan. 2015.
[11] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for
volves estimation of magnitude and phase spectra from the monaural speech separation,” IEEE/ACM Trans. Audio, Speech, Lang.
noisy complex speech spectrum. The proposed algorithm is Process., vol. 24, no. 3, pp. 483–492, Mar. 2016.
compared with related techniques that also consider estimating [12] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-
square error short-time spectral amplitude estimator,” IEEE Trans. Acoust.,
both magnitude and phase spectra for improved signal esti- Speech, Signal Process., vol. 32, no. 6, pp. 1109–1121, Dec. 1984.
mation. Evaluation is carried out under both stationary and [13] T. Gerkmann, M. K.-Becker, and J. Le Roux, “Phase processing for single-
non-stationary noise conditions for various SNR levels. The channel speech enhancement: History and recent advances,” IEEE Signal
Process. Mag., vol. 32, no. 2, pp. 55–66, Mar. 2015.
objective PESQ scores indicate that the proposed algorithm [14] A. V. Oppenheim and J. S. Lim, “The importance of phase in signals,”
achieves an improvement of (16.88–44.61)% and (11.36–30)% Proc. IEEE, vol. 69, no. 5, pp. 529–541, May 1981.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 13,2020 at 19:47:04 UTC from IEEE Xplore. Restrictions apply.
T et al.: MULTI-LEVEL SINGLE-CHANNEL SPEECH ENHANCEMENT 1327

[15] L. D. Alsteris and K. K. Paliwal, “Further intelligibility results from human [36] “IEEE recommended practice for speech quality measurements,” IEEE
listening tests using the short-time phase spectrum,” Speech Commun., Trans. Audio Electroacoustics, vol. 17, no. 3, pp. 225–246, Sep. 1969.
vol. 48, no. 6, pp. 727–736, 2006. [37] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-
[16] A. Sugiyama and R. Miyahara, “Phase randomization—A new paradigm time objective intelligibility measure for time-frequency weighted noisy
for single-channel signal enhancement,” IEEE Int. Conf. Acoust., Speech speech,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2010,
Signal Process., 2013, pp. 7487–7491. pp. 4214–4217.
[17] K. Paliwal, K. Wójcicki, and B. Shannon, “The importance of phase in [38] G. Hilkhuysen, N. Gaubitch, M. Brookes, and M. Huckvale, “Effects of
speech enhancement,” Speech Commun., vol. 53, no. 4, pp. 465–494, 2011. noise suppression on intelligibility: Dependency on signal-to-noise ratios,”
[18] S. Singh, A. M. Mutawa, M. Gupta, M. Tripathy, and R. S. Anand, “Phase J. Acoust. Soc. Amer., vol. 131, no. 1, pp. 531–539, 2012.
based single-channel speech enhancement using phase ratio,” in Proc.
IEEE 6th Int. Conf. Comput. Appl. Elect. Eng.-Recent Adv., 2017, pp. 393–
396.
[19] R. Maia and Y. Stylianou, “Iterative estimation of phase using complex
Cepstrum representation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Process., 2016, pp. 4990–4994.
[20] M. Krawczyk-Becker and T. Gerkmann, “STFT phase reconstruction Lavanya T received the M.E. degree in communi-
in voiced speech for an improved single-channel speech enhancement,” cation systems from the SSN College of Engineer-
IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 1931– ing, Chennai, in 2016. Currently, she is pursuing the
1940, Dec. 2014. Ph.D. degree with the Speech Lab, SSN College of
[21] Y. Wakabayashi, T. Fukumori, M. Nakayama, T. Nishiura, and Y. Engineering, Chennai. Her research interests include
Yamashita, “Single-channel speech enhancement with phase reconstruc- speech enhancement and speech signal processing.
tion based on phase distortion averaging,” IEEE Trans. Acoust., Speech,
Signal Process., vol. 26, no. 9, pp. 1559–1569, Sep. 2018.
[22] J. Kulmer and P. Mowlaee, “Phase estimation in single channel speech
enhancement using phase decomposition,” IEEE Signal Process. Lett.,
vol. 22, no. 5, pp. 598–602, May 2015.
[23] P. Mowlaee and J. Kulmer, “Phase estimation in single-channel speech
enhancement: Limits-potential,” IEEE/ACM Trans. Audio, Speech, Lang.
Process., vol. 23, no. 8, pp. 1283–1294, Aug. 2015.
[24] N. Zheng and X. Zhang, “Phase-aware speech enhancement based on deep
neural networks,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27,
no. 1, pp. 63–76, Jan. 2019. Nagarajan T (Member, IEEE) received the Ph.D.
[25] N. Reddy and M. Swamy, “Derivative of phase spectrum of truncated from the Department of Computer Science and Engi-
autoregressive signals,” IEEE Trans. Circuits Syst., vol. 32, no. 6, pp. 616– neering, Indian Institute of Technology, Madras in the
618, Jun. 1985. year 2004. Subsequently, he worked as a Postdoctral
[26] L. D. Alsteris and K. K. Paliwal, “Importance of window shape for phase- Fellow at INRS – EMT, Montreal, Canada, for two
only reconstruction of speech,” in Proc. IEEE Int. Conf. Acoust., Speech, years. He is currently a Professor and heads the De-
Signal Process., 2004, pp. 573–576. partment of Information Technology, SSN College of
[27] C. M. Tsai, “Adaptive local power-law transformation for color image Engineering, Chennai, India. His areas of research
enhancement,” Appl. Math. Inf. Sci., vol. 7, no. 5, pp. 2019–2026, 2013. include speech signal processing, continuous speech
[28] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proc. recognition, statistical parametric speech synthesis,
ICNN’95 - Int. Conf. Neural Netw., 1995, pp. 1942–1948. speaker verification and identification, spoken lan-
[29] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for guage identification, segmentation of speech into sub-word units, discriminative
speech enhancement,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, training techniques, and improved acoustic modeling techniques
no. 1, pp. 229–238, Jan. 2008.
[30] K. Wójcicki, M. Milacic, A. Stark, J. Lyons, and K. Paliwal, “Exploit-
ing conjugate symmetry of the short-time fourier spectrum for speech
enhancement,” IEEE Signal Process. Lett., vol. 15, pp. 461–464, 2008.
[31] P. Mowlaee and R. Saeidi, “Iterative closed-loop phase-aware single-
channel speech enhancement,” IEEE Signal Process. Lett., vol. 20, no. 12,
pp. 1235–1239, Dec. 2013. Vijayalakshmi P (Senior Member, IEEE) received
[32] P. C. Loizou, Speech Enhancement: Theory and Practice. 2nd ed., Boca the M.E. degree in communication systems from
Raton, FL, USA: CRC Press, 2013. the National Institute of Technology, Trichy in the
[33] J. G. Harris and M. D. Skowronski, “Energy redistribution speech intel- year 1999 and the Ph.D. from the Indian Institute
ligibility enhancement, vocalic and transitional cues,” J. Acoustical Soc. of Technology, Madras, in 2007. She worked as a
Amer., vol. 112, no. 5, pp. 2305–2305, 2002. Doctoral Trainee for a year at INRS – EMT, Montreal,
[34] C. Taal, R. C. Hendriks, and H. Richard, “Speech energy redistribution Canada. She is currently a Professor with the Depart-
for intelligibility improvement in noise based on a perceptual distortion ment of Electronics and Communication Engineer-
measure,” Comput. Speech Lang., vol. 28, no. 4, pp. 858–872, 2014. ing, SSN College of Engineering, Chennai, India. Her
[35] D. Griffin and J. Lim, “Signal estimation from modified short-time fourier areas of research include speech signal processing,
transform,” IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2, speech-based assistive technology, speech recogni-
pp. 236–243, Apr. 1984. tion, speech synthesis and speech enhancement.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 13,2020 at 19:47:04 UTC from IEEE Xplore. Restrictions apply.

You might also like