Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

356 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO.

2, FEBRUARY 2010

Speech Enhancement Using Harmonic Emphasis


and Adaptive Comb Filtering
Wen Jin, Xin Liu, Michael S. Scordilis, Senior Member, IEEE, and Lu Han

Abstract—An enhancement method for single-channel speech conventional hidden Markov model (HMM)-based minimum
degraded by additive noise is proposed. A spectral weighting func- mean square error (MMSE) estimator to enhance the harmonics
tion is derived by constrained optimization to suppress noise in for voiced speech, it incorporates a ternary voicing state and
the frequency domain. Two design parameters are included in the
suppression gain, namely, the frequency-dependent noise-flooring applies it to a harmonic representation of voiced speech. The
parameter (FDNFP) and the gain factor. The FDNFP controls sinusoidal model is adopted in the speech enhancement context
the level of admissible residual noise in the enhanced speech. in the algorithms of [4]–[8].
Enhanced harmonic structures are incorporated into the FDNFP The aforementioned algorithms rely on some underlying
by time-domain processing of the linear prediction residuals of speech model to enhance the harmonics of voiced speech. An
voiced speech. Further enhancement of the harmonics is achieved
by adaptive comb filtering derived using the gain factor with a alternative strategy is to process the speech directly either in
peak-picking algorithm. The performance of the enhancement the time domain waveforms or frequency domain magnitudes.
method was evaluated by the modified bark spectral distance [9] uses the fundamental frequency to narrow down the a
(MBSD), ITU-Perceptual Evaluation of Speech Quality (PESQ) priori probability distribution (PD) of the DFT amplitude, and
scores, composite objective measures and listening tests. Exper- consequently it improves the estimation of the DFT spectrum
imental results indicate that the proposed method outperforms
spectral subtraction; a main signal subspace method applicable to and enhances the harmonics of voiced speech. The spectral
both white and colored noise conditions and a perceptually based subtraction algorithm is used as a general and basic enhance-
enhancement method with a constant noise-flooring parameter, ment system [13]. A set of metrics is introduced to measure
particularly at lower signal-to-noise ratio conditions. Our listening the harmonicity of short-time pre-enhanced speech and serves
test indicated that 16 listeners on average preferred the proposed as an indicator of whether further enhancement is necessary.
approach over any of the other three approaches about 73% of
the time. The quality of voiced speech is improved by post-enhancing
the harmonic structures with adaptive comb filtering [14],
Index Terms—Constrained optimization, harmonic enhance- [15]. Ephraim introduced an MMSE estimator to enhance the
ment, speech enhancement.
speech in [16], [17]. Then [11] exploits the correlation between
frequency components to improve an MMSE estimator of the
I. INTRODUCTION short-time complex spectrum. Kalman filtering has been used to
produce a time-domain optimal estimation of the clean speech
[18]–[20]. In [12], an artificial harmonic signal is synthesized
HE enhancement of single-channel speech degraded by
T additive noise has been extensively studied in the past
and remains a challenging problem because only the noisy
by nonlinear processing of pre-enhanced speech. This artificial
signal is then included in a suppression gain that modifies the
spectral magnitudes of the noisy speech.
speech is available. Techniques have been proposed in the In this paper, we propose a new method that enhances the
literature to exploit the harmonic structure of voiced speech harmonics of voiced speech without ascribing to any underlying
for enhancing the speech quality [1]–[12]. In the work of [1] speech models. The harmonic speech structure obtained through
and [2], voiced speech is modeled as harmonic components short-time Fourier analysis is enhanced by applying a combi-
plus noise-like components, and enhancement is performed nation of time and frequency domain-based criteria, which are
by estimating the harmonic components while reducing the applicable for white as well as for colored additive noise con-
additive noise in the noise-like components. [3] extends the ditions. While similar principles are shared by [21]–[23], our
method addresses the voiced harmonics specifically, instead of
using general vector subspaces theory or typically formulated
Manuscript received June 24, 2008; revised July 02, 2009. First published using Karhunen–Loeve transform (KLT) solutions [21], [22]. In
July 31, 2009; current version published November 20, 2009. The associate ed-
itor coordinating the review of this manuscript and approving it for publication contrast to many state-of-the-art approaches, the proposed al-
was Prof. Yariv Ephraim. gorithm allows an admissible level of residual noise in the en-
W. Jin is with Qualcomm, San Diego, CA 92121 USA (e-mail: wjin@qual- hanced speech. Since in many real world applications, a com-
comm.com).
X. Liu and M. S. Scordilis are with the Department of Electrical and Com- plete removal of the degrading noise is neither feasible nor de-
puter Engineering, University of Miami, Coral Gables, FL 33146-0640 USA sirable, retaining a low-level background noise actually yields
(e-mail: x.liu6@umiami.edu; m.scordilis@miami.edu). better perceptual quality [24]. The proposed method improves
L. Han is with the Department of Electrical and Computer Engineering, North
Carolina State University, Raleigh, NC 27695 USA (e-mail: lhan2@ncsu.edu). speech quality by suppressing the noise in the frequency do-
Color versions of one or more of the figures in this paper are available online main with the use of a spectral weighting function. Two de-
at http://ieeexplore.ieee.org. sign parameters are introduced into the proposed suppression
Digital Object Identifier 10.1109/TASL.2009.2028916

1558-7916/$26.00 © 2009 IEEE


JIN et al.: SPEECH ENHANCEMENT USING HARMONIC EMPHASIS AND ADAPTIVE COMB FILTERING 357

gain, namely the frequency-dependent noise-flooring parameter bounded by . Because varies with frequency and
(FDNFP) and the gain factor. The FDNFP shapes the residual controls the level of residual noise at each frequency band , we
noise in the frequency domain such that the harmonic struc- refer to as the frequency-dependent noise-flooring parameter
ture of clean speech is preserved. To further enhance the har- (FDNFP) in this paper. With (3) as our target of approximation,
monics of voiced speech, adaptive comb filtering is performed the estimation error is
using the gain factor by picking the harmonic peaks from the
noisy speech spectrum. Therefore, the proposed algorithm ex-
tracts and enhances the harmonics by operating in both the time
and frequency domains.
(4)
This paper is organized as follows. Section II describes the
principles of the proposed enhancement method. Section III
presents the techniques for enhancing the harmonic structures of where and represent the
voiced speech. In Section IV, the performance of the proposed speech distortion and residual noise, respectively. Let
method is evaluated. Finally, Section V draws conclusions.
(5)

II. PRINCIPLES OF THE PROPOSED ENHANCEMENT METHOD be the energy of speech distortion, where denotes the expec-
tation and is matrix trace. Similarly, let
Suppose we have a single channel of noisy speech degraded
by additive noise. The noisy observation can be expressed as (6)

(1) denote the energy of residual noise in the th frequency band.


is the th spectral component selector defined as
where , , and are vectors representing the noisy
speech, clean speech, and additive noise, respectively. is the
number of samples in each analysis frame. The additive noise
is assumed to be statistically uncorrelated to the clean speech.
Denote by the Fourier transform matrix, where in- We formulate the speech enhancement task as the following
dicates matrix Hermitian. The -point short time Fourier trans- constrained optimization problem originally proposed in [21]
form (STFT) of the noisy speech is then given by
(7)
(2) subject to (8)

where , , and are the Fourier transforms of the noisy where is the threshold used to suppress noise at the th spec-
speech, clean speech, and noise, respectively. Our enhancement tral component. The estimator that satisfies (7) and (8) can
task is to find a spectral domain linear estimator such that be found by following an optimization procedure similar to that
produces a good approximation to the clean speech used in [21]. Specifically, is a stationary feasible point if it
spectrum. Ideally, the enhanced signal spectrum should be satisfies the gradient equation of the Lagrangian
identical to the clean speech spectrum . Many enhancement
methods, e.g., [21], [25], have been proposed in the literature
to minimize some error norm between the estimated (9)
and clean speech spectra. However, in practical systems there
always exist residual distortions in the enhanced speech. More- and
over, retaining a comfort level of residual noise in the enhanced
speech will actually improve the perceived quality in many sit-
uations. For example, in a telephone application, keeping a low- (10)
level natural sounding background noise will provide the far end where is the Lagrange multiplier for the th spectral compo-
user a feeling of the near end atmosphere and avoid the impres- nent. Assume is real and symmetric. From ,
sion of an interrupted transmission [24]. As stated in the pre- we obtain
vious section, complete removal of the noise is neither feasible
nor desirable. Therefore, we design our linear estimator in (11)
such a way that the enhanced speech spectrum approaches
where and is the time-domain autocorrelation matrix of
(3) speech and noise, respectively. is a
diagonal matrix of the Lagrange multipliers. The optimal esti-
where is a diagonal matrix with real-valued diagonal mator can be obtained by solving the matrix (11). Now let us
elements , and is the frequency index. The assume is also diagonal. The simplification comes after the
parameters “admit” certain level of noise to appear at each fact that the matrices and are asymptotically
frequency band in the enhanced speech. The values of are diagonal provided the matrices and are Toeplitz [26].
358 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 2, FEBRUARY 2010

The diagonal elements of and are the power Therefore, (17) can be viewed as a combination of a gain
spectrum components and of the clean speech and factor and a small positive noise-flooring parameter. Definitely,
noise, respectively [25]. Therefore, we have the asymptotic di- the value of the gain should be within the range
agonal solution as for a nontrivial design. The offset controls the level of ad-
missible residual noise in the enhancement output. The quantity
(12) determines the final suppression level of the noise. It
is noteworthy that a similar design was proposed in [24]. Unlike
where are the diagonal elements of . the FDNFP in our method, the parameter in [24] was a
We now show how (10) can be satisfied with this diagonal scalar variable and not frequency-dependent. Furthermore, the
solution. With diagonal and using the fact of asymptotic di- gain factor was derived from an estimation of the masking
agonalization, (6) can be rewritten as thresholds [24]. In the next section, we will show how the fre-
quency-dependent parameters and can be utilized to en-
(13) hanced the harmonics of voiced speech.

Let the constraints in (8) be satisfied with equality, then by sub-


stituting (13) into (8), we get III. HARMONIC ENHANCEMENT

(14)
A. Harmonic Enhancement by Noise Flooring
Substituting (12) into (14) and using the condition , we
Because voiced speech is quasi-periodic in nature, its
have
magnitude spectrum exhibits peaks and valleys separated by
harmonics of the fundamental frequency. The harmonic struc-
(15) ture of clean voiced speech is often corrupted by the additive
noise spectrum. Many classical noise reduction methods that
use multiplicative spectral gains, e.g. short-time spectral am-
where is the signal-to-noise ratio (SNR) plitude modification estimations, fail to recover the harmonic
for the th spectral component. Substituting (15) into (12), we structure because they do not take advantage of the properties
obtain the final solution that satisfies both of the Lagrangian of this redundancy in voiced speech.
gradient (9) and (10) Now let us examine the second term “ ” on the right-hand
side of (3). If we retrieved the harmonics of the clean voiced
(16) speech with some success, then we could incorporate this har-
monic structure into the FDNFP . As a consequence, the
spectral envelope of the noise is shaped by the harmonics be-
We can reduce the number of variables in (16) by setting the fore being suppressed towards the noise-floor. This way, the
threshold to be a proportion of the noise power spectrum residual noise in the enhanced speech will be shaped to have the
. Let , where is the proportionality same harmonic structure as the clean speech. In such a manner,
factor and specifies the amount of attenuation of noise power. the harmonics of the clean voiced speech can be recovered.
Then (16) can be rewritten as Therefore, aside from suppressing the noise to a comfortable
low-level, the FDNFP in (17) can be used to enforce a har-
(17)
monic-shaping on the residual noise spectrum in the enhanced
speech.
Obviously, we now have the flexibility of balancing between
In order to impose a harmonic envelope upon the FDNFP,
the two design parameters in (17). The term is introduced
we propose here an approach to extract the harmonic struc-
because our enhancement target is the “noise admitting” spec-
ture of voiced speech in the time domain. The motivation for
trum in (3). The value of should be small in order to
time domain processing is to preserve the correlation between
maintain a low-level of residual noise in the enhanced speech.
both spectral amplitudes and phases when restoring the har-
On the other hand, the parameter dominates the value of
monics. Because the phase coherence in voiced speech is a sig-
suppression gain. It can be deciphered as a conventional noise
nificant source of correlation and corresponds to energy local-
suppression function. In fact, if we let for all , then
ization in the time domain [11], we retrieve the harmonic infor-
and the second term “ ” on the right-hand
mation from noisy speech by enhancing the excitation peaks in
side of (3) becomes zero. This means the enhanced speech spec-
the linear prediction residuals. Fig. 1 depicts the steps for com-
trum will approach the clean speech spectrum . Several
puting the FDNFP .
choices for the design of have been proposed in [21]. Ac-
For voiced speech, a linear prediction (LP) analysis is per-
tually, if we let
formed on the noisy speech. In our implementation, the clas-
sical autocorrelation method is used to derive the LP parame-
and ters. The model order is set to 15. The LP residual signal is pro-
cessed in parallel by two different methods to enhance the exci-
then (17) reduces to the classical Wiener filter. tation peaks. The first method attenuates the signal amplitudes
JIN et al.: SPEECH ENHANCEMENT USING HARMONIC EMPHASIS AND ADAPTIVE COMB FILTERING 359

Fig. 1. Computation of the frequency-dependent noise flooring parameters.

between excitation peaks by windowing the LP residual signal


with a Kaiser window series. The duration of each window is
set to be equal to the pitch period. The centers (peaks) of the
windows are aligned in time with the peaks of excitation pulses.
The purpose of windowing is to enhance the amplitude contrast
between peaks and valleys of the excitation pulses.
In the second method, the LP residuals are averaged over the
pitch epoch

(18)

where and are the averaged and noisy LP residuals,


respectively. is the largest integer number of pitch periods
in the current analysis frame. is the number of samples in
one pitch period. is the time sample index, and is the pitch
epoch index. From (18), it should be noted that the duration of
, the averaged LP residual, is exactly one pitch period.
Then is repeated during the whole analysis frame. The Fig. 2. Waveform of clean speech and its LP residual, (a) clean speech, (b) LP
residual of clean speech.
motivation for this averaging is based on the fact that while
the LP bursts of voiced speech are quasi-periodic, the additive
noise tends to be random and uncorrelated. By averaging the LP extending in (18) over the entire duration of the analysis
residuals over several pitch periods, the periodic components frame. is the final LP residual with enhanced period-
will therefore be enhanced while the uncorrelated random com- icity. Because the averaging-enhanced residuals may not be
ponents will be suppressed. In order to provide the necessary as accurate as windowing-enhanced residuals, due to shimmer
pitch information for the aforementioned windowing and aver- for example, the parameter is set to 0.8. is then trans-
aging process, a pitch detection algorithm is run in parallel to formed to the frequency domain, and its magnitude spectrum
determine the pitch period of the current frame. Here we use the is normalized to 0 dB by its maximum magnitude. Finally, the
relatively simple SIFT (Simple Inverse Filter Tracking) method FDNFP is obtained by scaling the normalized spectrum of
[27] for pitch determination. Although the performance of the to some comfort noise level. In our implementation,
optimal temporal similarity method in [28] is better, it is more the normalized spectrum of is scaled down by 5 dB
complicated to implement and it hence increases the compu- for strongly voiced speech. This figure was experimentally
tational load. So the enhanced SIFT method is chosen instead. optimized so that the level of residual noise permitted in the
The final processed LP residual with enhanced periodicity is ob- enhanced signal is kept at a low level. Figs. 2–4 demonstrate
tained by the process of obtaining the FDNFP.
(19) Figs. 2 and 3 illustrate the harmonic enhancement of LP resid-
uals. Fig. 2 shows a frame of clean speech and its corresponding
where is a smoothing factor, and is the window-en- LP residual. Fig. 3(a) depicts the noisy speech obtained by de-
hanced LP residuals. is obtained by periodically grading clean speech with white Gaussian noise at a SNR of
360 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 2, FEBRUARY 2010

Fig. 3. Harmonic enhancement in linear prediction residuals, (a) speech in 2(a)


degraded by white Gaussian noise (SNR = 5 dB), (b) LP residual of noisy
speech, (c) LP residuals with enhanced periodicity.

5 dB, the noisy LP residual is shown in Fig. 3(b). The final pe-
Fig. 4. Spectrum of linear prediction residuals and FDNFP, (a) spectrum of
riodicity-enhanced residual of (19) is plotted in Fig. 3(c) clean LP residuals, (b) spectrum of noisy LP residuals (SNR = 5 dB), (c)
where noise suppression is clearly evident. In Fig. 4, the magni- spectrum of periodicity-enhanced LP residuals, (d) FDNFP.
tude spectrum of the clean, noisy, and periodicity-enhanced LP
residual are plotted in Fig. 4(a)–(c) respectively, and the FDNFP
is shown in Fig. 4(d). peak-picking method are introduced in this paper for a more reli-
able performance on the spectrum of noisy speech. Specifically,
the “harmonic test” is modified as
B. Adaptive Comb Filtering
(20)
From (17) we can see that it is also beneficial to incorporate
harmonic structures into for voiced speech. The reason is be-
or
cause is the dominant term of the suppression gain while the
values of are usually small. This way, the level of residual dB (21)
noise can be more effectively suppressed. Further improvement
of perceptual quality can be achieved by imposing a harmonic then, if
envelope on for voiced speech. However, unlike the which
are relatively flat over the entire frequency range, the should (22)
follow closely the spectral tilt as well as the formant peaks and
valleys of the speech. Therefore, we implement the as an and
adaptive comb filter by utilizing the spectral peak-picking algo-
rithm proposed in [29]. (23)
The peak-picking method in [29] was proposed as part of
a concatenative speech synthesis algorithm that uses the Har- frequency is declared voiced, otherwise is declared un-
monic plus Noise Model (HNM). Here it is used as a means voiced. The notations in (20), (21) and (22) are the same as de-
to determine the frequency locations of the comb peaks. Be- fined in [29], where denotes the frequency location of the
cause the spectral peaks were picked from the spectrum of clean peak under test within the range , and
speech in [29], some modifications and postprocessings to the are the frequencies of the peaks within the same range except
JIN et al.: SPEECH ENHANCEMENT USING HARMONIC EMPHASIS AND ADAPTIVE COMB FILTERING 361

frequency . is the initial fundamental frequency estimate


using an enhanced SIFT method [28]. and are the
amplitudes at and , respectively. and de-
note the cumulative amplitude at and , respectively. The
cumulative amplitude is defined as the non-normalized sum of
the amplitudes of all of the samples from the previous valley
to the following valley of the peak. denotes the mean
value of the cumulative amplitudes . is the index of
the nearest harmonic to . Having classified frequency as
voiced or as unvoiced, the next interval
is searched for its largest peak and the same “harmonic test” is
applied. The process is continued throughout the speech band-
width. The measurements of (20), (21), and (22) were originally
introduced in [29]. Fig. 5. Interpolation of a single harmonic peak and rejection of spurious peaks.
Magnitude spectra of clean speech (solid line) and noisy speech (dotted line).
In this paper, we have added the tonality measure in (23) 2
Peaks picked by the modified peak-picking method ( ). Peaks after postpro-
to the “harmonic test.” The advantage of the tonality test is to 
cessing ( ). White Gaussian noise, input SNR = 5 dB.
effectively remove the spurious peaks caused by white noise.
The quantity SFM in (23) denotes the spectral flatness measure
as defined in [30]

SFM (24)

where and denote the geometric mean and arithmetic


mean of the power spectrum in the range ,
respectively. We used SFM 50 dB in our implementa-
tion. In other words, an SFM of 50 dB indicates the signal is
entirely tonelike.
Even though the peak-picking method is modified as above,
some real harmonic peaks are rejected and some spurious peaks
are accepted because of the distortion effects of additive noise.
Moreover, the harmonics of clean speech in the spectral valleys
are often submerged by the noise spectrum and consequently Fig. 6. Recovery of multiple peaks submerged by noise. Magnitude spectra
of clean speech (solid line) and noisy speech (dotted line). Peaks picked by
these harmonic peaks can not be picked by the peak tracking 2 
the modified peak-picking method ( ). Peaks after postprocessing ( ). White
method. To overcome these problems, the following postpro- Gaussian noise, input SNR = 5 dB.
cessing steps are performed on the peaks picked by the modified
algorithm.
last harmonic in is located at and the
1) Interpolation of a single harmonic peak. A local peak is first harmonic in has a frequency of ,
declared a harmonic peak if both of the two conditions then the interpolated harmonics have frequencies
are true: .
— its frequency is within 15% of , the nearest
harmonic frequency; Fig. 5 illustrates steps 1 and 2 of the postprocessing. In Fig. 5,
— there are at least three peaks before and two peaks after the spectra of the clean and noisy speech are depicted in solid
it. line and dotted line, respectively. The modified peak-picking
2) Rejection of isolated peaks. A harmonic peak is rejected if method is applied to the spectrum of noisy speech. The peaks
its distance to the nearest neighboring peaks is either less picked by the modified peak-picking method are marked by
than or greater than . crosses . The final peaks after postprocessing are marked as
3) Recovery of multiple submerged intermediate peaks. Let circles . As shown in Fig. 5, the harmonic peak near 900 Hz
and be some positive integers and . is interpolated by step 1. The spurious peaks above 1600 Hz are
Multiple harmonic peaks are interpolated based on the rejected according to step 2.
following tests: Fig. 6 depicts step 3 of the postprocessing. As can be seen
— there are no peaks picked in the frequency range from Fig. 6, the spectrum of the noisy speech is relatively flat in
; the range 800 1700 Hz because of the effects of additive white
— there are at least three good harmonic peaks in the Gaussian noise. The harmonics of the clean speech are sub-
range and at least another three harmonics in merged by the noise spectrum. Since the conditions of step 3 are
. satisfied, four peaks are interpolated in the range 800 1700 Hz.
If both of the above conditions are true, then harmonics It should be noted that the spurious peak near 800 Hz is already
are interpolated in the range . Assume the eliminated by step 2 before the step 3 interpolation.
362 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 2, FEBRUARY 2010

After finding as many additional frequency locations of har-


monic peaks as possible, we are ready to design the gain factor
in (17) as an adaptive comb filter. In the first step, an initial
comb filter is implemented in the frequency domain as

otherwise
(25)
where is the peak frequency as determined by the modi-
fied peak-picking method and post-processings. is the
frequency response of the initial comb filter at frequency .
controls the width of the comb filter [10] and is set to 2 in
our implementation. The quantity specifies the filter gain at
peak frequency . Notice in (25) the comb structures are only Fig. 7. Adaptive comb filter for the noisy speech spectrum in Fig. 6.
implemented within the vicinity of one fundamental frequency
(pitch ) range centered at the peak frequency . The value
of determines the filter response outside the frequency range
. Since there are many design choices for
the gain factor , the designs of and are also flexible.
In this paper, we implemented the and as Wiener-type
gains

(26)

and

(27) Fig. 8. Spectrum of harmonic enhanced speech, the corresponding spectrum


of clean and noisy speech, and peak-picking are shown in Fig. 6.

where is the estimated power spectrum of clean speech,


and is the spectrum of noisy speech. can be com- by accurate estimation of the clean speech spectrum, as will be
puted directly from the noisy speech. The accurate estimation of shown in the following example of Fig. 7.
the clean speech spectrum is very crucial to the performance of For the noisy spectrum shown in Fig. 6, its corresponding
the proposed harmonic enhancement method. We have used the dominant gain factor is plotted in Fig. 7. Clearly, the gain factor
classical spectral subtraction in Fig. 7 is an adaptive comb filter with variable peak mag-
nitudes and peak frequencies. The advantage of using in
(28) (27) is also obvious, since the harmonic peak near 2400 Hz
was missed by the modified peak-picking method and postpro-
where is a zero-flooring parameter and is the cessing, but it is retrieved by the adaptive comb filter. The spec-
estimated spectrum of noise. is simply the index of frequency trum of the harmonic enhanced speech is shown in Fig. 8. By
. In the following text, and are interchangeable. To ob- comparing the spectrum of noisy speech in Fig. 6 with the en-
tain the estimated noise spectrum , the minimum statistics hanced spectrum in Fig. 8, we can see that the four harmonic
tracking method in [31] is implemented. peaks in 800 1700 Hz are enhanced. The corresponding wave-
Eventually, the gain factor in (17) is obtained by forms of the clean, noisy and harmonic enhanced speech are
shown in Fig. 9.
dB (29) Finally, the proposed harmonic enhancement method requires
a voice activity detector (VAD) to classify the speech signal
where 20 dB denotes an amplitude value of 0.01. Since the and a pitch determination algorithm (PDA) to find the pitch of
initial comb filter has variable peak magnitudes voiced speech. Accurate PDA and robust VAD under noisy envi-
and peak frequencies , and normally the peak magnitudes ronments are well studied topics in research literature, and they
are larger than 20 dB, the gain factor in (29) can be referred are also the crucial components for the success of the proposed
to as an adaptive comb filter. The motivation for choosing as harmonic enhancement algorithm. In our implementation, the
in (26) is that the spectral gain in (17) reduces to the Wiener filter enhanced SIFT method in [28] was used for pitch determination.
at the peak frequency when the estimation of the noise and The VAD is based on short-time energy level, zero-crossing rate,
clean speech spectrum is perfectly accurate, i.e., loudness, and success of pitch detection. The short-time speech
and , with . The advantage of using frames are classified as voiced, unvoiced, and silence. Harmonic
as in (27) is that the lost harmonic peaks can be retrieved enhancement is only applied to voiced frames. The Wiener-type
JIN et al.: SPEECH ENHANCEMENT USING HARMONIC EMPHASIS AND ADAPTIVE COMB FILTERING 363

Fig. 9. Waveforms of clean, noisy, and harmonic enhanced speech,


peak-peaking shown in Fig. 6, (a) clean speech, (b) noisy speech (white
Gaussian noise SNR =5 dB), and (c) harmonic enhanced speech.

gain in (27) is used for unvoiced and silent frames, and


dB for unvoiced, dB for silent frames, or
Fig. 10. Waveform and spectrogram of clean and noisy speech (female speech,
parameter values of 0.01 and 0.001, respectively. multitalker babble noise, SNR =5 dB), (a) clean speech, (b) spectrogram of
clean speech, (c) noisy speech, and (d) spectrogram of noisy speech.

IV. PERFORMANCE EVALUATION


The harmonic enhancement method was tested with 60 sen- component, respectively. is a noise-flooring parameter. In our
tences (30 male, 30 female) with durations between 4–6 s taken implementation, the is estimated by the MPEG-4 psy-
from the TIMIT speech database. The sentences were down- choacoustical model [33], is obtained by the method in
sampled to 8 kHz before noise samples were added. The noise [31], and dB. Notice the noise-flooring parameter in
sources were downloaded from the IEEE Signal Processing In- [24] is a constant for all frequencies.
formation Base [32]. Two types of noise were used, namely For more complete comparison, we also implemented the
white Gaussian noise and multitalker babble noise. The noise classical spectral subtraction signal enhancement method [13],
power level was scaled and added to the downsampled clean the Ephraim and Van Trees signal subspace speech enhancement
speech to generate noisy speech with SNR in the range of 0 to method [21], which was used in the white noise conditions, and
20 dB with 5-dB steps. The enhancement was applied to 32 ms the Lev–Ari and Ephraim subspace method for colored noise
(256 samples) frames of noisy speech with a 50% overlap be- [23].
tween adjacent frames. This resulted in a frame shift of 16 ms. The enhancement algorithms were evaluated by the generally
The FFT size was 512 and the enhancement output was obtained used ITU-PESQ (Perceptual Evaluation of Speech Quality)
by the overlap-and-add method. scores [34] as well as the Modified Bark Spectral Distortion
For comparison, we implemented and evaluated the spectral (MBSD) measure [35]. The ITU-PESQ (P.862) converts the
weighting gain method of [24] referred to as the Just Notable disturbance parameters in speech to a MOS-like listening
Distortion (JND), which is given by quality score, which ranges from 0.5 to 4.5. The higher the
score, the better the perceptual quality [34]. It is claimed that
(30) the PESQ scores have a 0.935 average correlation with the
subjective scores [36]. The MBSD measure [35] is an improve-
ment of the Bark Spectral Distortion (BSD) objective measure
where , and are the masking threshold, [37]. Both of the ITU-PESQ scores and MBSD measures are
noise spectrum, and the JND weighting gain for the th spectral objective measures that are claimed to be highly correlated to
364 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 2, FEBRUARY 2010

Fig. 12. Waveform and spectrogram of subspace [23] and proposed harmonic
Fig. 11. Waveform and spectrogram of JND [24] and Spectral Subtraction [13]
enhanced speech (female speech, multitalker babble noise, SNR =5 dB), enhanced speech (female speech, multitalker babble noise, SNR =5 dB),
(a) JND-enhanced speech, (b) spectrogram of JND-enhanced speech, (c) spec- (a) subspace-enhanced speech, (b) spectrogram of subspace-enhanced speech,
tral subtraction enhanced speech, and (d) spectrogram of spectral subtraction (c) harmonic enhanced speech, and (d) spectrogram of harmonic enhanced
enhanced speech. speech.

the subjective quality of speech. A comparison between the and spectrograms of the clean and noisy speech are depicted
ITU-PESQ and MBSD can be found in [38]. in Fig. 10. The clean speech is the sentence “Lori’s costume
We also use the new composite objective measures developed needed black gloves to be completely elegant.” spoken by a
in [39], which are obtained by linearly combining existing ob- female speaker. The noisy speech is obtained by degrading the
jective measures to form new measures. Such measures aim to clean speech with multitalker babble noise at a SNR of 5 dB.
predict the quality of noisy speech enhanced by noise suppres- The waveforms and spectrograms of the JND-enhanced and
sion algorithms. Three different composite measures were used. spectral subtraction enhanced speech are shown in Fig. 11,
while subspace [23] enhanced speech and the proposed Har-
1) : A composite measure for signal distortion (SIG) monic Enhanced (HE) enhanced speech are shown in Fig. 12.
formed by linearly combining the log-likelihood ratio It is evident that the HE speech preserves more harmonics
(LLR), PESQ, and weighted-slope spectral distance and suppresses more high-frequency noise than the other three
(WSS). methods. Futhermore, the proposed method suppresses babble
2) : A composite measure for background noise noise-related harmonics more effectively than the competing
distortion (BAK) formed by linearly combining the methods (e.g., around the time of 3 s).
segmental SNR (segSNR), PESQ, and WSS measures. Comprehensive test results are presented in Figs. 13 and 14.
3) : A composite measure for overall quality (OVL) The average PESQ scores and MBSD measures are used as
formed by linearly combining the PESQ, LLR, and WSS objective metrics for enhancement performance. Fig. 13 plots
measures. the average enhancement results of the 60 sentences degraded
by white Gaussian noise. The average enhancement results of
Figs. 10–12 show an example of enhancement of noisy the same 60 sentences degraded by multitalker babble noise are
speech degraded by multitalker babble noise. The waveforms shown in Fig. 14. The input SNR of both noise conditions are
JIN et al.: SPEECH ENHANCEMENT USING HARMONIC EMPHASIS AND ADAPTIVE COMB FILTERING 365

Fig. 13. Average PESQ scores and MBSD measures of 60 sentences of JND enhanced speech (dotted line), subspace enhanced speech (solid line), spectral
subtraction enhanced speech (dash-dot line), and harmonic enhanced speech (dashed line). The noise is white Gaussian at SNR of 0, 5, 10, 15, and 20 dB.

Fig. 14. Average PESQ scores and MBSD measures of 60 sentences of JND enhanced speech (dotted line), spectral subtraction enhanced speech (dash-dotted
line), subspace enhanced speech (solid line), and harmonic enhanced speech (dashed line). The noise is multitalker babble noise at SNR of 0, 5, 10, 15, and 20 dB.

set at 0, 5, 10, 15, and 20 dB. In both Figs. 13 and 14, the ob- performed the other three methods by large margins for both
jective measures (PESQ and MBSD) of JND-enhanced speech types of noise, particularly for low SNR conditions, but as the
are marked by dotted lines and diamond . The measurements input SNR increases the enhancement performances of all four
of spectral subtraction are illustrated in dash-dotted lines and methods tend to converge. This is because as the input SNR in-
circles . The measurements of subspace enhancements for creases the harmonics of clean voiced speech become less dis-
white and colored noise are expressed by solid line and plus torted in the noisy speech. Therefore, the benefits of harmonic
signs . The performances of the proposed HE method are enhancement tend to diminish with increasing input SNR. Com-
plotted in dashed line and asterisks . paring Figs. 13 with 14, it can be observed that harmonic en-
From Figs. 13 and 14 we can see the proposed harmonic en- hancement is particularly effective for multitalker babble noise
hancement method outperforms the JND approach in [24] and where voiced signals from background speakers often intro-
subspace approach [21], [23] at all SNR conditions for both duce unwanted harmonic regions that need to be suppressed.
white Gaussian noise and babble noise cases. The performance That is also evident in the spectrograms of Figs. 10–12. Statis-
improvement of the proposed HE method is more profound at tical analysis of the PESQ and MBSD results was performed by
low SNR. In the case of the spectral subtraction method the examining Fisher’s F-distribution computed via ANOVA anal-
PESQ scores suggest that it has the similar good performance as ysis. Results showed that the proposed harmonic enhancement
the proposed harmonic enhancement method in the white noise method improved performance with greater than 99.9% cer-
case for SNR greater than 5 dB and at SNR equal to 20 dB for tainty.
babble noise. However, one should be aware that PESQ is gen- Table I lists the results using the composite performance
erally deaf to the musical noise introduced by spectral subtrac- measures, organized according to the three evaluated measures:
tion, an observation also confirmed by subjective listening tests. . We can see that the harmonic enhanced
In the case of average MBSD scores, the proposed method out- method outperforms the other three methods under white
366 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 2, FEBRUARY 2010

TABLE I
COMPOSITE MEASUREMENT COMPARISONS OF 60 SENTENCES OF JND ENHANCED SPEECH, SPECTRAL SUBTRACTION
ENHANCED SPEECH, SUBSPACE ENHANCED SPEECH, AND HARMONIC ENHANCED SPEECH

Gaussian noise and multitaker babble noise particularly for low of 2.35. ANOVA analysis showed that the results of the test
SNR. When SNR is greater than 10, the subspace or spectral had statistical significance at levels greater than 99.9%. Sound
subtraction methods outperform the proposed method. This is examples of noisy speech and its enhancement using the four
because when speech is clear enough, trying to enhance the methods implemented in this paper are located in our website:
harmonics may actually degrade the quality by introducing http://www.chronos.ece.miami.edu/dasp/harmonic_speech_en-
artifacts. hancement.html.
A subjective listening test was conducted to compare the
performance of the proposed harmonic enhancement method V. CONCLUSION
against those of the three other techniques in A-B tests. The In this paper, we have proposed a speech enhancement
test was performed by a group of 16 listeners (four female, 12 method which aims at emphasizing harmonics. The harmonics
male), aged between 18 and 25 years old, all students at the are enhanced by processing the degraded speech in both
University of Miami. The authors were excluded from this test. the time and frequency domains. In contrast to many other
The subjects were not familiar with the sentences used in the state-of-the-art methods, the proposed algorithm allows a low
test. Three sentences were selected from the TIMIT database level of residual noise in the enhanced speech. The noisy
and used to generate babble and white noise-corrupted speech speech is enhanced in the frequency domain by a spectral
at SNR of 0 and 10 dB and processed with all four methods. weighting function, which contains two design parameters.
The resulting total of 18 sentence pairs and six sentence pairs of One of the design parameters, namely, the frequency-depen-
original noisy speech for each SNR condition were presented dent noise-flooring parameter (FDNFP), is used to emphasize
to each subject through headphones. For each SNR level, the the harmonics of voiced speech as well as to control the
order of the sentence pairs was randomized and the structure of frequency-dependent level of admissible residual noise. For
the test and the identity of each enhancement method was not voiced speech, the periodicity in the linear prediction residual
revealed to the listeners. Each subject was asked to compare signal was detected and enhanced and then transformed to the
the audio clips in each pair and vote indicating their quality frequency domain to be used as the FDNFP. The magnitudes of
preference. The test was designed to be short in order to avoid the FDNFP are scaled to some small values in order to suppress
listener fatigue, although the subjects were allowed to listen to the level of residual noise in the enhanced speech. The other
the sentence pair as many times as they needed to make a deci- design parameter is the dominant term in the spectral gain func-
sion. On average, the subjects voted in favor of the harmonic tion. For voiced frame, it enhances the harmonics by adaptive
approach for 72.66% of the audio clips with standard deviation comb filtering. The comb filter is implemented in the frequency
JIN et al.: SPEECH ENHANCEMENT USING HARMONIC EMPHASIS AND ADAPTIVE COMB FILTERING 367

domain by utilizing a spectral peak-picking algorithm. For [15] J. H. Chen and A. Gersho, “Adaptive postfiltering for quality enhance-
unvoiced and silent frames, the dominant weighting parameter ment of coded speech,” IEEE Trans. Speech Audio Process., vol. 3, no.
1, pp. 59–71, Jan. 1995.
reduces to a Wiener-type gain. [16] Y. Ephraim, “A minimum mean square error approach for speech en-
The enhancement algorithm was tested on 60 sentences hancement,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.
degraded by white Gaussian and multitalker babble noise (ICASSP), 1990, vol. 2, pp. 829–832.
[17] Y. Ephraim and D. Malah, “Speech enhancement using a minimum
at various input SNR. The enhancement performance was mean-square error log-spectral amplitude estimator,” IEEE Trans.
evaluated in terms of average ITU-PESQ scores, MBSD and Acoust., Speech, Signal Process., vol. 33, no. 2, pp. 443–445, Apr.
composite objective measures. Three other methods were 1985.
[18] K. K. Paliwal and A. Basu, “A speech enhancement method based
implemented and their performance compared against that of on kalman filtering,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal
the proposed method. Those were the spectral subtraction [13], Process. (ICASSP), 1987, pp. 177–180.
Ephraim and Van Trees signal subspace speech enhancement [19] V. Grancharov, J. H. Plasberg, J. Samuelsson, and W. B. Kleijn,
“Speech enhancement using a masking threshold constrained kalman
method [21] for white noise, the Lev–Ari and Ephraim subspace filter and its heuristic implementations,” IEEE Trans. Audio, Speech,
method for colored noise [23] and a perceptually based (JND) Lang. Process., vol. 14, no. 1, pp. 19–32, Jan. 2006.
enhancement method which employs a constant noise-flooring [20] M. Gabrea, “Adaptive kalman filtering-based speech enhancement al-
gorithm,” in Proc. Canadian Conf. Elect. Comput. Eng., Fredericton,
parameter [24]. Experimental results indicate that the proposed AB, Canada, 2001, vol. 1, pp. 521–526.
Harmonic Enhancement (HE) method outperforms the other [21] Y. Ephraim and H. L. Van Trees, “A signal subspace approach for
methods particularly at low SNR conditions. In the spectro- speech enhancement,” IEEE Trans. Speech Audio Process., vol. 3, pp.
251–266, Jul. 1995.
grams of enhanced speech, the harmonics are more prominent [22] U. Mittal and N. Phamdo, “Signal/noise KLT based approach for en-
and overall noise is more suppressed in the HE speech than hancing speech degraded by colored noise,” IEEE Trans. Speech Audio
the other methods. Subjective listening test also indicated that Process., vol. 8, no. 2, pp. 159–167, Mar. 2000.
[23] H. Lev-Ari and Y. Ephraim, “Extension of the signal subspace speech
the proposed method is generally more preferred. All obtained enhancement approach to colored noise,” IEEE Signal Process. Lett.,
results were statistically significant at very high confidence vol. 10, no. 4, pp. 104–106, Apr. 2003.
level. [24] S. Gustafsson, P. Jax, and P. Vary, “A novel psychoacoustically moti-
vated audio enhancement algorithm preserving background noise char-
acteristics,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.
(ICASSP), 1998, pp. 397–400.
REFERENCES [25] Y. Hu and P. C. Loizou, “Incorporating a psychoacoustical model in
[1] J. Hardwick, C. D. Yoo, and J. S. Lim, “Speech enhancement using frequency domain speech enhancement,” IEEE Signal Process. Lett.,
the dual excitation model,” in Proc. IEEE Int. Conf. Acoust., Speech, vol. 11, no. 2, pp. 270–273, Feb. 2004.
Signal Process. (ICASSP), 1993, pp. 367–370. [26] R. Gray, “On the asymptotic eigenvalue distribution of Toeplitz ma-
[2] S. Dubost and O. Cappe, “Enhancement of speech based on non-para- trices,” IEEE Trans. Inf. Theory, vol. IT-18, no. 6, pp. 725–730, Nov.
metric estimation of a time varying harmonic representation,” in Proc. 1972.
IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2000, pp. [27] J. D. Markel, “The sift algofithm for fundamental frequency esti-
1859–1862. mation,” IEEE Trans. Audio Electroacoust., vol. AU-20, no. 5, pp.
[3] M. E. Deisher and A. S. Spanias, “HMM-based speech enhancement 367–377, Dec. 1972.
using harmonic modeling,” in Proc. IEEE Int. Conf. Acoust., Speech, [28] P. Veprek and M. S. Scordilis, “Analysis, enhancement and evaluation
Signal Process. (ICASSP), 1997, pp. 1175–1178. of five pitch determination techniques,” Speech Commun., vol. 37, pp.
[4] M. E. Deisher and A. S. Spanias, “Speech enhancement using state- 249–270, Jul. 2002.
based estimation and sinusoidal modeling,” J. Acoust. Soc. Amer., vol. [29] Y. Stylianou, “Applying the harmonic plus noise model in concatena-
102, no. 2, pp. 1141–1148, Aug. 1997. tive speech synthesis,” IEEE Trans. Speech Audio Process., vol. 9, no.
[5] J. Jensen and J. H. L. Hansen, “Speech enhancement using a con- 1, pp. 21–29, Jan. 2001.
strained iterative sinusoidal model,” IEEE Trans. Speech Audio [30] J. D. Johnston, “Transform coding of audio signals using perceptual
Process., vol. 9, no. 7, pp. 731–740, Oct. 2001. noise criteria,” IEEE J. Sel. Areas Commun., vol. 6, no. 2, pp. 314–323,
[6] D. V. Anderson and M. A. Clements, “Audio signal noise reduction Feb. 1988.
using harmonic modeling,” in Proc. IEEE Int. Conf. Acoust., Speech, [31] R. Martin, “Noise power spectral density estimation based on op-
Signal Process. (ICASSP), 1999, pp. 805–808. timal smoothing and minimum statistics,” IEEE Trans. Speech Audio
[7] D. Morgan, B. George, L. Lee, and S. M. Kay, “Cochannel speaker Process., vol. 9, no. 5, pp. 504–512, Jul. 2001.
separation by harmonic enhancement and suppression,” IEEE Trans. [32] D. H. Johnson and P. N. Shami, “The signal processing information
Speech Audio Process., vol. 5, no. 5, pp. 407–424, Sep. 1997. base,” IEEE Signal Process. Mag. vol. 10, no. 4, pp. 36–42, Oct. 1993
[8] T. F. Quatieri and R. G. Danisewicz, “Cochannel speaker separation by [Online]. Available: http://www.spib.rice.edu/spib/select_noise.html
harmonic enhancement and suppression,” IEEE Trans. Acoust., Speech, [33] Information Eechnology—Coding of Audio-Visual Objects—Part 3:
Signal Process., vol. 38, no. 1, pp. 56–69, Jan. 1990. Audio, ISO/IEC 14496-3:2005, 2005.
[9] A. Erell and M. Weintraub, “Estimation of noise-corrupted speech dft- [34] “Perceptual evaluation of speech quality (PESQ): An objective method
spectrum using the pitch period,” IEEE Trans. Speech Audio Process., for end-to-end speech quality assessment of narrowband telephone net-
vol. 2, pp. 1–8, Jan. 1994. works and speech codecs,” ITU-T Rec. P.862, Feb. 2001 [Online].
[10] A.-T. Yu and H.-C. Wang, “New speech harmonic structure measure Available: http://www.itu.int/rec/T-REC-P.862-200102-I/en, accessed
and it application to post speech enhancement,” in Proc. IEEE Int. Conf. on Aug. 15, 2008
Acoust., Speech, Signal Process. (ICASSP), 2004, pp. 729–732. [35] W. Yang, M. Benbouchta, and R. Yantorno, “Performance of the mod-
[11] C. Li and S. V. Anderson, “Inter-frequency dependency in mmse ified bark spectral distortion as an objective speech quality measure,”
speech enhancement,” in Proc. 6th Nordic Signal Process. Symp., in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP),
2004, pp. 200–203. 1998, pp. 541–544.
[12] C. Plapous, C. Marro, and P. Scalart, “Speech enhancement using har- [36] J. G. Beerends, A. P. Hekstra, A. W Rix, and M. P. Hollier, “Percep-
monic regeneration,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal tual Evaluation of Speech Quality (PESQ): The new ITU standard for
Process. (ICASSP), 2005, pp. 157–160. end-to-end speech quality assessment part II—Psychoacoustic model,”
[13] S. F. Boll, “Suppression of acoustic noise in speech using spectral sub- J. Audio Eng. Soc., vol. 50, no. 10, pp. 765–778, Oct. 2002.
traction,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-27, [37] S. Wang, A. Sekey, and A. Gersho, “An objective measure for
no. 2, pp. 113–120, Apr. 1979. predicting subjective quality of speech coders,” IEEE J. Sel. Areas
[14] V. Grancharov, J. H. Plasberg, J. Samuelsson, and W. B. Kleijn, “Gen- Commun., vol. 10, no. 5, pp. 819–828, Jun. 1992.
eralized postfilter for speech quality enhancement,” IEEE Trans. Audio,
Speech, Lang. Process., vol. 16, no. 1, pp. 57–64, Jan. 2008.
368 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 2, FEBRUARY 2010

[38] W. Yang and R. Yantorno, “Comparison of two objective speech quality Michael S. Scordilis (SM’03) received the B.E. de-
measures: MBSD and ITU-T recommendation P.861,” in Proc. IEEE gree in communication engineering from the Royal
2d Workshop Multimedia Signal Process., Dec. 1998, pp. 426–431. Melbourne Institute of Technology, Melbourne, Aus-
[39] Y. Hu and P. C. Loizou, “Evaluation of objective measures for speech tralia, in 1984, and the M.S. degree in electrical en-
enhancement,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. gineering and the Ph.D. degree in engineering from
1, pp. 229–238, Jan. 2008. Clemson University, Clemson, SC, in 1986 and 1990,
respectively.
From 1990 to 1995, he was University Lecturer at
the University of Melbourne, Melbourne, Australia.
He has held visiting Senior Researcher positions at
Bell Communications Research (Bellcore), Morris-
town, NJ, Sun Microsystems Labs, Chelmsford, MA, and the University of
Wen Jin received the M.S. and Ph.D. degrees in elec- Patras, Patras, Greece. He is now Research Associate Professor of Electrical
trical and computer engineering from University of and Computer Engineering at the University of Miami, Coral Gables, FL. His
Miami, Coral Gables, FL, in 2001 and 2006, respec- current research interests include signal processing for speech, audio, signal
tively. recovery and enhancement, psychoacoustics, language processing, and multi-
His research interests include the general area of media signal processing. He is an active industry consultant in the areas of audio
audio and speech processing, especially in the area of and speech analysis, recognition and compression, and multimedia services, and
audio and speech coding, and single-channel speech holds patents in those areas. He has published over 60 papers in major journals
enhancement. He is now with Qualcomm, Inc. and conferences.
Dr. Scordilis received the 2003 “Eliahu I. Jury Award for Excellence in Re-
search” of the College of Engineering, University of Miami. He is a member of
the Technical Chamber of Greece.

Xin Liu was born in Beijing, China, on April 21,


1983 . She received the B.S. degree in electrical Lu Han received the M. S. degree in electrical en-
engineering from Beijing University of Chemical gineering from the Harbin Institute of Technology,
Technology (BUCT), Beijing, in 2005. She is cur- Harbin, China, in 2007.
rently pursuing the Ph.D. degree in the Department In August 2007, she joined Digital Audio and
of Electrical and Computer Engineering, University Speech Processing Lab at the University of Miami,
of Miami, Coral Gables, FL. Coral Gables, FL, as Research Assistant working
From 2005 to 2007, she was a Software Engineer on was speech enhancement. In August 2008, she
(database administrator) with Beijing Guoxin Com- transferred to the Department if Electrical and Com-
munication System Co., Ltd. (a subsidiary of China puter Engineering, North Carolina State University,
Telecom). Her research interests are in speech en- Raleigh. Her current research interests include image
hancement and speech and audio processing. processing and computer vision.

You might also like