Professional Documents
Culture Documents
Noise Power Spectral Density Estimation Based On Optimal Smoothing and Minimum Statistics
Noise Power Spectral Density Estimation Based On Optimal Smoothing and Minimum Statistics
Noise Power Spectral Density Estimation Based On Optimal Smoothing and Minimum Statistics
5, JULY 2001
Abstract—We describe a method to estimate the power spectral system. If the noise estimate is too low, unnatural residual noise
density of nonstationary noise when a noisy speech signal is given. will be perceived. If the estimate is too high, speech sounds
The method can be combined with any speech enhancement algo- will be muffled and intelligibility will be lost. The traditional
rithm which requires a noise power spectral density estimate. In
contrast to other methods, our approach does not use a voice ac- SNR based voice activity detectors (VAD) are difficult to
tivity detector. Instead it tracks spectral minima in each frequency tune and their application to low SNR speech results often
band without any distinction between speech activity and speech in clipped speech. Current research [4]–[6] aims therefore at
pause. By minimizing a conditional mean square estimation error incorporating soft-decision schemes which are also capable of
criterion in each time step we derive the optimal smoothing param- updating the noise psd during speech activity.
eter for recursive smoothing of the power spectral density of the
noisy speech signal. Based on the optimally smoothed power spec- In this paper, we present a novel noise estimation algorithm
tral density estimate and the analysis of the statistics of spectral which is based on an optimal signal psd smoothing method and
minima an unbiased noise estimator is developed. The estimator is on minimum statistics. The psd smoothing algorithm utilizes a
well suited for real time implementations. Furthermore, to improve first order recursive system with a time and frequency dependent
the performance in nonstationary noise we introduce a method to smoothing parameter. The smoothing parameter is optimized
speed up the tracking of the spectral minima. Finally, we evaluate
the proposed method in the context of speech enhancement and low for tracking nonstationary signals by minimizing a conditional
bit rate speech coding with various noise types. mean square error criterion.
Speech enhancement based on minimum statistics was pro-
Index Terms—Minimum statistics, spectral estimation, speech
enhancement. posed in [7] and modified in [8]. In contrast to other methods the
minimum statistics algorithm does not use any explicit threshold
to distinguish between speech activity and speech pause and
I. INTRODUCTION is therefore more closely related to soft-decision methods than
to the traditional voice activity detection methods. Similar to
W ITH the advent and wide dissemination of mobile com-
munications speech enhancement has found many new
applications. In turn the interest in practical and powerful speech
soft-decision methods it can also update the estimated noise psd
during speech activity. It was recently confirmed [9] that the
enhancement algorithms has grown considerably, and signifi- minimum statistics algorithm [7] performs well in nonstationary
cant progress has been made [1], [2]. Yet, speech processing noise.
under adverse conditions is still a challenge. When the signal to The minimum statistics method rests on two observations
noise ratio is low or the disturbing noise is nonstationary the re- namely that the speech and the disturbing noise are usually sta-
sults are plagued by speech distortions and unnatural sounding tistically independent and that the power of a noisy speech signal
or fluctuating residual background noises. frequently decays to the power level of the disturbing noise. It
Frequency domain speech enhancement systems typically is therefore possible to derive an accurate noise psd estimate by
consist of a spectral analysis/synthesis system, a spectral gain tracking the minimum of the noisy signal psd. Since the min-
computation method, and a background noise power spectral imum is smaller than (or in trivial cases equal to) the average
density (psd) estimation algorithm. While the former two are value the minimum tracking method requires a bias compen-
well understood [1]–[3] and easily implemented the noise sation. As we will show in the paper, the bias is a function of
estimator has frequently received less attention. The noise the variance of the smoothed signal psd and as such depends
estimator is, however, a very important component of the on the smoothing parameter of the psd estimator. In contrast to
overall system, especially if the algorithm should be capable of earlier work on minimum tracking [7] which utilizes a constant
handling nonstationary noise. In fact the noise estimator has a smoothing parameter and a constant minimum bias correction,
major impact on the overall quality of the speech enhancement the time and frequency dependent psd smoothing now also re-
quires a time and frequency dependent bias compensation. We
therefore analyze the underlying statistics and develop an ap-
Manuscript received March 31, 1999; revised February 28, 2001. This proximation to the bias of minimum power estimates which is
work was performed while the author was on leave at the Speech and Image well suited for real time implementations.
Processing Services Research Lab, AT&T Labs—Research, Florham Park, NJ
07932 USA. The associate editor coordinating the review of this manuscript The remainder of this paper is organized as follows. After
and approving it for publication was Dr. Shrikanth Narayanan. a brief introduction to noise estimation via minimum statistics
R. Martin is with Institute of Communication Systems and Data Processing, in Section II, we will derive the optimum smoothing parameter
Aachen University of Technology, D-52056 Aachen, Germany (e-mail:
martin@ind.rwth-aachen.de). and a heuristic error monitoring algorithm in Section III. In Sec-
Publisher Item Identifier S 1063-6676(01)04980-X. tion IV, we investigate the statistics of minimum (noise) power
1063–6676/01$10.00 © 2001 IEEE
MARTIN: NOISE POWER SPECTRAL DENSITY ESTIMATION 505
spectral density estimates. An algorithm for the compensation spectral density estimate of the noisy signal frequently decays
of the bias which is associated with minimum power spectral to values which are representative of the noise power level. The
density estimates is developed in Section V. Section IV presents method rests on the fundamental assumption that during speech
the algorithm for searching spectral minima. Special emphasis pause or within brief periods in between words and syllables the
is placed on a novel extension which significantly improves the speech energy is close or identical to zero. Thus, by tracking the
tracking of nonstationary noise. Finally, in Section VII we sum- minimum power within a finite window large enough to bridge
marize experimental results in terms of measurements and lis- high power speech segments the noise floor can be estimated.
tening tests. To highlight some of the obstacles which are encountered
when implementing such an approach we consider a recursively
II. PRINCIPLES OF MINIMUM STATISTICS NOISE ESTIMATION smoothed periodogram
A. Spectral Analysis (3)
In what follows we consider a bandlimited, sampled noisy
speech signal which is the sum of a clean speech signal and a simplified minimum tracking algorithm. Fig. 1 plots the
and a disturbing noise . denotes periodogram , the smoothed periodogram as
the sampling time index. We further assume that and an estimate of the signal psd, and the estimated noise power
are statistically independent and zero mean. The noisy signal which has not yet been compensated for bias as a
is transformed into the frequency domain by applying a function of the frame index and for a single frequency bin
window to a frame of consecutive samples of and . The noise in the noisy speech signal is nonstationary ve-
by computing the FFT of size on the windowed data. Before hicular noise with an overall SNR of approximately 10 dB. The
the next FFT computation the window is shifted by samples. window size is . The periodograms are recur-
This sliding window FFT analysis results in a set of frequency sively smoothed with an equivalent (rectangular) window length
domain signals which can be written as of seconds which represents a good compromise be-
tween smoothing the noise and tracking the speech signal. By
assuming independent periodograms and equating the variance
(1) of to the variance of a moving average estimator with
window length the smoothing parameter in (3) can be
where is the subsampled time index, , and is the fre- computed as .
quency bin index, , which is related to the The noise psd estimate is obtained by picking the min-
normalized center frequency by . Furthermore, imum value within a sliding window of 96 consecutive values
to facilitate our notation and to avoid unnecessary normalization of , regardless whether speech is present or not.
factors we assume . Typically, we use a sam- The minimum tracking provides a rough estimate of the noise
pling rate of Hz and . power. However, we note that to improve the method we have
We note that for all practical purposes and for to address the following issues.
the real and imaginary part of a Fourier transform coefficient • The smoothing with a fixed smoothing parameter
can be considered to be independent and can be mod- widens the peaks of speech activity of the smoothed
eled as zero mean Gaussian random variables [10].1 Under this psd estimate . This will lead to inaccurate
assumption each periodogram bin is an exponentially noise estimates as the sliding window for the minimum
distributed random variable [10] with probability density func- search might slip into broad peaks. Thus, we cannot use
tion (pdf) smoothing parameters close to one and, as a consequence,
the noise estimate will have a relatively large variance.
• The noise estimate as shown in Fig. 1 is biased toward
lower values.
(2) • In case of increasing noise power, the minimum tracking
lags behind.
where and The main themes of this paper are therefore to find a time
are the power spectral densities of the speech varying smoothing parameter such that the tracking ca-
and the noise signals, respectively. denotes the unit step pabilities of the smoothed periodogram and its variance
function, i.e., for and otherwise. are better balanced, to develop an algorithm for bias compensa-
Obviously, during speech pause, , the mean and tion, and to speed up the noise tracking in general.
the variance of are equal to and ,
respectively. III. OPTIMAL TIME VARYING SMOOTHING
B. Minimum Statistics Noise Estimation The smoothed signal psd estimate from which the
noise psd estimate is derived has to satisfy conflicting
The minimum statistics noise tracking method is based on the requirements. On one hand the variance should be as small as
observation that even during speech activity a short term power possible requiring the smoothing parameter in (3) to be close
1Strictly speaking, this assumption holds only when y (i) is stationary with a to one. On the other hand, the smoothed psd estimate has to
relatively small span of correlation and for a large frame size L !1 . track possibly nonstationary noise and, since we do not employ
506 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 5, JULY 2001
a voice activity detector, also has to follow the highly nonsta- on the right hand side of (7) is recognized as a smoothed version
tionary excursions of the speech signal. Especially when the of the a posteriori SNR [11]
input signal has a high dynamic range these requirements are
impossible to satisfy with a constant smoothing parameter . (8)
However, as we will see below, these problems can be circum-
vented with a time-varying and possibly frequency dependent Fig. 2 plots the optimal smoothing parameter for
smoothing parameter . . Since the optimal smoothing parameter is between zero
and one a stable and nonnegative noise power estimate
A. Derivation of the Smoothing Parameter is guaranteed.
To derive an optimal smoothing procedure we assume Having assumed speech pause in the above derivation does
speech pause and consider again the first order not pose any principal problems. The optimal smoothing pro-
smoothing equation for , now with a time and frequency cedure reacts to speech activity in the same way as to highly
dependent smoothing parameter nonstationary noise. In case of speech activity the smoothing
parameter is reduced to small values which enables the psd esti-
mate to closely follow the time varying psd of the noisy
(4) speech signal.
stance the frequency averaged periodogram. Our monitoring al- The bias can be computed analytically only if successive
gorithm therefore compares the average short-term psd estimate values of are
of the previous frame to the average independent, identically distributed (i.i.d.) random variables.
periodogram and thus detects deviations Unless the sequence of successive values is subsam-
of the short term psd estimate from the actual averaged peri- pled this is clearly not given. We therefore move directly to
odogram. The result of this comparison can be used to modify the case of correlated short term psd estimates and develop
the smoothing parameter in case of large deviations. an approximate solution. To simplify notations, we restrict
The comparison between the average smoothed psd estimate ourselves to the case of speech pause. All results carry over to
and the average actual periodogram is implemented by means the case of speech activity by replacing the noise variance by
of a “soft” characteristic the variance of the noisy speech signal.
The smoothing parameter in recursion (10) was chosen empir- For independent, exponentially and identically distributed pe-
ically. It does not appear to be a sensitive parameter. The mul- riodograms the characteristic function of the pdf of
tiplication of the correction factor with the optimal smoothing is then given by [12, Ch. 18]
parameter then yields the final smoothing parameter
(14)
(11)
Fig. 3. Mean of minimum of correlated short term noise psd estimates for Fig. 4. Normalized variance of minimum of correlated noise psd estimates for
= 1. = 1.
proximation of the first moment, , and the second operations per signal frame and frequency bin. The delay in re-
moment, , of sponse to a rising noise power is now only . For a sampling
rate of 8 kHz and an FFT length of samples we
typically use and .
(20) For less stationary noise the tracking can be improved by
looking in each subwindow for local minima with amplitudes
in the vicinity of the overall minimum. A minimum of a sub-
(21) window is considered to be local if its value was not obtained
(22) in the first or the last signal frame of this subwindow. Since we
now explicitly consider the minima of the subwindows we also
Good results are obtained by choosing the smoothing parameter have to compute a bias compensation for these shorter subwin-
and by limiting to values less or dows.
equal to 0.8. The new algorithm is summarized in Fig. 5. All computa-
Finally, is estimated by tions in Fig. 5 are embedded into loops over all frequency in-
dices and all time indices . Subwindow quantities are sub-
(23) scripted by . In the description of the algorithm we make ref-
erence to a subwindow counter which counts the signal
and this estimate is limited to a maximum of 0.5 corresponding frames within a subwindow and to the running minimum esti-
to . Since an increasing noise power can be tracked mate . At the startup of the program this counter
only with some delay the minimum statistics estimator has a is initialized to and is initialized
tendency to underestimate highly nonstationary noise. Fur- to a preset maximum value. The vector holds the
thermore, since the bias compensation (15) (or (16)) depends overall minimum of the length window. It is updated when-
on the estimated normalized variance the bias compensation ever , when the current minimum
factor is a random variable with a variance depending on the becomes smaller than , or when a local minimum
variance of . It is therefore advantageous to increase is detected.
the inverse bias by a factor proportional to The search range for local minima is
the normalized standard deviation of the short term estimate within 0.8 to 9 dB of the current overall minimum. It depends
with the average normal- on the average normalized variance of the short term
psd estimate. If the variance is small a local minimum very
ized variance and likely indicates the noise level. It can be therefore accepted
typically set to . This bias correction has an impact even if it is several dB larger than the current overall min-
only when the short term psd estimate and thus the estimated imum. An increasing noise level can be therefore tracked on
variance has a large variance. Without the bias correction the the subwindow level. If the variance is large fluctuations of
variations in would push the minimum local minima are not necessarily due to a rising noise floor.
to values which are too low. For stationary noise this factor is Therefore, only minima close to the overall minimum are
close to one. accepted. The functional dependence of the variance and the
search range for local minima was optimized by experiments.
VI. EFFICIENT IMPLEMENTATION OF THE MINIMUM SEARCH and are auxilliary vectors for
Our algorithm requires that we find the minimum of sub- keeping track of those frequency bins which might contain
sequent psd estimates . The computational complexity local minima. If the minimum of a subwindow was determined
as well as the delay inherent in this procedure depends on how as the first or the last value
often we update this minimum estimate. If we update the min- of this subwindow it is not accepted as a local minimum
imum in every time step we have compare operations for . If the minimum was obtained in
each time step and frequency bin. On the other hand, we might between the first or the last value of the subwindow it is
choose to update the minimum only after consecutive sam- marked as a local minimum . If a local
ples of have been computed. In this case, we need only minimum is larger than the overall minimum but still within
one compare operation per signal frame and frequency bin but the search range it replaces all previously
the worst case delay when responding to a rising noise power is stored subwindow minima and thus leads to an increased noise
now . Following the proposal in [7] we implemented a tree psd estimate.
search to balance the complexity and the update rate in a flex-
ible manner. VII. PERFORMANCE EVALUATION
We divide the window of samples into subwindows
of samples . This allows us to update the min- A. Qualitative Results
imum every samples while keeping the computational com- The noise estimation algorithm was evaluated in the context
plexity low. Whenever samples are read the minimum of the of speech enhancement with various noise types. We begin
current subwindow is determined and stored for later use. The our presentation of experimental results with a second look at
overall minimum is obtained as the minimum of all sub- the noisy speech file of Fig. 1. Fig. 6 plots the periodogram,
window minima. We therefore have compare smoothed periodogram, noise estimate, and time varying
510 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 5, JULY 2001
smoothing parameter for the same noisy speech file B. Quantitative Results
and the same frequency bin as in Fig. 1. We see that the time We measure the relative estimation error with respect to a ref-
varying smoothing parameter allows the estimated signal erence noise psd for computer generated white Gaussian noise,
power to closely follow the peaks of the speech signal while for vehicular noise, and for street noise without and with speech.
during speech pause the noise is well smoothed. Also, the While the white Gaussian noise is completely stationary, the ve-
bias compensation appears to work very well as the smoothed hicular noise has some fluctuations and the street noise is highly
power and the estimated noise power follow each other closely nonstationary. Speech (six male and six female speakers, no
during speech pause. We also note that the noise psd estimate pauses) was added at an SNR of 15 dB. In all cases the estima-
is updated during speech activity. This is a major advantage of tion error was averaged over three minutes of audio material.
the minimum statistics approach. As the true noise psd is not known for vehicular noise and for
Fig. 7 gives another example of the noise tracking abilities of street noise we used a first order recursive system as in (3) with
the algorithm. We now look at a speech sample which has high to compute the reference noise psd. The variance of
SNR speech ( dB) at its beginnning. After about 780 this estimator contributes to the variance which we observe for
clean speech frames computer generated white noise is added the noise psd estimation error.
to the speech. The response of the noise estimator is shown in Table I summarizes the results for speech pauses. Three dif-
Fig. 7. The noise jump is tracked with a delay of frames. ferent algorithms were tested: the minimum statistics approach
The small overshoot is a result of increasing the bias compensa- which was proposed in [7] and uses a fixed smoothing param-
tion factor by the variance dependent factor which is in eter and the new algorithms as described in Fig. 5
this situation at its upper limit. with the bias compensation according to (15) and (17). We also
MARTIN: NOISE POWER SPECTRAL DENSITY ESTIMATION 511
TABLE III [2] , “Speech enhancement using a minimum mean-square error log-
PARAMETERS FOR THE APPROXIMATION OF THE MEAN OF THE spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Pro-
MINIMUM (15) AND (17) cessing, vol. ASSP-33, pp. 443–445, Apr. 1985.
[3] P. P. Vaidyanathan, Multirate Systems and Filter Banks. Englewood
Cliffs, NJ: Prentice-Hall, 1993.
[4] H. G. Hirsch and C. Ehrlicher, “Noise estimation techniques for robust
speech recognition,” Proc. IEEE Int. Conf. Acoustics, Speech, Signal
Processing, vol. 1, pp. 153–156, 1995.
[5] J. Sohn and W. Sung, “A voice activity detector employing soft deci-
sion based noise spectrum adaptation,” Proc. IEEE Int. Conf. Acoustics,
Speech, Signal Processing, vol. 1, pp. 365–368, 1998.
[6] D. Malah, R. V. Cox, and A. J. Accardi, “Tracking speech-presence un-
certainty to improve speech enhancement in nonstationary noise envi-
ronments,” Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing,
pp. 789–792, 1999.
APPENDIX I [7] R. Martin, “Spectral subtraction based on minimum statistics,” in Proc.
MEAN OF MINIMUM FOR Eur. Signal Processing Conf., 1994, pp. 1182–1185.
[8] G. Doblinger, “Computationally efficient speech enhancement by
The probability density of the minimum of i.i.d. spectral minima tracking in subbands,” in Proc. EUROSPEECH, vol. 2,
random variables , is given by 1995, pp. 1513–1516.
[9] J. Meyer, K. U. Simmer, and K. D. Kammeyer, “Comparison of one-
and two-channel noise-estimation techniques,” in Proc. Int. Workshop
(24) Acoustic Echo Control Noise Reduction, 1997, pp. 17–20.
[10] D. R. Brillinger, Time Series: Data Analysis and Theory. New York:
where denotes the probability distribution function Holden-Day, 1981.
of . For and the Gaussian assumption [11] R. J. McAulay and M. L. Malpass, “Speech enhancement using a soft-
decision noise suppression filter,” IEEE Trans. Acoust., Speech, Signal
is exponentially distributed and Processing, vol. 28, pp. 137–145, Dec. 1980.
[12] N. L. Johnson, S. Kotz, and N. Balakrishnan, Continuous Univariate
Distributions: Wiley, 1994.
[13] H. A. David, Order Statistics. New York: Wiley, 1980.
[14] E. J. Gumbel, Statistics of Extremes. New York: Columbia Univ. Press,
1958.
[15] I. S. Gradshteyn and I. M. Ryzhik, Table of Integrals, Series, and Prod-
ucts, 5th ed. New York: Academic, 1994.
[16] A. McCree, K. Truong, E. B. George, T. P. Barnwell, and V.
Viswanathan, “A 2.4 KBIT/S MELP coder candidate for the new U.S.
(25) federal standard,” Proc. IEEE Int. Conf. Acoustics, Speech, Signal
Processing, pp. 200–203, 1996.
Therefore, for we obtain . [17] S. R. Quackenbush, T. P. Barnwell III, and M. A. Clements, Objec-
tive Measures of Speech Quality. Englewood Cliffs, NJ: Prentice-Hall,
1988.
APPENDIX II [18] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech
APPROXIMATION OF THE MEAN corrupted by acoustic noise,” Proc. IEEE Int. Conf. Acoustics, Speech,
Signal Processing, pp. 208–211, 1979.
Table III lists values for and as a function of [19] P. Händel, “Low-distortion spectral subtraction for speech enhance-
ment,” in Proc. EUROSPEECH, 1995, pp. 1549–1552.
. Values in between can be obtained by linear interpolation.
ACKNOWLEDGMENT
The author would like to thank Dr. R. V. Cox for his sup-
port and Prof. David Malah for many interesting discussions
Rainer Martin (S’86–M’90–SM’00) received the
and for making his speech enhancement code available. Several Dipl.-Ing. and Dr.-Ing. degrees from Aachen Univer-
reviewers provided constructive criticism which helped to im- sity of Technology, Aachen, Germany, in 1988 and
prove the presentation of the algorithm. The author is especially 1996, respectively, and the M.S.E.E. degree from
Georgia Institute of Technology, Atlanta, in 1989.
grateful to one of the anonymous referees whose comments led Since 1996, he has been a Senior Research
to an improved statistical model. Engineer with the Institute of Communication
Systems and Data Processing, Aachen University
REFERENCES of Technology. From 1998 to 1999, he was with
the AT&T Speech and Image Processing Services
[1] Y. Ephraim and D. Malah, “Speech enhancement using a minimum Research Lab, Florham Park, NJ. His research inter-
mean-square error short-time spectral amplitude estimator,” IEEE ests are acoustic signal processing, such as noise reduction and acoustic echo
Trans. Acoust., Speech, Signal Processing, vol. 32, pp. 1109–1121, cancellation, and robustness issues in speech and audio signal transmission,
Dec. 1984. e.g., frame erasure concealment in packet networks.