2013 DEC - A Robust Voice Activity Detection Method Based On Speech Enhancement

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

A Robust Voice Activity Detection Method Based on Speech Enhancement

Xulei Bao , Jie Zhu , Ning Chen


Department

School

of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China


of Information Science and Engineering, East China University of Science and Technology, Shanghai 200240, China

Abstract. In this paper, a robust multiple observation likelihood ratio test (MOLRT) based voice activity detection (VAD)
method is proposed. At the beginning of this paper, we introduce the Wiener lter to the observed signal in time domain
which can help mitigating some noise. The reason why we use the Wiener lter for VAD is that the performance of the
VAD is always better in high signal to noise ratio (SNR) range than in low SNR range. Then, some ideas are proposed
to improve the performance of the MOLRT based VAD method. As we all know, a conventional MOLRT based method
including three modules named likelihood ratio (LR) estimation, threshold setting and hangover technique. To improve
the estimation accuracy of LR, we adopt the unbiased minimum mean-square error (MMSE) algorithm for noise power
spectrum estimation in every frame, which is very effective for LR estimation in MOLRT-based VAD method. That is
because the LR is a function of a prior and a posterior SNR and unbiased MMSE algorithm is very useful for noise
estimation. In addition, to make our VAD method more robust, a dynamic threshold setting technique is proposed in our
method, which is related to the minimum noise power spectrum. That is because minimum noise power spectrum can help
us updating the value of threshold to a suitable level according to the denoised signal. Last but most important, a novel
hangover algorithm is introduced in this paper comparing to the conventional HMM based hangover algorithm. In the novel
hangover algorithm, the current frame is determined by the statistical result of the following speech/non-speech detections
based on the likelihood ratio test. And the evaluation results reveal that proposed method signicantly outperform the
baseline result of LRT as regards VAD accuracy in both noise variations and low SNR conditions.

1. Introduction

to-noise (SNR) becomes low. In this paper, we further optimize the MOLRT-based method by four different parts, which
can obtain more reliable speech/non-speech decision. First,
the Wiener lter is used for the noisy speech in the time
domain, which can mitigate the noise effectively. Second,
unbiased minimum mean-square error (MMSE) algorithm is
adopt to track the noise power spectrum of each frame. Third,
minimum noise power spectrum based threshold is proposed,
which makes our method more robustness. Last, an effective
hang-over scheme based on the statistical property of the values of multiple MOLRTs are proposed to make a more precise
speech/non-speech decision.
Section 2 provides the detailed explanations of our proposed method. Section 3 describes preliminary evaluation
experiments that show the advantage of the proposed method
by comparing it with conventional methods. We give our conclusion in Section 4.

Voice activity detection (VAD), which is a scheme to detect


the presence of speech in the observed signals automatically,
is considered as a crucial aspect for a wide range of speech
processing algorithms and their applications, including speech
coding, speech enhancement and automatic speech recognition
(ASR). Generally, heuristics rule-based VAD and model-based
VAD are considered as the most important methods.
In the last decade, plenty of heuristics rule-based VAD algorithms have been developed for different kinds of noises. The
difference among them is the features used, such as linear predictive coding parameters, energy, formant shape, higher order
statistics and cepstral features. However, heuristics rule-based
VAD method is quite difcult to cope with all kinds of noises
observed in the real world. Recently, the statistical model based
VAD is considered as a more attractive approach for noisy
speech, and the likelihood ratio test (LRT)-based VAD algorithm is one of the useful methods.
LRT-based VAD method was rst proposed by Sohn [1]
in 1999. In this method, the author assumed the probability
density functions of both speech and noise were Gaussian,
and a Hidden Markov Model (HMM) was used for hang-over
scheme. Later, Gorriz [2] incorporated contextual information
in a multiple observation LRT (MOLRT) to overcome the
non-stationary noise. Tan [3] only selected the DFT bins that
containing harmonic spectral peaks for likelihood ratio (LR)
in speech segments, as opposed to non-speech segments.
Although many improved LRT-based methods have been
proposed, the performance still drop rapidly when the signal-

2. Proposed Method
We assume the observed signals are recorded monaurally, and
the convolutional noise can be neglected. The additive noise
which is uncorrelated with speech may changes dynamically:
y(i) = s(i) + v(i),

(1)

where y(i), s(i) and v(i) represent the discrete-time noisy


speech, speech and noise at sample i respectively. There is no
doubt that the performance of VAD is better in higher SNR
environment. And it has been proved that the Wiener lter is
1

Author title
one of the most fundamental noise-reduction approaches [4].
Thus, we adopt the Wiener lter to lter the observed noisy
speech in the time domain.
Dene the error signal ns (i) between the clean speech sample s(i) and x(i) is
ns (i)  s(i) x(i) = s(i) wT y(i),

where
E{|N(m, k)|2 |Xm,k , H0 } = |X(m, k)|2 ,
and



m,k
E{|N(m, k)| |Xm,k , H1 } =
N2 (m, k)
1 + m,k

2
1
+
|X(m, k)|2 .
1 + m,k
2

(2)

where superscriptT denotes transpose of a vector or a matrix,


w = [w0 w1 wL1 ]T is an FIR lter of length L, and
y(i) = [y(i) y(i 1) y(i L + 1)]T . And the optional lter
that forms optimal estimate xo (i) is the Wiener lter which is
obtained as


wo = arg minJs (w) = arg minE n2s (i) ,
(3)
w

N2 (m, k) = p N2 (m 1, k) + (1 p )E{|N(m, k)|2 |X(m, k)}.


(9)
Given the observed signals, we can also obtain the
a posteriori speech present probability using the Bayes theorem

where the parameter wo can be estimated by the Minimum


mean-square error(MMSE) algorithm [4]. For a given set of
lter coefcients wo , (2) can be written as

P(H1|Xm,k ) =

(4)

1 K1
m,k 1 log m,k ,
K k=0

P(H1 )p(Xm,k |H1 )


P(H0 )p(Xm,k |H0 ) + P(H1 )p(Xm,k |H1 )


Suppose that the error signal and speech spectral coefcients have a complex Gaussian distribution, and the probability density functions conditioned on H0 (speech pause)
and H1 (speech active) can be given as p(Xm,k |H0 ) and
p(Xm,k |H1 ) [1], where k is the frequency band index, and Xm,k
is the spectral of estimated speech in mth frame. Then, the loglikelihood ratio (LLR) of the mth frame can be expressed as
p(Xm,k |H1 )
1 K1
1 K1
log

=
log
lm =
m,k

K k=0
K k=0
p(Xm,k |H0 )

(8)

The power spectrum of noise can be obtained via recursive


smoothing with parameter p = 0.8 [8] according to the estimation of noise periodogram

x(i) = woT x(i) = s(i) ns (i).

(7)

|X(m,k)|2

2
P(H0 )

= 1+
(1 + m,k )e N(m,k)
P(H1 )

m,k
1+m,k

1

(10)
Suppose that the a posterior m,k satised m,k 1, the
a prior m,k can be derived by using the ML algorithm, and
the estimator under speech present can be computed as
ML
E{|N(m, k)|2 |Xm,k , H1 , m,k
m,k 1} = N2 (m, k).

(11)

Thus, with (7) and (11), equation (6) can be rewritten as

(5)

E{|N(m, k)|2 |Xm,k } = P(H0 |Xm,k )|X(m, k)|2


+ P(H1 |Xm,k ) N2 (m, k).

where K is the band number of current frame, m,k =


|X(m, k)|2 /N2 (m, k) is the a posterior SNR. The reason why
we can obtain the formula (6) is that the a prior SNR m,k can
be derived from the a posterior SNR m,k based on maxML
imum likelihood (ML) estimator, as m,k
m,k 1.Thus,
from equation (6), we can nd the LLR is only relate to the
a posteriori SNR, which also means the LLR is the function
of noise power spectrum N2 (m, k). So, how to estimate the
unknown parameter N2 (m, k) precisely is one of the key issues
for LRT/MOLRT-based VAD technique.
To improve the accuracy of the estimation of noise power
spectrum, many approaches have been proposed during the
last decade [58]. In this paper, we follow the idea of unbiased MMSE-based noise power estimator in [8] to estimate the
noise power spectrum. With the assumption that the noise and
speech spectral coefcients have a complex Gaussian distribution, as equation (2), the expectation of the noise power spectrum conditioned on the observation can be expressed as [9]:

(12)

Here, P(H0 |Xm,k ) = 1 P(H1 |Xm,k ), and the power spectrum of noise is then obtained by a recursive smoothing of
E{|N(m, k)|2 |Xm,k } as given in (8).
Suppose that a collection of 2M + 1 sequential LLRs
from the current frame m , denoted as lm = {lmM , lmM+1 ,
lm , lm+1 , , lm+M }. The decision rule is established from
the geometric mean of lm , which is given by [2]
lm =

1
1
lm =
2M + 1
2M + 1

m+M

f r=mM

H1

l f r ,

(13)

H0

where is a threshold for speech/non-speech decision.


In formula (6), the decision rule is established from the geometric mean of lm , and the threshold is set as a constant. Suppose we have a high SNR environment, then the spectral of
noise power should be small, which makes the a posterior
SNR and LLRs large. However, when the SNR becomes low,
the values of LLRs will turn much smaller. So, it is undesirable
to use the same threshold for different SNR ranges.
Thus, we adjust the threshold based on minimum noise
power spectrum. This minimum spectral of noise power is estimated by optimal smoothing and minimum statistics (MS) [5].

E{|N(m, k)|2 |Xm,k } =P(H0 |Xm,k )E{|N(m, k)|2 |Xm,k , H0 }


+ P(H1 |Xm,k )E{|N(m, k)|2 |Xm,k , H1 },
(6)
2

Article title
Suppose we have a recursively smoothed periodogram with
time-frequency dependent smoothing faction (m, k), then the
smoothed power spectrum (m, k) is given by

Noisy
speech

(m, k) = (m, k)(m 1, k) + (1 (m, k))|X(m, k)|2 .


(14)
And (m, k) should be as close as possible to the noise
power spectrum during speech pauses, i.e. dE{((m, k)
N2 (m, k))2 |(m 1, k)}/d(m, k) = 0. Giving an additional
assumption that E{|X(m, k)|2 } = N2 (m, k) and E{|X(m, k)|4 } =
2N4 (m, k), the optimum value for (m, k) is
(m, k)opt =

1
.
1 + ((m 1, k)/N2 (m, k) 1)2

max c (m, k)
.
1 + ((m 1, k)/N2 (m, k) 1)2

Figure 1.

(15)

(16)

(17)

where is a constant factor for threshold m .


In classical VADs, the initial decision is modied to prevent
the saltation by considering the previous or backward decision
results [1]. We assume that the decision of the current state
depends on the initial decision (formula (4)) of a collection
of 30 sequential LRTs from the current frame m, denoted as
L (qm = Hi |L m ), which is given by
29

H1

(m + t) H N j ,

t=0

Decision

moLR(m)

Thres(m)

Flowchart of the proposed method

In this section, we present experimental results to show


the effectiveness of the developed system. For VAD test,
2906 utterances that we record in the practical environment
are considered here. The sample rate is set as fs =8kHz.
Both stationary and non-stationary noises are added to
the utterances manually. And the non-stationary noises
include car passing noise and babble noise which are
downloaded from htt p : //www. f reesound.com and htt p :
//spib.rice.edu/spib/data/signals/noise/babble.html respectively. We use a normalized Hamming-window of length
200 for spectral analysis and synthesis. And the number of
frequency bands is set as K=256. The noise output smoothing factor is set as max = 0.96 for updating the optional
smoothing parameter in formula (9). The collection number of
sequential LLRs is set as 2M + 1 = 17. The factor = 4.5.

where Bmin is the bias correction factor. And min is the minimum value of D successive short term power spectral density
(m, k), m {m1 , , m1 i, , m1 D + 1}.
Finally, the threshold m is given as


1
 K1
1

2
m =
(18)
Nmin (m, k) ,
K k=0

L (qm = Hi |L m ) =

N_m(m)

3. Experimental Results

Then, the unbiased noise estimated is obtained as


N2min (m, k) = Bmin (m, k)min (m, k),

FFT

Hang-over
Scheme

However, the value of estimated noise variance N2 (m, k)


lags behind the noise power spectrum. Hence some correction factor c (m, k) is calculated using the ratio of averaged
smoothed periodogram to the spectral of noisy speech power.
The nal smoothing parameter after the correction factor is
given as [5]
(m, k)opt =

Wiener Denoised
Filter
speech

(a) Clean speech

(c) VAD by Tan

(b) VAD by Sohn

(d) VAD by Proposed method

(19)

Figure 2.

where L m = {lm , lm+1 , , lm+29 } is the collection of 30


sequential LRT from the current frame, N j is a constant for
comparing, and (m) is the initial decision result of mth frame

1 : lm > m ,
(20)
(m) =
0 : otherwise,

An example of different VAD results

Fig.2 gives an experimental results that calculated by different VAD methods. From Fig.2, we can nd that the LRT-based
method [1] appears many small speech/non-speech segments
when it is used for low SNR environment. And the method proposed by Tan [3] shows much better performance. However,
these methods only offer little help to the automatic speech
recognition (ASR) system since the small segments interrupt
the ASR much.

The procedure of the proposed method can be summarize as


Fig.1, where m is the frame index.
3

Author title
Table 1. Experimental Results for Various Environmental Conditions.

Environment
Noise
SNR
5dB
White 15dB
25dB
5dB
Car
15dB
25dB
5dB
Babble 15dB
25dB

Proposed(%)
Mcc VAcc SBR
60.21 49.02 65.40
80.48 75.89 84.30
85.94 86.53 92.00
73.65 62.21 92.27
82.46 78.69 93.11
86.32 87.55 92.61
76.53 68.03 76.54
86.32 86.62 92.98
86.24 87.21 92.64

Without Wiener(%)
Mcc VAcc SBR
58.01 47.68 60.53
77.26 70.75 82.48
83.48 80.70 89.13
76.20 65.47 93.38
83.46 78.99 93.75
84.33 82.20 91.29
72.06 59.78 74.71
83.86 79.43 91.59
83.86 81.37 90.45

Dene the totally number of speech segments is Q, the correct start border number is a, and the correct end border number is b. So, the correct start boarder rate SBR = a/Q and the
correct end boarder rate EBR = b/Q. Using the VAD method
we only nd R speech segments, then the correct board precision is dened as BP = (a + b)/2 R . Finally, the correct rate
VAcc is dened as
3
.
VAcc =
(1/SBR + 1/EBR + 1/BP )

Tan(%)
VAcc
26.98
52.73
62.79
41.37
49.33
59.00
42.48
60.30
63.80

SBR
52.87
82.24
88.99
77.18
86.33
89.74
73.36
87.71
89.74

Mcc
41.31
50.75
53.11
19.95
30.41
46.48
44.91
50.39
52.41

Sohn(%)
VAcc
26.53
31.77
33.46
21.46
26.32
30.60
29.93
343.57
34.08

SBR
66.00
56.55
44.94
36.77
38.96
37.51
35.13
43.33
39.77

prior to the VAD, which gives a surprising VAD correction.


Whats more, an adaptive threshold is proposed that can obtain
a robuster VAD, and an statistics-based hang-over scheme is
proposed to reduce the saltation much.
References
[1] J. Sohn, N. S. Kim, and W. Sung, A statistical model based voice
activity detection, IEEE Signal Process., vol. 16, no. 1, pp. 13,
Jan. 1999.
[2] J. M. Gorriz, J. Ramirez, E. W. Lang, and C. G. Puntonet, Jointly
gaussian pdf based likelihood ratio test for voice activity detection, IEEE Trans. on Audio, Speech and Signal Process., vol. 16,
no. 8, pp. 15651578, Nov. 2008.
[3] L. Tan, B. J. Borgstrorn, and A. Alwan, Voice activity detection
using hamonic frequecy components in likelihood ratio test, in
ICASSP2010, March 2010, pp. 44664469.
[4] J. Chen, J. Benesty, A. Huang, and S. Doclo, New insights into
the noise reduction wiener lter, IEEE Trans. on Audio, Speech
and Lang. Process., vol. 14, no. 4, pp. 12181234, July 2006.
[5] R. Martin, Noise power spectral density estimation based on
optimal smoothing and minimum statistics, IEEE Trans. on
Speech and Audio Process., vol. 9, no. 5, pp. 504512, July 2001.
[6] R. C. Hendriks, R. Heusdens, and J. Jensen, Mmse based noise
psd tracking with low complexity, in ICASSP2010, Mar. 2010,
pp. 42664269.
[7] J. S. Erkelens and R. Heusdens, Tracking of nonstationary noise
baed on data-driven recursive noise power estimation, IEEE
Trans. on Audio, Speech and Lang. Process., vol. 16, no. 6, pp.
11121123, Aug. 2008.
[8] T. Gerkmann and R. C. hendriks, Unbiased mmse-based noise
power estimation with low complexity and low tracking delay,
IEEE Trans. on Audio, Speech, and Signal Process., vol. 20,
no. 4, pp. 13831393, May 2012.
[9] J. W. Shin, H. J. Kwon, and N. S. Kim, Voice activity detection
based on condtional map criterion, IEEE Signal Process. Lett.,
vol. 15, pp. 257260, Feb. 2008.

(21)

Furthermore, in the frame level, we suppose TP is the number


of correct speech frame matches, TN is the number of correct
silence frame matches, FN is the number of speech classied
as silence, and FP is the number of speech classied as silence,
then the frame-level correction Mcc is given as
TP TN FP FN
Mcc =
.
(TP + FN )(TP + FN )(TN + FP )(TN + FN )

Mcc
53.21
78.18
81.85
64.43
68.39
76.38
73.44
81.37
81.82

(22)

Table I gives the detail experimental results for various environmental conditions. Four methods are taken into account
for comparison, including the proposed method, the proposed
method without Winer lter, the Tans [3] and Sohns methods [1]. From Table I, we can nd following interesting points:
1. The proposed method and the harmonic-MOLRT based
method have a similar SBR correction, while the proposed
method have much higher VAcc than the harmonic-MOLRT
based method, which illuminates our proposed hang-over
scheme can reduce the saltation much.
2. The performance of proposed method is better than the
method without speech enhancement in many environment
conditions except for the low SNR non-stationary background, which also means the Wiener lter can help a lot in
most conditions.
3. The proposed method has a similar performance in 25dB
and 15dB, which demonstrates that our proposed method is
robust to various environment conditions.
4. Conclusion
In this letter, a robust MOLRT VAD based on speech enhancement is developed, which gives a much better performance
than the others. In it, a speech enhancement method is used
4

You might also like