Professional Documents
Culture Documents
2013 DEC - A Robust Voice Activity Detection Method Based On Speech Enhancement
2013 DEC - A Robust Voice Activity Detection Method Based On Speech Enhancement
2013 DEC - A Robust Voice Activity Detection Method Based On Speech Enhancement
School
Abstract. In this paper, a robust multiple observation likelihood ratio test (MOLRT) based voice activity detection (VAD)
method is proposed. At the beginning of this paper, we introduce the Wiener lter to the observed signal in time domain
which can help mitigating some noise. The reason why we use the Wiener lter for VAD is that the performance of the
VAD is always better in high signal to noise ratio (SNR) range than in low SNR range. Then, some ideas are proposed
to improve the performance of the MOLRT based VAD method. As we all know, a conventional MOLRT based method
including three modules named likelihood ratio (LR) estimation, threshold setting and hangover technique. To improve
the estimation accuracy of LR, we adopt the unbiased minimum mean-square error (MMSE) algorithm for noise power
spectrum estimation in every frame, which is very effective for LR estimation in MOLRT-based VAD method. That is
because the LR is a function of a prior and a posterior SNR and unbiased MMSE algorithm is very useful for noise
estimation. In addition, to make our VAD method more robust, a dynamic threshold setting technique is proposed in our
method, which is related to the minimum noise power spectrum. That is because minimum noise power spectrum can help
us updating the value of threshold to a suitable level according to the denoised signal. Last but most important, a novel
hangover algorithm is introduced in this paper comparing to the conventional HMM based hangover algorithm. In the novel
hangover algorithm, the current frame is determined by the statistical result of the following speech/non-speech detections
based on the likelihood ratio test. And the evaluation results reveal that proposed method signicantly outperform the
baseline result of LRT as regards VAD accuracy in both noise variations and low SNR conditions.
1. Introduction
to-noise (SNR) becomes low. In this paper, we further optimize the MOLRT-based method by four different parts, which
can obtain more reliable speech/non-speech decision. First,
the Wiener lter is used for the noisy speech in the time
domain, which can mitigate the noise effectively. Second,
unbiased minimum mean-square error (MMSE) algorithm is
adopt to track the noise power spectrum of each frame. Third,
minimum noise power spectrum based threshold is proposed,
which makes our method more robustness. Last, an effective
hang-over scheme based on the statistical property of the values of multiple MOLRTs are proposed to make a more precise
speech/non-speech decision.
Section 2 provides the detailed explanations of our proposed method. Section 3 describes preliminary evaluation
experiments that show the advantage of the proposed method
by comparing it with conventional methods. We give our conclusion in Section 4.
2. Proposed Method
We assume the observed signals are recorded monaurally, and
the convolutional noise can be neglected. The additive noise
which is uncorrelated with speech may changes dynamically:
y(i) = s(i) + v(i),
(1)
Author title
one of the most fundamental noise-reduction approaches [4].
Thus, we adopt the Wiener lter to lter the observed noisy
speech in the time domain.
Dene the error signal ns (i) between the clean speech sample s(i) and x(i) is
ns (i) s(i) x(i) = s(i) wT y(i),
where
E{|N(m, k)|2 |Xm,k , H0 } = |X(m, k)|2 ,
and
m,k
E{|N(m, k)| |Xm,k , H1 } =
N2 (m, k)
1 + m,k
2
1
+
|X(m, k)|2 .
1 + m,k
2
(2)
P(H1|Xm,k ) =
(4)
1 K1
m,k 1 log m,k ,
K k=0
Suppose that the error signal and speech spectral coefcients have a complex Gaussian distribution, and the probability density functions conditioned on H0 (speech pause)
and H1 (speech active) can be given as p(Xm,k |H0 ) and
p(Xm,k |H1 ) [1], where k is the frequency band index, and Xm,k
is the spectral of estimated speech in mth frame. Then, the loglikelihood ratio (LLR) of the mth frame can be expressed as
p(Xm,k |H1 )
1 K1
1 K1
log
=
log
lm =
m,k
K k=0
K k=0
p(Xm,k |H0 )
(8)
(7)
|X(m,k)|2
2
P(H0 )
= 1+
(1 + m,k )e N(m,k)
P(H1 )
m,k
1+m,k
1
(10)
Suppose that the a posterior m,k satised m,k 1, the
a prior m,k can be derived by using the ML algorithm, and
the estimator under speech present can be computed as
ML
E{|N(m, k)|2 |Xm,k , H1 , m,k
m,k 1} = N2 (m, k).
(11)
(5)
(12)
Here, P(H0 |Xm,k ) = 1 P(H1 |Xm,k ), and the power spectrum of noise is then obtained by a recursive smoothing of
E{|N(m, k)|2 |Xm,k } as given in (8).
Suppose that a collection of 2M + 1 sequential LLRs
from the current frame m , denoted as lm = {lmM , lmM+1 ,
lm , lm+1 , , lm+M }. The decision rule is established from
the geometric mean of lm , which is given by [2]
lm =
1
1
lm =
2M + 1
2M + 1
m+M
f r=mM
H1
l f r ,
(13)
H0
Article title
Suppose we have a recursively smoothed periodogram with
time-frequency dependent smoothing faction (m, k), then the
smoothed power spectrum (m, k) is given by
Noisy
speech
1
.
1 + ((m 1, k)/N2 (m, k) 1)2
max c (m, k)
.
1 + ((m 1, k)/N2 (m, k) 1)2
Figure 1.
(15)
(16)
(17)
H1
(m + t) H N j ,
t=0
Decision
moLR(m)
Thres(m)
where Bmin is the bias correction factor. And min is the minimum value of D successive short term power spectral density
(m, k), m {m1 , , m1 i, , m1 D + 1}.
Finally, the threshold m is given as
1
K1
1
2
m =
(18)
Nmin (m, k) ,
K k=0
L (qm = Hi |L m ) =
N_m(m)
3. Experimental Results
FFT
Hang-over
Scheme
Wiener Denoised
Filter
speech
(19)
Figure 2.
Fig.2 gives an experimental results that calculated by different VAD methods. From Fig.2, we can nd that the LRT-based
method [1] appears many small speech/non-speech segments
when it is used for low SNR environment. And the method proposed by Tan [3] shows much better performance. However,
these methods only offer little help to the automatic speech
recognition (ASR) system since the small segments interrupt
the ASR much.
Author title
Table 1. Experimental Results for Various Environmental Conditions.
Environment
Noise
SNR
5dB
White 15dB
25dB
5dB
Car
15dB
25dB
5dB
Babble 15dB
25dB
Proposed(%)
Mcc VAcc SBR
60.21 49.02 65.40
80.48 75.89 84.30
85.94 86.53 92.00
73.65 62.21 92.27
82.46 78.69 93.11
86.32 87.55 92.61
76.53 68.03 76.54
86.32 86.62 92.98
86.24 87.21 92.64
Without Wiener(%)
Mcc VAcc SBR
58.01 47.68 60.53
77.26 70.75 82.48
83.48 80.70 89.13
76.20 65.47 93.38
83.46 78.99 93.75
84.33 82.20 91.29
72.06 59.78 74.71
83.86 79.43 91.59
83.86 81.37 90.45
Dene the totally number of speech segments is Q, the correct start border number is a, and the correct end border number is b. So, the correct start boarder rate SBR = a/Q and the
correct end boarder rate EBR = b/Q. Using the VAD method
we only nd R speech segments, then the correct board precision is dened as BP = (a + b)/2 R . Finally, the correct rate
VAcc is dened as
3
.
VAcc =
(1/SBR + 1/EBR + 1/BP )
Tan(%)
VAcc
26.98
52.73
62.79
41.37
49.33
59.00
42.48
60.30
63.80
SBR
52.87
82.24
88.99
77.18
86.33
89.74
73.36
87.71
89.74
Mcc
41.31
50.75
53.11
19.95
30.41
46.48
44.91
50.39
52.41
Sohn(%)
VAcc
26.53
31.77
33.46
21.46
26.32
30.60
29.93
343.57
34.08
SBR
66.00
56.55
44.94
36.77
38.96
37.51
35.13
43.33
39.77
(21)
Mcc
53.21
78.18
81.85
64.43
68.39
76.38
73.44
81.37
81.82
(22)
Table I gives the detail experimental results for various environmental conditions. Four methods are taken into account
for comparison, including the proposed method, the proposed
method without Winer lter, the Tans [3] and Sohns methods [1]. From Table I, we can nd following interesting points:
1. The proposed method and the harmonic-MOLRT based
method have a similar SBR correction, while the proposed
method have much higher VAcc than the harmonic-MOLRT
based method, which illuminates our proposed hang-over
scheme can reduce the saltation much.
2. The performance of proposed method is better than the
method without speech enhancement in many environment
conditions except for the low SNR non-stationary background, which also means the Wiener lter can help a lot in
most conditions.
3. The proposed method has a similar performance in 25dB
and 15dB, which demonstrates that our proposed method is
robust to various environment conditions.
4. Conclusion
In this letter, a robust MOLRT VAD based on speech enhancement is developed, which gives a much better performance
than the others. In it, a speech enhancement method is used
4