Professional Documents
Culture Documents
Comparison of Speech Activity Detection Techniques For Speaker Recognition
Comparison of Speech Activity Detection Techniques For Speaker Recognition
I. I NTRODUCTION
Speech activity detection (SAD) is an important task in
most of the speech processing applications. The function of a
SAD is to distinguish silence, non-speech frames from speech
signals. It has been found that the presence of non-speech
frames considerably affects performance of the system. Speech
activity detection is also called voice activity detection. However, as in speech processing terminology voice and speech
are not same, we call the task of identifying speech frames
as SAD throughout this work. SAD techniques are designed
using various methods. Most of them use heuristically chosen
statistical properties of speech parameters like: energy, pitch,
entropy etc. Therefore, the performance of different SAD
are different and varying according to the level and type of
signal-to-noise ratio (SNR). As a result, the performances of
different speech based systems are significantly sensitive to the
employed SAD technique. Therefore, SAD should be carefully
chosen while designing a speech based system. Speech activity
detection is rigorously studied for speech recognition, speech
coding etc. However, it is not so far thoroughly studied for
speaker recognition applications. Very recently, it has drawn
attention of the researchers in this field [1][3].
In current speaker recognition systems, energy based SADs
are predominantly used. For example, Kinnunen et. al has
employed energy based SAD which is found useful for NIST
speech data [4]. The baseline speaker recognition system
developed in [5] also uses a different energy based SAD. A
bi-Gaussian model of speech frames log energy distribution
is suggested in [6]. In recent days, different SADs are used
for different quality of speech signal. For example, in [7],
for speech frame selection Hungarian phoneme recognizer
The authors are with the Department of Electronics and Electrical
Communication Engineering, Indian Institute of Technology Kharagpur,
Kharagpur, West Bengal, 721302, India. e-mail: sahidullahmd@gmail.com;
gsaha@ece.iitkgp.ernet.in
OF
0.7
Silence/Background Noise
Component
0.6
Speech
Component
0.5
Likelihood
0.4
0.3
0.2
0.1
0
9
Log Energy
1 http://speech.fit.vutbr.cz/software/phoneme-recognizer-based-long-temporal-context
2 http://www.fee.vutbr.cz/SPEECHDAT-E/public/Deliver/WP1/ED141v11-fin.doc
TABLE I
D ATABASE DESCRIPTION ( CORETEST SECTION ) FOR THE PERFORMANCE
EVALUATION OF VARIOUS SPEECH ACTIVITY DETECTORS .
SRE 2001
SRE 2002
Target Models
74, 100
139, 191
Test Segments
2038
3570
Total Trial
22418
39270
True Trial
2038
2983
Impostor Trial
20380
36287
B. Feature Extraction
MFCC features have been extracted using 20 filters linearly
spaced in mel scale from speech frame of 20ms keeping
50% overlap with adjacent frames. The details of our feature
extraction process can be found in [31]. We have also shown
the performance for our previously proposed overlapped block
transform coefficient (OBTC) feature termed as OBT-9-13
which extracted by performing distributed DCT on the mel
filterbank output [31]. The dimensions of the MFCC and
OBTC features are 38 and 40 correspondingly after static
features are augmented with their velocity coeffificents.
C. Classifier Description
In this paper, all the performance are based on GMM-UBM
based classifier where the target models are created by adapting UBM parameters [32]. Initially, a gender dependent UBM
with 256 mixtures is trained using expectation-maximization
algorithm after initialization using split vector quantization
with data from development section of NIST SRE 2001. Target
models are created by adapting the centres of the UBM with
relevance factor of 14. Finally, during score computation, only
top-5 Gaussians of UBM per frame are considered.
D. Performance Evaluation Metric
The performance of speaker recognition is measured using
equal error rate (EER) metric which is a particular operating
point on detection error tradeoff plot where the probability
of false rejection (FR) equals probability of false acceptance
(FA). The performance is also measured in terms of minimum
detection cost function (minDCF) where a cost function is
minimized by assigning unequal cost to FR and FA. Here, the
costs of FR and FA are set at 1 and 10 correspondingly.
IV. R ESULTS AND D ISCUSSIONS
Speaker recognition experiments are conducted on both
the databases using MFCC and OBTC feature. In all the
experiments, everything other than the SAD techniques are
kept fixed to observe the effect of SAD. In this paper, the
SAD techniques are abbreviated as follows:
G.729B: SAD used in G.729B [9].
SMSAD: Statistical model based SAD proposed by Sohn
ET. AL [10].
HNPNR:
Hungarian phoneme recognizer based
SAD [25].
TABLE II
S PEAKER VERIFICATION RESULTS IN TERMS OF EQUAL ERROR RATE AND MINIMUM DETECTION COST FUNCTION ON NIST SRE 2001 AND NIST SRE
2002 FOR DIFFERENT IMPLEMENTATIONS OF SPEECH ACTIVITY DETECTOR . [G.729B: SAD USED IN G.729 C ODEC [9], SMSAD: S TATISTICAL M ODEL
BASED SAD [10], HNPNR: H UNGARIAN P HONEME R ECOGNIZER [25], MEBTS: M AXIMUM E NERGY B ASED T HRESHOLD S ELECTION [4], AEBTS:
AVERAGE E NERGY BASED T HRESHOLD S ELECTION [5], UBGME: U TTERANCE W ISE B I G AUSSIAN M ODELING OF L OG E NERGY [28].]
Method
G.729B
9.32
8.59
4.11
3.85
10.16
9.83
4.50
4.38
SMSAD
8.97
8.11
3.92
3.74
9.86
9.85
4.51
4.29
HNPNR
8.54
7.76
3.74
3.48
9.09
8.78
4.43
4.35
MEBTS
8.24
7.27
3.58
3.44
9.09
8.58
4.45
4.10
AEBTS
7.96
7.21
3.61
3.49
9.45
8.59
4.44
4.22
UBGME
7.91
7.36
3.72
3.41
9.69
9.05
4.50
4.28
20
Speech
White
Pink
Volvo
Babble
Factory
10
10
20
30
40
0
500
1000
1500
2000
Frequency (in Hz)
2500
3000
3500
4000
Fig. 2. Average of short term magnitude responses for clean speech and noise samples. The speech signals are taken from NIST SRE 2001 where noise
samples are taken from NOISEX-92 database. Averaging operation is performed over all available signal frames.
The UBGME based approach uses only log-energy. However, bi Gaussian modeling of other parameters like
entropy, spectral flatness parameter can be studied to
extract robust speech frames.
Decision fusion technique [3] can be used to improve
the performance further by combining multiple speech
activity detector carrying supplementary information.
R EFERENCES
TABLE III
S PEAKER VERIFICATION RESULTS ON NIST SRE 2001 IN PRESENCE OF VARIOUS ADDITIVE ENVIRONMENTAL NOISE . T HE RESULTS ARE SHOWN FOR
VARIOUS SPEECH ACTIVITY DETECTION TECHNIQUES .
EER (in %)
Method
G.729B
SMVAD
HNPNR
MEBTS
AEBTS
UBGME
[22]
[23]
[24]
[25]
[26]
[27]
[28]
Noise
20 dB
MFCC
10 dB
0 dB
White
Pink
Volvo
Babble
Factory
11.38
10.70
11.30
9.96
10.55
16.69
13.64
11.43
13.74
13.35
White
Pink
Volvo
Babble
Factory
11.24
10.12
10.85
10.16
10.06
White
Pink
Volvo
Babble
Factory
minDCF 100
20 dB
OBTC
10 dB
20 dB
MFCC
10 dB
0 dB
0 dB
20 dB
OBTC
10 dB
0 dB
28.56
23.46
13.14
23.99
20.96
10.40
9.52
10.60
9.43
9.57
15.65
12.90
10.50
12.70
12.41
27.28
22.77
12.07
22.93
19.47
4.80
4.59
4.96
4.63
4.66
7.17
6.45
5.03
6.52
6.33
9.84
9.37
6.21
9.17
8.96
4.52
4.32
4.73
4.13
4.36
7.12
6.05
4.71
5.75
5.84
9.84
9.04
5.22
8.84
8.58
15.75
13.35
11.13
13.82
13.05
27.09
23.60
13.20
24.92
21.19
9.91
9.18
10.16
8.92
9.27
14.78
12.32
10.26
11.83
11.63
27.49
22.62
11.97
22.87
19.28
4.79
4.52
4.79
4.62
4.55
7.00
6.34
5.05
6.44
6.03
9.38
9.17
5.89
9.22
8.70
4.39
4.18
4.62
4.08
4.28
6.98
5.92
4.55
5.55
5.67
9.61
9.02
5.11
8.86
8.42
11.19
9.87
9.67
9.43
9.62
16.44
13.15
9.86
12.95
12.37
34.64
29.94
12.32
24.14
26.36
9.81
8.83
8.93
8.54
9.18
15.02
12.41
9.33
11.68
11.89
34.49
28.70
10.84
23.70
26.10
4.99
4.63
4.62
4.29
4.54
7.41
6.46
4.65
6.06
6.26
10.00
10.00
5.80
9.69
10.00
4.58
4.30
4.16
3.92
4.25
7.35
6.21
4.21
5.64
5.88
10.00
10.00
4.98
9.92
10.00
White
Pink
Volvo
Babble
Factory
13.64
10.06
8.73
8.64
9.22
20.85
17.08
9.13
13.63
15.55
33.57
30.86
10.74
29.63
29.92
12.31
8.64
7.91
7.60
8.54
20.36
17.17
8.34
13.20
15.75
34.20
30.47
9.29
29.64
28.56
5.98
4.38
3.99
3.87
4.23
9.02
8.31
4.05
6.65
7.51
10.00
10.00
4.81
9.88
10.00
5.50
3.93
3.75
3.47
3.94
8.63
7.94
3.71
6.18
7.19
9.98
9.95
4.16
9.81
9.93
White
Pink
Volvo
Babble
Factory
11.97
9.09
8.58
8.29
8.88
21.41
17.47
8.99
12.22
14.62
34.15
31.45
10.30
29.70
30.57
10.40
7.90
7.89
7.57
8.10
19.68
16.29
8.25
11.38
13.94
33.32
29.78
9.18
29.34
28.60
5.12
4.20
4.00
3.79
4.08
8.96
8.06
3.95
5.90
6.80
10.00
10.00
4.62
9.93
10.00
4.52
3.79
3.69
3.46
3.76
8.05
7.15
3.66
5.51
6.26
10.00
9.98
4.10
9.84
9.98
White
Pink
Volvo
Babble
Factory
9.76
8.89
8.73
8.78
9.03
14.18
12.03
9.37
11.63
11.39
24.88
21.44
11.19
20.70
21.20
9.31
8.69
8.30
8.34
8.59
13.98
11.29
8.82
10.50
11.14
24.88
20.80
9.96
20.61
20.50
4.66
4.17
4.16
3.85
4.05
6.70
5.71
4.09
5.39
5.53
9.29
8.79
4.78
8.58
8.62
4.19
3.80
3.89
3.52
3.76
6.52
5.41
3.80
5.10
5.11
9.47
8.85
4.26
8.59
8.60
[29]
[30]
[31]
[32]
[33]
TABLE IV
S PEAKER VERIFICATION RESULTS ON NIST SRE 2002 IN PRESENCE OF VARIOUS ADDITIVE ENVIRONMENTAL NOISE . T HE RESULTS ARE SHOWN FOR
VARIOUS SPEECH ACTIVITY DETECTION TECHNIQUES .
EER (in %)
Method
G.729B
SMVAD
HNPNR
MEBTS
AEBTS
UBGME
Noise
20 dB
MFCC
10 dB
0 dB
White
Pink
Volvo
Babble
Factory
12.38
11.93
12.50
12.07
12.03
17.60
15.92
13.07
16.66
16.09
White
Pink
Volvo
Babble
Factory
12.26
11.90
12.48
11.87
11.83
White
Pink
Volvo
Babble
Factory
minDCF 100
20 dB
OBTC
10 dB
20 dB
MFCC
10 dB
0 dB
0 dB
20 dB
OBTC
10 dB
29.47
25.81
15.22
26.86
25.01
11.96
11.67
12.37
11.26
11.43
17.93
15.65
12.40
15.15
14.92
0 dB
28.83
24.84
13.44
25.18
23.63
5.47
5.37
5.54
5.63
5.45
7.40
7.02
6.01
7.34
7.06
9.84
9.44
6.82
9.55
9.40
5.30
5.16
5.41
5.01
5.20
7.46
6.72
5.39
6.73
6.65
9.75
9.06
5.97
9.36
8.82
17.63
15.93
13.21
17.10
15.66
27.86
25.41
15.05
28.29
24.04
11.97
11.46
12.54
11.27
11.47
17.13
15.38
12.30
15.82
14.95
28.43
24.81
13.44
25.95
22.83
5.59
5.30
5.53
5.65
5.34
7.57
7.00
5.93
7.51
6.93
9.81
9.47
6.75
9.65
9.15
5.33
5.08
5.36
5.12
5.11
7.51
6.69
5.39
6.79
6.55
9.76
9.18
6.01
9.58
8.54
12.44
11.53
10.86
11.20
11.23
17.90
15.96
11.79
15.69
15.45
35.20
31.48
14.15
26.28
28.74
11.70
11.40
11.06
10.56
10.86
17.40
15.22
11.36
14.21
14.82
35.40
30.44
12.64
25.27
27.96
5.85
5.69
5.24
5.61
5.56
8.08
7.59
5.92
7.38
7.45
10.00
10.00
7.03
10.00
10.00
5.62
5.49
5.17
5.31
5.36
7.96
7.18
5.40
6.74
7.01
10.00
10.00
6.20
10.00
10.00
White
Pink
Volvo
Babble
Factory
14.28
11.20
10.12
10.60
11.00
22.93
20.26
11.19
17.20
17.67
34.97
33.90
12.71
32.41
32.12
12.88
10.29
9.82
9.69
9.86
21.56
19.01
10.06
15.46
17.36
34.96
33.02
11.20
31.48
31.14
6.51
5.15
4.82
5.07
5.07
8.73
8.38
5.13
7.59
7.97
9.96
9.88
6.01
9.82
9.78
6.19
4.87
4.49
4.61
4.70
8.88
8.41
4.69
7.12
7.84
9.97
9.86
5.24
9.82
9.78
White
Pink
Volvo
Babble
Factory
12.37
10.52
10.16
10.56
10.60
23.17
20.01
10.67
15.52
16.93
35.17
34.16
12.41
32.35
32.58
11.93
10.09
9.76
9.86
9.76
21.11
18.64
10.22
14.21
16.45
35.43
32.55
11.22
31.41
30.51
5.96
4.98
4.73
4.88
4.90
9.10
8.42
5.09
7.07
7.70
10.00
9.97
5.78
9.91
9.88
5.57
4.76
4.60
4.59
4.64
9.00
8.22
4.76
6.59
7.29
9.95
9.91
5.16
9.82
9.77
White
Pink
Volvo
Babble
Factory
11.70
10.93
10.69
10.76
10.76
16.06
14.45
11.46
14.29
14.25
26.92
23.97
12.77
23.10
24.57
11.10
10.50
10.16
10.16
10.19
15.99
13.95
10.53
13.21
13.48
26.95
23.24
11.26
23.13
24.04
5.29
5.08
4.88
4.99
5.00
7.02
6.52
5.15
6.43
6.54
9.71
9.29
5.83
9.16
9.38
5.05
4.85
4.67
4.60
4.70
7.04
6.29
4.78
6.13
6.24
9.61
9.10
5.31
9.05
9.14
TABLE V
S PEAKER VERIFICATION RESULTS IN DISTORTED NIST SRE 2002 FOR DIFFERENT SPEECH ACTIVITY DETECTORS .
Method
w/o t-norm
EER (in %)
minDCF 100
MFCC
OBTC
MFCC
OBTC
with t-norm
EER (in %)
minDCF 100
MFCC
OBTC
MFCC
OBTC
G.729B
21.69
19.41
8.52
8.05
21.22
18.54
8.14
7.53
SMSAD
22.09
19.51
8.65
8.16
21.35
18.74
8.37
7.72
HNPNR
21.18
19.13
8.75
8.34
20.72
18.64
8.14
7.49
MEBTS
23.73
22.79
9.94
9.93
23.57
22.26
9.96
9.97
AEBTS
23.33
21.89
9.96
9.93
22.90
21.45
9.97
9.96
UBGME
19.14
17.53
8.32
8.03
18.34
16.43
7.88
7.38