Professional Documents
Culture Documents
KST-2017_paper_68
KST-2017_paper_68
KST-2017_paper_68
Abstract—The Automatic Captioned Relay Service is crucial algorithm has been discussed. The experiment of the proposed
for hearing disabilities or hard-of-hearing to communicate with work is showed in section IV. Section V, The experiment
others in real life. This service uses an Automatic Speech Recogni- results and discussion are described and finally, conclusion of
tion (ASR) to transcribe speech to a caption. If can reduce waiting the work is presented in section VI.
time from non-streaming speech recognition, the relay service
will support more users. In this paper, we proposed a method
for improving a voice activity detection (VAD) using a short II. BASIC IDEA OF THE ALGORITHM
pause as endpoint based on dual-threshold method. This method
reduces waiting time of the captions results. The experimental
In previous research is most commonly used silence be-
results show that the propose method reduces 19.10% of waiting tween word over 100ms for better accuracy [8]. However,
time without losing the accuracy of captions when compare with the waiting time was increased accordingly. In this paper, we
traditional VAD. focus on using a short pause between words to reducing the
waiting time of the caption results. A pause between words
Keywords—voice activity detection, automatic captioned relay more than 20ms is defined as a short pause. Here are the main
service, short pause, dual-threshold method
two questions of this paper.
Fig. 2. The flow diagram of the proposed VAD algorithm, which used for N
X −1
determination of silence and short pause En = [Xn (m)]2 (2)
m=0
Fig. 3. Original speech signal Fig. 5. Zero-crossing rate of speech signal in noisy environment
C. Determine Silence and Short Pause ”silence” threshold, then algorithm decide that pause range
is a silent and move to the silence state. Meanwhile, if the
From Section 2, show that if the short pause between words
length of pause is lower than the ”silence” but greater than
can be used as endpoint. It able to reduce the waiting time of
”sp”, then the short pause is defined for used a short pause to
the VAD. In addition, to reduce the time waiting, the accuracy
find an appropriate endpoint. Finally, The state moves back to
of the results need to be maintained.
the speech state.
Therefore, the proposed VAD algorithm is necessary to
examine the process consists of two main process; (1) The D. Endpoint Detection
process is to determine a normal silence over 100ms, this When examining a pause that could be used as endpoint
process gives the probability of the most accurate in speech of the previous process. This process decides to use the pause
recognition, (2) The process of validating a pause between a that is detected. We determine that θ is the minimum waiting
word, which are more than 20ms, is chosen as the short pause. time allowed and ∆d is the actual processing time. Now the
This process could be used a short pause to find an appropriate algorithm can be selected by using the silence or short pause
endpoint. is used as endpoint by the algorithm below.
To locate the silence and short pause, we use a four-state • If the silence can be found in θ, the algorithm use the
transition diagram is illustrated in Figure 6. silence as the endpoint.
• If ∆d > (θ + θ2 ) but it does not find the silence segment.
Then, we use the latest short pause as endpoint and the
100ms of silence was added between front and back of
endpoint for better accuracy. However, the performance of
different length of silence compares in the experimental
results.
IV. E XPERIMENT
In experiment, we have divided the test into two types;
traditional dual-threshold method (TRAD) and proposed al-
gorithm algorithm (PROP). Then, the proposed algorithm we
have studied the length of the short pause that affect to the
waiting time and accuracy of captions. Moreover, the proposed
algorithm has been added with 100ms of the silence to compare
more effectively of the proposed algorithm (called PROP+). In
order, to confirm the effectiveness of the proposed algorithm,
we have to set the threshold for short-time energy (∆E)
and zero-crossing rate (∆Z) are all the same. As follows:
∆E = 0.5, ∆Z = 20
The threshold for short-time energy and zero-crossing
rate easier to determine using observation. While, automated
threshold for VAD, defining a plan to improve in the future.
The dataset is used in our experiment from LOTUS [11], is
Fig. 6. State transition diagram for determine silence and short pause
the large vocabulary continuous speech recognition (LVCSR)
corpus. Consisting speech of 24 people, including 12 women
As show in Figure 6, the four states include; silence, and 12 men. To testing in the noisy environment, the office
maybe-speech, speech, leaving-speech. We assume the silence noise is included in the corpus for measuring the performance
state is a start state and any state can be a final state. The of the algorithms.
transition conditions are on the edges between states and the To measure the effectiveness of each algorithm, we have
actions of the condition are in the bracket. From the following studied the two factors include accuracy and waiting time.
discussion, The output values from features extraction are F, In a measurement of accuracy, we have adopted the Word
T are threshold values and the output of startpoint, silence and Error Rate (WER) to measure accuracy of caption results from
list of the short pause are present by detected frame number. speech recognition. The WER is the number of substitution,
”count” is a number of speech frame detected, ”pause” is a deletion and insertion errors over the number of words with
number of pause frame was found in the speech and ”sp” is correct transcription, as follows equation (4):
defined as a minimum short pause length. Moreover, ”speech”
is a minimum speech length and the ”silence” is minimum
pause can be defined as a silence, these both parameters are Substitution + Deletion + Insertion
W ER = (4)
set to 100ms. Substitution + Deletion + Correct
To illustrate two main process in determination of silence In a measurement of the average waiting time of users to
and short pause. We focus on the leaving speech state that waiting for the caption results. These include the detection
the algorithm can detect the weak signal during the speech, time of VAD and recognition time then transcribed output to
we called a ”pause”. If the pause length is greater than the the user.
V. R ESULTS AND DISCUSSION efficiency of the accuracy. The results show that the proposed
algorithm can reduce the waiting time up to 19.10% without
TABLE I. losing the accuracy of captions from speech recognition.
C OMPARISON RESULTS FOR PROPOSED AND TRADITIONAL VAD
Average ACKNOWLEDGMENT
Minimum Average Waiting
Added Word This work was supported by Speech and Audio Technology
Short Waiting Time
Algorithm Silence Error Laboratory, National Electronics and Computer Technology
Pause Time Reduce
(ms) Rate Center, Thailand.
(ms) (s) (%)
(%)
TRAD NONE NONE 11.09 9.00 - R EFERENCES
40 NONE 11.77 7.22 19.77 [1] P. K. Pal and S. Phadikar, “Modified Energy Based Method for Word
Endpoints Detection of Continuous Speech Signal in Real World
80 NONE 12.03 7.18 20.29
Environment,” 2015, pp. 381–385.
PROP 120 NONE 11.72 7.50 16.70 [2] J. Wu and X.-L. Zhang, “An efficient voice activity detection algorithm
160 NONE 10.62 7.92 12.00 by combining statistical model and energy detection,” EURASIP Journal
200 NONE 10.85 8.33 7.52 on Advances in Signal Processing, vol. 2011, p. 18, 2011.
[3] C. Jia and B. Xu, “An improved entropy-based endpoint detection
40 100 10.22 7.28 19.10
algorithm,” International Symposium on Chinese Spoken, vol. 1, no. 1,
80 100 10.38 7.43 17.51 pp. 1–4, 2002.
PROP+ 120 100 10.63 7.53 16.36 [4] A. Misra, “Speech / Nonspeech Segmentation in Web Videos,” Pro-
ceedings of InterSpeech 2012, 2012.
160 100 10.21 7.75 13.96
[5] T. Hughes and K. Mierle, “Recurrent Neural Networks for Voice
200 100 9.75 8.34 7.37 Activity Detection,” Acoustics, Speech and Signal Processing . . . , pp.
7378–7382, 2013.
From Table I, the experiments show that the PROP and [6] F. Eyben, F. Weninger, S. Squartini, and B. Schuller, “Real-life voice
PROP+ can reduce waiting time as well. Instead using long activity detection with LSTM Recurrent Neural Networks and an
silence to determine endpoint, it can be using a short pause application to Hollywood movies,” in ICASSP, IEEE International
Conference on Acoustics, Speech and Signal Processing - Proceedings,
length from 40ms. But, affected in term of accuracy, PROP can 2013, pp. 483–487.
not perform a better went compared with TRAD. Reducing [7] F. Bie, Z. Zhang, D. Wang, and T. F. Zheng, “DNN-based Voice Activity
the minimum of short pause in PROP, the accuracy trend to Detection for Speaker Recognition,” pp. 1–11, 2015.
decrease rapidly. Meanwhile, added some silence in PROP+ [8] M. Moattar and M. Homayounpour, “A simple but efficient real-
can improve accuracy significantly and reduce waiting time. time voice activity detection algorithm,” European Signal Processing
We note that the PROP+ is effectively better than PROP and Conference (EUSIPCO), no. Eusipco, pp. 2549–2553, 2009.
TRAD. The comparison of accuracy between added PROP and [9] Q. Guo and N. Li, “A Improved Dual-threshold Speech Endpoint
PROP+ in short pause is illustrated in Figure 7. Detection Algorithm,” pp. 123–126, 2010.
[10] P. Podder, T. Khan, Zaman, and M. Haque Khan, “Comparative Per-
formance Analysis of Hamming , Hanning and Blackman Window,”
International Journal of Computer Applications, vol. 96, no. 18, pp.
1–7, 2014.
[11] P. Cotsomrong, T. Sunpetchniyom, S. Kasuriya, N. Thatphithakkul, and
C. Wutiwiwatchai, “LOTUS: Large vocabulary Thai continuous speech
recognition corpus,” NSTDA Annual Conference S&T in Thailand:
Towards the Molecular Economy (NAC2005), 2005.
VI. C ONCLUSIONS
In this paper, we propose a method that improves the
traditional dual-threshold method algorithm which reduces
waiting time, maintains efficiency and accuracy. Instead, using
long silence to determine endpoint, proposed algorithm can be
using a short pause as an endpoint. Moreover, we added silence
in short pause in the proposed algorithm to maintains the