KST-2017_paper_68

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Reducing Waiting Time in Automatic Captioned

Relay Service using Short Pause in


Voice Activity Detection

Abstract—The Automatic Captioned Relay Service is crucial algorithm has been discussed. The experiment of the proposed
for hearing disabilities or hard-of-hearing to communicate with work is showed in section IV. Section V, The experiment
others in real life. This service uses an Automatic Speech Recogni- results and discussion are described and finally, conclusion of
tion (ASR) to transcribe speech to a caption. If can reduce waiting the work is presented in section VI.
time from non-streaming speech recognition, the relay service
will support more users. In this paper, we proposed a method
for improving a voice activity detection (VAD) using a short II. BASIC IDEA OF THE ALGORITHM
pause as endpoint based on dual-threshold method. This method
reduces waiting time of the captions results. The experimental
In previous research is most commonly used silence be-
results show that the propose method reduces 19.10% of waiting tween word over 100ms for better accuracy [8]. However,
time without losing the accuracy of captions when compare with the waiting time was increased accordingly. In this paper, we
traditional VAD. focus on using a short pause between words to reducing the
waiting time of the caption results. A pause between words
Keywords—voice activity detection, automatic captioned relay more than 20ms is defined as a short pause. Here are the main
service, short pause, dual-threshold method
two questions of this paper.

I. I NTRODUCTION 1) How many the short pause can found in a sentence?


2) Is it possible to use a short pause to reduce the waiting
Automatic Captioned Relay Service is a service that allows time and maintains the accuracy?
people with hearing disabilities or hard-of-hearing to use a
special mobile application which enables user to speak and
simultaneously read captions of what others are saying. This
service uses an Automatic Speech Recognition (ASR) to
transcribe speech to a caption. Then the service transmits the
transcribed caption directly to the application display.
The goal of this paper is how to improve the relay service
serves captions from non-streaming speech recognition near
real-time as possible. To separate a continuous speech in real-
time, the relay service used Voice Activity Detection (VAD) to
separate speech and non-speech to portions. Then, the speech
portions were transcribed by speech recognition. In continuous Fig. 1. Short pause can usually found in sentence
speech, the traditional VADs separate a continuous speech to
long speech portions then speech recognition take the long From Figure 1, the dashed lines represent the short pause.
time to transcribe. Therefore, the user take long waiting time And the short pause can usually found in a sentence. To
for the caption results in continuous speech. improve the traditional dual-threshold method algorithm [9], a
The traditional VADs are roughly divided into several cat- short pause is used in the endpoint determination of proposed
egories as follows; (1) Time-domainsuch as short-time energy algorithm to reduce waiting time of caption results.
and zero-crossing rate [1], log-energy [2], (2) Frequency-
domain such as energy spectral entropy [3], (3) Pattern III. A SHORT PAUSE VAD
recognition such as Neural Network [4]–[7]. Some categories
In this section we attempt to design a method to using the
achieve high accuracy, but they require high computational
short pause in VAD, Instead using only silence segment. The
costs. Therefore, they are not suitable for real-time applica-
block diagram of proposed algorithm is presented and each of
tions.
the process are described.
In this paper, we propose a VAD algorithm to using a short
pause as endpoint based on a dual-threshold method which In traditional of dual-threshold method algorithm, the two
reduces waiting time, maintains efficiency and accuracy of features extraction are used short-time energy (STE) and zero-
captions in non-streaming speech recognition. crossing rate (ZCR) to conquer in a noisy environment. In
order to make the process of signals, we could make them
In section II, present the basic idea to using short pause stationary by framing operation. Then the window is added
as endpoint in VAD algorithm. Section III, the proposed VAD after framing is done, two features extraction are calculated
immediately. Now using the energy feature and ZCR value to
determine the startpoint and endpoint. The block diagram of
the proposed VAD is shown in the Figure 2.

Fig. 4. Energy waveform after extract from short-time energy

To obtain the short-time energy (E) of each frame, let


speech signal be y(m), represents the frame length and ham-
ming window [10] is adopted by w(m). Xn (m) is n-th frame
after adding window, the expression is as follows:

Xn (m) = w(m) ∗ y(m); 0 ≤ m ≤ N − 1 (1)

Finally, short-time energy of the n-th frame can be defined


by the following:

Fig. 2. The flow diagram of the proposed VAD algorithm, which used for N
X −1
determination of silence and short pause En = [Xn (m)]2 (2)
m=0

A. Framing 2) Zero-crossing rate


In nature, the speech signal is non-stationary and change The zero-crossing rate (ZCR) is another most popular
quite rapidly over time. But, it can be assume that the speech characteristic of speech signal. It represent the alternative
signal is stationary in short time range (10ms-30ms). Then, times of sampling points changes its sign, the expression is
framing the speech signal into the small frame can fulfill as follows:
the assumption which the speech signals is stationary. In our
experiment, we fixed frame size to 20ms.
N −1
1 X
Zn = { |sign[Xn (m)] − sign[Xn (m − 1)]|} (3)
B. Features Extraction 2 m=0
1) Short-time energy 
1, x≥0
Short-time energy is the most common feature for speech where, sign[x] =
−1, x<0
and non-speech detection. It can represent the changes of
amplitude of speech signal and amplitude of speech signal can Generally, ZCR used the secondary parameter to improve
be used to classify energy of voice and silence. However, the the accuracy of VAD when speech most uses in the noisy
accuracy of short-time energy they will decrease rapidly in the environment. The ZCR is higher when unvoice occurs speech
noisy environment. Figure 3 shows the original speech signal or silence segment and lower when voice segment occurs. The
and the energy of original signal after extracted is illustrated ZCR extracted from original speech signal is illustrated in
in Figure 4. Figure 5.

Fig. 3. Original speech signal Fig. 5. Zero-crossing rate of speech signal in noisy environment
C. Determine Silence and Short Pause ”silence” threshold, then algorithm decide that pause range
is a silent and move to the silence state. Meanwhile, if the
From Section 2, show that if the short pause between words
length of pause is lower than the ”silence” but greater than
can be used as endpoint. It able to reduce the waiting time of
”sp”, then the short pause is defined for used a short pause to
the VAD. In addition, to reduce the time waiting, the accuracy
find an appropriate endpoint. Finally, The state moves back to
of the results need to be maintained.
the speech state.
Therefore, the proposed VAD algorithm is necessary to
examine the process consists of two main process; (1) The D. Endpoint Detection
process is to determine a normal silence over 100ms, this When examining a pause that could be used as endpoint
process gives the probability of the most accurate in speech of the previous process. This process decides to use the pause
recognition, (2) The process of validating a pause between a that is detected. We determine that θ is the minimum waiting
word, which are more than 20ms, is chosen as the short pause. time allowed and ∆d is the actual processing time. Now the
This process could be used a short pause to find an appropriate algorithm can be selected by using the silence or short pause
endpoint. is used as endpoint by the algorithm below.
To locate the silence and short pause, we use a four-state • If the silence can be found in θ, the algorithm use the
transition diagram is illustrated in Figure 6. silence as the endpoint.
• If ∆d > (θ + θ2 ) but it does not find the silence segment.
Then, we use the latest short pause as endpoint and the
100ms of silence was added between front and back of
endpoint for better accuracy. However, the performance of
different length of silence compares in the experimental
results.

IV. E XPERIMENT
In experiment, we have divided the test into two types;
traditional dual-threshold method (TRAD) and proposed al-
gorithm algorithm (PROP). Then, the proposed algorithm we
have studied the length of the short pause that affect to the
waiting time and accuracy of captions. Moreover, the proposed
algorithm has been added with 100ms of the silence to compare
more effectively of the proposed algorithm (called PROP+). In
order, to confirm the effectiveness of the proposed algorithm,
we have to set the threshold for short-time energy (∆E)
and zero-crossing rate (∆Z) are all the same. As follows:
∆E = 0.5, ∆Z = 20
The threshold for short-time energy and zero-crossing
rate easier to determine using observation. While, automated
threshold for VAD, defining a plan to improve in the future.
The dataset is used in our experiment from LOTUS [11], is
Fig. 6. State transition diagram for determine silence and short pause
the large vocabulary continuous speech recognition (LVCSR)
corpus. Consisting speech of 24 people, including 12 women
As show in Figure 6, the four states include; silence, and 12 men. To testing in the noisy environment, the office
maybe-speech, speech, leaving-speech. We assume the silence noise is included in the corpus for measuring the performance
state is a start state and any state can be a final state. The of the algorithms.
transition conditions are on the edges between states and the To measure the effectiveness of each algorithm, we have
actions of the condition are in the bracket. From the following studied the two factors include accuracy and waiting time.
discussion, The output values from features extraction are F, In a measurement of accuracy, we have adopted the Word
T are threshold values and the output of startpoint, silence and Error Rate (WER) to measure accuracy of caption results from
list of the short pause are present by detected frame number. speech recognition. The WER is the number of substitution,
”count” is a number of speech frame detected, ”pause” is a deletion and insertion errors over the number of words with
number of pause frame was found in the speech and ”sp” is correct transcription, as follows equation (4):
defined as a minimum short pause length. Moreover, ”speech”
is a minimum speech length and the ”silence” is minimum
pause can be defined as a silence, these both parameters are Substitution + Deletion + Insertion
W ER = (4)
set to 100ms. Substitution + Deletion + Correct
To illustrate two main process in determination of silence In a measurement of the average waiting time of users to
and short pause. We focus on the leaving speech state that waiting for the caption results. These include the detection
the algorithm can detect the weak signal during the speech, time of VAD and recognition time then transcribed output to
we called a ”pause”. If the pause length is greater than the the user.
V. R ESULTS AND DISCUSSION efficiency of the accuracy. The results show that the proposed
algorithm can reduce the waiting time up to 19.10% without
TABLE I. losing the accuracy of captions from speech recognition.
C OMPARISON RESULTS FOR PROPOSED AND TRADITIONAL VAD
Average ACKNOWLEDGMENT
Minimum Average Waiting
Added Word This work was supported by Speech and Audio Technology
Short Waiting Time
Algorithm Silence Error Laboratory, National Electronics and Computer Technology
Pause Time Reduce
(ms) Rate Center, Thailand.
(ms) (s) (%)
(%)
TRAD NONE NONE 11.09 9.00 - R EFERENCES
40 NONE 11.77 7.22 19.77 [1] P. K. Pal and S. Phadikar, “Modified Energy Based Method for Word
Endpoints Detection of Continuous Speech Signal in Real World
80 NONE 12.03 7.18 20.29
Environment,” 2015, pp. 381–385.
PROP 120 NONE 11.72 7.50 16.70 [2] J. Wu and X.-L. Zhang, “An efficient voice activity detection algorithm
160 NONE 10.62 7.92 12.00 by combining statistical model and energy detection,” EURASIP Journal
200 NONE 10.85 8.33 7.52 on Advances in Signal Processing, vol. 2011, p. 18, 2011.
[3] C. Jia and B. Xu, “An improved entropy-based endpoint detection
40 100 10.22 7.28 19.10
algorithm,” International Symposium on Chinese Spoken, vol. 1, no. 1,
80 100 10.38 7.43 17.51 pp. 1–4, 2002.
PROP+ 120 100 10.63 7.53 16.36 [4] A. Misra, “Speech / Nonspeech Segmentation in Web Videos,” Pro-
ceedings of InterSpeech 2012, 2012.
160 100 10.21 7.75 13.96
[5] T. Hughes and K. Mierle, “Recurrent Neural Networks for Voice
200 100 9.75 8.34 7.37 Activity Detection,” Acoustics, Speech and Signal Processing . . . , pp.
7378–7382, 2013.
From Table I, the experiments show that the PROP and [6] F. Eyben, F. Weninger, S. Squartini, and B. Schuller, “Real-life voice
PROP+ can reduce waiting time as well. Instead using long activity detection with LSTM Recurrent Neural Networks and an
silence to determine endpoint, it can be using a short pause application to Hollywood movies,” in ICASSP, IEEE International
Conference on Acoustics, Speech and Signal Processing - Proceedings,
length from 40ms. But, affected in term of accuracy, PROP can 2013, pp. 483–487.
not perform a better went compared with TRAD. Reducing [7] F. Bie, Z. Zhang, D. Wang, and T. F. Zheng, “DNN-based Voice Activity
the minimum of short pause in PROP, the accuracy trend to Detection for Speaker Recognition,” pp. 1–11, 2015.
decrease rapidly. Meanwhile, added some silence in PROP+ [8] M. Moattar and M. Homayounpour, “A simple but efficient real-
can improve accuracy significantly and reduce waiting time. time voice activity detection algorithm,” European Signal Processing
We note that the PROP+ is effectively better than PROP and Conference (EUSIPCO), no. Eusipco, pp. 2549–2553, 2009.
TRAD. The comparison of accuracy between added PROP and [9] Q. Guo and N. Li, “A Improved Dual-threshold Speech Endpoint
PROP+ in short pause is illustrated in Figure 7. Detection Algorithm,” pp. 123–126, 2010.
[10] P. Podder, T. Khan, Zaman, and M. Haque Khan, “Comparative Per-
formance Analysis of Hamming , Hanning and Blackman Window,”
International Journal of Computer Applications, vol. 96, no. 18, pp.
1–7, 2014.
[11] P. Cotsomrong, T. Sunpetchniyom, S. Kasuriya, N. Thatphithakkul, and
C. Wutiwiwatchai, “LOTUS: Large vocabulary Thai continuous speech
recognition corpus,” NSTDA Annual Conference S&T in Thailand:
Towards the Molecular Economy (NAC2005), 2005.

Fig. 7. Comparison of accuracy between added non-silence and silence in


short pause

Finally, we note that the dual-threshold method is simple


and small computational cost for using in real-time VAD.
However, this algorithm is not satisfactory in some respects. To
use in noisy environments, these algorithm needs to be more
noise robustness.

VI. C ONCLUSIONS
In this paper, we propose a method that improves the
traditional dual-threshold method algorithm which reduces
waiting time, maintains efficiency and accuracy. Instead, using
long silence to determine endpoint, proposed algorithm can be
using a short pause as an endpoint. Moreover, we added silence
in short pause in the proposed algorithm to maintains the

You might also like