Professional Documents
Culture Documents
Contextualized Automatic Speech Recognition With Dynamic Vocabulary
Contextualized Automatic Speech Recognition With Dynamic Vocabulary
Yui Sudo1 , Yosuke Fukumoto1 , Muhammad Shakeel1 , Yifan Peng2 , Shinji Watanabe2
1
Honda Research Institute Japan Co., Ltd., Japan
2
Carnegie Mellon University, USA
yui.sudo@jp.honda-ri.com
H = AudioEnc(X). (1)
2.2. Decoder
Figure 1: Overall architecture of the proposed method, includ-
Given H generated by the audio encoder in Eq. (1) and the ing the audio encoder, bias encoder, and bias decoder with ex-
previously estimated token sequence y0:s−1 = [y0 , ..., ys−1 ], tended embedding and output layers.
the decoder estimates the next token ys recursively as follows:
Then, E0:s−1 is input to the Md transformer blocks with the 3.1. Bias encoder
hidden state vectors H in Eq. (1) to generate a hidden state
vector us as follows: Similar to [22], the bias encoder comprises an embedding layer,
Me transformer blocks, a mean pooling layer, and a bias list
us = Transformer(E0:s−1 , H), (4) B = {b1 , · · · , bN }, where bn ∈ V n is the I-lengnth subword
token sequence of the n-th bias phrase (e.g., [<N>, <el>,
Subsequently, a token-wise score αn = [α1n , ..., αK
n
] and corre- <ly>]). After converting the bias list B into a matrix B ∈
sponding probability are calculated as follows: RImax ×N through zero padding based on the maximum token
length Imax in B, the embedding layer and the Me transformer
αn = Linear(us ), (5) blocks extract the high level representation G ∈ Rd×Imax ×N as
follows:
P (ys | y0:s−1 , X) = Softmax(αn ), (6)
G = TransformerEnc(Embedding(B)). (9)
By repeating these processes recursively, the posterior probabil-
ity is formulated as follows: Then, a mean pooling layer extracts phrase-level embedding
S
vectors V = [v1 , · · · , vN ] ∈ Rd×N as follows:
Y
P (y0:S | X) = P (ys | y0:s−1 , X) , (7)
V = MeanPool(G). (10)
s=1
where S is the total number of tokens. Model parameters are 3.2. Bias decoder with dynamic vocabulary
optimized by minimizing the negative log-likelihood described To avoid the complexity of learning dependencies within the
as follows: bias phrases, we introduce a dynamic vocabulary V b = {<b1>
L = − log P (y0:S | X). (8) , · · · , <bN >} where the phrase-level bias tokens represent the
bias phrases with the single entities for the N bias phrases in
The embedding and output layers are extended to the proposed the bias list B. Unlike Eq. (2), the bias decoder estimates the
biasing method in Section 3.2. next token ys′ ∈ {V n ∪ V b } given H, V in Eqs. (1) and (10),
′
and y0:s−1 as follows:
3. Proposed method
P (ys′ |y0:s−1
′ ′
, X, B) = BiasDecoder(y0:s−1 , H, V ). (11)
Figure 1 shows the overall architecture of the proposed method,
which comprises the audio encoder, bias encoder, and decoder For example, if a bias phrase “Nelly” exists in the bias list, the
with extended embedding and output layers. These components bias decoder outputs the corresponding bias token [<Nelly>]
are described in the following subsections. rather than the normal token sequence [<N>, <el>, <ly>].
3.4. Bias weight during inference
Considering practicality, we introduce a bias weight to Eq. (14)
to avoid over/under-biasing during inference as follows:
wi exp(αi )
WeightSoftmaxi (α, w) = P(K+N ) , (17)
j=1 wj exp(αj )
where w = [w1 ...., w(K+N ) ] and i represent a weight vector
and its index for α = [α1n , ..., αKn
, α1b , ..., αN
b
], respectively.
The same bias weight µ is(applied to bias tokens as follows:
(a) CTC-based biasing (b) RNN-T-based biasing
Figure 3: Various architectures utilized in the proposed method. 1.0 (i ≦ K)
wi = (18)
µ (i > K).
The bias decoder comprises an extended embedding layer,
Md transformer blocks, and an extended output layer. First, the 4. Experiment
′
input token sequence y0:s−1 is converted to the embedding vec- 4.1. Experimental setup
′
tor sequence E0:s−1 = [e′0 , ..., e′s−1 ] ∈ Rd×s . Unlike Eq. (3),
′ To demonstrate the adaptability of the proposed method, we
if the input token ys−1 is a bias token, the corresponding bias
embedding vn is extracted from V (Figure 2a), otherwise, the applied it to offline CTC/attention [8] and streaming RNN-T
embedding layer is used with a linear layer as follows: models. The input features are 80-dimensional Mel filterbanks
( with a window size of 512 samples and a hop length of 160
′ ′
′ Linear(Extract(V , ys−1 )) (ys−1 ∈ V b) samples. Then, SpecAugment is applied. The audio encoder
es−1 = ′ ′ (12) comprises two convolutional layers with a stride of two and a
Linear(Embedding(ys−1 )) (ys−1 ∈ V n ).
256-dimensional linear projection layer followed by 12 con-
′
Then, the Md transformer blocks convert E0:s−1 to the hidden former layers with 1024 linear units and layer normalization.
state vector u′s as in Eq. (4). In addition to the normal token For the streaming RNN-T, the audio encoder is processed block-
score αn in Eq. (5), the bias token score αb = [α1b , ..., αN
b
] is wisely [38] with the block size and look ahead of 800 and 320
calculated using an inner product with linear layers (Figure 2b): ms, respectively. The bias encoder has six transformer blocks
with 1024 linear units. Regarding the bias decoder, the offline
Linear(u′s )Linear(V T ) CTC/attention model has six transformer blocks with 2048 lin-
αb = √ . (13)
d ear units, and the streaming RNN-T model has a single long
short-term memory layer with a hidden size of 256 and a linear
After αb is concatenated with αn , resulting in α =
layer of 320 joint sizes for prediction and joint networks. The
[α1n , ..., αK
n
, α1b , ..., αN
b
], the softmax function calculates the
attention layers in the audio encoder, the bias encoder, and the
token-wise probability, including the bias token, as follows:
bias decoder are four-multihead attentions with a dimension d
P ys′ | y0:s−1
′
, X, B = Softmax(Concat(αn , αb )). (14) of 256. The offline CTC/attention and streaming RNN-T mod-
els have 40.58 M and 31.38 M parameters, respectively, includ-
Similar to Eqs. (7) and (8), the posterior probability and the loss ing the bias encoders. The bias weight µ (Section 3.4) is set
function are formulated as follows: to 0.8 (this is discussed further in Section 4.4). During train-
YS′ ing, a bias list B is created randomly for each batch with Nutt
′
P ys′ | y0:s−1
′
P (y0:S ′ | X, B) = , X, B , (15) = [2 - 10] and I = [2 - 10] (Section 3.3). The proposed models
s=1 are trained 150 epochs at a learning rate of 0.0025/0.002 for the
′ ′ CTC/attention-based and RNN-T-based systems, respectively.
L = − log P (y0:S ′ | X, B). (16) The Librispeech-960 corpus [39] is used to evaluate the
′
where S represents the total number of tokens based on the proposed method using ESPnet as the E2E-ASR toolkit [40].
proposed bias tokens, respectively. Note that Eqs. (13) and (14) The proposed method is evaluated in terms of word error rate
are flexible to the bias list size N , and the proposed method is (WER), bias phrase WER (B-WER), and unbiased phrase WER
optimized using Eq. (16) without auxiliary losses. (U-WER) [25]. Note that insertion errors are counted toward
The proposed method can be applied to other E2E-ASR ar- the B-WER if the inserted phrases are present in the bias list;
chitectures, such as CTC, RNN-T, and their hybrids, includ- otherwise, insertion errors are counted toward the U-WER.
ing streaming systems and multilingual systems [8–10, 33–37],
because the proposed method only extends the embedding and 4.2. Results on offline CTC/attention-based system
output layers in addition to the bias encoder (Figure 3).
Table 1 shows the results of the offline CTC/attention-based
systems obtained on the Librispeech-960 dataset for different
3.3. Training
bias list sizes N . With a bias list size N > 0, the proposed
During the training phase, a bias list B is created randomly method improves the B-WER considerably despite a small in-
from the reference transcriptions for each batch, where Nutt bias crease in the U-WER, which results in a substantial improve-
phrases are selected per utterance, each has I tokens in length. ment in the overall WER. While the B-WER and U-WER tend
This process yields a total of N bias phrases calculated as Nutt × to deteriorate with larger N , the proposed method remains su-
batch size. Once the bias list B is defined, the corresponding perior to other DB techniques across all bias list sizes. Al-
reference transcriptions are replaced with the bias tokens. For though the proposed method does not perform as well as the
example, if “<N>, <el>, <ly>” is extracted from the com- baseline when N = 0, this is not considered a particularly limit-
plete utterance y = [<Hi>, <N>, <el>, <ly>], the reference ing drawback considering that users typically register important
transcription is modified to y ′ = [<Hi>, <Nelly>]. keywords in the bias list.
Table 1: WER results of offline CTC/attention-based systems obtained on Librispeech-960 (U-WER/B-WER).
N = 0 (no-bias) N = 100 N = 500 N = 1000
Model test-clean test-other test-clean test-other test-clean test-other test-clean test-other
Baseline [8] 2.57 5.98 2.57 5.98 2.57 5.98 2.57 5.98
(CTC/attention) (1.5/10.9) (4.0/23.1) (1.5/10.9) (4.0/23.1) (1.5/10.9) (4.0/23.1) (1.5/10.9) (4.0/23.1)
CPPNet [18] 4.29 9.16 3.40 7.77 3.68 8.31 3.81 8.75
(2.6/18.3) (5.9/37.5) (2.6/10.4) (6.0/23.0) (2.8/10.9) (6.5/24.3) (2.9/11.4) (6.9/25.3)
Attention-based DB 5.05 8.81 2.75 5.60 3.21 6.28 3.47 7.34
+ BPB beam search [22] (3.9/14.1) (6.6/27.9) (2.3/6.0) (4.9/12.0) (2.7/7.0) (5.5/13.5) (3.0/7.7) (6.4/15.8)
Proposed 3.16 6.95 1.80 4.63 1.92 4.81 2.01 4.97
(1.9/13.8) (4.6/27.5) (1.7/2.9) (4.3/7.1) (1.8/3.1) (4.5/7.9) (1.9/3.3) (4.6/8.5)