Contextualized Automatic Speech Recognition With Dynamic Vocabulary

Contextualized Automatic Speech Recognition with Dynamic Vocabulary
Yui Sudo1 , Yosuke Fukumoto1 , Muhammad Shakeel1 , Yifan Peng2 , Shinji Watanabe2
1
Honda Research Institute Japan Co., Ltd., Japan
2
Carnegie Mellon University, USA
yui.sudo@jp.honda-ri.com
Abstract phrases, referred to as a bias list, to tailor the E2E-ASR model

to specific contexts. A key feature of many DB techniques is
Deep biasing (DB) improves the performance of end-to- the use of a cross-attention layer to ensure accurate recogni-
end automatic speech recognition (E2E-ASR) for rare words
arXiv:2405.13344v1 [eess.AS] 22 May 2024
tion of bias phrases. To improve the effectiveness of this layer,

or contextual phrases using a bias list. However, most existing previous studies [18–22] have introduced an auxiliary loss func-
methods treat bias phrases as sequences of subwords in a prede- tion specifically for bias phrase detection. This function facil-
fined static vocabulary, which can result in ineffective learning itates the optimization of the parameters of the cross-attention
of the dependencies between subwords. More advanced tech- layer through multitask learning. However, these modules are
niques address this problem by incorporating additional text tightly integrated into individual architectures, making their ap-
data, which increases the overall workload. This paper proposes plication to other architectures complex, and multitask learning
a dynamic vocabulary where phrase-level bias tokens can be requires multiple experimental training phases to fine-tune the
added during the inference phase. Each bias token represents learning weights, which is a time-consuming process.
an entire bias phrase within a single token, thereby eliminat-
ing the need to learn the dependencies between the subwords In addition, most existing techniques process bias phrases
within the bias phrases. This method can be applied to var- as sequences of subwords in a predefined static vocabulary,
ious architectures because it only extends the embedding and which can lead to inadequate learning of the dependencies be-
output layers in common E2E-ASR architectures. Experimen- tween the subwords, especially for rare phrases. For example,
tal results demonstrate that the proposed method improves the the personal name “Nelly” might be segmented into a subword
performance of bias phrases on English and Japanese datasets. sequence, e.g., “N”, “el”, “ly”. However, this approach fre-
Index Terms: speech recognition, contextualization, biasing, quently fails to capture the dependencies between these sub-
dynamic vocabulary words. To address this problem, prefix tree-based approaches
[23–25] have been proposed to efficiently organize bias phrases
with common prefixes into a tree structure to manage large bias
1. Introduction lists effectively. This tree structure is relatively lightweight to
End-to-end automatic speech recognition (E2E-ASR) [1]meth- create; however, it requires additional processing. Advanced
ods convert acoustic feature sequences directly to token se- strategies have incorporated additional text data during training
quences without requiring the components used in conventional to improve the handling of large bias lists. For example, previ-
ASR systems, e.g., acoustic models and language models (LM). ous studies [25, 26] have demonstrated significant performance
Various E2E-ASR methods have been proposed, including con- improvements in terms of bias phrase recognition by training
nectionist temporal classification (CTC) [2, 3], recurrent neural an LM or text encoder using additional text data. However,
network transducer (RNN-T) [4,5], attention mechanisms [6,7], this technique can increase the workload considerably, and in
and their hybrids [8–11]. However, the effectiveness of E2E- some cases the acquisition of additional data may be infeasi-
ASR models is strongly dependent on the context of the train- ble. Other attempts have been proposed to improve contextual-
ing data, which can result in performance inconsistencies for ization using extra information, such as phonemes for the bias
specific user contexts. However, retraining models for every phrases [27–29], named entity tags [30], and synthesized speech
possible context is infeasible, which highlights the need for a [31, 32]; however, these strategies also increase the workload.
method that allows users and developers to contextualize the This paper proposes a simple but effective deep biasing
model efficiently without additional training process. method by introducing a dynamic vocabulary where phrase-
To address the challenge of contextualization in ASR, a level bias tokens can be added during the inference phase. Each
common strategy is shallow fusion with an external LM [12,13], bias token represents an entire bias phrase within a single token,
which frequently involves using a weighted finite state trans- thereby bypassing the complex process of learning subword de-
ducer (WFST) to create a specialized in-class LM that improves pendencies within the bias phrases. This approach allows for
the recognition of the target named entities. In addition, neural effective biasing without relying on additional resources such
LM fusion methods [14, 15] attempt to improve accuracy by as text data and tree structures, unlike the previous methods
integrating an external neural LM with an E2E-ASR system. [25, 26]. In addition, the proposed bias tokens are integrated
This integration involves rescoring the output hypotheses of the seamlessly with the conventional tokens in the embedding and
E2E-ASR model; however, integrating an external LM requires output layers, which eliminates the need for auxiliary losses,
additional training, which increases the overall complexity. unlike the previous studies [18–22]. Furthermore, unlike the
Deep biasing (DB) [16–22] provides an efficient strategy previous methods [19–22, 25, 26], the proposed method can be
for contextualizing E2E-ASR models without additional re- applied to various architectures because it only extends the em-
training processes. This approach uses an editable list of bedding and output layers of the common E2E-ASR models.
2. End-to-end ASR
This section describes an attention-based E2E-ASR system,
which is extended to the proposed method.
2.1. Audio encoder

The audio encoder comprises two convolutional layers, a linear
projection layer, and Ma Conformer blocks [5]. The Conformer
encoder transforms an audio feature sequence X to a T length
hidden state vector sequence H = [h1 , ..., hT ] ∈ Rd×T where
d represents the dimension:
H = AudioEnc(X). (1)
2.2. Decoder
Figure 1: Overall architecture of the proposed method, includ-
Given H generated by the audio encoder in Eq. (1) and the ing the audio encoder, bias encoder, and bias decoder with ex-
previously estimated token sequence y0:s−1 = [y0 , ..., ys−1 ], tended embedding and output layers.
the decoder estimates the next token ys recursively as follows:
P (ys |y0:s−1 , X) = Decoder(y0:s−1 , H), (2)
where ys is the s-th subword-level token in the pre-defined

static vocabulary V n of size K (ys ∈ V n ). Here, y0 repre-
sents the <sos> token. Specifically, the decoder comprises an
embedding layer, Md transformer blocks, and an output layer. (a) Embedding layer (b) Output layer
First, the embedding layer with positional encoding converts the
input token sequence y0:s−1 to an embedding vector sequence Figure 2: Extended embedding and output layers. (a) If the
E0:s−1 = [e0 , ..., es−1 ] ∈ Rd×s as follows: input token is a dynamic bias token, the corresponding phrase-
level bias embedding vn is extracted. (b) The bias phrase score
E0:s−1 = Embedding(y0:s−1 ). (3) αb is calculated using the inner product.
Then, E0:s−1 is input to the Md transformer blocks with the 3.1. Bias encoder
hidden state vectors H in Eq. (1) to generate a hidden state
vector us as follows: Similar to [22], the bias encoder comprises an embedding layer,
Me transformer blocks, a mean pooling layer, and a bias list
us = Transformer(E0:s−1 , H), (4) B = {b1 , · · · , bN }, where bn ∈ V n is the I-lengnth subword
token sequence of the n-th bias phrase (e.g., [<N>, <el>,
Subsequently, a token-wise score αn = [α1n , ..., αK
n
] and corre- <ly>]). After converting the bias list B into a matrix B ∈
sponding probability are calculated as follows: RImax ×N through zero padding based on the maximum token
length Imax in B, the embedding layer and the Me transformer
αn = Linear(us ), (5) blocks extract the high level representation G ∈ Rd×Imax ×N as
follows:
P (ys | y0:s−1 , X) = Softmax(αn ), (6)
G = TransformerEnc(Embedding(B)). (9)
By repeating these processes recursively, the posterior probabil-
ity is formulated as follows: Then, a mean pooling layer extracts phrase-level embedding
S
vectors V = [v1 , · · · , vN ] ∈ Rd×N as follows:
Y
P (y0:S | X) = P (ys | y0:s−1 , X) , (7)
V = MeanPool(G). (10)
s=1
where S is the total number of tokens. Model parameters are 3.2. Bias decoder with dynamic vocabulary
optimized by minimizing the negative log-likelihood described To avoid the complexity of learning dependencies within the
as follows: bias phrases, we introduce a dynamic vocabulary V b = {<b1>
L = − log P (y0:S | X). (8) , · · · , <bN >} where the phrase-level bias tokens represent the
bias phrases with the single entities for the N bias phrases in
The embedding and output layers are extended to the proposed the bias list B. Unlike Eq. (2), the bias decoder estimates the
biasing method in Section 3.2. next token ys′ ∈ {V n ∪ V b } given H, V in Eqs. (1) and (10),
′
and y0:s−1 as follows:
3. Proposed method
P (ys′ |y0:s−1
′ ′
, X, B) = BiasDecoder(y0:s−1 , H, V ). (11)
Figure 1 shows the overall architecture of the proposed method,
which comprises the audio encoder, bias encoder, and decoder For example, if a bias phrase “Nelly” exists in the bias list, the
with extended embedding and output layers. These components bias decoder outputs the corresponding bias token [<Nelly>]
are described in the following subsections. rather than the normal token sequence [<N>, <el>, <ly>].
3.4. Bias weight during inference
Considering practicality, we introduce a bias weight to Eq. (14)
to avoid over/under-biasing during inference as follows:
wi exp(αi )
WeightSoftmaxi (α, w) = P(K+N ) , (17)
j=1 wj exp(αj )
where w = [w1 ...., w(K+N ) ] and i represent a weight vector
and its index for α = [α1n , ..., αKn
, α1b , ..., αN
b
], respectively.
The same bias weight µ is(applied to bias tokens as follows:
(a) CTC-based biasing (b) RNN-T-based biasing
Figure 3: Various architectures utilized in the proposed method. 1.0 (i ≦ K)
wi = (18)
µ (i > K).
The bias decoder comprises an extended embedding layer,
Md transformer blocks, and an extended output layer. First, the 4. Experiment
′
input token sequence y0:s−1 is converted to the embedding vec- 4.1. Experimental setup
′
tor sequence E0:s−1 = [e′0 , ..., e′s−1 ] ∈ Rd×s . Unlike Eq. (3),
′ To demonstrate the adaptability of the proposed method, we
if the input token ys−1 is a bias token, the corresponding bias
embedding vn is extracted from V (Figure 2a), otherwise, the applied it to offline CTC/attention [8] and streaming RNN-T
embedding layer is used with a linear layer as follows: models. The input features are 80-dimensional Mel filterbanks
( with a window size of 512 samples and a hop length of 160
′ ′
′ Linear(Extract(V , ys−1 )) (ys−1 ∈ V b) samples. Then, SpecAugment is applied. The audio encoder
es−1 = ′ ′ (12) comprises two convolutional layers with a stride of two and a
Linear(Embedding(ys−1 )) (ys−1 ∈ V n ).
256-dimensional linear projection layer followed by 12 con-
′
Then, the Md transformer blocks convert E0:s−1 to the hidden former layers with 1024 linear units and layer normalization.
state vector u′s as in Eq. (4). In addition to the normal token For the streaming RNN-T, the audio encoder is processed block-
score αn in Eq. (5), the bias token score αb = [α1b , ..., αN
b
] is wisely [38] with the block size and look ahead of 800 and 320
calculated using an inner product with linear layers (Figure 2b): ms, respectively. The bias encoder has six transformer blocks
with 1024 linear units. Regarding the bias decoder, the offline
Linear(u′s )Linear(V T ) CTC/attention model has six transformer blocks with 2048 lin-
αb = √ . (13)
d ear units, and the streaming RNN-T model has a single long
short-term memory layer with a hidden size of 256 and a linear
After αb is concatenated with αn , resulting in α =
layer of 320 joint sizes for prediction and joint networks. The
[α1n , ..., αK
n
, α1b , ..., αN
b
], the softmax function calculates the
attention layers in the audio encoder, the bias encoder, and the
token-wise probability, including the bias token, as follows:
bias decoder are four-multihead attentions with a dimension d
P ys′ | y0:s−1
′
, X, B = Softmax(Concat(αn , αb )). (14) of 256. The offline CTC/attention and streaming RNN-T mod-

els have 40.58 M and 31.38 M parameters, respectively, includ-
Similar to Eqs. (7) and (8), the posterior probability and the loss ing the bias encoders. The bias weight µ (Section 3.4) is set
function are formulated as follows: to 0.8 (this is discussed further in Section 4.4). During train-
YS′ ing, a bias list B is created randomly for each batch with Nutt
′
P ys′ | y0:s−1
′
P (y0:S ′ | X, B) = , X, B , (15) = [2 - 10] and I = [2 - 10] (Section 3.3). The proposed models
s=1 are trained 150 epochs at a learning rate of 0.0025/0.002 for the
′ ′ CTC/attention-based and RNN-T-based systems, respectively.
L = − log P (y0:S ′ | X, B). (16) The Librispeech-960 corpus [39] is used to evaluate the
′
where S represents the total number of tokens based on the proposed method using ESPnet as the E2E-ASR toolkit [40].
proposed bias tokens, respectively. Note that Eqs. (13) and (14) The proposed method is evaluated in terms of word error rate
are flexible to the bias list size N , and the proposed method is (WER), bias phrase WER (B-WER), and unbiased phrase WER
optimized using Eq. (16) without auxiliary losses. (U-WER) [25]. Note that insertion errors are counted toward
The proposed method can be applied to other E2E-ASR ar- the B-WER if the inserted phrases are present in the bias list;
chitectures, such as CTC, RNN-T, and their hybrids, includ- otherwise, insertion errors are counted toward the U-WER.
ing streaming systems and multilingual systems [8–10, 33–37],
because the proposed method only extends the embedding and 4.2. Results on offline CTC/attention-based system
output layers in addition to the bias encoder (Figure 3).
Table 1 shows the results of the offline CTC/attention-based
systems obtained on the Librispeech-960 dataset for different
3.3. Training
bias list sizes N . With a bias list size N > 0, the proposed
During the training phase, a bias list B is created randomly method improves the B-WER considerably despite a small in-
from the reference transcriptions for each batch, where Nutt bias crease in the U-WER, which results in a substantial improve-
phrases are selected per utterance, each has I tokens in length. ment in the overall WER. While the B-WER and U-WER tend
This process yields a total of N bias phrases calculated as Nutt × to deteriorate with larger N , the proposed method remains su-
batch size. Once the bias list B is defined, the corresponding perior to other DB techniques across all bias list sizes. Al-
reference transcriptions are replaced with the bias tokens. For though the proposed method does not perform as well as the
example, if “<N>, <el>, <ly>” is extracted from the com- baseline when N = 0, this is not considered a particularly limit-
plete utterance y = [<Hi>, <N>, <el>, <ly>], the reference ing drawback considering that users typically register important
transcription is modified to y ′ = [<Hi>, <Nelly>]. keywords in the bias list.
Table 1: WER results of offline CTC/attention-based systems obtained on Librispeech-960 (U-WER/B-WER).
N = 0 (no-bias) N = 100 N = 500 N = 1000
Model test-clean test-other test-clean test-other test-clean test-other test-clean test-other
Baseline [8] 2.57 5.98 2.57 5.98 2.57 5.98 2.57 5.98
(CTC/attention) (1.5/10.9) (4.0/23.1) (1.5/10.9) (4.0/23.1) (1.5/10.9) (4.0/23.1) (1.5/10.9) (4.0/23.1)
CPPNet [18] 4.29 9.16 3.40 7.77 3.68 8.31 3.81 8.75
(2.6/18.3) (5.9/37.5) (2.6/10.4) (6.0/23.0) (2.8/10.9) (6.5/24.3) (2.9/11.4) (6.9/25.3)
Attention-based DB 5.05 8.81 2.75 5.60 3.21 6.28 3.47 7.34
+ BPB beam search [22] (3.9/14.1) (6.6/27.9) (2.3/6.0) (4.9/12.0) (2.7/7.0) (5.5/13.5) (3.0/7.7) (6.4/15.8)
Proposed 3.16 6.95 1.80 4.63 1.92 4.81 2.01 4.97
(1.9/13.8) (4.6/27.5) (1.7/2.9) (4.3/7.1) (1.8/3.1) (4.5/7.9) (1.9/3.3) (4.6/8.5)
Table 2: Experimental results obtained on Japanese dataset.

Model CER U-CER B-CER
Baseline [8] 9.85 8.17 21.76
BPB beam search [22] 9.67 9.20 13.16
Proposed 9.03 8.93 9.73
Table 3: WER results of streaming RNN-T-based systems on

Librispeech-960 test-clean.
Figure 4: Example of cumulative log probability ID Model Text data WER/B-WER
A0 Baseline (RNN-T) None 3.7 / 14.1
A1 Trie-based DB + WFST [25] None 2.8 / 7.4
A2 Phoneme-based DB [26] None 2.6 / 6.8
A3 Proposed None 2.6 / 3.1
B1 A1 + NNLM* [25] *810M words 2.0 / 5.7
B2 A2 + USTR* + WFST [26] *810M words 2.1 / 2.0
181 hours of Japanese speech in the database developed by

the Advanced Telecommunications Research Institute Interna-
tional [42], and 93 hours of Japanese speech data, including
Figure 5: Effect of the bias weight µ. meeting and morning assembly scenarios with the attention-
based system described in Section 4.1. Table 2 shows the re-
4.3. Analysis of proposed bias token sults obtained when N = 203 technical terms are registered in
the bias list B. The proposed method improves the B-CER sig-
Figure 4 shows an example of the cumulative log probability de- nificantly with a slight degradation in U-CER, thereby resulting
scribed in Eq. (15), where the blue and red lines indicate the re- in the best overall CER.
sults obtained with and without the proposed bias tokens. With-
out the use of dynamic tokens, the model struggles to capture 4.6. Validation on streaming RNN-T-based system
the relationships between the subwords, which results in signifi-
cantly lower scores for each subword. Conversely, the proposed Table 3 shows the results of the streaming RNN-T-based sys-
method assigns high scores to the bias token (<Nelly>), result- tem with a bias list of size N = 100. Consistent with the re-
ing in a better B-WER (Table 1). Interestingly, the log probabil- sults from the offline CTC/attention-based system (Tabel 1),
ities before and after the bias token (<fresh> and <is>) remain the proposed method significantly improves the B-WER com-
stable, which indicates that the proposed method maintains the pared to the conventional DB method without relying on addi-
context successfully throughout the entire token sequence. tional information such as phonemes, while maintaining a com-
parable overall WER to conventional DB approaches (A1-2 vs.
4.4. Effect of the bias weight during inference A3). Furthermore, the conventional DB method [25,26] signifi-
cantly improves the B-WER using additional text data (A1-2 vs.
Figure 5 shows the effect of the bias weights µ (Section 3.4) B1-2), whereas the proposed method achieves similar B-WER
on the WER, U-WER, and B-WER results when the bias list without using such additional data (A3 vs. B1-2).
size N = 2000. Increasing the bias weights µ improves the B-
WER but deteriorates U-WER due to its tendency to overbias.
Under this experimental condition, there is a slight tendency 5. Conclusion
to overbias when no bias weights are introduced; thus, when
µ = 0.8, over-bias can be suppressed. The degree to which In this paper, we present a simple but effective deep biasing
the model should be biased depends on the target user domain; method that introduces a dynamic vocabulary where the phrase-
thus, we believe that this mechanism to adjust the bias weights level dynamic bias tokens with the single entities. To enhance
easily during inference is effective. the usability of the proposed method, we introduce a bias weight
to adjust the bias intensity during inference. The experimen-
4.5. Validation on Japanese dataset tal results obtained by applying the proposed method to an
offline CTC/attention-based system and a streaming RNN-T-
We also validate the proposed method on a Japanese dataset based system demonstrate that it significantly improves bias
containing the Corpus of Spontaneous Japanese (581 h) [41], phrase recognition on English and Japanese datasets.
6. References [22] Y. Sudo, M. Shakeel, Y. Fukumoto, Y. Peng, and S. Watan-
abe, “Contextualized automatic speech recognition with attention-
[1] R. Prabhavalkar et al., “End-to-end speech recognition: A sur- based bias phrase boosted beam search,” in Proc. ICASSP, 2024,
vey,” IEEE/ACM Transactions on Audio, Speech, and Language pp. 10 896–10 900.
Processing, vol. 32, pp. 325–351, 2023.
[23] S. Dingliwal, M. Sunkara, S. Ronanki, J. Farris, K. Kirchhoff, and
[2] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con- S. Bodapati, “Personalization of ctc speech recognition models,”
nectionist temporal classification: Labelling unsegmented se- in Proc. SLT, 2023, pp. 302–309.
quence data with recurrent neural networks,” in Proc. ICML,
2006, pp. 369–376. [24] G. Sun, C. Zhang, and P. C. Woodland, “Tree-constrained pointer
generator for end-to-end contextual speech recognition,” in Proc.
[3] A. Graves and N. Jaitly, “Towards end-to-end speech recognition ASRU, 2021, pp. 780–787.
with recurrent neural networks,” in Proc. ICML, 2014, pp. 1764–
[25] D. Le, Jain et al., “Contextualized streaming end-to-end speech
1772.
recognition with trie-based deep biasing and shallow fusion,” in
[4] A. Graves, “Sequence transduction with recurrent neural net- Proc. Interspeech, 2021, pp. 1772–1776.
works,” in Proc. ICML, 2012. [26] J. Qiu, L. Huang, B. Li, J. Zhang, L. Lu, and Z. Ma, “Improving
[5] A. Gulati et al., “Conformer: Convolution-augmented transformer large-scale deep biasing with phoneme features and text-only data
for speech recognition,” in Proc. Interspeech, 2020, pp. 5036– in streaming transducer,” in Proc. ASRU, 2023, pp. 1–8.
5040. [27] A. Bruguier, R. Prabhavalkar, G. Pundak, and T. N. Sainath,
[6] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben- “Phoebe: Pronunciation-aware contextualization for end-to-end
gio, “Attention-based models for speech recognition,” Advances speech recognition,” in Proc. ICASSP, 2019, pp. 6171–6175.
in neural information processing systems, vol. 28, 2015. [28] Z. Chen, M. Jain, Y. Wang, M. L. Seltzer, and C. Fuegen, “Joint
grapheme and phoneme embeddings for contextual end-to-end
[7] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend
asr.” in Proc. Interspeech, 2019, pp. 3490–3494.
and spell: A neural network for large vocabulary conversational
speech recognition,” in Proc. ICASSP, 2016, pp. 4960–4964. [29] H. Futami, E. Tsunoo, Y. Kashiwagi, H. Ogawa, S. Arora, and
S. Watanabe, “Phoneme-aware encoding for prefix-tree-based
[8] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hy- contextual asr,” in Proc. ICASSP, 2024.
brid ctc/attention architecture for end-to-end speech recognition,”
IEEE Journal of Selected Topics in Signal Processing, vol. 11, [30] Y. Sudo, K. Hata, and K. Nakadai, “Retraining-free customized
no. 8, pp. 1240–1253, 2017. asr for enharmonic words based on a named-entity-aware model
and phoneme similarity estimation,” in Proc. Interspeech, 2023,
[9] K. Hu, T. N. Sainath, R. Pang, and R. Prabhavalkar, “Deliberation pp. 3312–3316.
model based two-pass end-to-end speech recognition,” in Proc.
ICASSP, 2020, pp. 7799–7803. [31] X. Wang et al., “Towards contextual spelling correction for
customization of end-to-end speech recognition systems,” IEEE
[10] Y. Sudo, M. Shakeel, B. Yan, J. Shi, and S. Watanabe, “4D ASR: Trans. Audio, Speech, Lang. Process., vol. 30, pp. 3089–3097,
Joint modeling of CTC, attention, transducer, and mask-predict 2022.
decoders,” in Proc. Interspeech, 2023, pp. 3312–3316.
[32] X. Wang, Y. Liu, J. Li, and S. Zhao, “Improving contextual
[11] Y. Peng et al., “Reproducing whisper-style training using an open- spelling correction by external acoustics attention and semantic
source toolkit and publicly available data,” in Proc. ASRU, 2023, aware data augmentation,” in Proc. ICASSP, 2023, pp. 1–5.
pp. 1–8. [33] Y. Sudo, S. Muhammad, Y. Peng, and S. Watanabe, “Time-
[12] R. Huang, O. Abdel-Hamid, X. Li, and G. Evermann, “Class lm synchronous one-pass Beam Search for Parallel Online and Of-
and word mapping for contextual biasing in end-to-end asr,” in fline Transducers with Dynamic Block Training,” in Proc. Inter-
Proc. Interspeech, 2020, pp. 4348–4351. speech, 2023, pp. 4479–4483.
[13] I. Williams, A. Kannan, P. Aleksic, D. Rybach, and T. Sainath, [34] E. Tsunoo, H. Futami, Y. Kashiwagi, S. Arora, and S. Watanabe,
“Contextual speech recognition in end-to-end neural network sys- “Integration of Frame- and Label-synchronous Beam Search for
tems using beam search,” in Proc. Interspeech, 2018. Streaming Encoder-decoder Speech Recognition,” in Proc. Inter-
speech, 2023, pp. 1369–1373.
[14] A. Kannan et al., “An analysis of incorporating an external
language model into a sequence-to-sequence model,” in Proc. [35] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and
ICASSP, 2018, pp. 5824–5828. I. Sutskever, “Robust speech recognition via large-scale weak su-
pervision,” in Proc. ICML, 2023, pp. 28 492–28 518.
[15] A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold fusion: train- [36] Y. Peng et al., “Owsm v3.1: Better and faster open whisper-
ing seq2seq models together with language models,” in Proc. In- style speech models based on e-branchformer,” arXiv preprint
terspeech, 2018, pp. 387–391. arXiv:2401.16658, 2024.
[16] G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and [37] Y. Peng, Y. Sudo, M. Shakeel, and S. Watanabe, “Owsm-ctc: An
D. Zhao, “Deep context: End-to-end contextual speech recogni- open encoder-only speech foundation model for speech recog-
tion,” in Proc. SLT, 2018, pp. 418–425. nition, translation, and language identification,” arXiv preprint
[17] M. Jain, G. Keren, J. Mahadeokar, and Y. Saraf, “Contextual rnn-t arXiv:2402.12654, 2024.
for open domain asr,” in Proc. Interspeech, 2020, pp. 11–15. [38] E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, “Trans-
[18] K. Huang et al., “Contextualized End-to-End Speech Recogni- former ASR with contextual block processing,” in Proc. ASRU,
tion with Contextual Phrase Prediction Network,” in Proc. Inter- 2019, pp. 427–433.
speech, 2023, pp. 4933–4937. [39] V. Panayotov et al., “Librispeech: an asr corpus based on public
domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
[19] M. Han et al., “Improving end-to-end contextual speech recogni-
tion with fine-grained contextual knowledge selection,” in Proc. [40] S. Watanabe et al., “ESPnet: End-to-end speech processing
ICASSP, 2022, pp. 491–495. toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
[20] C. Huber, J. Hussain, S. Stüker, and A. Waibel, “Instant one-shot [41] K. Maekawa, “Corpus of spontaneous Japanese: Its design and
word-learning for context-specific neural sequence-to-sequence evaluation,” in ISCA & IEEE Workshop on Spontaneous Speech
speech recognition,” in Proc. ASRU, 2021, pp. 1–7. Processing and Recognition, 2003.
[21] S. Zhou, Z. Li, Y. Hong, M. Zhang, Z. Wang, and B. Huai, [42] A. Kurematsu et al., “Atr japanese speech database as a tool of
“Copyne: Better contextual asr by copying named entities,” arXiv speech recognition and synthesis,” Speech Communication, vol. 9,
preprint arXiv:2305.12839, 2023. no. 4, pp. 357–363, 1990.

Contextualized Automatic Speech Recognition With Dynamic Vocabulary

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Contextualized Automatic Speech Recognition With Dynamic Vocabulary

Uploaded by

Copyright:

Available Formats

Contextualized Automatic Speech Recognition with Dynamic Vocabulary

Abstract phrases, referred to as a bias list, to tailor the E2E-ASR model

tion of bias phrases. To improve the effectiveness of this layer,

2.1. Audio encoder

P (ys |y0:s−1 , X) = Decoder(y0:s−1 , H), (2)

where ys is the s-th subword-level token in the pre-defined

Table 2: Experimental results obtained on Japanese dataset.

Table 3: WER results of streaming RNN-T-based systems on

181 hours of Japanese speech in the database developed by

You might also like