Initial Decoding With Minimally Augmented Language Model For Improved Lattice Rescoring in Low Resource ASR

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Sādhanā Vol. , No., , pp.

– © Indian Academy of Sciences


DOI

Initial Decoding with Minimally Augmented Language Model for Improved Lattice
Rescoring in Low Resource ASR

SAVITHA MURTHY1,* and DINKAR SITARAM2


1
Department of CSE, P.E.S. University
2
Cloud Computing Innovation Council of India

Abstract. Automatic speech recognition systems for low-resource languages typically have smaller corpora on
which the language model is trained. Decoding with such a language model leads to a high word error rate due to
the large number of out-of-vocabulary words in the test data. Larger language models can be used to rescore the
arXiv:2403.10937v1 [eess.AS] 16 Mar 2024

lattices generated from initial decoding. This approach, however, gives only a marginal improvement. Decoding
with a larger augmented language model, though helpful, is memory intensive and not feasible for low resource
system setup. The objective of our research is to perform initial decoding with a minimally augmented language
model. The lattices thus generated are then rescored with a larger language model. We thus obtain a significant
reduction in error for low-resource Indic languages, namely, Kannada and Telugu.
This paper addresses the problem of improving speech recognition accuracy with lattice rescoring in low-resource
languages where the baseline language model is not sufficient for generating inclusive lattices. We minimally
augment the baseline language model with unigram counts of words that are present in a larger text corpus of
the target language but absent in the baseline. The lattices generated after decoding with a minimally augmented
baseline language model are more comprehensive for rescoring. We obtain 21.8% (for Telugu) and 41.8% (for
Kannada) relative word error reduction with our proposed method. This reduction in word error rate is comparable
to 21.5% (for Telugu) and 45.9% (for Kannada) relative word error reduction obtained by decoding with full
Wikipedia text augmented language mode while our approach consumes only 1/8th the memory. We demonstrate
that our method is comparable with various text selection-based language model augmentation and also consistent
for data sets of different sizes. Our approach is applicable for training speech recognition systems under low
resource conditions where speech data and compute resources are insufficient, while there is a large text corpus
that is available in the target language. Our research involves addressing the issue of out-of-vocabulary words of
the baseline in general and does not focus on resolving the absence of named entities. Our proposed method is
simple and yet computationally less expensive.

Keywords. Indic languages, Telugu and Kannada ASR, Low resource, out of vocabulary, language model aug-
mentation, Automatic Speech Recognition
1
data on the World Wide Web has made it possible to leverage
more data for data augmentation purposes [4, 5, 6, 7, 8].
1 Introduction Our study involves augmenting the language model us-
There is a lot of interest in Automatic Speech Recognition ing larger text corpora from the web in a traditional hybrid
(ASR) for low-resource languages for more than a decade ASR setup for low-resource Indic languages, namely, Tel-
[1, 2, 3]. Low resource indicates a scarcity of any of the lan- ugu and Kannada. A common approach in traditional ASR
guage resources required to train traditional ASR, namely, re- systems is to train a transcript language model from the avail-
sources such as pronunciation dictionaries, text data to train able speech transcripts of the training dataset. In the case of
a Language Model (LM) or audio data with corresponding low-resource languages, language models trained on speech
transcriptions. Obtaining low Word Error Rates (WER) in corpus only with a few hours of training data may contain
low-resource ASR is a challenge. These ASR systems have many unknown or OOV words. Decoding with such small
low to medium vocabulary of only a thousand to less than language models leads to high WER. This issue can be ad-
50,000 words. This results in a high probability of Out-Of- dressed by interpolating the baseline language model with a
Vocabulary (OOV) words that may be present in the test set larger language model in the concerned language to account
leading to high WER. Data augmentation has been a pre- for as many words as possible. A larger language model
dominant approach in improving ASR performance for low- though comprehensive requires the construction of a decod-
resource languages. The prevalence of enormous amounts of ing graph in a traditional hybrid ASR setup and is very mem-
ory intensive. In case of limited availability of computational
resources, allocating the required memory may not be feasi-
*For correspondence ble. Alternatively, the availability of a larger text corpora can
1 savithamurthy@pes.edu

1
2

be harnessed by performing initial decoding with a smaller 4. In-Vocabulary (IV) recognition is not affected and only
language model to generate a lattice and then re-scoring the improves using our proposed method.
lattice with the larger well-trained language model [9]. How-
This paper is organized as follows: section 2 gives an
ever, in the case of low-resource languages, a small language
overview of different data augmentation techniques used
model trained on the available speech text may not be suf-
in ASR – for both Large Vocabulary Speech Recognition
ficient to generate a comprehensive lattice. This leads to a
(LVSCR) and low resource languages. It also summarizes
large number of missing words in the lattices which are not
the OOV recovery techniques used. Section 3 describes
resolved even after rescoring using a large, augmented lan-
the OOV problem in the context of low-resource languages
guage model.
namely, Telugu and Kannada. Section 4 explains the concept
1.1 Our Contribution of decoding and lattice rescoring in brief. Section 5 describes
the datasets used for our experiments and the experimental
In this paper, we focus on improving OOV recovery on setup. Section 6 gives an overview of our experiments. Sec-
the baseline and thus making lattice rescoring with a larger tion 7 discusses the results followed by a section 8 on the
language model more effective for low-resource languages. conclusion and future work.
We consider the baseline language model augmented with
Wikipedia text for our experiments using a larger language 2 Related Work
model. We explore language model augmentation techniques
This section gives an overview of various data augmentation
for two Indian languages namely, (i) Kannada — a seed cor-
techniques that have been employed for improving recogni-
pus with 4 hours of read speech and (ii) Telugu — 40 hours
tion accuracy in ASR in section 2.1 and an overview of OOV
speech corpus of read and conversational speech. We train
detection and recovery literature in section 2.2.
the baseline language model on the available speech tran-
scripts. We train another language model on the unigram 2.1 Data Augmentation in ASR
counts of the Out-of-Train (OOT) words from the larger text
corpus that are not present in the baseline train transcript. Data augmentation is one of the research directions towards
We augment the baseline language model by interpolating it improving the performance of ASR systems. Various ap-
with the word unigram language model. We perform initial proaches of data augmentation include (i) Perturbation (ii)
decoding with this minimally augmented language model. Use of mixed features (iii) Transfer learning (iv) Text aug-
We separately train a language model using Wikipedia text mentation (v) Leveraging multilingual features (vi) Speech
and interpolate it with the baseline to obtain a larger lan- synthesis/generation. Perturbation is a technique where the
guage model. The lattices generated from initial decoding training data is augmented with variations such as swap-
are then rescored with the larger language model. Our ex- ping blocks of time steps and frequencies [10], changing
periments include different text selection methods as well as the speed of audio signal [11, 12] and modifying vocal tract
different-sized datasets. Our method also eliminates the em- length [13, 14]. Chen et al. [15] employ different pertur-
pirical need to determine the amount of text that needs to be bations such as pitch, speed, tempo, volume, reverberation
selected for language model augmentation purposes. and spectral augmentation to improve the accuracy of chil-
Our empirical study regarding language model augmen- dren’s speech recognition. Mix-up is another technique for
tation and lattice rescoring in low-resource languages depicts data augmentation wherein more samples of training data
the following: are generated using a convex combination of pair of sam-
ples and their labels [16, 17, 18]. Transfer learning is a
1. Lattices generated after decoding with the baseline technique where model parameters are shared across mul-
language model may not contain all the probable tiple tasks and can be considered as a method for data aug-
words (earlier OOVs). Lattice rescoring only adjusts mentation [19, 20, 21]. Speech synthesis/generation has re-
the probabilities on the existing path and hence there cently been used to increase the amount of training data [22].
is no significant improvement in OOV recovery as well The text-to-speech system is trained on ASR training data to
as WER. match the style of the training corpus for better recognition
2. Initial decoding with a baseline language model aug- [23]. Rosenberg et al. [24] also train a Tacotron model on
mented to include unigram counts of OOT words Librispeech corpus and out-of-domain speech to explore the
(words that are present in the bigger text corpus but effect of acoustic diversity. Text augmentation has been used
not in the baseline vocabulary) improves OOV recov- [25, 26] to improve language model scores and reduce WER.
ery and hence enables the inclusion of more words in The aforementioned research has been adapted for languages
the lattices. with sufficient resources.
A few of these aforementioned augmentation techniques
3. Rescoring the lattices generated from 2 with Wikipedia have also been applied to low-resource ASR to improve ac-
augmented language model results in a significant re- curacy. Sharma et al. [27] use mixup technique to augment
duction in WER. This reduction is comparable to that speech data with TTS audio while Meng et al. [28] use mixed
of decoding with a larger language model while re- features (MFCC and mel-spectrograms) for data augmenta-
scoring is more time and memory efficient. tion to train Listen, Attend and Spell (LAS) and Transformer
Minimal LM Augmentation for Effective Rescoring 3

models. Rossenbach et al. [29] improve the performance of is clustering or similarity-based approach where words sim-
an End-to-End (E2E) ASR by augmenting the acoustic model ilar to reference or in-vocabulary words to determine OOV
with synthesized speech generated using a Text To Speech probabilities [56, 57, 58, 59, 60].
(TTS) system trained on the ASR speech corpus. The work While data augmentation has been predominantly used to
by Lin et al. [30] employs synthesized speech to improve reduce WER in LVSCR, very few researchers adapt data aug-
keyword spotting with limited data. Perturbation has been mentation to handle OOV in ASR [61, 62, 63]. These studies
used to augment features for speech training [31, 32, 33]. address issues related to specific words such as proper nouns.
A multilingual approach where speech data from similar Others select sentences using distance measures to ensure the
languages are leveraged to augment and train the acoustic proximity of selected sentences to the training data. Naptali
model is quite popular in low-resource ASR. Rosenberg et et al. [60] use web data to determine the OOV probabilities
al. [34] applied a multilingual approach to improving key- after determining OOV candidates based on similarity mea-
word search in IARPA BABEL OP3 languages. The ‘Low sures. These approaches are effective when there is a suffi-
Resource Speech Recognition Challenge for Indian Lan- cient amount of training data.
guages - Interspeech 2018’ included 40 hours of speech cor- These approaches mentioned above to handle OOV de-
pora in Telugu, Tamil and Gujarati languages. Multilingual tection and recovery, may not be very effective for languages
training was adapted wherein the acoustic model was trained that are agglutinative and inflective, with every root word
in all three languages leading to an improvement of approx- having several forms based on different contexts. It is all the
imately 5-8% in WER [35, 36, 37, 38, 39]. However, these more challenging to address the issue of OOV detection and
methods above reduce recognition errors in words already recovery when the language in concern is low resource and
present in the ASR’s lexicon. A combination of methods, lacks sufficient data to train an ASR. We explain the problem
namely, vocal tract length perturbation and multilingual fea- of OOV in such languages in the following section.
tures are adapted for IARPA Babel languages in the work by
Tuske et al. [13]. Yılmaz et al. [40] improve ASR in code- 3 OOV Problem in Low Resource Agglutinative and In-
switched speech for Dutch and Frisian. They augment the flective Languages
language model by generating text using Recurrent Neural
Low-resource languages have high OOV rates due to limited
Network (RNN) LM trained on the transcripts from the code-
vocabulary and language model size. This is more prominent
switched speech data. They also enhance the audio from
in the case of agglutinative languages. Dravidian languages
available high-resource data in code-switched speech.
(languages spoken in southern India) like Telugu and Kan-
2.2 OOV Detection and Recovery nada are agglutinative, resulting in words of different lengths
derived from the basic word forms. They are also highly
There has been extensive research on OOV detection and inflective for the following reasons: (a)There are 8 cases
recovery to improve ASR recognition accuracy. The chal- (called vibhaktis) in Telugu with 3 to 4 suffixes correspond-
lenge in handling OOVs is that the ASR system is unaware ing to each of the cases resulting in a total of 30 suffixes
of the presence of an OOV. This, in turn, results in the as per the vibhaktis of each noun form in Telugu; (b)There
hypothesis always containing words in the lexicon leading are different suffixes for two genders (called linga) in Tel-
to errors in recognition. There have been efforts to detect ugu which again differ based on the number which is sin-
OOVs using filler models [41, 42, 43] where there are place- gular or plural; (c)Telugu being highly inflective the nouns
holders for OOVs which can then be replaced with the ex- change forms based on the case, gender and number resulting
tended vocabulary for improved recognition. Confidence in 30 * 3 * 2 = 150 variations for each noun in the language;
measures are another indication of OOV presence. Confi- (d)There are 10 different sandhis in Telugu, where multiple
dence scores are used to identify the probable OOV candi- words are joined together to form a single word along, with
dates [44, 45, 46, 47, 48, 49]. Studies have tried to achieve 7 sandhis that come from the Sanskrit language; (e) The suf-
open vocabulary ASR by employing subword models where fix for verb (all tenses) in a sentence is based on the gen-
hybrid language models - both word LM and subsequence der, number and case of the noun it corresponds to result-
(phoneme, syllable or subword) are used together. In this ing in higher inflection. Kannada also has similar inflections
case, the unknown words are replaced with subsequences to with eight cases and three genders, with noun forms being
aid better recognition. The most recent work is by Zhang et inflected based on cases, gender, and tense. Also, sandhis in
al. [50] where a hybrid lexical model with phonemes and Kannada (3 in Kannada + 7 borrowed from Sanskrit) again
words is used to generate OOV candidates with phoneme can join multiple words to form single words.
constraints. Instead of hybrid subword models, parallel mod- The problem of OOV for such languages does not only
els also have been used for OOV detection and recovery include addressing named entities which are always of con-
where both subsequence lattices and word lattices are used cern in all the languages, but also the need for a large cor-
to determine OOV [45, 51, 52, 53, 54]. OOV recovery fol- pus with a comprehensive vocabulary that accounts for all
lows OOV detection. P2G mapping is the most popular for the agglutinative and inflective forms of nouns, verbs, ad-
OOV recovery where the subsequences are stitched together jectives and so on. For example, the word ‘ABHIMAANIS-
to form words using P2G models [55, 50]. Another approach
4

TUNNAANANI’ 2 in Telugu that occurs in the test data is Challenge” by Srivastava et al.at Interspeech, 2018 [70] and
not present in Wikipedia. The root form of this word is (ii) Kannada speech corpus recorded using the transcripts of
‘ABHIMAANISTUNNA’, and a different inflection ‘ABHI- speech synthesis dataset from IIIT Hyderabad3 for our exper-
MAANISTUNNAARU’ of the word is present. Defining a iments. Telugu speech corpus consists of 40 hours of read
text corpus that includes all inflections of a word requires and conversational speech. Every audio recorded in both
linguistic expertise, and obtaining corresponding audio is a datasets is 16 kHz mono. The Telugu training dataset has
tedious task. Hence, speech corpora for these languages tend a vocabulary of 43,260 words, and the test dataset has an
to have high OOV rates. There has been work on improv- OOV rate of 12.04%. The high OOV rate in Telugu is due to
ing ASR recognition accuracy in agglutinative languages by the reasons listed in section 3. Addressing these challenges
splitting the words into morphemes. The focus of our ex- requires a notably large dataset. Hence, the Telugu dataset
periments is to address OOV recovery and WER reduction used can be considered a low resource. The Kannada base-
for any low-resource language in general and, is comple- line vocabulary is 1754 words, and the test data has an OOV
mentary to the technique of using morphemes as subwords rate of 25.22%. Kannada dataset with 4 hours of speech can
[65, 66, 67, 68]. be considered an extremely low resource corpus due to the
reasons listed in section 3. Table 1 lists the baseline WER of
4 Concept of Decoding and Lattice Rescoring
Telugu (which is 25.51%) and Kannada (which is 51.87%)
4.1 Decoding ASR systems.

Decoding involves finding the most probable path of word se-


Table 1. Telugu and Kannada ASR baselines
quences. A Weighted Finite State Transducer (WFST) [69] is
used for this purpose. It is represented as given in Equation1: Language Duration WER (%) OOV rate (%)
HCLG = min(det(H ◦ C ◦ L ◦ G)) (1) Telugu 40 hours 25.51 12.04
Kannada 4 hours 51.87 31.58
where H represents the HMM transitions, C represents con-
text dependencies among phones, L represents the lexicon − To avoid recognition errors due to agglutination, we concatenate
consecutive words to form the longest word in the vocabulary.
and G represents the grammar represented by the language
model.
− WER for Telugu as specified in Microsoft release is 34.36%. We
While decoding with a bigger language model may be
obtained a WER of 25.51% after test transcript corrections and ac-
more effective in reducing WER, composing an HCLG counting for agglutination.
graph, referred to as a decoding graph, with the grammar
belonging to a large language model is more memory inten-
sive, and decoding becomes computationally expensive be- 5.2 Experimental Setup
cause of a large search space. Hence, normal practice is to
decode using the smaller language model and then perform a The baseline ASR consists of TDNN-F [71] acoustic model
lattice rescoring with a bigger language model. with 128 chunks, 7 hidden layers, 2048 hidden dimensions,
an initial learning rate of 0.008 and 10 epochs using Kaldi
4.2 Lattice Rescoring ASR toolkit [72] recipe.
Lattice represents alternate possible word sequences having The language model for the baseline is trained on the
higher scores than other possible word sequences. A lattice speech transcripts in the training set. Trigram language
contains the paths with higher scores and is derived from the model with Witten bell smoothing is employed. We use the
pruned subset of the decoding graph [69]. Lattices are gen- ‘count merging’ method [73][74] of interpolation to augment
erated as a result of decoding. Lattice rescoring is a pro- the baseline language model. Count merging is given by
cess where the path probabilities are updated with the new Equation 2
language model probabilities while retaining the transition βi ci (h)pi (w|h)
P
pCM (w|h) = i P (2)
and pronunciation probabilities. Lattice rescoring can be per- β j c j (h)
formed using a larger language model. This also conserves
computational resources as there is no need to compose a de- where i represents the domain or the model βi is the model
coding graph. scaling factor for i and h is the history for the ith model or
domain. The interpolation weight for a given history in count
5 Datasets and Experimental Setup merging is given as:

5.1 Datasets βi ci (h)


λi (h) = P (3)
j β j c j (h)
We consider (i) Telugu speech corpus released by Microsoft
as part of the “Speech Recognition for Indian Languages The interpolation weight in count merging depends on
the history counts pertaining to a particular domain instead
2 The example words in Telugu are represented using IITM label set notation
3 Accessed July 13, 2017. http://festvox.org/databases/iiit voices
[64]
Minimal LM Augmentation for Effective Rescoring 5

of absolute counts. Thus, the history counts that belong to a After this, we consider augmenting the baseline language
smaller corpus get more weightage as compared to the same model with unigram counts of only OOT words. As seen in
history in a larger corpus. Count merging is stated to perform Table 2, such an augmentation has fewer memory require-
better than linear interpolation [73]. We compare HCLG ments, comparatively very close to the memory need of the
graph decoding and baseline lattice rescoring methods with baseline language model. Therefore, we propose initial de-
the augmented LM. coding with a graph composed of an LM augmented with
OOT unigrams probabilities from Wikipedia. The gener-
6 Experiments ated lattices are then rescored with a bigger augmented lan-
guage model. We compare our method against different text
This section describes various experiments performed con-
selection-based augmentation (described in section 6.3) as
cerning language model augmentation. Section 6.1 de-
well as different data sizes (described in section 6.4).
scribes the details of augmenting the language model with
full Wikipedia text for Telugu and Kannada languages. Sec- 6.3 Text Selection Based Language Model Augmentation
tion 6.2 gives details of language model augmentation with
a focus on OOT in terms of memory requirements and the Along with the complete Wikipedia text augmentation men-
reason for choosing only unigram counts of OOT words for tioned in section 6.1, we adapt different text selection meth-
minimally augmenting the baseline language model. Section ods available in the literature, namely, contrastive selec-
6.3 describes various text selection techniques present in lit- tion [75], change in likelihood [76], entropy-based selec-
erature that we employ for augmentation purposes. Section tion [77]. These selection methods attempt to benefit from
6.4 lists details on various dataset sizes used for our experi- the availability of large amounts of non-domain data to im-
ments. prove language model probabilities, eventually improving
ASR performance. We select 50% of the highest-ranked sen-
6.1 Language Model Augmentation with full Wikipedia text tences from Wikipedia for every text selection method used
for language model augmentation.
We use Wikipedia text data in Telugu and Kannada to en-
hance the corresponding Language Models (LM). The text 6.3.1 Contrastive Selection
data was pre-processed to remove non-alphabet characters
Chen et al. [75] augment the language model with a con-
and was normalized. Telugu Wikipedia XML dump data
trastive selection from a larger text corpus. They select sen-
consisting of 2525122 sentences, was used to augment the
tences that are similar to the ASR training set. We instead
language model that increased the vocabulary to 1815924
select trigrams with higher probability given the interpolated
words from 43260 words in the baseline. Kannada Wikipedia
LM (baseline and Wikipedia) compared to probability given
XML dump consists of 873339 sentences increasing the vo-
the Wikipedia LM as given in Equation 4
cabulary to 943729 words from an initial vocabulary of 1754
in the baseline. Σ[log P(t|D) − log P(t|B)]
sentence score = (4)
#(t)
6.2 OOT-based Language Model Augmentation
where, p(t|D) denotes trigram probability with respected to
The memory requirement for decoding graph construction D train and wiki interpolated language model, and p(t|B) de-
with a full Wikipedia text augmented language model is ap- noted trigram probability with respect to B language model
proximately 32 gigabytes for Telugu and 18 gigabytes for trained on Wikipedia. The sum of the differences in trigram
Kannada, as listed in Table 2, which is quite large. There- probabilities is normalized by the number of trigrams in a
fore, we consider only OOT-based enhancements to the base- sentence. This method selects sentences containing trigrams
line language model. OOT words, in this context, are those that are similar to the training text.
words that are present in Wikipedia but not in the baseline
6.3.2 Delta Likelihood Based Selection
training text. Table 2 lists the different OOT-based enhance-
ments employed and their memory requirements. Klakow [76] uses the change in the log-likelihood when a
First, we consider Wikipedia lines containing OOT sentence is removed from the corpus, and the language model
words for enhancing the baseline language model. In low- is trained on it. The work claims an improvement in perplex-
resourced, agglutinative, and inflective languages like Telugu ity and OOV rate with this approach. We implement this
and Kannada, lines containing OOT words constitute more technique for trigrams represented in Equation 5
than 90% of the Wikipedia text. Augmenting the baseline X P(w|uv)
language model with such a large subset requires approxi- ∆S i = Ntarget (u, v, w) log (5)
w
PAi (w|uv)
mately the same memory as full Wikipedia augmentation.
We then consider trigrams from Wikipedia containing where target represents the train(baseline) corpus, u, v and w
OOT words. Again because of high OOT rates, this resulted are words in a sentence, P(w|uv) is the trigram probability in
in a bigger language model than the complete Wikipedia the augmented LM, and Ai represents the augmented corpus
itself. As specified in Table 2, ‘baseline + Wikipedia with ith sentence removed. This method of selection selects
OOT trigrams’ requires more memory than ‘baseline + full the sentences similar to the baseline corpus which results in
Wikipedia’ for both Telugu and Kannada. maximum change in likelihood.
6

Table 2. Memory requirements for decoding graph construction


Language Language Model Max Memory Required
baseline ∼2GB
baseline + full Wikipedia ∼32GB
Telugu baseline + Wikipedia OOT lines <32GB
baseline + Wikipedia OOT trigrams >32GB
baseline + Wikipedia OOT words 4GB
baseline ∼1GB
baseline + full Wikipedia ∼18GB
Kannada baseline + Wikipedia OOT lines ∼18GB
baseline + Wikipedia OOT trigrams ∼26GB
baseline + Wikipedia OOT words <2GB
− GB: Giga Bytes

6.3.3 Entropy Based Selection 7.1 Text Selection based Language Model Augmentation
Itoh et al. [77] employ the highest entropy-based selection This section depicts the results of different text selection
of the N-best hypotheses for augmenting the acoustic model. methods for Telugu and Kannada ASR. We list the effect of
We compute the sentence score based on the entropy of tri- language model augmentation on WER, OOV and IV recog-
grams of Wikipedia with respect to the transcript language nition for both Telugu and Kannada with different text selec-
model as given in Equation 6 tion methods that have been employed. Also, the effect of
P applying our proposed method is depicted under the column
−P(w|T ) log P(w|T )
sentence score = (6) named ‘Rescore after OWALM Decode’.
#(w)
Table 3 and Table 4 list the WERs for different selection
where, w is the trigram and T is the baseline corpus. methods for Telugu and Kannada ASR respectively. Differ-
ent subsets of Wikipedia text are used, based on the selection
6.3.4 Random Selection methods applied, to augment the baseline language model.
The text selection methods for LM augmentation adapt vari- The first column lists the selection method used. The second
ous strategies with respect to the training text. In addition to column lists the WERs obtained after decoding with the lan-
these selection methods, we also conduct our experiments on guage models augmented using the corresponding selection
a random selection of sentences from the Wikipedia text cor- methods. The third column lists the WER obtained when
pora for Telugu and Kannada and augment the corresponding the lattices generated after initial decoding with the base-
language models. We randomly select 50% of the Wikipedia line language model, are rescored with the augmented lan-
text for augmenting the baseline language model. guage model. Finally, the fourth column depicts the effect of
our method, which is to initially augment the baseline lan-
6.4 Different Sized Data Subsets guage model with OOT words, perform initial decode and
then rescore the lattices with the corresponding augment lan-
We study the effect of our minimal language model augment- guage models. Likewise, Table 5 and Table 6 list the percent-
ing approach for initial decoding and later lattice rescoring age of initial OOVs that were recognized while Table 7 and
on different-sized datasets. We compare the performance of Table 8 list the percentage of initial in-vocabulary words that
our method with fully Wikipedia-augmented decoding on 10 were recognized for different selection methods.
hours, 20 hours, and 30 hours subsets of the Telugu speech We consider WER obtained by decoding using a lan-
dataset along with 4 hours of the Kannada dataset and 40 guage model augmented with complete Wikipedia text as the
hours of the Telugu dataset. reference for the analysis. The best WER reduction observed
for Telugu ASR, trained on a 40-hour corpus with a base-
7 Results and Discussion
line LM of 44882 sentences, is 5.75% absolute and 21.49%
We discuss the results regarding different text selection meth- relative using count merge base LM augmentation with full
ods in section 7.1 followed by section 7.2 where we discuss Wikipedia text. On the other hand, Kannada ASR which is
the results concerning different sizes of datasets. The results trained only on 4 hours of speech with a baseline LM of 2647
presented in both sections comprise WER obtained and the sentences, achieves a best WER reduction of 23.89% abso-
effect of language model augmentation with various tech- lute and 45.93% relative. This is more significant because of
niques on OOV and IV recognition. The results are listed the very small baseline data.
as a percentage of OOV and IV words recognized. It can be seen from Table 3 and Table 4 that selecting only
50 per cent of the Wikipedia text using either contrastive,
Minimal LM Augmentation for Effective Rescoring 7

Table 3. WER for Telugu based on different text selections based LM augmentation
WER (%)
LM Augmentation Decoding Lattice Rescoring Rescore after OWALM Decode
(our method)
Baseline 26.76 − −
Full Wiki (reference) 21.01 25.78 20.92
Contrastive Selection 21.12 26.20 21.11
Delta Likelihood Selection 21.69 26.06 21.53
Entropy-Based Selection 21.69 26.11 21.56
Random Selection 21.47 25.90 21.27
− Baseline WER is specified after including all the words from Wiki into the lexicon

− OWALM : OOT Words Augmented Language Model

Table 4. WER for Kannada based on different text selections based LM augmentation
WER (%)
LM Augmentation Decoding Lattice Rescoring Rescore after OWALM Decode
(our method)
Baseline 52.01 − −
Full Wiki (reference) 28.12 50.08 30.27
Contrastive Selection 29.82 50.95 31.67
Delta Likelihood Selection 28.42 50.79 31.48
Entropy-Based Selection 29.77 50.87 30.43
Random Selection 29.49 50.89 31.09
− Baseline WER is specified after including all the words from Wiki into the lexicon

− OWALM : OOT Words Augmented Language Model

delta likelihood, entropy-based for augmentation purposes WER. This is because, in the case of low-resource languages,
and decoding results in a reduction of WER approximately as the lattices generated with baseline LM decoding may not
same as decoding with complete Wikipedia augmented lan- contain many words due to the high OOV rate. As rescor-
guage model. This is because the selection methods help se- ing only updates the existing path probabilities based on new
lect only meaningful sentences related to the baseline corpus. LM, only sub-words, if any, present in the lattice are recog-
Wikipedia text contains garbage sentences as well which are nized, and no new words are added. Hence, there is only
eliminated through this selection. This helps in reducing the a marginal improvement in WER. For example, the word
size of the language model for decoding. Interestingly, sim- ‘AADEESHAALAMERAKU’ in Telugu ASR is correctly
ilar improvements in accuracy are obtained when the base- recognized after rescore because the sub-words ‘AADEE-
line language model is augmented with a random selection of SHAALA’ and ‘MERAKU’ are two sub-words present after
sentences from Wikipedia text. This may be the case due to baseline LM decode.
high OOV rates. A randomly selected subset containing suf- According to our method, augmenting the baseline lan-
ficient OOV words is highly probable, thus improving per- guage model with a minimum of only OOV words with re-
formance. However, there is always uncertainty about how spect to the baseline for decoding generates lattices that con-
much to select, irrespective of the selection technique. Our tain these words and hence rescoring such a lattice with a
approach eliminates this uncertainty by augmenting the base- larger language model results in a significant reduction of
line language model with the unigram counts of all the OOC WER as is evident from Table 3 and Table 4. The WER
words from the larger corpus for initial decoding. The lat- obtained for different selection methods using our method
tices generated can then be rescored with a language model is consistently very close to the reference WER. It is bet-
augmented with an entire large text corpus (refer to the entry ter for Telugu and slightly more for Kannada. Also, from
for Full Wiki LM Augmentation in Table 3 and Table 4) to Table 5 and Table 6, it is seen that the percentage of out-
obtain an effective reduction in WER. of-vocabulary words recognized using our method improves
Rescoring the lattices after initial decoding with the base- and is very close to the reference. From Table 7 and Table
line language model leads to only marginal improvement in 8, we see that the percentage of in-vocabulary words recog-
8

Table 5. OOV Recovery in Telugu ASR based on different text selections based LM augmentation
OOV Recognized (%)
LM Augmentation Decoding Lattice Rescoring Rescore after OWALM Decode
(our method)
Baseline − − −
Full Wiki 36.38 6.84 35.9
Contrastive Selection 35.41 6.1 35.53
Delta Likelihood Selection 32.63 6.02 33
Entropy-Based Selection 31.72 5.72 31.87
Random Selection 33.41 6.69 32.20
− OWALM : OOT Words Augmented Language Model

Table 6. OOV Recovery in Kannada ASR based on different text selections based LM augmentation
OOV Recognized (%)
LM Augmentation Decoding Lattice Rescoring Rescore after OWALM Decode
(our method)
Baseline − − −
Full Wiki 62.28 6.72 56.54
Contrastive Selection 56.31 6.72 53.96
Delta Likelihood Selection 57.12 6.03 54.76
Entropy-Based Selection 59.53 6.37 55.74
Random Selection 58.55 6.43 54.13
− OWALM : OOT Words Augmented Language Model

Figure 1. WER for different selection methods in Telugu

Figure 2. WER for different selection methods in Kannada


nized also improves with respect to the baseline.
Figure 1 and Figure 2 depict the comparison between
augmented language model rescore with lattices generated methods. Our proposed method is more effective in leverag-
from baseline decode (blue bar), decode with augmented lan- ing the availability of a bigger text corpus for low-resource
guage model (brown bar) and our proposed method (green). languages and also saves computational resources.
Y-axis in the figure represents WER as a percentage. The
7.2 Datasets of Different Sizes
WER obtained after the initial baseline LM decoding fol-
lowed by rescoring with a bigger language model shows In this section, the results depict the effect of initial decoding
only a marginal reduction from the baseline WER for both with baseline LM enhanced to include OOT unigrams and
Kannada (52.01%) and Telugu (26.76%) datasets. The re- then performing lattice rescoring with LM augmented with
sults show that the rescoring method proposed in this pa- full Wikipedia text on different-sized datasets as listed in Ta-
per, ‘rescore after OWALM decode’ (green bar), results in ble 9.
a WER that is comparable with decoding using a bigger lan- The relative improvement obtained after language model
guage model (brown bar), and this is true for all selection augmentation is more pronounced for smaller speech
Minimal LM Augmentation for Effective Rescoring 9

Table 7. IV Recognition in Telugu ASR based on different text selections based LM augmentation
IV Recognized (%)
LM Augmentation Decoding Lattice Rescoring Rescore after OWALM Decode
(our method)
Baseline 91.97 − −
Full Wiki 94.59 6.84 35.9
Contrastive Selection 94.34 92.87 94.34
Delta Likelihood Selection 94.26 92.57 94.06
Entropy-Based Selection 94.28 92.28 94.28
Random Selection 94.17 92.87 92.85
− OWALM : OOT Words Augmented Language Model

Table 8. IV Recognition in Kannada ASR based on different text selections based LM augmentation
IV Recognized (%)
LM Augmentation Decoding Lattice Rescoring Rescore after OWALM Decode
(our method)
Baseline 64.83 − −
Full Wiki 78.99 66.12 76.98
Contrastive Selection 77.21 67.17 77.19
Delta Likelihood Selection 73.01 67.72 76.17
Entropy-Based Selection 78.84 66.78 78.4
Random Selection 77.59 67.65 75.08
− OWALM : OOT Words Augmented Language Model

Table 9. Results for different sized datasets


Duration Language Model WER (%) OOV Recognized (%) IV Recognized (%)
(Decode / Rescore)
Baseline 52.01 − 68.43
OWALM Decode 38.79 35.94 69.30
4hrs (Kannada)
Wiki Decode 28.12 62.28 78.99
Wiki Rescore after OWALM Decode 30.27 56.54 76.98
Baseline 43.43 − 86.89
OWALM Decode 33.81 28.75 85.2
10hrs (Telugu)
Wiki Decode 29.16 34.38 91.47
Wiki Rescore after OWALM Decode 29.44 34.32 90.77
Baseline 29.97 − 91.53
OWALM Decode 24.52 31.33 91.06
20hrs (Telugu)
Wiki Decode 22.53 34.9 93.74
Wiki Rescore after OWALM Decode 22.3 36.5 94.06
Baseline 27.46 − 91.64
OWALM Decode 22.86 32.14 92.18
30hrs (Telugu)
Wiki Decode 20.95 36.83 94.69
Wiki Rescore after OWALM Decode 20.89 36.5 94.27
Baseline 26.76 − 91.97
OWALM Decode 22.66 31.27 92.05
40hrs (Telugu)
Wiki Decode 21.01 36.38 94.59
Wiki Rescore after OWALM Decode 20.92 35.90 94.27
− OWALM - OOT Words Augmented Language Model
10

Figure 3. WER, OOV and IV percentages for 4 hours dataset Figure 5. WER, OOV and IV percentages for 20 hours
dataset

Figure 4. WER, OOV and IV percentages for 10 hours


dataset Figure 6. WER, OOV and IV percentages for 30 hours
dataset

datasets and reduces with an increase in the size of the cor-


pora. Nevertheless, the results obtained from our approach
are yet comparable with decoding using a full Wikipedia
augmented language model across all sized datasets. Our
method shows more improvement in the full Wikipedia de-
coding as the size of the dataset improves.
Figure 3, Figure 4, Figure 5, Figure 6, and Figure 7 de-
pict the effect on WER, OOV and IV recognition for differ-
ent sizes of the datasets – 4 hours of Kannada, 10 hours, 20
hours, 30 hours and 40 hours of Telugu respectively.
The data points for each of the figures are base-
line (with no LM augmentation), OWALM decode,
‘Wiki full count merge decode’ (decode with LM augmen-
tation with complete Wikipedia using count merge) Figure 7. WER, OOV and IV percentages for 40 hours
and ‘Wiki rescore after OWALM decode’ (rescore dataset
with count merge augmented LM after decoding with
OWALM). The data points for ‘oov wiki aug’ and
‘oov words wiki aug rescore’ are closer to each other model. Also, the IV recognition shows improvement with re-
when compared to the relative difference with the baseline. spect to the baseline. This is true across datasets of different
This indicates that first decoding with baseline LM aug- sizes. The relative improvement, in terms of WER and OOV
mented with only OOT unigrams increases the language recovery, is, however, more pronounced for smaller datasets.
model scores for these OOV words. This makes later rescor-
ing with a larger LM effective, and the accuracy is compara-
ble with that obtained after decoding with a larger language
Minimal LM Augmentation for Effective Rescoring 11

8 Conclusions and Future Work for low resource languages using web data. In INTER-
SPEECH 2015: 16th Annual Conference of the International
Our experiments show that rescoring with a language model Speech Communication Association, pages 829–833. Interna-
trained on larger text may not be very effective in the case of tional Speech Communication Association (ISCA), 2015.
low-resource languages since the baseline language model
[6] Sethy Abhinav, Georgiou Panayiotis, and Narayanan
is not sufficiently large to cover all the possible contexts for
Shrikanth. Text data acquisition for domain-specific language
words in the vocabulary. Hence, it is essential that we aug-
models. In Proceedings of the 2006 Conference on Empiri-
ment the baseline language model with minimal data that cal Methods in Natural Language Processing, pages 382–389,
comprises the enhanced vocabulary, obtain a lattice from the 2006.
decoding graph built using the minimally enhanced grammar
and then rescore with a larger language model. Our method [7] Parada Carolina, Sethy Abhinav, Dredze Mark, and Jelinek
of building a decoding graph from a grammar augmented Frederick. A spoken term detection framework for recovering
with only out-of-train word unigrams from Wikipedia and out-of-vocabulary words using the web. In Eleventh annual
conference of the international speech communication associ-
then rescoring the obtained lattice with a larger language
ation, 2010.
model is as effective (more effective in the case of Telugu
and comparable in the case of Kannada) as decoding with an [8] Ng Tim, Ostendorf Mari, Hwang Mei-Yuh, Siu Manhung, Bu-
HCLG graph built using a language model augmented with lyko Ivan, and Lei Xin. Web-data augmented language models
the entire Wikipedia text. This method is applicable across for mandarin conversational speech recognition. In Proceed-
different-sized datasets and also for different text selection ings.(ICASSP’05). IEEE International Conference on Acous-
methods. We can, thus, leverage the availability of larger tics, Speech, and Signal Processing, 2005., volume 1, pages
amounts of non-domain text corpora while at the same time, I–589. IEEE, 2005.
reducing the computational overhead of decoding with a big- [9] Beck Eugen, Zhou Wei, Schlüter Ralf, and Ney Hermann.
ger language model. It would be interesting to investigate the Lstm language models for lvcsr in first-pass decoding and
application of our approach to other low-resource languages. lattice-rescoring. arXiv preprint arXiv:1907.01030, 2019.
Further, we also intend to explore our approach with the com-
plementary morpheme-based approach and with approaches [10] Song Xingcheng, Wu Zhiyong, Huang Yiheng, Su Dan, and
Meng Helen. Specswap: A simple data augmentation method
for named entity recognition. We leave this for future work.
for end-to-end speech recognition. In INTERSPEECH, pages
Acknowledgements 581–585, 2020.

[11] Ko Tom, Peddinti Vijayaditya, Povey Daniel, and Khudanpur


We thank Dr K.V. Subramanian, Head, Center for Cloud
Sanjeev. Audio augmentation for speech recognition. In Six-
Computing and Big Data, PES University, Bangalore, for all teenth annual conference of the international speech commu-
the support. nication association, 2015.
References [12] Vachhani Bhavik, Bhat Chitralekha, and Kopparapu Sunil Ku-
mar. Data augmentation using healthy speech for dysarthric
[1] Hazen Timothy J, Shen Wade, and White Christopher. Query- speech recognition. In Interspeech, pages 471–475, 2018.
by-example spoken term detection using phonetic posterior-
gram templates. In 2009 IEEE Workshop on Automatic Speech [13] Tüske Zoltán, Golik Pavel, Nolden David, Schlüter Ralf, and
Recognition & Understanding, pages 421–426. IEEE, 2009. Ney Hermann. Data augmentation, feature combination, and
multilingual neural networks to improve asr and kws perfor-
[2] Novotney Scott and Callison-Burch Chris. Cheap, fast and mance for low-resource languages. In Fifteenth Annual Con-
good enough: Automatic speech recognition with non-expert ference of the International Speech Communication Associa-
transcription. In Human Language Technologies: The 2010 tion, 2014.
Annual Conference of the North American Chapter of the
Association for Computational Linguistics, pages 207–215, [14] Jaitly Navdeep and Hinton Geoffrey E. Vocal tract length per-
2010. turbation (vtlp) improves speech recognition. In Proc. ICML
Workshop on Deep Learning for Audio, Speech and Language,
[3] Thomas Samuel, Ganapathy Sriram, and Hermansky Hynek. volume 117, page 21, 2013.
Cross-lingual and multi-stream posterior features for low re-
source lvcsr systems. In Eleventh Annual Conference of the [15] Chen Guoguo, Na Xingyu, Wang Yongqing, Yan Zhiyong,
International Speech Communication Association, 2010. Zhang Junbo, Ma Sifan, and Wang Yujun. Data augmenta-
tion for children’s speech recognition–the” ethiopian” system
[4] Bulyko Ivan, Ostendorf Mari, Siu Manhung, Ng Tim, Stolcke for the slt 2021 children speech recognition challenge. arXiv
Andreas, and Çetin Özgür. Web resources for language mod- preprint arXiv:2011.04547, 2020.
eling in conversational speech recognition. ACM Transactions
on Speech and Language Processing (TSLP), 5(1):1–25, 2007. [16] Medennikov Ivan, Khokhlov Yuri Y, Romanenko Aleksei,
Popov Dmitry, Tomashenko Natalia A, Sorokin Ivan, and
[5] Mendels Gideon, Cooper Erica, Soto Victor, Hirschberg Ju- Zatvornitskiy Alexander. An investigation of mixup training
lia, Gales Mark JF, Knill Kate M, Ragni Anton, and Wang strategies for acoustic models in asr. In Interspeech, pages
Haipeng. Improving speech recognition and keyword search 2903–2907, 2018.
12

[17] Saon George, Tüske Zoltán, Audhkhasi Kartik, and Kings- [29] Rossenbach Nick, Zeyer Albert, Schlüter Ralf, and Ney Her-
bury Brian. Sequence noise injected training for end-to-end mann. Generating synthetic audio data for attention-based
speech recognition. In ICASSP 2019-2019 IEEE Interna- speech recognition systems. In ICASSP 2020-2020 IEEE In-
tional Conference on Acoustics, Speech and Signal Processing ternational Conference on Acoustics, Speech and Signal Pro-
(ICASSP), pages 6261–6265. IEEE, 2019. cessing (ICASSP), pages 7069–7073. IEEE, 2020.
[18] Zhu Yingke, Ko Tom, and Mak Brian. Mixup learning strate- [30] Lin James, Kilgour Kevin, Roblek Dominik, and Sharifi
gies for text-independent speaker verification. In Interspeech, Matthew. Training keyword spotters with limited and syn-
pages 4345–4349, 2019. thesized speech data. In ICASSP 2020-2020 IEEE Interna-
[19] Ghahremani Pegah, Manohar Vimal, Hadian Hossein, Povey tional Conference on Acoustics, Speech and Signal Processing
Daniel, and Khudanpur Sanjeev. Investigation of transfer (ICASSP), pages 7474–7478. IEEE, 2020.
learning for asr using lf-mmi trained neural networks. In [31] Hartmann William, Ng Tim, Hsiao Roger, Tsakalidis Stavros,
2017 IEEE Automatic Speech Recognition and Understand- and Schwartz Richard M. Two-stage data augmentation for
ing Workshop (ASRU), pages 279–286. IEEE, 2017. low-resourced speech recognition. In Interspeech, pages
[20] Manohar Vimal, Povey Daniel, and Khudanpur Sanjeev. Jhu 2378–2382, 2016.
kaldi system for arabic mgb-3 asr challenge using diariza-
[32] Hailu Nirayo, Siegert Ingo, and Nürnberger Andreas. Improv-
tion, audio-transcript alignment and transfer learning. In
ing automatic speech recognition utilizing audio-codecs for
2017 IEEE Automatic Speech Recognition and Understand-
data augmentation. In 2020 IEEE 22nd International Work-
ing Workshop (ASRU), pages 346–352. IEEE, 2017.
shop on Multimedia Signal Processing (MMSP), pages 1–5.
[21] Wang Sicheng, Li Wei, Siniscalchi Sabato Marco, and Lee IEEE, 2020.
Chin-Hui. A cross-task transfer learning approach to adapt-
ing deep speech enhancement models to unseen background [33] Rebai Ilyes, BenAyed Yessine, Mahdi Walid, and Lorré Jean-
noise using paired senone classifiers. In ICASSP 2020-2020 Pierre. Improving speech recognition using data augmenta-
IEEE International Conference on Acoustics, Speech and Sig- tion and acoustic model fusion. Procedia Computer Science,
nal Processing (ICASSP), pages 6219–6223. IEEE, 2020. 112:316–322, 2017.

[22] Li Jason, Gadde Ravi, Ginsburg Boris, and Lavrukhin Vi- [34] Rosenberg Andrew, Audhkhasi Kartik, Sethy Abhinav, Ram-
taly. Training neural speech recognition systems with syn- abhadran Bhuvana, and Picheny Michael. End-to-end speech
thetic speech augmentation. arXiv preprint arXiv:1811.00707, recognition and keyword search on low-resource languages.
2018. In 2017 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 5280–5284. IEEE,
[23] Laptev Aleksandr, Korostik Roman, Svischev Aleksey, An- 2017.
drusenko Andrei, Medennikov Ivan, and Rybin Sergey. You
do not need more data: Improving end-to-end speech recogni- [35] Vydana Hari Krishna, Gurugubelli Krishna, Vegesna Vishnu
tion by text-to-speech data augmentation. In 2020 13th Inter- Vidyadhara Raju, and Vuppala Anil Kumar. An exploration
national Congress on Image and Signal Processing, BioMed- towards joint acoustic modeling for indian languages: Iiit-h
ical Engineering and Informatics (CISP-BMEI), pages 439– submission for low resource speech recognition challenge for
444. IEEE, 2020. indian languages, interspeech 2018. In INTERSPEECH, pages
3192–3196, 2018.
[24] Rosenberg Andrew, Zhang Yu, Ramabhadran Bhuvana, Jia
Ye, Moreno Pedro, Wu Yonghui, and Wu Zelin. Speech recog- [36] Fathima Noor, Patel Tanvina, Mahima C, and Iyengar
nition with augmented synthesized speech. In 2019 IEEE Anuroop. Tdnn-based multilingual speech recognition system
automatic speech recognition and understanding workshop for low resource indian languages. In INTERSPEECH, pages
(ASRU), pages 996–1002. IEEE, 2019. 3197–3201, 2018.
[25] Beneš Karel and Burget Lukáš. Text augmentation for lan-
[37] Pulugundla Bhargav, Baskar Murali Karthick, Kesiraju San-
guage models in high error recognition scenario. arXiv
tosh, Egorova Ekaterina, Karafiát Martin, Burget Lukás, and
preprint arXiv:2011.06056, 2020.
Cernockỳ Jan. But system for low resource indian language
[26] Peyser Cal, Mavandadi Sepand, Sainath Tara N, Apfel James, asr. In INTERSPEECH, pages 3182–3186, 2018.
Pang Ruoming, and Kumar Shankar. Improving tail perfor-
mance of a deliberation e2e asr model using a large text cor- [38] Shetty Vishwas M, Sharon Rini A, Abraham Basil, Seeram
pus. arXiv preprint arXiv:2008.10491, 2020. Tejaswi, Prakash Anusha, Ravi Nithya, and Umesh Srini-
vasan. Articulatory and stacked bottleneck features for low
[27] Sharma Yash, Abraham Basil, Taneja Karan, and Jyothi resource speech recognition. In INTERSPEECH, pages 3202–
Preethi. Improving low resource code-switched asr using aug- 3206, 2018.
mented code-switched tts. arXiv preprint arXiv:2010.05549,
2020. [39] Billa Jayadev. Isi asr system for the low resource speech
recognition challenge for indian languages. In INTER-
[28] Meng Linghui, Xu Jin, Tan Xu, Wang Jindong, Qin Tao, and SPEECH, pages 3207–3211, 2018.
Xu Bo. Mixspeech: Data augmentation for low-resource au-
tomatic speech recognition. In ICASSP 2021-2021 IEEE In-
ternational Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), pages 7008–7012. IEEE, 2021.
Minimal LM Augmentation for Effective Rescoring 13

[40] Yılmaz Emre, Heuvel Henk van den, and Leeuwen [52] Burget Lukas, Schwarz Petr, Matejka Pavel, Hannemann
David Avan. Acoustic and textual data augmentation for Mirko, Rastrow Ariya, White Christopher, Khudanpur San-
improved asr of code-switching speech. arXiv preprint jeev, Hermansky Hynek, and Cernocky Jan. Combination of
arXiv:1807.10945, 2018. strongly and weakly constrained recognizers for reliable de-
tection of oovs. In 2008 IEEE International Conference on
[41] Klakow Dietrich, Rose Georg, and Aubert Xavier. Oov- Acoustics, Speech and Signal Processing, pages 4081–4084.
detection in large vocabulary system using automatically de- IEEE, 2008.
fined word-fragments as fillers. In Sixth European Conference
on Speech Communication and Technology, 1999. [53] Kombrink Stefan, Burget Lukáš, Matějka Pavel, Karafiát Mar-
tin, and Hermansky Hynek. Posterior-based out of vocabulary
[42] Schaaf Thomas. Detection of oov words using generalized word detection in telephone speech. In Tenth Annual Confer-
word models and a semantic class language model. In Seventh ence of the International Speech Communication Association,
European Conference on Speech Communication and Tech- 2009.
nology, 2001.
[54] Murthy Savitha, Sitaram Dinkar, and Sitaram Sunayana. Ef-
[43] Kitaoka Norihide, Chen Bohan, and Obashi Yuya. Dynamic
fect of tts generated audio on oov detection and word error
out-of-vocabulary word registration to language model for
rate in asr for low-resource languages. In Interspeech, pages
speech recognition. EURASIP Journal on Audio, Speech, and
1026–1030, 2018.
Music Processing, 2021(1):1–8, 2021.
[55] Hori Takaaki, Watanabe Shinji, and Hershey John R. Multi-
[44] Thomas Samuel, Audhkhasi Kartik, Tüske Zoltán, Huang
level language modeling and decoding for open vocabulary
Yinghui, and Picheny Michael. Detection and recovery of
end-to-end speech recognition. In 2017 IEEE Automatic
oovs for improved english broadcast news captioning. In IN-
Speech Recognition and Understanding Workshop (ASRU),
TERSPEECH, pages 2973–2977, 2019.
pages 287–293. IEEE, 2017.
[45] Hazen Timothy J and Bazzi Issam. A comparison and com-
[56] Hannemann Mirko, Kombrink Stefan, Karafiát Martin, and
bination of methods for oov word detection and word con-
Burget Lukáš. Similarity scoring for recognizing repeated
fidence scoring. In 2001 IEEE International Conference on
out-of-vocabulary words. In Eleventh Annual Conference of
Acoustics, Speech, and Signal Processing. Proceedings (Cat.
the International Speech Communication Association, 2010.
No. 01CH37221), volume 1, pages 397–400. IEEE, 2001.

[46] Yazgan Ali and Saraclar Murat. Hybrid language models for [57] Illina I. and Fohr Dominique. RNN Language Model Estima-
out of vocabulary word detection in large vocabulary con- tion for Out-of-Vocabulary Words, pages 199–211. 12 2020.
versational speech recognition. In 2004 IEEE International [58] Egorova Ekaterina and Burget Lukás. Out-of-vocabulary word
Conference on Acoustics, Speech, and Signal Processing, vol- recovery using fst-based subword unit clustering in a hybrid
ume 1, pages I–745. IEEE, 2004. ASR system. In 2018 IEEE International Conference on
[47] Ketabdar Hamed, Hannemann Mirko, and Hermansky Hynek. Acoustics, Speech and Signal Processing, ICASSP 2018, Cal-
Detection of out-of-vocabulary words in posterior based asr. gary, AB, Canada, April 15-20, 2018, 2018.
In Eighth Annual Conference of the International Speech [59] Qin Long, Sun Ming, and Rudnicky Alexander I. OOV de-
Communication Association, 2007. tection and recovery using hybrid models with different frag-
[48] White Christopher, Zweig Geoffrey, Burget Lukas, Schwarz ments. In INTERSPEECH 2011, 12th Annual Conference
Petr, and Hermansky Hynek. Confidence estimation, oov de- of the International Speech Communication Association, Flo-
tection and language id using phone-to-word transduction and rence, Italy, August 27-31, 2011, 2011.
phone-level alignments. In 2008 IEEE International Con-
[60] Naptali Welly, Tsuchiya Masatoshi, and Nakagawa Seiichi.
ference on Acoustics, Speech and Signal Processing, pages
Class-based n-gram language model for new words using out-
4085–4088. IEEE, 2008.
of-vocabulary to in-vocabulary similarity. IEICE Trans. Inf.
[49] Rastrow Ariya, Sethy Abhinav, and Ramabhadran Bhuvana. Syst., 95-D(9):2308–2317, 2012.
A new method for oov detection using hybrid word/fragment
[61] Orosanu Luiza and Jouvet Denis. Adding new words into a
system. In 2009 IEEE International Conference on Acoustics,
language model using parameters of known words with sim-
Speech and Signal Processing, pages 3953–3956. IEEE, 2009.
ilar behavior. In Abbas Mourad and Abdelali Ahmed, edi-
[50] Zhang Xiaohui, Povey Daniel, and Khudanpur Sanjeev. tors, 1st International Conference on Natural Language and
Oov recovery with efficient 2nd pass decoding and open- Speech Processing, ICNLSP 2015, Algiers, Algeria, Octo-
vocabulary word-level rnnlm rescoring for hybrid asr. In ber 18-19, 2015, volume 128 of Procedia Computer Science,
ICASSP 2020-2020 IEEE International Conference on Acous- pages 18–24. Elsevier, 2015.
tics, Speech and Signal Processing (ICASSP), pages 6334–
[62] Wang Wei, Zhou Zhikai, Lu Yizhou, Wang Hongji, Du Chen-
6338. IEEE, 2020.
peng, and Qian Yanmin. Towards data selection on TTS data
[51] Lin Hui, Bilmes Jeff, Vergyri Dimitra, and Kirchhoff Katrin. for children’s speech recognition. In IEEE International Con-
Oov detection by joint word/phone lattice alignment. In 2007 ference on Acoustics, Speech and Signal Processing, ICASSP
IEEE Workshop on Automatic Speech Recognition & Under- 2021, Toronto, ON, Canada, June 6-11, 2021, 2021.
standing (ASRU), pages 478–483. IEEE, 2007.
14

[63] Zheng Xianrui, Liu Yulan, Gunceler Deniz, and Willett [70] Srivastava Brij Mohan Lal, Sitaram Sunayana, Mehta Ru-
Daniel. Using synthetic audio to improve the recognition of pesh Kumar, Mohan Krishna Doss, Matani Pallavi, Satpal
out-of-vocabulary words in end-to-end asr systems. In IEEE Sandeepkumar, Bali Kalika, Srikanth Radhakrishnan, and
International Conference on Acoustics, Speech and Signal Nayak Niranjan. Interspeech 2018 low resource automatic
Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, speech recognition challenge for indian languages. In SLTU,
2021, 2021. pages 11–14, 2018.

[64] Shetty Vishwas M and Umesh S. Exploring the use of [71] Povey Daniel, Cheng Gaofeng, Wang Yiming, Li Ke,
common label set to improve speech recognition of low re- Xu Hainan, Yarmohammadi Mahsa, and Khudanpur Sanjeev.
source indian languages. In ICASSP 2021-2021 IEEE Interna- Semi-orthogonal low-rank matrix factorization for deep neu-
tional Conference on Acoustics, Speech and Signal Processing ral networks. In Interspeech, pages 3743–3747, 2018.
(ICASSP), pages 7228–7232. IEEE, 2021.
[72] Povey Daniel, Ghoshal Arnab, Boulianne Gilles, Burget
[65] Creutz Mathias, Hirsimäki Teemu, Kurimo Mikko, Puu- Lukas, Glembek Ondrej, Goel Nagendra, Hannemann Mirko,
rula Antti, Pylkkönen Janne, Siivola Vesa, Varjokallio Matti, Motlicek Petr, Qian Yanmin, Schwarz Petr, and others . The
Arisoy Ebru, Saraçlar Murat, and Stolcke Andreas. Morph- kaldi speech recognition toolkit. In IEEE 2011 workshop
based speech recognition and modeling of out-of-vocabulary on automatic speech recognition and understanding, number
words across languages. ACM Transactions on Speech and CONF. IEEE Signal Processing Society, 2011.
Language Processing (TSLP), 5(1):1–29, 2007.
[73] Pusateri Ernest, Van Gysel Christophe, Botros Rami,
[66] Lileikytė Rasa, Lamel Lori, Gauvain Jean-Luc, and Gorin Ar- Badaskar Sameer, Hannemann Mirko, Oualil Youssef, and
seniy. Conversational telephone speech recognition for lithua- Oparin Ilya. Connecting and comparing language model inter-
nian. Computer Speech & Language, 49:71–82, 2018. polation techniques. arXiv preprint arXiv:1908.09738, 2019.

[67] He Yanzhang, Hutchinson Brian, Baumann Peter, Ostendorf [74] Hsu Bo-June. Generalized linear interpolation of language
Mari, Fosler-Lussier Eric, and Pierrehumbert Janet. Subword- models. In 2007 IEEE workshop on automatic speech recog-
based modeling for handling oov words inkeyword spotting. nition & understanding (ASRU), pages 136–140. IEEE, 2007.
In 2014 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 7864–7868. IEEE, [75] Chen Zhehuai, Rosenberg Andrew, Zhang Yu, Wang Gary,
2014. Ramabhadran Bhuvana, and Moreno Pedro J. Improving
speech recognition using gan-based speech synthesis and con-
[68] Choueiter Ghinwa, Povey Daniel, Chen Stanley F, and Zweig trastive unspoken text selection. In INTERSPEECH, pages
Geoffrey. Morpheme-based language modeling for arabic 556–560, 2020.
lvcsr. In 2006 IEEE International Conference on Acoustics
Speech and Signal Processing Proceedings, volume 1, pages [76] Klakow Dietrich. Selecting articles from the language model
I–I. IEEE, 2006. training corpus. In 2000 IEEE International Conference on
Acoustics, Speech, and Signal Processing. Proceedings (Cat.
[69] Povey Daniel, Hannemann Mirko, Boulianne Gilles, Burget No. 00CH37100), volume 3, pages 1695–1698. IEEE, 2000.
Lukáš, Ghoshal Arnab, Janda Miloš, Karafiát Martin, Kom-
brink Stefan, Motlı́ček Petr, Qian Yanmin, and others . Gen- [77] Itoh Nobuyasu, Sainath Tara N, Jiang Dan Ning, Zhou Jie, and
erating exact lattices in the wfst framework. In 2012 IEEE In- Ramabhadran Bhuvana. N-best entropy based data selection
ternational Conference on Acoustics, Speech and Signal Pro- for acoustic modeling. In 2012 IEEE International Confer-
cessing (ICASSP), pages 4213–4216. IEEE, 2012. ence on Acoustics, Speech and Signal Processing (ICASSP),
pages 4133–4136. IEEE, 2012.

You might also like