Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Efficient MDI Adaptation for N-gram Language Models

Author Name1 , Co-author Name2


1
Author Affiliation
2
Co-author Affiliation
author@university.edu, coauthor@company.com

Abstract words W = w1 w2 . . . wL as (n − 1)-order Markov model.


n-gram model, the probability
More specifically, in Q of the word
This paper presents an efficient algorithm for n-gram language sequence p(W ) = L i QL i−1
i=1 p(wi |w1 ) ≈ i=1 p(wi |wi−n+1 ),
model adaptation under the minimum discrimination informa- i−1
where wi−n+1 = wi−n+1 wi−n+2 . . . wi is the length-(n − 1)
tion (MDI) principle, where a out-domain language model is
history hi of word wi , and we omit the index i when the context
adapted to satisfy the constraints of marginal distributions de-
is clear.
rived from in-domain corpus. The challenge for MDI language
The n-gram language model defines a probability distribu-
model adaptation is its computational complexity. By taking ad-
tion p(h, w), or the conditional probability p(w|h), over the
vantage of the backoff structure of n-gram model and the idea of
event space V n where V is the vocabulary. However, the space
hierarchical training method, originally proposed for maximum
V n is very large that not every n-gram hw is seen in the train-
entropy (ME) language models, we show that MDI adaptation
ing text corpus, known as the sparse data problem. Thus, there
can be computed in linear-time complexity to the inputs, same
have been smoothing techniques to estimate the probability
as computing ME models, although MDI is more general than
p(w|h) for the unseen n-grams. One of the popular techniques
ME. This makes MDI adaptation possible for large corpus and
is backing-off. The idea is to recursively estimate p(w|h) of
vocabulary. Experimental results show that ...
unseen n-grams based on the lower order (n − 1)-gram prob-
Index Terms: speech recognition, language model, n-gram,
abilities p(w|h0 ), where h0 = wi−n+2
i−1
, which may have been
adaptation
seen in the corpus. More specifically,

1. Introduction  ∗
p (w|h) hw is seen in corpus
p(w|h) = (1)
Related work: bow(h) · p(w|h0 ) otherwise

• N-gram adaptation [1]: where the discounted probability p∗ (w|h) and the backoff
weight bow(h) areP together to ensure the conditional probabil-
– simple interpolation: λLM1 + (1 − λ)LM2
ity sums to one: w∈V p(w|h) = 1. (Note that the model
– context-aware interpolation ratio [2]: λ(h, w) smoothed by interpolation is just a special case of backoff
– log-linear interpolation [3], [4]: λ log LM1 + (1 − model.)
λ) log LM2 In practice, the backoff n-gram language models are stored
in the ARPA format, which is a compact representation of the
• ME models and hierarchical training method: probability p(w|h) defined over V n . Simply speaking, for each
n-gram hw as well as lower order n0 -grams seen in the corpus,
– Jun Wu’s paper [5]
its entry is stored with its discounted log probability p∗ (w|h)
• MDI adaptation: (in log10 format) and optionally its backoff weight bow(h):

– first MDI language model [6], [7], first MDI adap- log10 p∗ (w|h) hw [log10 bow(h)]
tation [8], Sanjeev [9], [10], [11]
– efficient algorithm [12] and [13] only consider un- Note that the backoff weight is omitted when bow(h) = 1,
igram constraints where there exists closed-form except for the n-grams end with the end-of-a-sentence symbol
solutions “< /s >”, which obviously cannot be backed-off to from any
other n-grams.
We may compare our work with existing work in sev-
eral perspectives, such as perplexity, WER, computational cost 3. LM Adaptation and MDI Adaptation
(time, space, size of LMs) and number of parameters.
Justification: Why can we believe MDI or MaxEnt may be Consider a common scenario where we hope to compute a lan-
better than interpolation? In what cases can MDI be more ap- guage model from the task-specific (in-domain) corpus, which
propriate or better than simple interpolation? is small and too sparse to estimate a good LM. Fortunately, we
can do better by capitalizing on a large, general-domain back-
ground (out-domain) corpus. This motivates LM adaptation,
2. Backoff N-gram Language Models which is to estimate the in-domain LM based on both the em-
This sections introduce n-gram language models, backoff n- pirical statistics extracted from the in-domain corpus and the
gram models and the ARPA format. (larger and richer) out-domain data.
Language model (LM) is a probability distribution over The question is how to combine information from the two
word sequences. N-gram model is the most commonly used sources, e.g., two probability distributions, in a suitable man-
language modeling technique, which views the sequence of ner? We adopt the MDI framework. The idea is to compute
the target distribution pad such that (1) it matches all in-domain Algorithm 1 Generalized Iterative Scaling (GIS) Algorithm
statistics characterizing the in-domain distribution pin while (2) Require: pout (w3 |w12 ), ph (w1 , w2 ) and α1 , . . . , αK
remain “as close as possible” to the out-domain distribution (0) (0) (0)
1: Set λ1 = λ2 = . . . = λK = 0
pout . For simplicity, we take a trigram LM as an example to
2: n = 0
illustrate the idea in the rest of the paper.
3: while stopping criterion not met do
Formally, the in-domain statistics can be regarded as a set of 2 (n)
) exp ( K 3
P
(n) pout (w3 |w1 i=1 λi fi (w1 ))
linear equality constraints that pad must satisfy. The constraints 4: pad (w3 |w12 ) := 2 ,λ (n) (n)
Z(w1 ,...,λK )
are defined based on a set of feature functions fi : V n 7→ R as: 1
5: for j = 1, · · · , K do
Epad (w3 ) [fi (w13 )] = αi , i = 1, 2, . . . , K (2) 6: α̂j := Ep(n) (w3 ) [fj (w13 )]
1 1
ad
αj
The constant αi are estimated from the in-domain corpus. In 7: ∆j := log α̂j
particular, when fi is a binary function and corresponds to the 8: Update parameter: λj
(n+1)
:= λ(n) + η · ∆j
j
marginal distribution of pin , for example, a bigram w̄2 w̄3 : 9: end for

1 w2 = w̄2 and w3 = w̄3 10: for each history w1 w2 in training data do
fi (w13 ) =
0 otherwise 11: Update normalization term:
(n+1) (n+1)
Then, the constraint can be written as: Z(w12 , λ1 , . . . , λK )
(n+1)
:= w pout (w3 |w1 ) exp ( K
2
fi (w13 ))
P P
i=1 λi
X X 12: end for
pad (w13 ) = pin (w13 ) 13: n := n + 1
w2 w3 =w̄2 w̄3 w2 w3 =w̄2 w̄3
14: end while
(n)
⇔ pad (w̄2 w̄3 ) = pin (w̄2 w̄3 ) 15: return pad (w3 |w12 ) as p∗ad
On the other hand, the closeness of two distributions q and
p can be measured by Kullback-Liebler (KL) distance:
X q(w13 ) 4. Efficient Algorithm: The Hierarchical
D(q||p) = q(w13 ) log (3)
n
p(w13 ) Training Method
hw∈V

Therefore, the MDI adaptation problem can be stated as: The challenge for implementing the above GIS algorithm is its
computational complexity: in line 6 and 11, a naive implemen-
p∗ad = arg min D(pad ||pout ), subject to: (4) tation may loop over w13 ∈ V 3 space, which is astronomical
pad
when the vocabulary V is large.
Epad (w3 ) [fi (w13 )] = αi , i = 1, 2, . . . , K (5)
1

If the constraints are consistent, there exists a unique solu- 4.1. The Hierarchical Training Method for MDI Adapta-
tion [?]. Using Lagrange multipliers, the solution of this con- tion v.s. ME Models
strained optimization problem can be shown to be of the form: Fortunately, to overcome this challenge, the hierarchical train-
pout (w13 ) exp ( K
P 3 ing method [5] has been proposed for maximum entropy (ME)
i=1 λi fi (w1 ))
pad (w13 ) = models. The algorithm only requires linear time to its inputs,
Z(λ1 , . . . , λK )
instead of the V n space. The trick is based on the backoff struc-
where Z(λ1 , . . . , λK ) is the normalization term by summing ture of probability q (n) (w3 |w12 ) and the idea of shared and in-
up all the numerators. The parameter λi can be found by the cremental computation. As ME models can be view as a special
generalized iterative scaling (GIS) algorithm. Note that in the case of MDI adaptation where the out-domain distribution pout
case that pout is the uniform distribution, p∗ad is indeed a maxi- is the uniform distribution, now we show that similar algorith-
mum entropy model. mic trick can be applied to MDI adaptation without incurring
In language modeling, we are more interested in the con- extra computation complexity. The difference is we need to
ditional distribution pad (w3 |w12 ). We can formulate the prob- handle the non-uniform pout appropriately.
lem in terms of pad (w3 |w12 ) and compute pad (w3 |w12 ) directly, To illustrate the idea, let us start with a general language
once we know the distribution of histories ph (w12 ) and assume model p(w3 |w12 ) that has the backoff structure as in equation 1.
it is shared among pout and pad . So, the conditional distance Consider two general cases where we need to calculate the left-
measure becomes: aligned or right-aligned summation of the product of p(w3 |w12 )
X X q(w3 |w12 ) and a scaling factor c(w13 ):
D(q||p|ph ) = ph (w12 ) q(w3 |w12 ) log
p(w3 |w12 ) X
2w1 w∈V ΣL (w12 ) = p(w3 |w12 ) · c(w13 ) (7)
w3 ∈V
The adaptation problem becomes finding p∗ad (w3 |w12 ) that min- X
imizes D(pad ||pout |ph )) and satisfies the constraints in (5). As ΣR (w23 ) = p(w3 |w12 ) · c(w13 ) (8)
before, the solution is of the form: w1 ∈V

pout (w3 |w12 ) exp ( K 3


P
i=1 λi fi (w1 )) The scaling factor is defined as c(w13 )
pad (w3 |w12 ) = 2
(6) =
Z(w1 , λ1 , . . . , λK ) exp ( K
P 3
i=1 λi fi (w1 )). So, the normalization term Z in
The parameters can be found by GIS algorithm and ph is ap- line 11 is in fact a sort of ΣL , and the α̂j in line 6 corresponds
proximated by empirical values p̂h estimated from the corpus. to ΣR when the feature functions fi corresponds to marginal
We describe the GIS algorithm for determining parameters distribution. We will show that computing the two summations
in (6) as follows (copied from [9], except adding the learning above takes only linear time with regards to the number of
rate η, which is what we have implemented): entries in the ARPA format representing p(w3 |w12 ), i.e., the
seen n-grams, plus the number K of features. After that, we us see how to compute ΣR (w3 ):
will prove that pad , or q (n) (w3 |w12 ), has the backoff structure X
as in equation 1, and thus the proposed algorithm can be readily ΣR (w3 ) = p(w3 |w12 ) · c(w13 )
applied to it. 2 ∈V 2
w1
X
4.2. Computing ΣL (w12 ) for history w12 = ΣR (w23 )
w2 ∈V
Consider equation 7, where we compute ΣL (w12 ) by summing X
p∗ (w3 |w12 ) − bow(w12 ) · p(w3 |w2 ) · c(w13 )

=
over w3 ∈ V given history w12 . By definition, for every history 2
∈V 2
w12 , it costs |V | addition operations. However, this cost can be w1
3
∧seen(w1 )
reduced to the number of seen w3 given w12 , much less than |V |. X
As p(w3 |w12 ) is a backoff model, we rewrite equation 7 as: + p(w3 |w2 ) · c(w23 ) · g(w2 )
w2 ∈V
X
ΣL (w12 ) p(w3 |w12 ) c(w13 )
X
= · p(w3 |w2 ) · bow(w12 ) · c(w13 ) − c(w23 )

+
w3 ∈V 2
w1 ∈V 2
X ∗ 3 3
(w3 |w12 ) bow(w12 ) c(w13 ) ∧c(w1 )6=c(w2 )

= p − · p(w3 |w2 ) · X
p∗ (w3 |w12 ) − bow(w12 ) · p(w3 |w2 ) · c(w13 )

w3 ∈V =
3
∧seen(w1 )
2
X w1 ∈V 2
+ bow(w12 ) p(w3 |w2 ) · c(w23 ) ∧seen(w13
)
w3 ∈V
X
X + (p∗ (w3 |w2 ) − bow(w2 ) · p(w3 )) · c(w23 ) · g(w2 )
bow(w12 ) c(w13 ) c(w23 )

+ p(w3 |w2 ) · − w2 ∈V
3
∧seen(w2 )
w3 ∈V
3 3
∧c(w1 )6=c(w2 )
X
+ p(w3 ) · c(w3 ) bow(w2 ) · g(w2 )
w2 ∈V
For
P the second term above, we can define ΣL (w2 ) = X
p(w3 ) · bow(w2 ) · c(w23 ) − c(w3 )

3 +
w3 ∈V p(w 3 |w2 ) · c(w 2 ). This appears to be a dynamic
programming problem, where ΣL (w2 ) is shared among all w2 ∈V
3
∧c(w2 )6=c(w3 )
ΣL (w12 ) where w1 ∈ V , and the base case is ΣL (∅) = X
p(w3 |w2 ) · bow(w12 ) · c(w13 ) − c(w23 )
P
w3 ∈V p(w3 ) · c(w3 ), needed to compute only once.

+
Thus, ΣL (w12 ) for history w12 can be computed hierarchi- 2
w1 ∈V 2
3 3
cally and bottom-up from ΣL (∅) along the backoff structure ∧c(w1 )6=c(w2 )

of p(w3 |w12 ). The computation of ΣL (∅) requires the time of P


O(|V |). To compute any ΣL (w12 ), one only need to compute
We denote g(∅) = w2 ∈V bow(w2 ) · g(w2 ) =
P 2
ΣL for its sub-n-grams, ΣL (w2 ) and ΣL (∅) in this case, and 2
w ∈V 2 bow(w2 ) · bow(w 1 ). On the other hand, we see that:
1
the complexity is the number of seen n-grams along the way,
plus the number of distinct scaling factors K at most, as each ΣR (w13 ) = p(w3 |w12 ) · c(w13 )
positive c term is accessed only once. Recall in the ME case,
p(w3 |w12 ) is taken as 1, and the above equation can be simpli- By observing the marginalization of ΣR (w13 ), ΣR (w23 ) and
fied to contain only the scaling factor c. ΣR (w3 ) above, we can come up with an algorithm to compute
any ΣR . It enumerates all seen n-grams and distinct scaling
4.3. Computing ΣR (w23 ) as marginalization factors, when the n-gram’s suffix matches the parameter of ΣR ,
it is added to ΣR . The last problem is to compute g, which turns
Consider equation 8, where we compute ΣR (w23 ) by summing out to be also in linear time.
over w1 ∈ V given the suffix w23 . This is actually to marginalize
the quantity of w13 to w23 . Again, ΣR (w23 ) can be efficiently 4.4. The backoff structure of pad
computed as:
The adapted model can be written as:
X
ΣR (w23 ) = p(w3 |w12 ) · c(w13 ) pout (w3 |w12 ) · c(w13 )
pad (w3 |w12 ) =
w1 ∈V Z(w12 )
X
p∗ (w3 |w12 ) − bow(w12 ) · p(w3 |w2 ) · c(w13 )

= If pout is a backoff model as in equation 1, then so is pad :
w1 ∈V
3
 ∗
∧seen(w1 )  pad (w3 |w12 ) if w13 seen in pout
2
pad (w3 |w1 ) = or w13 is a constraint
X
+ p(w3 |w2 ) · c(w23 ) bow(w12 )
bowad (w12 ) · pad (w3 |w2 ) otherwise

w1 ∈V
X
p(w3 |w2 ) · bow(w12 ) · c(w13 ) − c(w23 )

+ where:
pout (w3 |w12 ) · c(w13 )
w1 ∈V
3 3 p∗ad (w3 |w12 ) =
∧c(w1 )6=c(w2 ) Z(w12 )
∗ Z(w2 )
This looks a bit different from section 4.2, as ΣR (w23 ) is not bowad (w12 ) = bowout (w12 )
Z(w12 )
decomposed to sub-problem ΣR (w3 ). We denote g(w2 ) =
P 2
w1 ∈V bow(w1 ) in the second term for simplicity. Now, let The lower order n-grams of pad are defined analogously.
4.5. Computing α for pin , Z and α̂ for pad [8] P. S. Rao, S. Dharanipragada, and S. Roukos, “Mdi adaptation of
language models across corpora,” in Fifth European Conference
As pin and pad are backoff models, their marginal distribution on Speech Communication and Technology, 1997.
α and α̂ can be computed using the trick above to compute ΣR ,
[9] M. Weintraub, Y. Aksu, S. Dharanipragada, S. Khudanpur,
and the normalization term Z according to ΣL . However, there H. Ney, J. Prange, A. Stolcke, F. Jelinek, and E. Shriberg, “Lm95
is minor difference when computing α or α̂ compared to ΣR . project report: Fast training and portability. rapport technique,”
By definition: Center for Language and Speech Processing, Johns Hopkins Uni-
X versity, Baltimore, Maryland, États-Unis, 1996.
ΣR (w23 ) = p(w3 |w12 ) · c(w13 ) [10] W. Reichl, “Language model adaptation using minimum discrim-
w1 ∈V ination information,” in Sixth European Conference on Speech
X Communication and Technology, 1999.
α(w23 ) = pin (w3 |w12 ) · ph (w12 ) · c(w13 )
w1 ∈V
[11] S. F. Chen, “Shrinking exponential language models,” in Proceed-
ings of Human Language Technologies: The 2009 Annual Confer-
X pout (w3 |w12 ) · ph (w12 ) · c(w13 ) ence of the North American Chapter of the Association for Com-
α̂(w23 ) = putational Linguistics. Association for Computational Linguis-
w ∈V
Z(w12 )
1 tics, 2009, pp. 468–476.
It can be shown that their computational complexity are the [12] M. Federico, “Efficient language model adaptation through mdi
estimation,” in Sixth European Conference on Speech Communi-
same. We only need to re-define the function g to contain extra
2
ph (w1 )
cation and Technology, 1999.
terms ph (w12 ) or Z(w1 2) , and computing g 0 is still in linear time. [13] R. Kneser, J. Peters, and D. Klakow, “Language model adapta-
tion using dynamic marginals,” in Fifth European Conference on
5. Implementation Issues Speech Communication and Technology, 1997.

(1) be cautious handling “< s >” and “< /s >”; (2) when pout
is given as conditional probability, we need to approximately
recover the joint probability, i.e., we need to come up with ph
and compute the marginal probabilities for the arpa LM; (3)
magnitude of lambdas, overflow, constraint selection, constraint
interaction; (4) the choice of vocabulary; (5) vectorization

6. Experimental Results
Settings: (1) comparing different constraints, (2) comparing
different datasets, formal or conversational, (3) using the in-
domain vocabulary vs. using larger-sized or out-domain vocab-
ulary.

7. Conclusion
8. Acknowledgements
9. References
[1] J. R. Bellegarda, “Statistical language model adaptation: review
and perspectives,” Speech communication, vol. 42, no. 1, pp. 93–
108, 2004.
[2] X. Liu, M. J. F. Gales, and P. C. Woodland, “Use of contexts in
language model interpolation and adaptation,” Computer Speech
& Language, vol. 27, no. 1, pp. 301–321, 2013.
[3] A. Gutkin, “Log-linear interpolation of language models.”
[4] K. Heafield, C. Geigle, S. Massung, and L. Schwartz, “Normal-
ized log-linear interpolation of backoff language models is effi-
cient,” in Proceedings of the 54th Annual Meeting of the Asso-
ciation for Computational Linguistics (Volume 1: Long Papers),
2016, pp. 876–886.
[5] J. Wu and S. Khudanpur, “Efficient training methods for maxi-
mum entropy language modeling,” in Sixth International Confer-
ence on Spoken Language Processing, 2000.
[6] S. Della Pietra, V. Della Pietra, R. L. Mercer, and S. Roukos,
“Adaptive language modeling using minimum discriminant esti-
mation,” in [Proceedings] ICASSP-92: 1992 IEEE International
Conference on Acoustics, Speech, and Signal Processing, vol. 1.
IEEE, 1992, pp. 633–636.
[7] P. S. Rao, M. D. Monkowski, and S. Roukos, “Language model
adaptation via minimum discrimination information,” in 1995 In-
ternational Conference on Acoustics, Speech, and Signal Process-
ing, vol. 1. IEEE, 1995, pp. 161–164.

You might also like