Professional Documents
Culture Documents
Efficient MDI Adaptation For N-Gram Language Models
Efficient MDI Adaptation For N-Gram Language Models
1. Introduction ∗
p (w|h) hw is seen in corpus
p(w|h) = (1)
Related work: bow(h) · p(w|h0 ) otherwise
• N-gram adaptation [1]: where the discounted probability p∗ (w|h) and the backoff
weight bow(h) areP together to ensure the conditional probabil-
– simple interpolation: λLM1 + (1 − λ)LM2
ity sums to one: w∈V p(w|h) = 1. (Note that the model
– context-aware interpolation ratio [2]: λ(h, w) smoothed by interpolation is just a special case of backoff
– log-linear interpolation [3], [4]: λ log LM1 + (1 − model.)
λ) log LM2 In practice, the backoff n-gram language models are stored
in the ARPA format, which is a compact representation of the
• ME models and hierarchical training method: probability p(w|h) defined over V n . Simply speaking, for each
n-gram hw as well as lower order n0 -grams seen in the corpus,
– Jun Wu’s paper [5]
its entry is stored with its discounted log probability p∗ (w|h)
• MDI adaptation: (in log10 format) and optionally its backoff weight bow(h):
– first MDI language model [6], [7], first MDI adap- log10 p∗ (w|h) hw [log10 bow(h)]
tation [8], Sanjeev [9], [10], [11]
– efficient algorithm [12] and [13] only consider un- Note that the backoff weight is omitted when bow(h) = 1,
igram constraints where there exists closed-form except for the n-grams end with the end-of-a-sentence symbol
solutions “< /s >”, which obviously cannot be backed-off to from any
other n-grams.
We may compare our work with existing work in sev-
eral perspectives, such as perplexity, WER, computational cost 3. LM Adaptation and MDI Adaptation
(time, space, size of LMs) and number of parameters.
Justification: Why can we believe MDI or MaxEnt may be Consider a common scenario where we hope to compute a lan-
better than interpolation? In what cases can MDI be more ap- guage model from the task-specific (in-domain) corpus, which
propriate or better than simple interpolation? is small and too sparse to estimate a good LM. Fortunately, we
can do better by capitalizing on a large, general-domain back-
ground (out-domain) corpus. This motivates LM adaptation,
2. Backoff N-gram Language Models which is to estimate the in-domain LM based on both the em-
This sections introduce n-gram language models, backoff n- pirical statistics extracted from the in-domain corpus and the
gram models and the ARPA format. (larger and richer) out-domain data.
Language model (LM) is a probability distribution over The question is how to combine information from the two
word sequences. N-gram model is the most commonly used sources, e.g., two probability distributions, in a suitable man-
language modeling technique, which views the sequence of ner? We adopt the MDI framework. The idea is to compute
the target distribution pad such that (1) it matches all in-domain Algorithm 1 Generalized Iterative Scaling (GIS) Algorithm
statistics characterizing the in-domain distribution pin while (2) Require: pout (w3 |w12 ), ph (w1 , w2 ) and α1 , . . . , αK
remain “as close as possible” to the out-domain distribution (0) (0) (0)
1: Set λ1 = λ2 = . . . = λK = 0
pout . For simplicity, we take a trigram LM as an example to
2: n = 0
illustrate the idea in the rest of the paper.
3: while stopping criterion not met do
Formally, the in-domain statistics can be regarded as a set of 2 (n)
) exp ( K 3
P
(n) pout (w3 |w1 i=1 λi fi (w1 ))
linear equality constraints that pad must satisfy. The constraints 4: pad (w3 |w12 ) := 2 ,λ (n) (n)
Z(w1 ,...,λK )
are defined based on a set of feature functions fi : V n 7→ R as: 1
5: for j = 1, · · · , K do
Epad (w3 ) [fi (w13 )] = αi , i = 1, 2, . . . , K (2) 6: α̂j := Ep(n) (w3 ) [fj (w13 )]
1 1
ad
αj
The constant αi are estimated from the in-domain corpus. In 7: ∆j := log α̂j
particular, when fi is a binary function and corresponds to the 8: Update parameter: λj
(n+1)
:= λ(n) + η · ∆j
j
marginal distribution of pin , for example, a bigram w̄2 w̄3 : 9: end for
1 w2 = w̄2 and w3 = w̄3 10: for each history w1 w2 in training data do
fi (w13 ) =
0 otherwise 11: Update normalization term:
(n+1) (n+1)
Then, the constraint can be written as: Z(w12 , λ1 , . . . , λK )
(n+1)
:= w pout (w3 |w1 ) exp ( K
2
fi (w13 ))
P P
i=1 λi
X X 12: end for
pad (w13 ) = pin (w13 ) 13: n := n + 1
w2 w3 =w̄2 w̄3 w2 w3 =w̄2 w̄3
14: end while
(n)
⇔ pad (w̄2 w̄3 ) = pin (w̄2 w̄3 ) 15: return pad (w3 |w12 ) as p∗ad
On the other hand, the closeness of two distributions q and
p can be measured by Kullback-Liebler (KL) distance:
X q(w13 ) 4. Efficient Algorithm: The Hierarchical
D(q||p) = q(w13 ) log (3)
n
p(w13 ) Training Method
hw∈V
Therefore, the MDI adaptation problem can be stated as: The challenge for implementing the above GIS algorithm is its
computational complexity: in line 6 and 11, a naive implemen-
p∗ad = arg min D(pad ||pout ), subject to: (4) tation may loop over w13 ∈ V 3 space, which is astronomical
pad
when the vocabulary V is large.
Epad (w3 ) [fi (w13 )] = αi , i = 1, 2, . . . , K (5)
1
If the constraints are consistent, there exists a unique solu- 4.1. The Hierarchical Training Method for MDI Adapta-
tion [?]. Using Lagrange multipliers, the solution of this con- tion v.s. ME Models
strained optimization problem can be shown to be of the form: Fortunately, to overcome this challenge, the hierarchical train-
pout (w13 ) exp ( K
P 3 ing method [5] has been proposed for maximum entropy (ME)
i=1 λi fi (w1 ))
pad (w13 ) = models. The algorithm only requires linear time to its inputs,
Z(λ1 , . . . , λK )
instead of the V n space. The trick is based on the backoff struc-
where Z(λ1 , . . . , λK ) is the normalization term by summing ture of probability q (n) (w3 |w12 ) and the idea of shared and in-
up all the numerators. The parameter λi can be found by the cremental computation. As ME models can be view as a special
generalized iterative scaling (GIS) algorithm. Note that in the case of MDI adaptation where the out-domain distribution pout
case that pout is the uniform distribution, p∗ad is indeed a maxi- is the uniform distribution, now we show that similar algorith-
mum entropy model. mic trick can be applied to MDI adaptation without incurring
In language modeling, we are more interested in the con- extra computation complexity. The difference is we need to
ditional distribution pad (w3 |w12 ). We can formulate the prob- handle the non-uniform pout appropriately.
lem in terms of pad (w3 |w12 ) and compute pad (w3 |w12 ) directly, To illustrate the idea, let us start with a general language
once we know the distribution of histories ph (w12 ) and assume model p(w3 |w12 ) that has the backoff structure as in equation 1.
it is shared among pout and pad . So, the conditional distance Consider two general cases where we need to calculate the left-
measure becomes: aligned or right-aligned summation of the product of p(w3 |w12 )
X X q(w3 |w12 ) and a scaling factor c(w13 ):
D(q||p|ph ) = ph (w12 ) q(w3 |w12 ) log
p(w3 |w12 ) X
2w1 w∈V ΣL (w12 ) = p(w3 |w12 ) · c(w13 ) (7)
w3 ∈V
The adaptation problem becomes finding p∗ad (w3 |w12 ) that min- X
imizes D(pad ||pout |ph )) and satisfies the constraints in (5). As ΣR (w23 ) = p(w3 |w12 ) · c(w13 ) (8)
before, the solution is of the form: w1 ∈V
(1) be cautious handling “< s >” and “< /s >”; (2) when pout
is given as conditional probability, we need to approximately
recover the joint probability, i.e., we need to come up with ph
and compute the marginal probabilities for the arpa LM; (3)
magnitude of lambdas, overflow, constraint selection, constraint
interaction; (4) the choice of vocabulary; (5) vectorization
6. Experimental Results
Settings: (1) comparing different constraints, (2) comparing
different datasets, formal or conversational, (3) using the in-
domain vocabulary vs. using larger-sized or out-domain vocab-
ulary.
7. Conclusion
8. Acknowledgements
9. References
[1] J. R. Bellegarda, “Statistical language model adaptation: review
and perspectives,” Speech communication, vol. 42, no. 1, pp. 93–
108, 2004.
[2] X. Liu, M. J. F. Gales, and P. C. Woodland, “Use of contexts in
language model interpolation and adaptation,” Computer Speech
& Language, vol. 27, no. 1, pp. 301–321, 2013.
[3] A. Gutkin, “Log-linear interpolation of language models.”
[4] K. Heafield, C. Geigle, S. Massung, and L. Schwartz, “Normal-
ized log-linear interpolation of backoff language models is effi-
cient,” in Proceedings of the 54th Annual Meeting of the Asso-
ciation for Computational Linguistics (Volume 1: Long Papers),
2016, pp. 876–886.
[5] J. Wu and S. Khudanpur, “Efficient training methods for maxi-
mum entropy language modeling,” in Sixth International Confer-
ence on Spoken Language Processing, 2000.
[6] S. Della Pietra, V. Della Pietra, R. L. Mercer, and S. Roukos,
“Adaptive language modeling using minimum discriminant esti-
mation,” in [Proceedings] ICASSP-92: 1992 IEEE International
Conference on Acoustics, Speech, and Signal Processing, vol. 1.
IEEE, 1992, pp. 633–636.
[7] P. S. Rao, M. D. Monkowski, and S. Roukos, “Language model
adaptation via minimum discrimination information,” in 1995 In-
ternational Conference on Acoustics, Speech, and Signal Process-
ing, vol. 1. IEEE, 1995, pp. 161–164.