Professional Documents
Culture Documents
GPLSTM Icassp1
GPLSTM Icassp1
net/publication/331298635
Gaussian Process Lstm Recurrent Neural Network Language Models for Speech
Recognition
CITATIONS READS
34 1,388
6 authors, including:
Shoukang Hu Jianwei Yu
The Chinese University of Hong Kong 87 PUBLICATIONS 1,070 CITATIONS
66 PUBLICATIONS 789 CITATIONS
SEE PROFILE
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Negotiating Strategy for The Sixth International Automated Negotiating Agents Competition (ANAC 2015) View project
All content following this page was uploaded by Max W. Y. Lam on 19 April 2019.
Max W. Y. Lam? Xie Chen† Shoukang Hu? Jianwei Yu? Xunying Liu? Helen Meng?
?
The Chinese University of Hong Kong, Hong Kong SAR, China
†
Microsoft AI and Research, One Microsoft Way, Redmond, WA, USA
{wylam, skhu, jwyu, xyliu, hmmeng}@se.cuhk.edu.hk, xieche@microsoft.com
1. INTRODUCTION This paper focuses on the recurrent neural network language models
(RNNLMs), which computes the word probability as
Language models (LMs) play an important role in speech recogni-
tion to obtain the likelihood of a word sequence w1n = (w1 , ..., wn ): P (wt |w1t−1 ) ≈ P (wt |wt−1 , ht−1 ) , (2)
n
Y where ht−1 is the hidden state that tries to encode (wt−2 , ..., w1 )
P (w1n ) = P (w1 , ..., wn ) = P (wt |w1t−1 ) (1) into a D-dimensional vector, and D is the number of hidden nodes
t=1 that we refer to throughout this paper. In NNLMs, a word wt is often
represented by a N -dimensional one-hot row vector w̃t , where N is
In general, there are two types of LMs – discrete-space LMs, e.g., n-
the number of words in the vocabulary. To deal with sparse text data,
grams (NGs) [1, 2], and continuous-space LMs, e.g., neural network
the one-hot vector is first projected into a M -dimensional continuous
language models (NNLMs) [3, 4, 5].
space [3] where M is the embedding size (usually M N ):
In particular, this paper focuses on the recurrent neural network
language models (RNNLMs) [5, 6, 7] and the long short-term mem- xt = Uw̃t> , (3)
ory (LSTM) [8, 9] units, which in recent years have been shown
to define the state-of-the-art performance and give significant im- where U is a projection matrix that can be learned during training.
provements over the conventional back-off n-grams [10, 11, 12]. Followed by the word embedding, we obtain the hidden state ht by
LSTM RNNLMs recursively accumulates the sequential information recursively applying a gating function: ht = g(ht−1 , xt−1 ), which
in the complete word history and are well known for tracking longer is a vector function that controls the amount of information carried
history contexts and thus widely used in current RNNLMs. How- by wt−1 preserved in the “memory” ht . The commonly used archi-
ever, the fixed, deterministic parameter estimate in standard LSTM tecture of RNNLMs uses a single sigmoid activation based gating
function parameterized by a weight matrix Θ ∈ RD×(M +D+1) :
This research was partially funded by Research Grants Council of Hong
Kong General Research grant Fund No.14200218, and the Chinese Univer- gsig (xt−1 , ht−1 ) = σ Θ [xt−1 , ht−1 , 1]> , (4)
sity of Hong Kong (CUHK) grant No. 4055065.
… … …
Softmax
Layer
… … …
output gate
forget gate
input gate
cell gate
Recurrent
Layer
… … …
Projection
Layer
100000…0 001000…0 000010…0 Elementwise
Gate Concatenate Duplicate
Operation
Fig. 1. The logic flow of RNNLMs Fig. 2. An example architecture of a LSTM cell
where σ(u) = [σ(u1 ), ..., σ(uD )] for any u ∈ RD while each where ⊗ is the Hadamard product. An example LSTM architecture
σ(ui ) = (1 + eui )−1 . To align with the common input-output def- is shown on Fig. 2.
inition in neural networks, we let yt−1 , ht to be the output cor-
responding to input xt−1 . We can then compute the approximated 3. GAUSSIAN PROCESS LSTM RNNLMS
probability taking a Softmax approximation:
We first start with a formulation of gates g(·) in a more generalized
exp(Vy>
t−1 ) fashion such that the activation function a(·) is not restricted to have
P (wt |wt−1 , ht−1 ) = P , (5)
exp(Vy>
t−1 ) the same form at each node, leading to
where V is another learnable projection matrix that projects the D- g (z; Θ) = a(Θz> ) = [a1 (θ 1 • z) , ..., aD (θ D • z)] , (12)
dimensional output back to the N -dimensional vocabulary space,
and exp(v) = [ev1 , ..., evN ] for any v ∈ RN . All in all, the logic where Θ = [θ 1 , ..., θ D ]> ∈ RD×(M +D+1) and z = [xt−1 , ht−1 , 1]
flow of RNNLMs is illustrated on Fig. 1. so that the gates in Eqn. (4, 6-9), are just the special cases of g(·).
where the first term can be approximated by Monte-Carlo sampling 4.1. Experiments on Penn Treebank Corpus
on the CE loss, i.e., to draw S samples from q(Θ) using a differen-
(s) (s)
tiable expression for the sth sample as Θd,r = µd,r + exp(γd,r )d,r
(s) Language Model PPL PPL(+4G)
with d,r ∼ N (0, 1) and then we average over the CE losses pro- 4-gram 141.7 -
duced by all samples for every training batch of n words: Standard LSTM 114.4 99.7
S
1 XX
n (P1) GPact as the forget gate 115.2 92.4
L1 ≈ − P̂ (wt |w1t−1 ) • log P wt |wt−1 , ht−1 ; Θ(s) , (P2) GPact as the input gate 115.1 91.7
S s=1 t=2
(P3) GPact as the cell gate 111.9 88.3
(19) (P4) GPact as the output gate 109.4 88.3
(P5) GPact as the ct gate 111.2 88.2
where P̂ (wt |w1t−1 ) N
∈ R is the target word probability vector. It (P6) GPact as a new gate for ht−1 108.2 88.1
is worth pointing out that when minibatch training is employed, it (P7) GPact as a new gate for xt 112.0 90.0
suffices in practice to take only one sample (S = 1) per batch so that
no additional time cost is added to the standard CE training. The
Table 1. Perplexities (PPL) attained on PTB test set by applying
second term can also be expressed in closed-form:
GPact to different positions inside a LSTM cell
D M +D+1
1X X
L2 = µ2d,r + exp(2γd,r ) − 2 log(γd,r ) − 1 (20) Because of practical limitation, in this paper we only consider
2 r=1
d=1 replacing or adding a gate using GPact. To start with, we examine
and callhm, respectively, when linearly combined with both the 4-
gram LM and the LSTM RNNLM.
output gate
forget gate
Language Model PPL
input gate
cell gate
P1 P2 P3 P4 dev eval
4-gram 111.1 30.4 31.0
P6
LSTM 83.4 29.4 30.0
P7
GP-LSTM 81.2 29.3 29.8
4-gram + LSTM 76.8 29.3 29.8
4-gram + GP-LSTM 74.2 29.0 29.4
Possible Positions Elementwise 4-gram + LSTM + GP-LSTM 71.2 28.7 29.3
for GPact Operation Concatenate Duplicate
[10] T. Mikolov, A. Deoras, S. Kombrink, L. Burget, and [26] T. Hain, L. Burget, J. Dines, P. N. Garner, F. Grézl, A. El Han-
J. Černockỳ, “Empirical evaluation and combination of ad- nani, M. Huijbregts, M. Karafiat, M. Lincoln, and V. Wan,
vanced language modeling techniques,” in Twelfth Annual “Transcribing meetings with the amida systems,” IEEE Trans-
Conference of the International Speech Communication Asso- actions on Audio, Speech, and Language Processing, vol. 20,
ciation, 2011. no. 2, pp. 486–498, 2012.
[11] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, [27] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot,
and T. Robinson, “One billion word benchmark for measur- T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal,
ing progress in statistical language modeling,” arXiv preprint et al., “The ami meeting corpus: A pre-announcement,” in
arXiv:1312.3005, 2013. International workshop on machine learning for multimodal
interaction. Springer, 2005, pp. 28–39.
[12] M. Sundermeyer, H. Ney, and R. Schlüter, “From feedfor-
[28] F. Grezl and P. Fousek, “Optimizing bottle-neck features for
ward to recurrent lstm neural networks for language modeling,”
lvcsr.,” in ICASSP, 2008, vol. 8, pp. 4729–4732.
IEEE Transactions on Audio, Speech, and Language Process-
ing, vol. 23, no. 3, pp. 517–529, 2015. [29] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent
pre-trained deep neural networks for large-vocabulary speech
[13] M. W. Y. Lam, S. Hu, X. Xie, S. Liu, J. Yu, R. Su, X. Liu,
recognition,” IEEE Transactions on audio, speech, and lan-
and H. Meng, “Gaussian process neural networks for speech
guage processing, vol. 20, no. 1, pp. 30–42, 2012.
recognition,” Proc. Interspeech 2018, pp. 1778–1782, 2018.
[30] H. Wang, A. Ragni, M. J. F. Gales, K. M. Knill, P. C. Wood-
[14] Y. Kom Samo and S. Roberts, “Generalized spectral kernels,”
land, and C. Zhang, “Joint decoding of tandem and hybrid
arXiv preprint arXiv:1506.02236, 2015.
systems for improved keyword spotting on low resource lan-
[15] M I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, guages,” 2015.
“An introduction to variational methods for graphical models,” [31] M. J. F. Gales et al., “Maximum likelihood linear transforma-
Machine learning, vol. 37, no. 2, pp. 183–233, 1999. tions for hmm-based speech recognition,” Computer speech &
[16] D. P. Kingma and M. Welling, “Auto-encoding variational language, vol. 12, no. 2, pp. 75–98, 1998.
bayes,” arXiv preprint arXiv:1312.6114, 2013.