2020 Acl-Main 574

Handling Rare Entities for Neural Sequence Labeling
Yangming Li♠,♥ , Han Li♥ , Kaisheng Yao♠ and Xiaolong Li♠

♠
Ant Financial Services Group, Alibaba Group
♥
Harbin Institute of Technology
{pangmao.lym,kaisheng.yao,xl.li}@antfin.com
Abstract Frequency Number Percentage

= 0 (OOV) 1611 65.1%
One great challenge in neural sequence label- = 1 (Low) 191 7.7%
ing is the data sparsity problem for rare entity < 10 (Low) 635 25.7%
words and phrases. Most of test set entities
> 20 (High) 117 4.7%
appear only few times and are even unseen
in training corpus, yielding large number of ≥ 0 (Total) 2475 100.0%
out-of-vocabulary (OOV) and low-frequency
(LF) entities during evaluation. In this work, Table 1: Number of occurrences of test set entities in
we propose approaches to address this prob- the training set. OOV entities are those that have no
lem. For OOV entities, we introduce local occurrence (Frequency = 0) in the training set. Low
context reconstruction to implicitly incorpo- frequency entities are those with fewer than ten occur-
rate contextual information into their represen- rences (Frequency < 10). Percentages of entity occur-
tations. For LF entities, we present delex- rences are also shown. Data source is CoNLL-03.
icalized entity identification to explicitly ex-
tract their frequency-agnostic and entity-type-
specific representations. Extensive experi- in test dataset may occur in training corpus few
ments on multiple benchmark datasets show times or are absent at all. In this paper, we refer
that our model has significantly outperformed this phenomenon particularly to rare entity prob-
all previous methods and achieved new start- lem. It is different from other types of data sparsity
of-the-art results. Notably, our methods sur- problems such as the lack of training data for low-
pass the model fine-tuned on pre-trained lan-
resource language (Lin et al., 2018), as this rare
guage models without external resource.
entity problem is more related to a mismatch of en-
tity distributions between training and test, rather
1 Introduction
than the size of training data. We present an exam-
In the context of natural language processing ple of the problem in Table 1. It shows that less
(NLP), the goal of sequence labeling is to assign a than 5% of test set entities are frequently observed
categorical label to each entity word or phrase in a in the training set, and about 65% of test set entities
text sequence. It is a fundamental area that under- are absent from the training set.
lies a range of applications including slot filling and The rare entities can be categorized into two
named entity recognition. Traditional methods use types: out-of-vocabulary (OOV) for those test set
statistical models. Recent approaches have been entities that are not observed in the training set,
based on neural networks (Collobert et al., 2011; and low frequency (LF) for those entities with low
Mesnil et al., 2014; Ma and Hovy, 2016; Strubell frequency (e.g., fewer than 10) occurrences in the
et al., 2017; Li et al., 2018; Devlin et al., 2018; Liu training set. Without proper processing, rare en-
et al., 2019a; Luo et al., 2020; Xin et al., 2018) tities can incur the following risks when building
and they have made great progresses in various a neural network. Firstly, OOV terms may act as
sequence labeling tasks. noise for inference, as they lack lexical information
However, a great challenge to neural-network- from training set (Bazzi, 2002). Secondly, it is hard
based approaches is from the data sparsity problem to obtain high-quality representations on LF enti-
(Augenstein et al., 2017). Specifically in the con- ties (Gong et al., 2018). Lastly, high occurrences
text of sequence labeling, the majority of entities of OOV and LF entities expose distribution discrep-
6441
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6441–6451
July 5 - 10, 2020. c 2020 Association for Computational Linguistics
ancy between training and test, which mostly leads methods indeed alleviate the rare entity problem.
to poor performances during test. Notably, without using any external knowledge nor
In general, there are two existing strategies at- pre-trained models, the proposed methods surpass
tempting to mitigate the above issues: external the model that uses fine-tuned BERT.
resource and transfer learning. The external re-
source approach, for example (Huang et al., 2015; 2 Background
Li et al., 2018), uses external knowledge such as Given an input sequence X = [x1 , x2 , · · · , xN ]
part-of-speech tags for NER or additional infor- with N tokens, the sequence labeling task aims at
mation from intent detection for slot filling. How- learning a functional mapping to obtain a target
ever, external knowledge such as part-of-speech tag label sequence Y = [y1 , y2 , · · · , yN ] with equal
is not always available for practical applications length. In the following, we briefly introduce a
and open source taggers such as (Manning et al., typical method for sequence labeling and review
2014) may perform poorly for cross-domain anno- related techniques we use in deriving our model.
tations. Character or n-gram feature are mainly de-
signed to deal with morphologically similar OOV 2.1 Bidirectional RNN + CRF
words. The transfer learning approach, such as Recurrent neural network (RNN) (Hochreiter and
using ELMo (Peters et al., 2018) and BERT (De- Schmidhuber, 1997) has been widely used for se-
vlin et al., 2018), fine-tunes pre-trained models on quence labeling. The majority of high performance
the downstream task (Liu et al., 2019a). Never- models use bidirectional RNN (Schuster and Pali-
theless, it is not directly addressing problems such wal, 1997) to encode input sequence X and condi-
as entity distribution discrepancy between training tional random field (CRF) (Lafferty et al., 2001) as
and test. Moreover, our proposed methods sur- a decoder to output Y .
pass these methods without resorting to external The bidirectional RNN firstly embeds observa-
resources nor large pre-trained language models. tion xi at each position i to a continuous space xi .
This paper proposes novel techniques that enable It then applies forward and backward operations
sequence labeling models to achieve state-of-the- on the whole sequence time-recursively as
art performances without using external resource (→− →
− →
−
nor transfer learning. These are h i = f (xi , h i−1 )
←− ←
− ←
− . (1)
• local context reconstruction (LCR), which is h i = f (xi , h i+1 )
applied on OOV entities, and
CRF computes the probability of a label se-
• delexicalized entity identification (DEI), quence Y given X as
which is applied on LF entities.  X
log p(Y |X) ∝
 (gi [yi ] + G[yi , yi+1 ])
Local context reconstruction enables OOV enti- i , (2)
ties to be related to their contexts. One key point is  →
− ←−
gi = W ∗ ( h i ⊕ h i )

applying variational autoencoder to model this re-
construction process that is typically a one-to-many where ⊕ denotes concatenation operation. G and
generation process. Delexicalized entity identifi- W are learnable matrices. The sequence with the
cation aims at extracting frequency-agnostic and maximum score is the output of the model, typi-
entity-type-specific representation, therefore reduc- cally obtained using Viterbi algorithm.
ing the reliance on high-frequency occurrence of We use bidirectional RNN + CRF model, in par-
entities1 . It uses a novel adversarial training tech- ticular, Bi-LSTM+CRF (Huang et al., 2015), as the
nique to achieve this goal. Both methods use an baseline model in our framework and it is referred
effective random entity masking strategy. in the bottom part of Figure 1.
We evaluate the methods on sequence labeling
tasks on several benchmark datasets. Extensive ex- 2.2 Variational Autoencoder
periments show that the proposed methods signifi- The above model, together with other encoder-
cantly outperform previous models by a large mar- decoder models (Sutskever et al., 2014; Bahdanau
gin. Detailed analysis indicates that the proposed et al., 2014), learn deterministic and discriminative
1
This paper refers slots in slot filling tasks as entities for functional mappings. The variational auto-encoder
brevity, although their definitions are not equivalent. (VAE) (Kingma and Welling, 2015; Rezende et al.,
6442
list flights to 0/1 with fares on 0/1 , please .
Delexicalized Delexicalized
Local Context Local Context Local Context
Entity Entity
Reconstruction Reconstruction Reconstruction
Identification Identification
[SOS] list flights to indianapolis with fares on monday morning , please . [EOS]
Neural Sequence Labeling (Bi-LSTM + CRF)
[SOS] O O O B-FROMLOC O O O B-DATE I-DATE O O O [EOS]
Figure 1: Overall framework to use local context reconstruction and delexicalized entity identification for neural
sequence labeling. “[SOS]” and “[EOS]” are used for marking sequence begining and sequence ending, respec-
tively. The local context reconstruction is applied between any two sucessive entities, including the special entities.
The delexicalized entity identitification is applied for all entities except for the special entities.
2014; Bowman et al., 2015), on the other hand, is be optimized by alternating between optimizations
stochastic and generative. of parameters of q(z|x) and p(x|z).
Using VAE, we may assume a sequence x = We apply VAE for local context reconstruction
[x1 , x2 , · · · , xN ] is generated stochastically from from slot/entity tags in Figure 1. This is a gener-
a latent global variable z with a joint probability of ation process that is inherently one-to-many. We
observe that VAE is superior to the deterministic
p(x, z) = p(x|z) ∗ p(z). (3) model (Bahdanau et al., 2014) in learning represen-
where p(z) is the prior probability of z, generally tations of rare entities.
a simple Gaussian distribution, to keep the model
2.3 Adversarial Training
from generating x deterministically. p(x|z) repre-
sents a generation density, usually modeled with a Adversarial training (Goodfellow et al., 2014), orig-
conditional language model with initial state of z. inally proposed to improve robustness to noise in
Maximum likelihood training of a model for Eq. image, is later extended to NLP tasks such as text
(3) involves computationally intractable integration classification (Miyato et al., 2015, 2016) and learn-
of z. To circumvent this, VAE uses variational infer- ing word representation (Gong et al., 2018).
ence with variational distribution of z coming from We apply adversarial training to learn better rep-
a Gaussian density q(z|x) = N (µ, diag(σ 2 )), resentations of low frequency entities via delexical-
with vector mean µ and diagonal matrix variance ized entity identification in Figure 1. It has a dis-
diag(σ 2 ) parameterized by neural networks. VAE criminator to differentiate representations from the
also uses reparameterization trick to obtain latent original low-frequency entities and the representa-
variable z as follows: tions of the delexicalized entities. Training aims at
obtaining representations that can fool the discrimi-
z=µ+σ , (4)
nator, therefore achieving frequency-agnostics and
where is sampled from standard Gaussian distri- entity-type-specificity.
bution and denotes element-wise product.
The evidence lower bound (ELBO) of the like- 3 The Model
lihood p(x) is obtained using Jensen’s inequality
We illustrate the overall framework of the proposed
Eq(z|x) log p(x, z) ≤ log p(x) as follows:
model in Figure 1. Its baseline sequence labeling
Lvae (x) = −KL(q(z|x)||p(z)) module is described in Section 2.1. We describe the
, (5) details of local context reconstruction in Sec. 3.1
−CE(q(z|x)|p(x|z))
and delexicalized entity identification in Sec. 3.2,
where KL(q||p) and CE(q|p) respectively denote together with an example to illustrate them in Fig-
the Kullback-Leibler divergence and the cross- ure 2. We denote parameters in Sec. 2.1 as θrnn and
entropy between distribution q and p. ELBO can θemb , respectively, for its RNN and matrix to obtain
6443
[EOS] on fares with KL divergence
adversarial training
𝑁(0, 1)
Posterior Gaussian +∇# 𝑓 material
Cross Entropy MLP Discriminator −∇# 𝑓
ℎ( ℎ) ℎ* ℎ+ reparameterization trick
binary classification pooling
log 𝜎 𝜇
on fares with [SOS] pooling
MLP
B-DATE morning delexicalized
[UNK] Local Context Representation
clone
OOV masking minus feature delexicalization
indianapolis with fares on monday on monday morning
(a) VAE for local context reconstruction. (b) AT for delexicalized entity identification.
Figure 2: An example to illustrate local context reconstruction and delexicalized entity identification.
embedding. Parameters in Sec. 3.1 and Sec. 3.2 are To generate the i-th word in the local context
each denoted as θlcr and θD . sequence [xj+1 , xj+2 , · · · , xk−1 ], we first apply a
RNN-decoder with its initial hidden state from the
3.1 Local Context Reconstruction latent variable zfjk and the first observation from
Contrary to the conventional methods that explic- the embedding of “[SOS]” symbol to recursively
itly provide abundant lexical features from external obtain hidden state →−r fi as follows:
knowledge, we implicitly enrich word representa-
→
− →
−
tions with contextual information by training them r fi = f (xi , →
−
r fi−1 ), (9)
to reconstruct their local contexts.
This RNN-decoder specifically does parameter
Masking Every entity word xi in X, which is sharing with the forward pass RNN-encoder in
defined to be not associated with non-entity label Eq. (1). We then use softmax to compute the distri-
“O”, in sequence X is firstly randomly masked bution of word at position l as
with OOV symbol “[UNK]” as follows: →
− vae
P i = Sof tmax(Wgf ∗ rfi ), (10)
u “[UNK]” if yi 6= “O” ∩ > p
xi = , (6)
xi otherwise where Wgf is a learnable matrix.
Lastly, we compute KL distance and cross-
where constant p is a threshold and is uniformly
entropy for length-L local context sequence in Eq.
sampled between 0 and 1.
(5) as follows:
Forward Reconstruction In the forward re- 
f
ζ(µfjk [d], σjk
f
X
construction process, the forward pass of Eq. 

 KL jk = [d]),
(1) is firstly applied on sequence X u = d



→
− →
−

f 1X
[xu1 , xu2 , · · · , xuN ] to obtain hidden states h ui . Then, CE jk = − log( P vae
i [xi ])
, (11)
L
a forward span representation, mfjk , of the local

i


→ − vae


context between position k and j is obtained using = −KL − CEf ,
f

L jk jk jk
RNN-minus feature (Wang and Chang, 2016) as
follows: where d denotes hidden dimension index and the
→
− →
−
mfjk = h uk − h uj . (7) closed form KL divergence ζ is defined as
To apply VAE to reconstruct the local context, the ζ(µ, σ) = µ2 + σ − (1 + log σ). (12)
mean µ and log-variance log σ are firstly computed
from the above representation as follows: Backward Reconstruction Same as the forward
reconstruction, the backward reconstruction is ap-
µfjk = W1µ tanh(W0µ mfjk )
(
plied on non-adjacent successive entities. The back-
f
, (8) ward pass of Eq. (1) is firstly applied on the entity-
log σjk = W1σ tanh(W0σ mfjk )
masked sequence X u . Once the backward span
where W∗∗ are all learnable matrices. Then, the representation, mbkj , of the local context between
reparameterization trick in Eq. (4) is applied on ←
− ←
−
position k and j is obtained as mbkj = h uj − h uk ,
µfjk and σjk
f
= exp(log σjk f
) to obtain a global the same procedures of the above described for-
latent variable zfjk for the local context. ward reconstruction are conducted, except using
6444
←
−
the backward RNN-encoder f (·) in lieu of the Algorithm 1: Training Algorithm
forward RNN-encoder in Eq. (9). Input: Dataset S, θrnn , θemb , θlcr , θD .
The objective for local context reconstruction is 1 repeat
X→ − vae ← − 2 Sample a minibatch with pairs (X, Y ).
J vae (X; θlcr , θrnn ) = max L jk + L vae
jk ,
θlcr ,θrnn 3 Update θD by gradient descent
jk
(13) according to Eq. (17).
which is to maximize the ELBO w.r.t. parameters 4 Update θlcr and θrnn by gradient ascent
θlcr and θrnn . to joint maximization of J vae + J at
according to Eqs. (13) and (17).
3.2 Delexicalized Entity Identification 5 Update θrnn and θemb by gradient
For low-frequency entities, the delexicalized entity ascent according to Eq. (2).
identification aims at obtaining frequency-agnostic 6 until Convergence;
and entity-type-specific representations. Output: θrnn , θemb , θlcr , θD .
Delexicalization We first randomly substitute en-

tity words in input sequence X with their corre- Following the principle of adversarial training,
sponding labels as we develop the following minimax objective to
train RNN model θrnn and the discriminator θD :
d yi if yi 6= “O” ∩ > p
xi = , (14)
xi otherwise J at (X, Y ; θD , θrnn ) = (17)
d
P
where p is a threshold and is uniformly sampled minθD maxθrnn jk log(pjk ) + log(1 − pjk ) ,
from [0, 1]. We refer this to delexicalization (Wen
et al., 2015), but insert randomness in it. which aims at fooling a strong discriminator
θD with parameter θrnn optimized, leading to
Representation for Identification To obtain frequency-agnostics.
representation to identify whether an entity has
been delexicalized to its label, we first use forward 4 Training Algorithm
and backward RNN-encoders in Eq. (1) on the
sentence X d = [xd1 , xd2 , · · · , xdN ] and obtain hid- Notice that the model has three modules with their
←− →
− own objectives. We update their parameters jointly
den states h di and h di for each position i. Their
←
− →
− using Algorithm 1. The algorithm first improves
concatenation is hdi = h di ⊕ h di . For position i in discriminator θD to identify delexicalized items. It
the original sequence without delexicalization, its then updates θlcr and θrnn with joint optimization
←− →
−
concatenated hidden state hi = h i ⊕ h i . J vae and J at to improve θrnn to fool the discrimi-
For an entity with a span from position j to k, its nator. As VAE optimization of J vae has posterior
representation edjk is obtained from the following collapse problem, we adopt KL cost annealing strat-
average pooling egy and word dropout techniques (Bowman et al.,
1 X 2015). Finally, the algorithm updates both of θrnn
edjk = hdi . (15) and θemb in Bi-LSTM+CRF by gradient ascent ac-
k−j+1
i cording to Eq. (2). Note that θlcr shares the same
Average pooling is also applied on hi s to obtain parameters with θrnn and θemb .
ejk for the original entity with that span. During experiments, we also find it is beneficial
to have a few epochs of pretraining of parameters
Discriminator A multi-layer perceptron (MLP) θrnn and θemb with optimization of Eq. (2).
based discriminator with parameter θD is employed
to output a confidence score in [0, 1], indicating the 5 Experiments
probability of the delexicalization of an entity; i.e.,
This section compares the proposed model against
( d
pjk = σ(vdT tanh(Wd ∗ edjk )) state-of-the-art models on benchmark datasets.
, (16)
pjk = σ(vdT tanh(Wd ∗ ejk )) 5.1 Settings
where paramters vd and Wd are learnable and σ(x) Slot Filling We use available ATIS dataset (Tur
1
is Sigmoid function 1+exp(−x) . et al., 2010) and SNIPS dataset (Coucke et al.,
6445
Models ATIS SNIPS CoNLL-03
Lample et al. (2016) Bi-LSTM + CRF w/ char 95.17 93.71 90.94
Liu et al. (2018) LM-LSTM-CRF 95.33 94.07 91.24
Liu et al. (2019a) GCDT 95.98 95.03 91.96
Qin et al. (2019) Stack-propagation† 95.9 94.2 -
Liu et al. (2019b) CM-Net† 95.82 97.15 -
Bi-LSTM + CRF 95.02 93.37 90.11
w/ external resources‡ 95.67 94.76 91.04
This Work
w/ BERT fine-tuned embedding∗ 95.94 96.15 92.53
w/ Proposed Methods 96.01 97.20 92.67
Table 2: Sequence labeling test results of baselines and the proposed model on benchmark datasets. ∗ refers to fine
tuning on pretrained large models. † refers to using multi-task learning. ‡ refers to adopting external resources.
The improvements over all prior methods are statistically significant with p < 0.01 under t-test.
2018). Meanwhile, we follow the same setup as den states with a rate of 0.3. L2 regularization is
(Goo et al., 2018; Qin et al., 2019). set as 1 × 10−6 to avoid overfit. Following (Liu
et al., 2018, 2019a,b), we adopt the cased, 300d
NER We use the public CoNLL-03 dataset (Sang
Glove (Pennington et al., 2014) to initialize word
and Meulder, 2003) as in (Huang et al., 2015; Lam-
embeddings. We utilize Adam algorithm (Kingma
ple et al., 2016; Liu et al., 2019a). The dataset
and Ba, 2015) to optimize the models and adopt
is tagged with four named entity types, including
the suggested hyper-parameters.
PER, LOC, ORG, and MISC.
Baselines We compare the proposed model with 5.2 Main Results
five types of methods: 1) strong baseline (Lample
The main results of the proposed model on ATIS
et al., 2016) use character embedding to improve
and CoNLL-03 are illustrated in Table 2. The pro-
sequence tagger; 2) recent state-of-the-art models
posed model outperforms all other models on all
for slot filling (Qin et al., 2019; Liu et al., 2019b)
tasks by a substantial margin. On slot filling tasks,
that utilize multi-task learning to incorporate addi-
the model obtains averaged improvements of 0.15
tional information from intent detection; 3) recent
points on ATIS and 1.53 points on SNIPS over CM-
state-of-the-art models, including Liu et al. (2018)
Net and Stack-propagation, without using extra in-
and Liu et al. (2019a), for NER; 4) Bi-LSTM +
formation from jointly modeling of slots and intents
CRF model augmented with external resources,
in these models. In comparison to the prior state-
(i.e., POS tagging using Stanford Parser2 ); and
of-the-art models of GCDT, the improvements are
5) Bi-LSTM + CRF model with word embedding
0.03 points on ATIS, 2.17 points on SNIPS and
from fine-tuned BERTLARGE (Devlin et al., 2018).
0.71 points on CoNLL-03.
Results are reported in F1 scores.
Compared with strong baseline (Lample et al.,
We follow most of the baseline performances
2016) that utilizes char embedding to improve Bi-
reported in (Lample et al., 2016; Liu et al., 2019b;
LSTM + CRF, the gains are even larger. The model
Qin et al., 2019; Liu et al., 2019a) and rerun the
obtains improvements of 0.84 points on ATIS, 3.49
open source toolkit NCRFpp3 , LM-LSTM-CRF4 ,
points on SNIPS and 1.73 points on CoNLL-03,
and GCDT5 on slot filling tasks6 .
over Bi-LSTM + CRF and LM-LSTM-CRF.
Implementation Details We use the same con- Finally, we have tried improving the baseline Bi-
figuration setting for all datasets. The hidden di- LSTM+CRF in our model with external resources
mensions are set as 500. We apply dropout to hid- of lexical information, including part-of-speech
2
https://nlp.stanford.edu/software/lex-parser.shtml. tags, chunk tags and character embeddings. How-
3
https://github.com/jiesutd/NCRFpp. ever, their F1 scores are consistently below the
4
https://github.com/LiyuanLucasLiu/LM-LSTM-CRF. proposed model by an average of 1.47 points. We
5
https://github.com/Adaxry/GCDT. also replace word embeddings in Bi-LSTM+CRF
6
Few results are not available for comparison as Qin et al.
(2019); Liu et al. (2019b) are for mult-task learning of intent with those from fine-tuned BERTLARGE but its re-
detection and slot filling. sults are worse than the proposed model, by 0.07
6446
Method SNIPS CoNLL-03
Method
Bi-LSTM + CRF + LCR + DEI 97.20 OOV LF
w/o LCR 94.37 LM-LSTM-CRF 2049 1136
w/o VAE, w/ LSTM-LM 96.02 GCDT 2073 1149
w/o OOV masking 95.63 Bi-LSTM + CRF 2041 1135
w/o DEI 95.82 w/ external resource‡ 2052 1143
w/o LCR, DEI (Bi-LSTM + CRF) 93.37 w/ BERT fine-tuned embedding∗ 2084 1153
Bi-LSTM + CRF + LCR 2112 1139
Table 3: Ablation experiments for local context recon- Bi-LSTM + CRF + DEI 2043 1169
struction (LCR) and delexicalized entity identification
Bi-LSTM + CRF + LCR + DEI 2124 1181
(DEI). LCR includes VAE and OOV masking.
Total 2509 1363
points, 1.05 points and 0.14 points, respectively, Table 4: Numbers of OOV and LF entities that are cor-
rectly labeled. ∗ refers to fine tuning on pretrained large
on ATIS, SNIPS, and CoNLL-03.
models. ‡ refers to adopting external resources.
6 Analysis
It is noteworthy that the substantial improvements These results show that both local context re-
by the model are obtained without using external construction and delexicalized entity identification
resources nor large pre-trained models. Keys to its contribute greatly to the improved performance by
success are local context reconstruction and delex- the proposed model. Because both LCR and DEI
icalized entity identification. This section reports share the same RNN-encoder as the baseline Bi-
our analysis of these modules. LSTM, the information from reconstructing local
context and fooling the discriminator of delexical-
6.1 Ablation Study
ization is useful for the Bi-LSTM to better predict
Local Context Reconstruction (LCR) We first sequence labels.
examine the impact bought by the LCR process.
In Table 3, we show that removing LCR (w/o
6.2 Rare Entity Handling
LCR) hurts performance significantly on SNIPS.
We then study if constructing local context in LCR In this section, we compare models specifically by
using a traditional deterministic encoder-decoder the numbers of OOV and LF entities they can recall
can be equally effectively as using VAE. We make a correctly. Such comparison reveals the capability
good faith attempt of using LSTM-based language of each model in handling rare entities.
model (Sundermeyer et al., 2012) to generate local Results are presented in Table 4. In the case of
context directly from local context representation without using any external resource and pre-trained
(w/o VAE, w/ LSTM-LM). This does improve re- models, the proposed model recalls 3.66% more
sults over that without LCR at all, indicating the OOV entities and 3.96% more LF entities than
information from reconstructing local context is in- LM-LSTM-CRF. This gain is similar when com-
deed useful. However, its F1 score is still far worse paring against Bi-LSTM+CRF. Furthermore, the
than that of using VAE. This confirms that VAE is proposed model also recalls more rare entities than
superior to deterministic model in dealing with the GCDT, a recent state-of-the-art model in NER. Sep-
inherently one-to-many generation of local context arately using LCR or DEI improves performance
from entities. Lastly, we examine the impact of over baseline Bi-LSTM+CRF. Their gains are com-
OOV masking and observe that F1 score without it plementary as results show that jointly applying
(w/o OOV masking) drops about 1.6 point below LCR and DEI obtains the best performance. These
the model. We attribute this improvement from results demonstrate convincingly the capability of
OOV masking to mitigating the entity distribution local context reconstruction and delexicalized en-
discrepancy between training and test. tity identification in rare entities.
Delexicalized Entity Identification (DEI) Re- Importantly, results in the last two rows reveal
moving delexicalized entity identification (w/o that potentially large improvements can be poten-
DEI) performs worse than the model, with large tially achieved since there are still 15.34% of OOV
drop of 1.38 point on SNIPS. entities and 13.35% of LF entities not recalled.
6447
Figure 4: Comparisons with respect to different per-
centage of training data on ATIS.
Figure 3: Visualization of learned representations on
CoNLL-03 test dataset. Entity types are represented in
different shapes with red for PER, blue for ORG, green 7 Related Work
for LOC and orange for MISC. Rare entities are repre-
Neural sequence labeling has been an active field
sented using bigger points. The points with ”X” are for
the delexicalized entities. in NLP, and we briefly review recently proposed
approaches related to our work.
Slot Filling and NER Neural sequence labeling

6.3 Representation for Delexicalized Entity
has been applied to slot filling (Mesnil et al., 2014;
Identification
Zhang and Wang, 2016; Liu and Lane, 2016; Qin
We visualize the learned representation at Eq. (15) et al., 2019) and NER (Huang et al., 2015; Strubell
using t-SNE (Maaten and Hinton, 2008) in Figure 3. et al., 2017; Liu et al., 2018; Devlin et al., 2018; Liu
It shows 2-dimensional projections of randomly et al., 2019a). For slot filling, multi-task learning
sampled 800 entities on CoNLL-03 dataset. for joint slot filling and intent detection has been
Figure 3 clearly shows separability of entities dominating in the recent literature, for example
by their entity types but no separations among low- (Liu and Lane, 2016). The recent work in (Liu et al.,
frequency and frequent entities. This observation is 2019b) employs a collaborative memory network
consistent to the mini-max objective in Eq. (17) to to further model the semantic correlations among
learn entity-type-specific and frequency-agnostic words, slots and intents jointly. For NER, recent
representations. works use explicit architecture to incorporate infor-
mation such as global context (Liu et al., 2019a) or
6.4 Handling Data Scarcity conduct optimal architecture searches (Jiang et al.,
2019). The best performing models have been us-
This section investigates the proposed model on ing pre-training models on large corpus (Baevski
data scarcity. On ATIS, the percentage of training et al., 2019) or incorporating fine-tuning on exist-
samples are reduced down to 20% of the original ing pre-trained models (Liu et al., 2019a) such as
size, with a reduction size of 20%. This setting is BERT (Devlin et al., 2018).
challenging and few previous works have exper-
imented. Results in Figure 3 show that the pro- External Resource This approach to handle rare
posed model consistently outperforms other mod- entities includes feature engineering methods such
els, especially in low-resource conditions. Further- as incorporating extra knowledge from part-of-
more, reductions of performance from the proposed speech tags (Huang et al., 2015) or character em-
model are much smaller, in comparison to other beddings (Li et al., 2018). Extra knowledge also
models. For instance, at percentage 40%, the pro- includes tags from public tagger (Manning et al.,
posed model only lose 1.17% of its best F1 score 2014). Multi-task learning has been effective in
whereas GCDT loses 3.62% of its F1 score. This incorporating additional label information through
suggests that the proposed model is more robust to multiple objectives. Joint slot filling and intent de-
low resource than other models. tection have been used in (Zhang and Wang, 2016;
6448
Qin et al., 2019; Zhang et al., 2019). Joint part-of- Issam Bazzi. 2002. Modelling out-of-vocabulary
speech tagging and NER have been used in (Lin words for robust speech recognition. Ph.D. thesis,
Massachusetts Institute of Technology.
et al., 2018).
Samuel R Bowman, Luke Vilnis, Oriol Vinyals, An-
Transfer Learning This approach refers to drew M Dai, Rafal Jozefowicz, and Samy Ben-
methods that transfer knowledge from high- gio. 2015. Generating sentences from a continuous
resources to low-resources (Zhou et al., 2019) or space. arXiv preprint arXiv:1511.06349.
use models pretrained on large corpus to benefit Ronan Collobert, Jason Weston, Leon Bottou, Michael
downstream tasks (Devlin et al., 2018; Liu et al., Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011.
2019a). The most recent work in (Zhou et al., 2019) Natural language processing (almost) from scratch.
applies adversarial training that uses a resource- pages 2493–2537.
adversarial discriminator to improve performances Alice Coucke, Alaa Saade, Adrien Ball, Théodore
on low-resource data. Bluche, Alexandre Caulier, David Leroy, Clément
Doumouro, Thibault Gisselbrecht, Francesco Calta-
8 Conclusion girone, Thibaut Lavril, et al. 2018. Snips voice plat-
form: an embedded spoken language understanding
We have presented local context reconstruction for system for private-by-design voice interfaces. arXiv
preprint arXiv:1805.10190.
OOV entities and delexicalized entity identification
for low-frequency entities to address the rare entity Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
problem. We adopt variational autoencoder to learn Kristina Toutanova. 2018. BERT: Pre-training of
deep bidirectional transformers for language under-
a stochastic reconstructor for the reconstruction and standing. NAACL, pages 4171–4186.
adversarial training to extract frequency-agnostic
and entity-type-specific features. Extensive experi- Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang,
and Tie-Yan Liu. 2018. Frage: Frequency-agnostic
ments have been conducted on both slot filling and word representation. In Advances in neural informa-
NER tasks on three benchmark datasets, showing tion processing systems, pages 1334–1345.
that sequence labeling using the proposed methods
Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li
achieve new state-of-the-art performances. Impor- Huo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-
tantly, without using external knowledge nor fine Nung Chen. 2018. Slot-gated modeling for joint
tuning of large pretrained models, our methods slot filling and intent prediction. In Proceedings of
enable a sequence labeling model to outperform the 2018 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
models fine-tuned on BERT. Our analysis also in- Human Language Technologies, Volume 2 (Short Pa-
dicates large potential of further performance im- pers), pages 753–757.
provements by exploiting OOV and LF entities.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Acknowledgments Courville, and Yoshua Bengio. 2014. Generative ad-
versarial nets. In Advances in neural information
This work was done while the first author did in- processing systems, pages 2672–2680.
ternship at Ant Financial. We thank anonymous
reviewers for valuable suggestions. Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Long short-term memory. Neural Computation,
9(8):1735–1780.
References Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec-
tional lstm-crf models for sequence tagging. arXiv
Isabelle Augenstein, Leon Derczynski, and Kalina preprint arXiv:1508.01991.
Bontcheva. 2017. Generalisation in named entity
recognition: A quantitative analysis. Computer Yufan Jiang, Chi Hu, Tong Xiao, Chunliang Zhang,
Speech & Language, 44:61–83. and Jingbo Zhu. 2019. Improved differentiable ar-
chitecture search for language modeling and named
Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke entity recognition. In Proceedings of the 2019 Con-
Zettlemoyer, and Michael Auli. 2019. Cloze- ference on Empirical Methods in Natural Language
driven pretraining of self-attention networks. arXiv Processing and the 9th International Joint Confer-
preprint arXiv:1903.07785. ence on Natural Language Processing (EMNLP-
IJCNLP), pages 3576–3581.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2014. Neural machine translation by jointly Diederick P Kingma and Jimmy Ba. 2015. Adam: A
learning to align and translate. arXiv preprint method for stochastic optimization. In International
arXiv:1409.0473. Conference on Learning Representations (ICLR).
6449
Diederik P Kingma and Max Welling. 2015. Auto- 2014. The stanford corenlp natural language pro-
encoding variational bayes. ICLR. cessing toolkit. In Proceedings of 52nd annual meet-
ing of the association for computational linguistics:
John Lafferty, Andrew McCallum, and Fernando CN system demonstrations, pages 55–60.
Pereira. 2001. Conditional random fields: Prob-
abilistic models for segmenting and labeling se- Grégoire Mesnil, Yann Dauphin, Kaisheng Yao,
quence data. Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xi-
aodong He, Larry Heck, Gokhan Tur, Dong Yu, et al.
Guillaume Lample, Miguel Ballesteros, Sandeep Sub- 2014. Using recurrent neural networks for slot fill-
ramanian, Kazuya Kawakami, and Chris Dyer. 2016. ing in spoken language understanding. IEEE/ACM
Neural architectures for named entity recognition. Transactions on Audio, Speech, and Language Pro-
In Proceedings of the 2016 Conference of the North cessing, 23(3):530–539.
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Takeru Miyato, Andrew M Dai, and Ian Goodfel-
pages 260–270, San Diego, California. Association low. 2016. Adversarial training methods for semi-
for Computational Linguistics. supervised text classification.
Changliang Li, Liang Li, and Ji Qi. 2018. A self- Takeru Miyato, Shin ichi Maeda, Masanori Koyama,
attentive model with gate mechanism for spoken lan- Ken Nakae, and Shin Ishii. 2015. Distributional
guage understanding. In Proceedings of the 2018 smoothing with virtual adversarial training.
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 3824–3833. Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. Glove: Global vectors for word rep-
Ying Lin, Shengqi Yang, Veselin Stoyanov, and Heng resentation. In Proceedings of the 2014 conference
Ji. 2018. A multi-lingual multi-task architecture for on empirical methods in natural language process-
low-resource sequence labeling. ACL, pages 799– ing (EMNLP), pages 1532–1543.
809.
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
Bing Liu and Ian Lane. 2016. Attention-based recur- Gardner, Christopher Clark, Kenton Lee, and Luke
rent neural network models for joint intent detection Zettlemoyer. 2018. Deep contextualized word repre-
and slot filling. arXiv preprint arXiv:1609.01454. sentations. arXiv preprint arXiv:1802.05365.
Liyuan Liu, Jingbo Shang, Xiang Ren, Libo Qin, Wanxiang Che, Yangming Li, Haoyang
Frank Fangzheng Xu, Huan Gui, Jian Peng, Wen, and Ting Liu. 2019. A stack-propagation
and Jiawei Han. 2018. Empower sequence la- framework with token-level intent detection for
beling with task-aware neural language model. spoken language understanding. arXiv preprint
In Thirty-Second AAAI Conference on Artificial arXiv:1909.02188.
Intelligence.
Danilo J. Rezende, Shakir Mohamed, and Daan Wier-
Yijin Liu, Fandong Meng, Jinchao Zhang, Jinan Xu, stra. 2014. Stochastic backpropagation and approxi-
Yufeng Chen, and Jie Zhou. 2019a. GCDT: A global mate inference in deep generative models. ICML.
context enhanced deep transition architecture for se-
quence labeling. ACL, pages 2431–2441. Erik F Sang and Fien De Meulder. 2003. Introduc-
tion to the conll-2003 shared task: Language- in-
Yijin Liu, Fandong Meng, Jinchao Zhang, Jie Zhou, dependent named entity recognition. arxiv preprint
Yufeng Chen, and Jinan Xu. 2019b. CM- cs/0306050.
Net: A novel collaborative memory network for
spoken language understanding. arXiv preprint Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec-
arXiv:1909.06937. tional recurrent neural networks. Signal Processing,
IEEE Transactions on, 45:2673–2681.
Ying Luo, Fengshun Xiao, and Hai Zhao. 2020. Hi-
erarchical contextualized representation for named Emma Strubell, Patrick Verga, David Belanger, and An-
entity recognition. drew McCallum. 2017. Fast and accurate sequence
labeling with iterated dilated convolutions. EMNLP,
Xuezhe Ma and Eduard Hovy. 2016. End-to-end 138:2670–2680.
sequence labeling via bi-directional lstm-cnns-crf.
ACL, pages 1064–1074. Martin Sundermeyer, Ralf Schlüter, and Hermann Ney.
2012. LSTM neural networks for language model-
Laurens van der Maaten and Geoffrey Hinton. 2008. ing. In Thirteenth annual conference of the interna-
Visualizing data using t-sne. Journal of machine tional speech communication association.
learning research, 9(Nov):2579–2605.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Christopher Manning, Mihai Surdeanu, John Bauer, Sequence to sequence learning with neural net-
Jenny Finkel, Steven Bethard, and David McClosky. works.
6450
Gokhan Tur, Dilek Hakkani-Túr, and Larry Heck. 2010.
What is left to be understood in ATIS? IEEE Spo-
ken Language Technology Workshop (SLT), pages
19–24.
Wenhui Wang and Baobao Chang. 2016. Graph-based
dependency parsing with bidirectional lstm. In Pro-
ceedings of the 54th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Papers), volume 1, pages 2306–2315.
Tsung-Hsien Wen, Milica Gasic, Dongho Kim, Nikola
Mrksic, Pei-Hao Su, David Vandyke, and Steve
Young. 2015. Stochastic language generation in di-
alogue using recurrent neural networks with con-
volutional sentence reranking. arXiv preprint
arXiv:1508.01755.
Yingwei Xin, Ethan Hart, Vibhuti Mahajan, and Jean-
David Ruvini. 2018. Learning better internal struc-
ture of words for sequence labeling.
Chenwei Zhang, Yaliang Li, Nan Du, Wei Fan, and
Philip S. Yu. 2019. Joint slot filling and intent detec-
tion via capsule neural networks. ACL, pages 5259–
5267.
Xiaodong Zhang and Houfeng Wang. 2016. A joint
model of intent determination and slot filling for spo-
ken language understanding. In IJCAI, volume 16,
pages 2993–2999.
Joey Tianyi Zhou, Hao Zhang, Di Jin, Hongyuan Zhu,
Meng Fang, Rick Siow Mong Goh, and Kenneth
Kwok. 2019. Dual adversarial neural transfer for
low-resource named entity recognition. ACL, pages
3461–3471.
6451

2020 Acl-Main 574

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020 Acl-Main 574

Uploaded by

Copyright:

Available Formats

Handling Rare Entities for Neural Sequence Labeling

Yangming Li♠,♥ , Han Li♥ , Kaisheng Yao♠ and Xiaolong Li♠

Abstract Frequency Number Percentage

Neural Sequence Labeling (Bi-LSTM + CRF)

[SOS] O O O B-FROMLOC O O O B-DATE I-DATE O O O [EOS]

Delexicalization We first randomly substitute en-

Slot Filling and NER Neural sequence labeling

You might also like