Professional Documents
Culture Documents
Joint Extraction of Entities and Relations Based On A Novel Tagging Scheme
Joint Extraction of Entities and Relations Based On A Novel Tagging Scheme
Country-President
Abstract
The [United States]E-loc President [Trump]E-per will visit the [Apple Inc]E-Org .
None
None
Joint extraction of entities and relations is
Extracted Results
an important task in information extrac-
arXiv:1706.05075v1 [cs.CL] 7 Jun 2017
Final Results: {United States, Country-President, Trump} {Apple Inc, Company-Founder, Steven Paul Jobs}
Figure 2: Gold standard annotation for an example sentence based on our tagging scheme, where “CP”
is short for “Country-President” and “CF” is short for “Company-Founder”.
Figure 3: An illustration of our model. (a): The architecture of the end-to-end model, (b): The LSTM
memory block in Bi-LSTM encoding layer, (c): The LSTM memory block in LSTMd decoding layer.
corresponding to the t-th word in the sentence and where i, f and o are the input gate, forget gate
n is the length of the given sentence. After word and output gate respectively, b is the bias term, c
embedding layer, there are two parallel LSTM lay- is the cell memory, and W(.) are the parameters.
ers: forward LSTM layer and backward LSTM For each word wt , the forward LSTM layer will
layer. The LSTM architecture consists of a set of encode wt by considering the contextual informa-
→
−
recurrently connected subnets, known as memory tion from word w1 to wt , which is marked as ht . In
blocks. Each time-step is a LSTM memory block. the similar way, the backward LSTM layer will en-
The LSTM B-CP-1
memory block in Bi-LSTM encoding ccode
t-1 wX t based + on the contextual
ct information from
Output O E-CP-1 O S-CP-2 ←
−
Softmax layer is used to compute current hidden vector ht wn to wt , which X is marked
tanh as ht . Finally, we con-
T1 on the Tprevious
based T3
hidden T4 h
vector T5 pre- ←− →
−
t−1 , the
2
σ h
catenate tanh hσ
σ t and t to Xrepresent word t’s encoding
Decoding vious cell vector ct−1 and the current input word →
− ← −
Layer LSTMd LSTMd LSTMd LSTMd LSTMd hinformation,
t-1 denoted as ht h=t [ ht , ht ].
embedding wt . Its structure diagram is shown in TheWt LSTM Decoding Layer. We also adopt
(a) Bi-LSTM Block
Encoding
Figure
h1 3 (b),h2and detailh3operations
h4 are defined
h5 as a LSTM structure to produce the tag sequence.
Layer follows: ct-1 X + ct
When detecting the tag of word wt , the inputs of
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM X tanh Tt
decoding layer are: ht obtained
Tt-1 from Bi-LSTM
Embeding iW
t = δ(WwiWw2 t + Whi ht−1 + Wci ct−1 + bi ), (1) σ σ tanh σ X tanh
Layer 1 W3 W4 W5 encoding layer, former predicted tag embedding
Input ht-1 (2)ht
Sentence
The United States president Trump Tt−1 , former cell value ct−1 , and the former hid-
ft = δ(Wwf wt +Whf ht−1 +Wcf ct−1 +bf ), (2) Wt (b) LSTMd Block
(2)
den vector in decoding layer ht−1 . The structure
zt = tanh(Wwc wt + Whc ht−1 + bc ), (3) diagram of the memory block in LSTMd is shown
in Figure 3 (c), and detail operations are defined
ct = ft ct−1 + it zt , (4) as follows:
(2) (2) (2) (2) (2)
ot = δ(Wwo wt + Who ht−1 + Wco ct + bo ), (5) it = δ(Wwi ht + Whi ht−1 + Wti Tt−1 + bi ),
(7)
ht = ot tanh(ct ), (6)
(2) (2) (2) (2) (2)
ft = δ(Wwf ht + Whf ht−1 + Wtf Tt−1 + bf ), 4 Experiments
(8)
(2) (2) (2)
4.1 Experimental setting
(2)
zt = tanh(Wwc ht +Whc ht−1 +Wtc Tt−1 +b(2)
c ),
Dataset To evaluate the performance of our meth-
(9)
ods, we use the public dataset NYT 2 which is pro-
(2) (2) (2) (2) (2)
ct = ft ct−1 + it zt , (10) duced by distant supervision method (Ren et al.,
(2) (2) (2) (2)
2017). A large amount of training data can be ob-
(2) (2)
ot = δ(Wwo ht + Who ht−1 + Wco ct + b(2)
o ), tained by means of distant supervision methods
(11) without manually labeling. While the test set is
(2) (2) (2)
ht = ot tanh(ct ), (12) manually labeled to ensure its quality. In total, the
(2)
training data contains 353k triplets, and the test set
Tt = Wts ht + bts . (13) contains 3, 880 triplets. Besides, the size of rela-
tion set is 24.
The final softmax layer computes normalized en-
Evaluation We adopt standard Precision (Prec),
tity tag probabilities based on the tag predicted
Recall (Rec) and F1 score to evaluate the results.
vector Tt :
Different from classical methods, our method can
yt = Wy Tt + by , (14) extract triplets without knowing the information of
entity types. In other words, we did not use the la-
exp(yti )
pit = Nt
, (15) bel of entity types to train the model, therefore we
P
exp(yt )j do not need to consider the entity types in the eval-
j=1 uation. A triplet is regarded as correct when its re-
lation type and the head offsets of two correspond-
where Wy is the softmax matrix, Nt is the to-
ing entities are both correct. Besides, the ground-
tal number of tags. Because T is similar to
truth relation mentions are given and “None” label
tag embedding and LSTM is capable of learning
is excluded as (Ren et al., 2017; Li and Ji, 2014;
long-term dependencies, the decoding manner can
Miwa and Bansal, 2016) did. We create a vali-
model tag interactions.
dation set by randomly sampling 10% data from
The Bias Objective Function. We train our
test set and use the remaining data as evaluation
model to maximize the log-likelihood of the data
based on (Ren et al., 2017)’s suggestion. We run
and the optimization method we used is RM-
10 times for each experiment then report the aver-
Sprop proposed by Hinton in (Tieleman and Hin-
age results and their standard deviation as Table 1
ton, 2012). The objective function can be defined
shows.
as:
Hyperparameters Our model consists of a Bi-
|D| Lj LSTM encoding layer and a LSTM decoding layer
(j) (j)
X X
L =max (log(pt = yt |xj , Θ) · I(O) with bias objective function. The word embed-
j=1 t=1 dings used in the encoding part are initialed by
(j) (j) running word2vec3 (Mikolov et al., 2013) on NYT
+α · log(pt = yt |xj , Θ) · (1 − I(O))),
training corpus. The dimension of the word em-
beddings is d = 300. We regularize our network
where |D| is the size of training set, Lj is the
(j) using dropout on embedding layer and the dropout
length of sentence xj , yt is the label of word t ratio is 0.5. The number of lstm units in encoding
(j)
in sentence xj and pt is the normalized probabil- layer is 300 and the number in decoding layer is
ities of tags which defined in Formula 15. Besides, 600. The bias parameter α corresponding to the
I(O) is a switching function to distinguish the loss results in Table 1 is 10.
of tag ’O’ and relational tags that can indicate the 2
The dataset can be downloaded at:
results. It is defined as follows: https://github.com/shanzhenren/CoType. There are three
( data sets in the public resource and we only use the NYT
1, if tag = 0 O0 dataset. Because more than 50% of the data in BioInfer
I(O) = has overlapping relations which is beyond the scope of this
0, if tag 6= 0 O0 . paper. As for dataset Wiki-KBP, the number of relation type
in the test set is more than that of the train set, which is also
not suitable for a supervised training method. Details of the
α is the bias weight. The larger α is, the greater data can be found in Ren’s(Ren et al., 2017) paper.
3
influence of relational tags on the model. https://code.google.com/archive/p/word2vec/
Methods Prec. Rec. F1
FCM 0.553 0.154 0.240
DS+logistic 0.258 0.393 0.311
LINE 0.335 0.329 0.332
MultiR 0.338 0.327 0.333
DS-Joint 0.574 0.256 0.354
CoType 0.423 0.511 0.463
LSTM-CRF 0.693 ± 0.008 0.310 ± 0.007 0.428 ± 0.008
LSTM-LSTM 0.682 ± 0.007 0.320 ± 0.006 0.436 ± 0.006
LSTM-LSTM-Bias 0.615 ± 0.008 0.414 ± 0.005 0.495 ± 0.006
Table 1: The predicted results of different methods on extracting both entities and their relations. The
first part (from row 1 to row 3) is the pipelined methods and the second part (row 4 to 6) is the jointly
extracting methods. Our tagging methods are shown in part three (row 7 to 9). In this part, we not only
report the results of precision, recall and F1, we also compute their standard deviation.
Baselines We compare our method with sev- (Vaswani et al., 2016). LSTM-CRF is proposed
eral classical triplet extraction methods, which for entity recognition by using a bidirectional
can be divided into the following categories: the LSTM to encode input sentence and a conditional
pipelined methods, the jointly extracting meth- random fields to predict the entity tag sequence.
ods and the end-to-end methods based our tagging Different from LSTM-CRF, LSTM-LSTM uses a
scheme. LSTM layer to decode the tag sequence instead
For the pipelined methods, we follow (Ren of CRF. They are used for the first time to jointly
et al., 2017)’s settings: The NER results are ob- extract entities and relations based on our tagging
tained by CoType (Ren et al., 2017) then several scheme.
classical relation classification methods are ap-
plied to detect the relations. These methods are: 4.2 Experimental Results
(1) DS-logistic (Mintz et al., 2009) is a distant su- We report the results of different methods as
pervised and feature based method, which com- shown in Table 1. It can be seen that our method,
bines the advantages of supervised IE and unsu- LSTM-LSTM-Bias, outperforms all other meth-
pervised IE features; (2) LINE (Tang et al., 2015) ods in F1 score and achieves a 3% improvement
is a network embedding method, which is suit- in F 1 over the best method CoType (Ren et al.,
able for arbitrary types of information networks; 2017). It shows the effectiveness of our proposed
(3) FCM (Gormley et al., 2015) is a compositional method. Furthermore, from Table 1, we also can
model that combines lexicalized linguistic context see that the jointly extracting methods are better
and word embeddings for relation extraction. than pipelined methods, and the tagging methods
The jointly extracting methods used in this pa- are better than most of the jointly extracting meth-
per are listed as follows: (4) DS-Joint (Li and Ji, ods. It also validates the validity of our tagging
2014) is a supervised method, which jointly ex- scheme for the task of jointly extracting entities
tracts entities and relations using structured per- and relations.
ceptron on human-annotated dataset; (5) MultiR When compared with the traditional methods,
(Hoffmann et al., 2011) is a typical distant super- the precisions of the end-to-end models are signifi-
vised method based on multi-instance learning al- cantly improved. But only LSTM-LSTM-Bias can
gorithms to combat the noisy training data; (6) be better to balance the precision and recall. The
CoType (Ren et al., 2017) is a domain indepen- reason may be that these end-to-end models all use
dent framework by jointly embedding entity men- a Bi-LSTM encoding input sentence and different
tions, relation mentions, text features and type la- neural networks to decode the results. The meth-
bels into meaningful representations. ods based on neural networks can well fit the data.
In addition, we also compare our method with Therefore, they can learn the common features of
two classical end-to-end tagging models: LSTM- the training set well and may lead to the lower ex-
CRF (Lample et al., 2016) and LSTM-LSTM pansibility. We also find that the LSTM-LSTM
Elements E1 E2 (E1,E2)
PRF Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1
LSTM-CRF 0.596 0.325 0.420 0.605 0.325 0.423 0.724 0.341 0.465
LSTM-LSTM 0.593 0.342 0.434 0.619 0.334 0.434 0.705 0.340 0.458
LSTM-LSTM-Bias 0.590 0.479 0.529 0.597 0.451 0.514 0.645 0.437 0.520
Table 2: The predicted results of triplet’s elements based on our tagging scheme.
model is better than LSTM-CRF model based on dicted to be wrong because the relation type is pre-
our tagging scheme. Because, LSTM is capable of dicted to be wrong.
learning long-term dependencies and CRF (Laf-
ferty et al., 2001) is good at capturing the joint
probability of the entire sequence of labels. The 5.2 Analysis of Biased Loss
related tags may have a long distance from each
Different from LSTM-CRF and LSTM-LSTM,
other. Hence, LSTM decoding manner is a little
our approach is biased towards relational labels to
better than CRF. LSTM-LSTM-Bias adds a bias
enhance links between entities. In order to further
weight to enhance the effect of entity tags and
analyze the effect of the bias objective function,
weaken the effect of invalid tag. Therefore, in this
we visualize the ratio of predicted single entities
tagging scheme, our method can be better than the
for each end-to-end method as Figure 4. The sin-
common LSTM-decoding methods.
gle entities refer to those who cannot find their cor-
5 Analysis and Discussion responding entities. Figure 4 shows whether it is
E1 or E2, our method can get a relatively low ra-
5.1 Error Analysis tio on the single entities. It means that our method
In this paper, we focus on extracting triplets com- can effectively associate two entities when com-
posed of two entities and a relation. Table 1 has pared LSTM-CRF and LSTM-LSTM which pay
shown the predict results of the task. It treats an little attention to the relational tags.
triplet is correct only when the relation type and
the head offsets of two corresponding entities are
0.25
both correct. In order to find out the factors that LSTM-CRF
LSTM-LSTM
affect the results of end-to-end models, we ana- 0.20 LSTM-LSTM-Bias
lyze the performance on predicting each element 0.178 0.186
0.167
The Ratio of Single E
Table 3: Output from different models. Standard Si represents the gold standard of sentence i. The
blue part is the correct result, and the red one is the wrong one. E1CF in case ’3’ is short for
E1Company−F ounder .
0.5
tities. Therefore, in this example, LSTM-LSTM-
0.4 Bias can extract two related entities, while LSTM-
0.3 LSTM can only extract one entity of “Florida” and
0.2 can not detect entity “Panama City Beach”.
Precition
0.1 Recall S2 is a negative example that shows these meth-
F1 ods may mistakenly predict one of the entity.
0.0
0 5 10 15 20
Bias Parameter There are no indicative words between entities
N uremberg and Germany. Besides, the patten
Figure 5: The results predicted by LSTM-LSTM- “a * of *” between Germany and M iddleAges
Bias on different bias parameter α. may be easy to mislead the models that there ex-
ists a relation of “Contains” between them. The
problem can be solved by adding some samples
5.3 Case Study of this kind of expression patterns to the training
In this section, we observe the prediction results of data.
end-to-end methods, and then select several repre- S3 is a case that models can predict the enti-
sentative examples to illustrate the advantages and ties’ head offset right, but the relational role is
disadvantages of the methods as Table 3 shows. wrong. LSTM-LSTM treats both “Stephen A.
Each example contains three row, the first row is Schwarzman” and “Blackstone Group” as entity
the gold standard, the second and the third rows E1, and can not find its corresponding E2. Al-
are the extracted results of model LSTM-LSTM though, LSTM-LSMT–Bias can find the entities
and LSTM-LSTM-Bias respectively. pair (E1, E2), it reverses the roles of “Stephen A.
S1 represents the situation that the distance be- Schwarzman” and “Blackstone Group”. It shows
tween the two interrelated entities is far away that LSTM-LSTM-Bias is able to better on pre-
dicting entities pair, but it remains to be improved Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke
in distinguishing the relationship between the two Zettlemoyer, and Daniel S Weld. 2011. Knowledge-
based weak supervision for information extraction
entities.
of overlapping relations. In Proceedings of the
49th Annual Meeting of the Association for Compu-
6 Conclusion tational Linguistics. Association for Computational
Linguistics, pages 541–550.
In this paper, we propose a novel tagging scheme
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec-
and investigate the end-to-end models to jointly tional lstm-crf models for sequence tagging. arXiv
extract entities and relations. The experimen- preprint arXiv:1508.01991 .
tal results show the effectiveness of our proposed
Nal Kalchbrenner and Phil Blunsom. 2013. Recur-
method. But it still has shortcoming on the identi- rent continuous translation models. In EMNLP. vol-
fication of the overlapping relations. In the future ume 3, page 413.
work, we will replace the softmax function in the
Nanda Kambhatla. 2004. Combining lexical, syntactic,
output layer with multiple classifier, so that a word and semantic features with maximum entropy mod-
can has multiple tags. In this way, a word can ap- els for extracting relations. In Proceedings of the
pear in multiple triplet results, which can solve the 43th ACL international conference. page 22.
problem of overlapping relations. Although, our
Arzoo Katiyar and Claire Cardie. 2016. Investigating
model can enhance the effect of entity tags, the as- lstms for joint extraction of opinion entities and rela-
sociation between two corresponding entities still tions. In Proceedings of the 54th ACL international
requires refinement in next works. conference.
John Lafferty, Andrew McCallum, Fernando Pereira,
Acknowledgments et al. 2001. Conditional random fields: Probabilis-
tic models for segmenting and labeling sequence
We thank Xiang Ren for dataset details and help- data. In Proceedings of the eighteenth international
ful discussions. This work is also supported by the conference on machine learning, ICML. volume 1,
pages 282–289.
National High Technology Research and Devel-
opment Program of China (863 Program) (Grant Guillaume Lample, Miguel Ballesteros, Sandeep Sub-
No. 2015AA015402), the National Natural Sci- ramanian, Kazuya Kawakami, and Chris Dyer. 2016.
Neural architectures for named entity recognition.
ence Foundation of China (No. 61602479) and the
In Proceedings of the NAACL international confer-
NSFC project 61501463. ence.
Qi Li and Heng Ji. 2014. Incremental joint extraction
of entity mentions and relations. In Proceedings
References of the 52rd Annual Meeting of the Association for
Michele Banko, Michael J Cafarella, Stephen Soder- Computational Linguistics. pages 402–412.
land, Matthew Broadhead, and Oren Etzioni. 2007. Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Za-
Open information extraction from the web. In IJ- iqing Nie. 2015. Joint entity recognition and disam-
CAI. volume 7, pages 2670–2676. biguation. In Conference on Empirical Methods in
Natural Language Processing. pages 879–888.
Jason PC Chiu and Eric Nichols. 2015. Named en-
tity recognition with bidirectional lstm-cnns. In Pro- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
cessings of Transactions of the Association for Com- rado, and Jeff Dean. 2013. Distributed representa-
putational Linguistics. tions of words and phrases and their compositional-
ity. In Advances in neural information processing
Cıcero Nogueira et al. dos Santos. 2015. Classifying systems. pages 3111–3119.
relations by ranking with convolutional neural net-
Mike Mintz, Steven Bills, Rion Snow, and Dan Ju-
works. In Proceedings of the 53th ACL international
rafsky. 2009. Distant supervision for relation ex-
conference. volume 1, pages 626–634.
traction without labeled data. In Proceedings of
the Joint Conference of the 47th Annual Meeting of
Matthew R Gormley, Mo Yu, and Mark Dredze. 2015. the ACL. Association for Computational Linguistics,
Improved relation extraction with feature-rich com- pages 1003–1011.
positional embedding models. In Proceedings of the
EMNLP. Makoto Miwa and Mohit Bansal. 2016. End-to-end re-
lation extraction using lstms on sequences and tree
Sepp Hochreiter and Jürgen Schmidhuber. 1997. structures. In Proceedings of the 54rd Annual Meet-
Long short-term memory. Neural computation ing of the Association for Computational Linguis-
9(8):1735–1780. tics.
Makoto Miwa and Yutaka Sasaki. 2014. Modeling Xiaofeng Yu and Wai Lam. 2010. Jointly identifying
joint entity and relation extraction with table repre- entities and extracting relations in encyclopedia text
sentation. In Proceedings of the 2014 Conference on via a graphical model approach. In Proceedings of
Empirical Methods in Natural Language Process- the 21th COLING international conference. pages
ing. pages 1858–1869. 1399–1407.
David Nadeau and Satoshi Sekine. 2007. A sur- Daojian et al. Zeng. 2014. Relation classification via
vey of named entity recognition and classification. convolutional deep neural network. In Proceedings
Lingvisticae Investigationes 30(1):3–26. of the 25th COLING international conference. pages
2335–2344.
Alexandre Passos, Vineet Kumar, and Andrew McCal-
lum. 2014. Lexicon infused phrase embeddings for Feifei Zhai, Saloni Potdar, Bing Xiang, and Bowen
named entity resolution. In International Confer- Zhou. 2017. Neural models for sequence chunk-
ence on Computational Linguistics. pages 78–86. ing. In Proceedings of the AAAI international con-
ference.
Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare R
Voss, Heng Ji, Tarek F Abdelzaher, and Jiawei Han. Suncong Zheng, Jiaming Xu, Peng Zhou, Hongyun
2017. Cotype: Joint extraction of typed entities and Bao, Zhenyu Qi, and Bo Xu. 2016. A neural net-
relations with knowledge bases. In Proceedings of work framework for relation extraction: Learning
the 26th WWW international conference. entity semantic and relation pattern. Knowledge-
Based Systems 114:12–23.
Bryan et al. Rink. 2010. Utd: Classifying semantic re-
lations by combining lexical and semantic resources.
In Proceedings of the 5th International Workshop on
Semantic Evaluation. pages 256–259.