Effective Attention Modeling For Neural Relation Extraction

Effective Attention Modeling for Neural Relation Extraction
Tapas Nayak and Hwee Tou Ng

Department of Computer Science
National University of Singapore
nayakt@u.nus.edu, nght@comp.nus.edu.sg
Abstract The idea is that if a sentence contains both enti-

ties of a tuple, it is chosen as a training sentence
Relation extraction is the task of determin- of that tuple. Although this process can generate
ing the relation between two entities in a sen- some noisy training instances, it can give a signif-
tence. Distantly-supervised models are popu- icant amount of training data which can be used to
lar for this task. However, sentences can be
build supervised models for this task.
long and two entities can be located far from
each other in a sentence. The pieces of ev- Mintz et al. (2009), Riedel et al. (2010), and
idence supporting the presence of a relation Hoffmann et al. (2011) proposed feature-based
between two entities may not be very direct, learning models and used entity tokens and their
since the entities may be connected via some nearby tokens, their part-of-speech tags, and other
indirect links such as a third entity or via co- linguistic features to train their models. Recently,
reference. Relation extraction in such scenar- many neural network-based models have been
ios becomes more challenging as we need to
proposed to avoid feature engineering. Zeng et al.
capture the long-distance interactions among
the entities and other words in the sentence.
(2014, 2015) used convolutional neural networks
Also, the words in a sentence do not contribute (CNN) with max-pooling to find the relation be-
equally in identifying the relation between the tween two given entities. Though these models
two entities. To address this issue, we propose have been shown to perform reasonably well on
a novel and effective attention model which distantly supervised data, they sometimes fail to
incorporates syntactic information of the sen- find the relation when sentences are long and enti-
tence and a multi-factor attention mechanism. ties are located far from each other. CNN mod-
Experiments on the New York Times corpus
els with max-pooling have limitations in under-
show that our proposed model outperforms
prior state-of-the-art models. standing the semantic similarity of words with the
given entities and they also fail to capture the long-
1 Introduction distance dependencies among the words and enti-
ties such as co-reference. In addition, all the words
Relation extraction from unstructured text is an in a sentence may not be equally important in find-
important task to build knowledge bases (KB) au- ing the relation and this issue is more prominent
tomatically. Banko et al. (2007) used open in- in long sentences. Prior CNN-based models have
formation extraction (Open IE) to extract relation limitations in identifying the multiple important
triples from sentences where verbs were consid- factors to focus on in sentence-level relation ex-
ered as the relation, whereas supervised informa- traction.
tion extraction systems extract a set of pre-defined To address this issue, we propose a novel multi-
relations from text. Mintz et al. (2009), Riedel factor attention model1 focusing on the syntac-
et al. (2010), and Hoffmann et al. (2011) proposed tic structure of a sentence for relation extraction.
distant supervision to generate the training data We use a dependency parser to obtain the syn-
for sentence-level relation extraction, where rela- tactic structure of a sentence. We use a linear
tion tuples (two entities and the relation between form of attention to measure the semantic similar-
them) from a knowledge base such as Freebase ity of words with the given entities and combine
(Bollacker et al., 2008) were mapped to free text 1
The code and data of this paper can be found at
(Wikipedia articles or New York Times articles). https://github.com/nusnlp/MFA4RE
603
Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 603–612
Hong Kong, China, November 3-4, 2019. 2019
c Association for Computational Linguistics
→
− ←−
it with the dependency distance of words from the ht ∈ Rdw +dz and ht ∈ Rdw +dz are the output at
given entities to measure their influence in iden- the tth step of the forward LSTM and backward
tifying the relation. Also, single attention may LSTM respectively. We concatenate them to ob-
not be able to capture all pieces of evidence for tain the tth Bi-LSTM output ht ∈ R2(dw +dz ) .
identifying the relation due to normalization of at-
tention scores. Thus we use multi-factor atten- 3.1 Global Feature Extraction
tion in the proposed model. Experiments on the We use a convolutional neural network (CNN) to
New York Times (NYT) corpus show that the pro- extract the sentence-level global features for rela-
posed model outperforms prior work in terms of tion extraction. We concatenate the positional em-
F1 scores on sentence-level relation extraction. beddings u1 and u2 of words with the hidden rep-
resentation of the Bi-LSTM layer and use the con-
2 Task Description volution operation with max-pooling on concate-
nated vectors to extract the global feature vector.
Sentence-level relation extraction is defined as
follows: Given a sentence S and two entities
qt = ht ku1t ku2t
{E1 , E2 } marked in the sentence, find the relation
r(E1 , E2 ) between these two entities in S from a ci = f T (qi kqi+1 k....kqi+k−1 )
pre-defined set of relations R ∪ {None}. None cmax = max(c1 , c2 , ...., cn )
indicates that none of the relations in R holds be- f
vg = [c1max , c2max , ...., cmax
g
]
tween the two marked entities in the sentence. The
relation between the entities is argument order-
qt ∈ R2(dw +dz +du ) is the concatenated vector for
specific, i.e., r(E1 , E2 ) and r(E2 , E1 ) are not the
the tth word and f is a convolutional filter vector
same. Input to the system is a sentence S and
of dimension 2k(dw + dz + du ) where k is the
two entities E1 and E2 , and output is the relation
filter width. The index i moves from 1 to n and
r(E1 , E2 ) ∈ R ∪ {None}.
produces a set of scalar values {c1 , c2 , ....., cn }.
The max-pooling operation chooses the maximum
3 Model Description
cmax from these values as a feature. With fg
We use four types of embedding vectors in our number of filters, we get a global feature vector
model: (1) word embedding vector w ∈ Rdw (2) v g ∈ R fg .
entity token indicator embedding vector z ∈ Rdz ,
which indicates if a word belongs to entity 1, en- 3.2 Attention Modeling
tity 2, or does not belong to any entity (3) a posi- Figure 1 shows the architecture of our attention
tional embedding vector u1 ∈ Rdu which repre- model. We use a linear form of attention to find the
sents the linear distance of a word from the start semantically meaningful words in a sentence with
token of entity 1 (4) another positional embedding respect to the entities which provide the pieces of
vector u2 ∈ Rdu which represents the linear dis- evidence for the relation between them. Our at-
tance of a word from the start token of entity 2. tention mechanism uses the entities as attention
We use a bi-directional long short-term memory queries and their vector representation is very im-
(Bi-LSTM) (Hochreiter and Schmidhuber, 1997) portant for our model. Named entities mostly con-
layer to capture the interaction among words in sist of multiple tokens and many of them may not
a sentence S = {w1 , w2 , ....., wn }, where n is be present in the training data or their frequency
the sentence length. The input to this layer is the may be low. The nearby words of an entity can
concatenated vector x ∈ Rdw +dz of word embed- give significant information about the entity. Thus
ding vector w and entity token indicator embed- we use the tokens of an entity and its nearby to-
ding vector z. kens to obtain its vector representation. We use
the convolution operation with max-pooling in the
xt = wt || zt context of an entity to get its vector representation.
→
− −−−−→
ht = LSTM(xt , ht−1 ) ci = f T (xi kxi+1 k....kxi+k−1 )
←
− ←−−−−
ht = LSTM(xt , ht+1 ) cmax = max(cb , cb+1 , ...., ce )
→
− ← −
ht = ht ||ht ve = [c1max , c2max , ...., cfmax
e
]
604
Figure 1: Architecture of our attention model with m = 1. We have not shown the CNN-based global feature
extraction here. FFN=feed-forward network.
f is a convolutional filter vector of size k(dw + dz ) are generally more important. To address this is-
where k is the filter width and x is the concate- sue, we propose to incorporate the syntactic struc-
nated vector of word embedding vector (w) and ture of a sentence in our attention mechanism.
entity token indicator embedding vector (z). b The syntactic structure is obtained from the depen-
and e are the start and end index of the sequence dency parse tree of the sentence. Words which are
of words comprising an entity and its neighbor- closer to the entities in the dependency parse tree
ing context in the sentence, where 1 ≤ b ≤ are more relevant to finding the relation. In our
e ≤ n. The index i moves from b to e and model, we define the dependency distance to every
produces a set of scalar values {cb , cb+1 , ....., ce }. word from the head token (last token) of an entity
The max-pooling operation chooses the maximum as the number of edges along the dependency path
cmax from these values as a feature. With fe num- (See Figure 2 for an example). We use a distance
ber of filters, we get the entity vector ve ∈ Rfe . window size ws and words whose dependency dis-
We do this for both entities and get ve1 ∈ Rfe and tance is within this window receive attention and
ve2 ∈ Rfe as their vector representation. We adopt the other words are ignored. The details of our at-
a simple linear function as follows to measure the tention mechanism follow.
semantic similarity of words with the given enti- ( 1
1
exp(fscore (hi , ve1 )) if li1 ∈ [1, ws]
ties: 1 l1 −1
di = 2 i
1 1 1
1
fscore (hi , ve1 ) = hTi Wa1 ve1 2ws exp(fscore (hi , ve )) otherwise
( 1
2 2
exp(fscore (hi , ve2 )) if li2 ∈ [1, ws]
fscore (hi , ve2 ) = hTi Wa2 ve2 2
d2i = 2li −1
1 2 2
hi is the Bi-LSTM hidden representation of the 2ws exp(fscore (hi , ve )) otherwise
ith word. Wa1 , Wa2 ∈ R2(dw +dz )×fe are trainable d1 d2
1
p1i = P i 1 , p2i = P i 2
weight matrices. fscore (hi , ve1 ) and fscore
2 (hi , ve2 ) j dj j dj
represent the semantic similarity score of the ith
word and the two given entities. d1i and d2i are un-normalized attention scores and
Not all words in a sentence are equally impor- p1i and p2i are the normalized attention scores for
tant in finding the relation between the two enti- the ith word with respect to entity 1 and entity 2 re-
ties. The words which are closer to the entities spectively. li1 and li2 are the dependency distances
605
vector of normalized probabilities of relation la-
bels.
3.5 Loss Function

We calculate the loss over each mini-batch of size
B. We use the following negative log-likelihood
Figure 2: An example dependency tree. The two num- as our objective function for relation extraction:
bers indicate the distance of the word from the head
token of the two entities respectively along the depen- B
1 X
dency tree path. L=− log(p(ri |si , e1i , e2i , θ))
B
i=1
of the ith word from the two entities. We mask where p(ri |si , e1i , e2i , θ) is the conditional proba-
those words whose average dependency distance bility of the true relation ri when the sentence si ,
from the two entities is larger than ws. We use two entities e1i and e2i , and the model parameters θ
the semantic meaning of the words and their de- are given.
pendency distance from the two entities together
in our attention mechanism. The attention feature 4 Experiments
vectors va1 and va1 with respect to the two entities
are determined as follows: 4.1 Datasets
Xn X n We use the New York Times (NYT) corpus (Riedel
va1 = p1i hi , va2 = p2i hi et al., 2010) in our experiments. There are two
i=1 i=1 versions of this corpus: (1) The original NYT cor-
pus created by Riedel et al. (2010) which has 52
3.3 Multi-Factor Attention valid relations and a None relation. We name this
Two entities in a sentence, when located far from dataset NYT10. The training dataset has 455, 412
each other, can be linked via more than one co- instances and 330, 776 of the instances belong to
reference chain or more than one important word. the None relation and the remaining 124, 636 in-
Due to the normalization of the attention scores stances have valid relations. The test dataset has
as described above, single attention cannot cap- 172, 415 instances and 165, 974 of the instances
ture all relevant information needed to find the re- belong to the None relation and the remaining
lation between two entities. Thus we use a multi- 6, 441 instances have valid relations. Both the
factor attention mechanism, where the number of training and test datasets have been created by
factors is a hyper-parameter, to gather all relevant aligning Freebase (Bollacker et al., 2008) tuples
information for identifying the relation. We re- to New York Times articles. (2) Another version
place the attention matrix Wa with an attention created by Hoffmann et al. (2011) which has 24
tensor Wa1:m ∈ Rm×2(dw +dz )×2fe where m is valid relations and a None relation. We name this
the factor count. This gives us m attention vec- dataset NYT11. The corresponding statistics for
tors with respect to each entity. We concatenate NYT11 are given in Table 1. The training dataset
all the feature vectors obtained using these atten- is created by aligning Freebase tuples to NYT ar-
tion vectors to get the multi-attentive feature vec- ticles, but the test dataset is manually annotated.
tor vma ∈ R4m(dw +dz ) .
4.2 Evaluation Metrics
3.4 Relation Extraction We use precision, recall, and F1 scores to evaluate
We concatenate vg , vma , ve1 , and ve2 , and this con- the performance of models on relation extraction
catenated feature vector is given to a feed-forward after removing the None labels. We use a confi-
layer with softmax activation to predict the nor- dence threshold to decide if the relation of a test
malized probabilities for the relation labels. instance belongs to the set of relations R or None.
If the network predicts None for a test instance,
r = softmax(Wr (vg || vma || ve1 || ve2 ) + br )
then it is considered as None only. But if the net-
Wr ∈ R(fg +2fe +4m(dw +dz ))×(|R|+1) is the weight work predicts a relation from the set R and the cor-
matrix, br ∈ R|R|+1 is the bias vector of the feed- responding softmax score is below the confidence
forward layer for relation extraction, and r is the threshold, then the final class is changed to None.
606
NYT10 NYT11 work (CNN) with max-pooling is applied to ex-
# relations 53 25
# instances 455,412 335,843 tract the sentence-level feature vector. This fea-
# valid relation tuples 124,636 100,671 ture vector is passed to a feed-forward layer with
Train # None relation tuples 330,776 235,172 softmax to classify the relation.
avg. sentence length 41.1 37.2
avg. distance between
12.8 12.2 (2) PCNN (Zeng et al., 2015): Words are repre-
entity pairs
sented using word embeddings and two positional
# instances 172,415 1,450
# valid relation tuples 6,441 520 embeddings. A convolutional neural network
Test # None relation tuples 165,974 930 (CNN) is applied to the word representations.
avg. sentence length 41.7 39.7
avg. distance between
Rather than applying a global max-pooling oper-
13.1 11.0 ation on the entire sentence, three max-pooling
entity pairs
operations are applied on three segments of the
Table 1: Statistics of the NYT10 and NYT11 dataset. sentence based on the location of the two entities
(hence the name Piecewise Convolutional Neural
This confidence threshold is the one that achieves Network (PCNN)). The first max-pooling opera-
the highest F1 score on the validation dataset. We tion is applied from the beginning of the sentence
also include the precision-recall curves for all the to the end of the entity appearing first in the sen-
models. tence. The second max-pooling operation is ap-
plied from the beginning of the entity appearing
4.3 Parameter Settings first in the sentence to the end of the entity appear-
We run word2vec (Mikolov et al., 2013) on the ing second in the sentence. The third max-pooling
NYT corpus to obtain the initial word embeddings operation is applied from the beginning of the en-
with dimension dw = 50 and update the embed- tity appearing second in the sentence to the end
dings during training. We set the dimension of of the sentence. Max-pooled features are concate-
entity token indicator embedding vector dz = 10 nated and passed to a feed-forward layer with soft-
and positional embedding vector du = 5. The max to determine the relation.
hidden layer dimension of the forward and back- (3) Entity Attention (EA) (Shen and Huang,
ward LSTM is 60, which is the same as the dimen- 2016): This is the combination of a CNN model
sion of input word representation vector x. The and an attention model. Words are represented
dimension of Bi-LSTM output is 120. We use using word embeddings and two positional em-
fg = fe = 230 filters of width k = 3 for fea- beddings. A CNN with max-pooling is used to
ture extraction whenever we apply the convolution extract global features. Attention is applied with
operation. We use dropout in our network with a respect to the two entities separately. The vec-
dropout rate of 0.5, and in convolutional layers, tor representation of every word is concatenated
we use the tanh activation function. We use the with the word embedding of the last token of the
sequence of tokens starting from 5 words before entity. This concatenated representation is passed
the entity to 5 words after the entity as its con- to a feed-forward layer with tanh activation and
text. We train our models with mini-batch size of then another feed-forward layer to get a scalar at-
50 and optimize the network parameters using the tention score for every word. The original word
Adagrad optimizer (Duchi et al., 2011). We use representations are averaged based on the attention
the dependency parser from spaCy2 to obtain the scores to get the attentive feature vectors. A CNN-
dependency distance of the words from the enti- extracted feature vector and two attentive feature
ties and use ws = 5 as the window size for depen- vectors with respect to the two entities are con-
dency distance-based attention. catenated and passed to a feed-forward layer with
softmax to determine the relation.
4.4 Comparison to Prior Work
(4) BiGRU Word Attention (BGWA) (Jat et al.,
We compare our proposed model with the follow-
2017): Words are represented using word embed-
ing state-of-the-art models.
dings and two positional embeddings. They are
(1) CNN (Zeng et al., 2014): Words are rep-
passed to a bidirectional gated recurrent unit (Bi-
resented using word embeddings and two posi-
GRU) (Cho et al., 2014) layer. Hidden vectors
tional embeddings. A convolutional neural net-
of the BiGRU layer are passed to a bilinear op-
2
https://spacy.io/ erator (a combination of two feed-forward layers)
607
NYT10 NYT11
Model Prec. Rec. F1 Prec. Rec. F1
CNN (Zeng et al., 2014) 0.413 0.591 0.486 0.444 0.625 0.519
PCNN (Zeng et al., 2015) 0.380 0.642 0.477 0.446 0.679 0.538†
EA (Shen and Huang, 2016) 0.443 0.638 0.523† 0.419 0.677 0.517
BGWA (Jat et al., 2017) 0.364 0.632 0.462 0.417 0.692 0.521
BiLSTM-CNN 0.490 0.507 0.498 0.473 0.606 0.531
Our model 0.541 0.595 0.566* 0.507 0.652 0.571*
Table 2: Performance comparison of different models on the two datasets. * denotes a statistically significant
improvement over the previous best state-of-the-art model with p < 0.01 under the bootstrap paired t-test. †
denotes the previous best state-of-the-art model.
Figure 3: Precision-Recall curve for the NYT10 Figure 4: Precision-Recall curve for the NYT11
dataset. dataset.
to compute a scalar attention score for each word. 4.5 Experimental Results
Hidden vectors of the BiGRU layer are multiplied
by their corresponding attention scores. A piece- We present the results of our final model on the
wise CNN is applied on the weighted hidden vec- relation extraction task on the two datasets in Ta-
tors to obtain the feature vector. This feature vec- ble 2. Our model outperforms the previous state-
tor is passed to a feed-forward layer with softmax of-the-art models on both datasets in terms of F1
to determine the relation. score. On the NYT10 dataset, it achieves 4.3%
(5) BiLSTM-CNN: This is our own baseline. higher F1 score compared to the previous best
Words are represented using word embeddings state-of-the-art model EA. Similarly, it achieves
and entity indicator embeddings. They are passed 3.3% higher F1 score compared to the previ-
to a bidirectional LSTM. Hidden representations ous best state-of-the-model PCNN on the NYT11
of the LSTMs are concatenated with two posi- dataset. Our model improves the precision scores
tional embeddings. We use CNN and max-pooling on both datasets with good recall scores. This will
on the concatenated representations to extract the help to build a cleaner knowledge base with fewer
feature vector. Also, we use CNN and max- false positives. We also show the precision-recall
pooling on the word embeddings and entity indica- curves for the NYT10 and NYT11 datasets in Fig-
tor embeddings of the context words of entities to ures 3 and 4 respectively. The goal of any rela-
obtain entity-specific features. These features are tion extraction system is to extract as many rela-
concatenated and passed to a feed-forward layer to tions as possible with minimal false positives. If
determine the relation. the recall score becomes very low, the coverage of
608
Figure 5: Performance comparison across different Figure 6: Performance comparison across different
sentence lengths on the NYT10 dataset. sentence lengths on the NYT11 dataset.
NYT10 NYT11
the KB will be poor. From Figure 3, we observe m Prec. Rec. F1 Prec. Rec. F1
that when the recall score is above 0.4, our model 1 0.541 0.595 0.566 0.495 0.621 0.551
achieves higher precision than all the competing 2 0.521 0.597 0.556 0.482 0.656 0.555
3 0.490 0.617 0.547 0.509 0.633 0.564
models on the NYT10 dataset. On the NYT11 4 0.449 0.623 0.522 0.507 0.652 0.571
dataset (Figure 4), when recall score is above 0.6, 5 0.467 0.609 0.529 0.488 0.677 0.567
our model achieves higher precision than the com-
peting models. Achieving higher precision with Table 3: Performance comparison of our model with
different values of m on the two datasets.
high recall score helps to build a cleaner KB with
good coverage.
the F1 score by 3.2% (A3−A2). Increasing the
5 Analysis and Discussion
window size to 10 reduces the F1 score marginally
5.1 Varying the number of factors (m) (A3−A4). Replacing the attention normalizing
We investigate the effects of the multi-factor count function with softmax operation also reduces the
(m) in our final model on the test datasets in Ta- F1 score marginally (A3−A5). In our model, we
ble 3. We observe that for the NYT10 dataset, concatenate the features extracted by each atten-
m = {1, 2, 3} gives good performance with m = tion layer. Rather than concatenating them, we
1 achieving the highest F1 score. On the NYT11 can apply max-pooling operation across the mul-
dataset, m = 4 gives the best performance. These tiple attention scores to compute the final atten-
experiments show that the number of factors giv- tion scores. These max-pooled attention scores
ing the best performance may vary depending on are used to obtain the weighted average vector of
the underlying data distribution. Bi-LSTM hidden vectors. This affects the model
performance negatively and F1 score of the model
5.2 Effectiveness of Model Components decreases by 3.0% (A3−A6).
We include the ablation results on the NYT11
5.3 Performance with Varying Sentence
dataset in Table 4. When we add multi-factor at-
Length and Varying Entity Pair Distance
tention to the baseline BiLSTM-CNN model with-
out the dependency distance-based weight factor We analyze the effects of our attention model with
in the attention mechanism, we get 0.8% F1 score different sentence lengths in the two datasets in
improvement (A2−A1). Adding the dependency Figures 5 and 6. We also analyze the effects of
weight factor with a window size of 5 improves our attention model with different distances be-
609
Figure 7: Performance comparison across different dis- Figure 8: Performance comparison across different dis-
tances between entities on the NYT10 dataset. tances between entities on the NYT11 dataset.
Prec. Rec. F1
(A1) BiLSTM-CNN 0.473 0.606 0.531 (2016), Vashishth et al. (2018), Wu et al. (2019),
(A2) Standard attention 0.466 0.638 0.539
(A3) Window size (ws) = 5 0.507 0.652 0.571 and Ye and Ling (2019) used multiple sentences in
(A4) Window size (ws) = 10 0.510 0.640 0.568 a multi-instance relation extraction setting to cap-
(A5) Softmax 0.490 0.658 0.562 ture the features located in multiple sentences for
(A6) Max-pool 0.492 0.600 0.541
a pair of entities. In their evaluation setting, they
Table 4: Effectiveness of model components (m = 4) evaluated model performance by considering mul-
on the NYT11 dataset. tiple sentences having the same pair of entities as a
single test instance. On the other hand, our model
and the previous models that we compare to in this
tween the two entities in the two datasets in Fig-
paper (Zeng et al., 2014, 2015; Shen and Huang,
ures 7 and 8. We observe that with increasing sen-
2016; Jat et al., 2017) work on each sentence inde-
tence length and increasing distance between the
pendently and are evaluated at the sentence level.
two entities, the performance of all models drops.
Since there may not be multiple sentences that
This shows that finding the relation between en-
contain a pair of entities, it is important to improve
tities located far from each other is a more diffi-
the task performance at the sentence level. Future
cult task. Our multi-factor attention model with
work can explore the integration of our sentence-
dependency-distance weight factor increases the
level attention model in a multi-instance relation
F1 score in all configurations when compared to
extraction framework.
previous state-of-the-art models on both datasets.
Not much previous research has exploited the
6 Related Work dependency structure of a sentence in different
ways for relation extraction. Xu et al. (2015) and
Relation extraction from a distantly supervised Miwa and Bansal (2016) used an LSTM network
dataset is an important task and many researchers and the shortest dependency path between two en-
(Mintz et al., 2009; Riedel et al., 2010; Hoff- tities to find the relation between them. Huang
mann et al., 2011) tried to solve this task us- et al. (2017) used the dependency structure of a
ing feature-based classification models. Recently, sentence for the slot-filling task which is close
Zeng et al. (2014, 2015) used CNN models for to the relation extraction task. Liu et al. (2015)
this task which can extract features automatically. exploited the shortest dependency path between
Shen and Huang (2016) and Jat et al. (2017) used two entities and the sub-trees attached to that path
attention mechanism in their model to improve (augmented dependency path) for relation extrac-
performance. Surdeanu et al. (2012), Lin et al. tion. Zhang et al. (2018) and Guo et al. (2019)
610
used graph convolution networks with pruned de- human knowledge. In Proceedings of the ACM SIG-
pendency tree structures for this task. In this MOD International Conference on Management of
Data.
work, we have incorporated the dependency dis-
tance of the words in a sentence from the two en- Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bah-
tities in a multi-factor attention mechanism to im- danau, and Yoshua Bengio. 2014. On the proper-
ties of neural machine translation: Encoder-decoder
prove sentence-level relation extraction.
approaches. In Proceedings of Eighth Workshop on
Attention-based neural networks are quite suc- Syntax, Semantics and Structure in Statistical Trans-
cessful for many other NLP tasks. Bahdanau lation.
et al. (2015) and Luong et al. (2015) used atten-
John Duchi, Elad Hazan, and Yoram Singer. 2011.
tion models for neural machine translation, Seo Adaptive subgradient methods for online learning
et al. (2017) used attention mechanism for answer and stochastic optimization. Journal of Machine
span extraction. Vaswani et al. (2017) and Kundu Learning Research.
and Ng (2018) used multi-head or multi-factor at- Zhijiang Guo, Yan Zhang, and Wei Lu. 2019. Attention
tention models for machine translation and an- guided graph convolutional networks for relation ex-
swer span extraction respectively. He et al. (2018) traction. In Proceedings of the 57nd Annual Meeting
used dependency distance-focused word attention of the Association for Computational Linguistics.
model for aspect-based sentiment analysis. Ruidan He, Wee Sun Lee, Hwee Tou Ng, and Daniel
Dahlmeier. 2018. Effective attention modeling for
7 Conclusion aspect-level sentiment classification. In Proceed-
ings of the 27th International Conference on Com-
In this paper, we have proposed a multi-factor at- putational Linguistics.
tention model utilizing syntactic structure for rela- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
tion extraction. The syntactic structure component short-term memory. Neural Computation.
of our model helps to identify important words in
Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke
a sentence and the multi-factor component helps Zettlemoyer, and Daniel S Weld. 2011. Knowledge-
to gather different pieces of evidence present in based weak supervision for information extraction
a sentence. Together, these two components im- of overlapping relations. In Proceedings of the 49th
prove the performance of our model on this task, Annual Meeting of the Association for Computa-
tional Linguistics.
and our model outperforms previous state-of-the-
art models when evaluated on the New York Times Lifu Huang, Avirup Sil, Heng Ji, and Radu Florian.
(NYT) corpus, achieving significantly higher F1 2017. Improving slot filling performance with atten-
tive neural networks on dependency structures. In
scores. Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing.
Acknowledgments
Sharmistha Jat, Siddhesh Khandelwal, and Partha
We would like to thank the anonymous reviewers Talukdar. 2017. Improving distantly supervised re-
for their valuable and constructive comments on lation extraction using word and entity based atten-
this work. tion. In Proceedings of the 6th Workshop on Auto-
mated Knowledge Base Construction.
Souvik Kundu and Hwee Tou Ng. 2018. A question-
References focused multi-factor attention network for question
answering. In Proceedings of the Association for the
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Advancement of Artificial Intelligence.
gio. 2015. Neural machine translation by jointly
learning to align and translate. In Proceedings of Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan,
the International Conference on Learning Represen- and Maosong Sun. 2016. Neural relation extraction
tations. with selective attention over instances. In Proceed-
ings of the 54th Annual Meeting of the Association
Michele Banko, Michael J Cafarella, Stephen Soder- for Computational Linguistics.
land, Matthew Broadhead, and Oren Etzioni. 2007.
Open information extraction from the web. In Pro- Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou,
ceedings of the International Joint Conference on and Houfeng Wang. 2015. A dependency-based
Artificial Intelligence. neural network for relation classification. In Pro-
ceedings of the 53rd Annual Meeting of the Associ-
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim ation for Computational Linguistics and the 7th In-
Sturge, and Jamie Taylor. 2008. Freebase: A col- ternational Joint Conference on Natural Language
laboratively created graph database for structuring Processing.
611
Minh-Thang Luong, Hieu Pham, and Christopher D Yuning Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao
Manning. 2015. Effective approaches to attention- Peng, and Zhi Jin. 2015. Classifying relations via
based neural machine translation. In Proceedings of long short term memory networks along shortest de-
the 2015 Conference on Empirical Methods in Nat- pendency paths. In Proceedings of the 2015 Con-
ural Language Processing. ference on Empirical Methods in Natural Language
Processing.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013. Efficient estimation of word represen- Zhi-Xiu Ye and Zhen-Hua Ling. 2019. Distant supervi-
tations in vector space. In Proceedings of the Inter- sion relation extraction with intra-bag and inter-bag
national Conference on Learning Representations. attentions. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf- for Computational Linguistics.
sky. 2009. Distant supervision for relation extrac-
tion without labeled data. In Proceedings of the Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.
47th Annual Meeting of the Association for Compu- 2015. Distant supervision for relation extraction via
tational Linguistics and the 4th International Joint piecewise convolutional neural networks. In Pro-
Conference on Natural Language Processing. ceedings of the 2015 Conference on Empirical Meth-
ods in Natural Language Processing.
Makoto Miwa and Mohit Bansal. 2016. End-to-end
relation extraction using LSTMs on sequences and Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,
tree structures. In Proceedings of the 54th Annual and Jun Zhao. 2014. Relation classification via con-
Meeting of the Association for Computational Lin- volutional deep neural network. In Proceedings of
guistics. the 25th International Conference on Computational
Linguistics.
Sebastian Riedel, Limin Yao, and Andrew McCallum.
2010. Modeling relations and their mentions with- Yuhao Zhang, Peng Qi, and Christopher D. Manning.
out labeled text. In Proceedings of the European 2018. Graph convolution over pruned dependency
Conference on Machine Learning and Knowledge trees improves relation extraction. In Proceedings of
Discovery in Databases. the 2018 Conference on Empirical Methods in Nat-
ural Language Processing.
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and
Hannaneh Hajishirzi. 2017. Bidirectional attention
flow for machine comprehension. In Proceedings of
the International Conference on Learning Represen-
tations.
Yatian Shen and Xuanjing Huang. 2016. Attention-
based convolutional neural network for semantic re-
lation extraction. In Proceedings of the 26th Inter-
national Conference on Computational Linguistics.
Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,
and Christopher D. Manning. 2012. Multi-instance
multi-label learning for relation extraction. In Pro-
ceedings of the 2012 Joint Conference on Empirical
Methods in Natural Language Processing and Com-
putational Natural Language Learning.
Shikhar Vashishth, Rishabh Joshi, Sai Suman Prayaga,
Chiranjib Bhattacharyya, and Partha Talukdar. 2018.
Reside: Improving distantly-supervised neural rela-
tion extraction using side information. In Proceed-
ings of the 2018 Conference on Empirical Methods
in Natural Language Processing.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Proceedings of Advances in Neural In-
formation Processing Systems.
Shanchan Wu, Kai Fan, and Qiong Zhang. 2019. Im-
proving distantly supervised relation extraction with
neural noise converter and conditional optimal se-
lector. In Proceedings of the Association for the Ad-
vancement of Artificial Intelligence.
612

Effective Attention Modeling For Neural Relation Extraction

Uploaded by

Copyright:

Available Formats

You might also like

Effective Attention Modeling For Neural Relation Extraction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Effective Attention Modeling For Neural Relation Extraction

Uploaded by

Copyright:

Available Formats

Effective Attention Modeling for Neural Relation Extraction

Tapas Nayak and Hwee Tou Ng

Abstract The idea is that if a sentence contains both enti-

3.5 Loss Function

You might also like