Professional Documents
Culture Documents
Learning Latent Forests For Medical Relation Extraction
Learning Latent Forests For Medical Relation Extraction
nsubj
Abstract nmod
dobj
amod
nmod
amod
The goal of medical relation extraction is to detect analysis ... phenylalanine hydroxylase catalytic domain ... catechol inhibitors
(a)
relations among entities, such as genes, mutations nmod
nmod
and drugs in medical texts. Dependency tree struc- nmod compound
amod
nmod
amod
tures have been proven useful for this task. Exist- analysis ... phenylalanine hydroxylase catalytic domain ... catechol inhibitors
ing approaches to such relation extraction leverage (b)
off-the-shelf dependency parsers to obtain a syn- 0.14
0.15
0.15 0.07
tactic tree or forest for the text. However, for the 0.09 0.03
0.06 0.12
medical domain, low parsing accuracy may lead to analysis ... phenylalanine hydroxylase catalytic domain ... catechol inhibitors
(c)
error propagation downstream the relation extrac-
tion pipeline. In this work, we propose a novel Figure 1: (a) 1-best dependency tree; (b) manually labeled gold tree;
model which treats the dependency structure as a (c) a dependency forest generated by the LF-GCN model, where the
latent variable and induces it from the unstructured number for each arc indicates the weight of the edge in the forest.
text in an end-to-end fashion. Our model can be We omit some edges for simplicity. Phrase phenylalanine hydrox-
understood as composing task-specific dependency ylase and catecho are gene and drug entity, respectively.
forests that capture non-local interactions for bet-
ter relation extraction. Extensive results on four
datasets show that our model is able to significantly 2018]. In the medical domain, early efforts leverage graph
outperform state-of-the-art systems without relying LSTM [Peng et al., 2017] or graph neural networks [Song
on any direct tree supervision or pre-training. et al., 2018; Guo et al., 2019a] to encode the 1-best depen-
dency tree. However, dependency parsing accuracy is rela-
tively low in the medical domain. Figure 1 shows the 1-best
1 Introduction dependency tree obtained by the Stanford CoreNLP [Man-
With a significant growth in the medical literature, re- ning et al., 2014], where the dependency tree contains an er-
searchers in the area are still required to track it mostly man- ror. In particular, the entity phrase phenylalanine hydrox-
ually. This is an opportunity to automate some of this track- ylase is broken since the word hydroxylase is mistakenly
ing process, and indeed Natural Language Processing (NLP) considered as the main verb. In order to mitigate the er-
techniques have been used to automatically extract knowl- ror propagation when incorporating the dependency structure,
edge from medical articles. Among these techniques, rela- Song et al. [2019] construct a dependency forest by adding
tion extraction plays a significant role as it facilitates the de- additional edges with high confidence scores given by a de-
tection of relations among entities in the medical literature pendency parser trained on the news domain or merging the
[Peng et al., 2017; Song et al., 2019]. For example in Figure K-bests trees [Eisner, 1996] by combining identical depen-
1, the sub-clause “human phenylalanine hydroxylase cat- dency edges. Lifeng et al. [2020] jointly train a pre-trained
alytic domain with bound catechol inhibitors” drawn from dependency parser [Dozat and Manning, 2017] and a rela-
the CPR dataset [Krallinger et al., 2017] contains two entities, tion extraction model. The dependency forest generated by
namely phenylalanine hydroxylase and catechol. There is a the parser is a 3-dimensional tensor, with each position rep-
“down regulator” relation between these two entities, denoted resenting the conditional probability of one word modifying
as “CPR:4”. another word with a relation, which encodes all possible de-
Dependency structures are often used for relation extrac- pendency relations with the confidence scores.
tion as they are able to capture non-local syntactic relations Unlike previous research efforts that rely on dependency
that are only implicit in the surface form alone [Zhang et al., parsers trained on newswire text, our proposed model treats
the dependency parse as a latent variable and induces it
∗∗ in an end-to-end fashion. We build our model based on
Equally Contributed. Work done while Zhijiang at University
of Edinburgh. the mechanism of structured attention [Kim et al., 2017;
3651
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
0.15
0.30
N Networks
0.28 0.12 1st Forest
classifier 0.08
0.11 0.35 0.22
0.12
Graph Neural
Networks
Kinetics... human and rat dihydroorotate dehydrogenase by ... atovaquone ...
Forest
Encoder
...
N Forests
M
Matrix-Tree 0.10
Theorem N-th Forest
Forest 0.15
0.15
Inducer 0.24 0.28 0.19 0.17
N Heads
Multi-Head
ention
Multi-Head
Attention
Kinetics... human and rat dihydroorotate dehydrogenase by ... atovaquone ...
sentence Attention
Figure 2: Overview of our proposed LF-GCN model. It is composed of M identical blocks and each block has two components—forest
inducer and forest encoder. The forest inducer consists of two sub-modules, where the first sub-module computes N attention matrices based
on the multi-head attention, and the second sub-module takes the N attention matrices as inputs to obtain N dependency forests based on the
Matrix-Tree Theorem. Then, the forest encoder uses graph neural networks to encode the induced forests.
Liu and Lapata, 2018]. Using a variant of the Matrix-Tree dependency between the i-th word (head) to the j-th node
Theorem [Tutte, 1984; Koo et al., 2007], our model is able to (modifier). We denote the contextual output of the sentence
generate task-specific non-projective dependency structures h ∈ Rn×d as h = h1 , ..., hn , where hi ∈ Rd represents the
for capturing non-local interactions between entities without hidden state of the i-th word with a d dimension. We use
recourse to any pre-trained parsers or tree supervision. We the bidirectional LSTM to obtain contextual representations
further construct multiple forests by projecting the represen- of the sentence.
tations of nodes to different representation subspaces, allow- For the graph G, MTT takes the edge scores and root scores
ing an induction of a more informative latent structure for as inputs then generates a latent forest by computing the
better relation prediction. We name our proposed model as marginal probabilities for each edge. Given the input h and a
LF-GCN, where LF is the abbreviation of latent forests. weight vector θ k ∈ Rm of dependencies, where m ∈ R rep-
Experiments show that our LF-GCN model is able to resents the number of dependencies for the k-th (k ∈ [1, N ])
achieve better performance on various relation extraction latent structure, inducing the k-th latent forest for the input h
k
tasks. For the sentence-level tasks, our model surpasses the amounts to finding the latent variables zij (h, θ k ) for all edges
current stat-of-the-art models on the CPR dataset [Krallinger that satisfy i6=j and root node whose index equals to 0.
et al., 2017] and the PGR dataset [Sousa et al., 2019]) by The k-th latent forest induced by MTT contains many non-
3.2% and 2.6% in terms of F 1 score, respectively. For the projective dependency trees, which are denoted by Tk . Let
cross-sentence tasks [Peng et al., 2017], our model is also P (y|h; θ k ) denote the conditional probability of a tree y over
consistently better than others, showing its effectiveness on Tk . Following the formulation by Koo et al. [2007], the
long medical text. We release our code at https://github.com/ marginal probability of a dependency edge from i-th word
Cartus/Latent-Forests. to j-th word for the k-th forest can be expressed as :
X
k
2 Model P (zij = 1) = P (y|h; θ k ) (1)
y∈Tk :(i,j)∈y
Here we present the proposed model as shown in Figure 2.
We derive two steps to obtain the marginal probabilities
2.1 Forest Inducer expressed in Equation (1).
Existing approaches leverage a dependency parser trained Computing Attention Scores
on newswire text [Song et al., 2019] or a fine-tuned parser Inspired by Vaswani et al. [2017], we calculate the edge
for the medical domain [Lifeng et al., 2020] to generate scores by the multi-head attention mechanism, which allows
a dependency forest, which is a fully-connected weighted the model to jointly attend to information from different rep-
graph. Unlike previous efforts, we treat the forest as a la- resentation subspaces. N attention matrices will be fed into
tent variable, which can be learned from a targeting dataset the MTT to obtain N latent forests in order to capture differ-
in an end-to-end manner. Inspired by Kim et al. [2017] ent dependencies in different representation subspaces. The
and Liu and Lapata [2018], we use a variant of Kirchhoff’s attention matrix for the k-th head is calculated by a function
Matrix-Tree Theorem (MTT) [Tutte, 1984; Koo et al., 2007; of the query Q with the corresponding key K. Here Q and K
Smith and Smith, 2007] to induce the latent structure of an in- are both equal to the contextual representation h. We project
put sentence. Such latent structure can be viewed as multiple Q and K to different representation subspaces in order to gen-
full dependency forests, which efficiently represent all possi- erate N attention matrices for calculating N latent forests.
ble dependency trees within a compact and dense structure. Formally, the k-th forest Sk is given by:
Given a sentence s = w1 , ..., wn , where wi represents the
QWQ × (KWK )T
i-th word, we define a graph G on n nodes, where each node Sk = softmax( √ ) (2)
refers to the word in the s, and the edge (i, j) refers to the d
3652
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
where WQ ∈ Rd×d and WK ∈ Rd×d are parameters for pro- mapping [Martins and Astudillo, 2016]. When α=1, it re-
jections. Skij denotes the normalized attention score between covers the softmax mapping. Correia et al. [2019] propose
the i-th token and the j-th token with hi and hj . Then, we an exact algorithm to learn α automatically. Here we apply
compute the root score rki , which represents the normalized k α-entmax to k latent forests, which enables the model to
probability of the i-th node to be selected as the root node of develop different pruning strategies for different latent forest.
the k-th forest:
2.2 Forest Encoder
rki = Wrk hi (3)
Given N latent forests generated by the forest inducer, we en-
where Wrk ∈ R1×d is the weight for the projection. code them by using densely-connected graph convolutional
networks [Kipf and Welling, 2017; Guo et al., 2019b]. For-
Imposing Structural Constraint mally, given the k-th latent forest, which is represented by the
Following Koo et al. [2007] and Smith and Smith [2007], we adjacency matrix Ak , the convolution computation for the i-
calculate the marginal probability of each dependency edge th node at the l-th layer, which takes the representation hil−1
of the k-th latent forest, by injecting a structural bias on Sk . from previous layer as input and outputs the updated repre-
We assign non-negative weights Pk ∈ Rn×n to the edges as: sentations hli , can be defined as:
0 if i = j n
Pkij = (4) hlki = σ(
X
Akij Wkl hl−1 + blk ) (9)
exp (Skij ) otherwise i
j=1
where Pkij is the weight of the edge between the i-th and the
j-th node. We define a Laplacian matrix Lk ∈ Rn×n of G in where Wkl and blk are the weight matrix and bias vector for
the k-th latent forest in the l-th layer, respectively. σ is an
Equation (5), and its variant L̂k ∈ Rn×n in Equation (6).
Pn activation function. h0i ∈ Rd is the initial contextual repre-
k sentation of the i-th node. Then a linear combination layer is
i0 =1 Pi0 j if i = j
Lkij = k (5) used to integrate representations of the N latent forests:
−Pij otherwise
hcomb = Wcomb hout + bcomb (10)
exp(rki )
if i = 1
L̂kij = (6) where hout is the output by concatenating outputs from N
Lkij if i > 1
separated convolutional layers, i.e., hout = [h(1) ; ...; h(N ) ] ∈
We use Akij to denote the marginal probability of the de- Rd×N . Wcomb ∈ R(d×N )×d is a weight matrix and bcomb is
pendency edge between the i-th node and the j-th node. a bias vector for the linear transformation.
Then, Akij can be derived based on:
3 Experiments
Ak (zij = 1) = (1 − δ1,j )Pkij [(L̂k )−1 ]ij
(7) 3.1 Data
−(1 − δi,1 )Pkij [(L̂k )−1 ]ji
We evaluate our LF-GCN model with four datasets on two
where δ is Kronecker delta and Ak ∈ Rn×n can be inter- tasks, namely cross-sentence n-ary relation extraction and
preted as a weighted adjacency matrix for the k-th forest. sentence-level relation extraction.
Now we can feed A ∈ RN ×n×n into the forest encoder to For cross-sentence n-ary relation extraction, we use two
update the representations of nodes in the latent structure. datasets generated by Peng et al. [2017], which has 6,987
ternary relation instances and 6,087 binary relation instances
Adaptive Pruning Strategy extracted from PubMed. The relation label contains five cat-
Prior dependency-based model [Zhang et al., 2018] also pro- egories, e.g., “sensitivity”, “resistance” and “none”. Follow-
poses rule-based method to prune a dependency tree to fur- ing Song et al. [2018], we define two sub-tasks for a more
ther improve relation classification performance. However, detailed evaluation, i.e., binary-class n-ary relation extraction
the weighted adjacency matrix A is derived based on a con- and multi-class n-ary relation extraction. For binary-class ex-
tinuous relaxation, and such induced structures are not dis- traction, we cluster the four relation classes as “Yes” and treat
crete, so the existing rule-based pruning methods are not ap- the label “None” as “No”.
plicable. Instead, we use α-entmax [Blondel et al., 2018; For sentence-level relation extraction, we follow the exper-
Correia et al., 2019] to remove irrelevant information by im- imental settings by Lifeng et al. [2020] on BioCreative Vi
posing the sparsity constraints on the adjacency matrix. α- CPR (CPR) [Krallinger et al., 2017] and Phenotype-Gene re-
entmax is able to assign exactly zero weights. Therefore, an lation (PGR) [Sousa et al., 2019]. The CPR dataset contains
unnecessary path in the induced latent forests will not be con- the relations between chemical components and human pro-
sidered by the latent forest encoder. The expression of our teins. It has 16,107 training, 10,030 development and 14,269
soft pruning strategy is described as: testing instances, with five regular relations, such as “CPR:3”,
“CPR:9” and “None” relation. PGR introduces the relations
Ak = α-entmax(Ak ) (8)
between human phenotypes with human genes, and it con-
where α is a parameter to control the sparsity of each adja- tains 11,780 training instances and 219 test instances, with
cency matrix. When α=2, the entmax recovers the sparsemax binary class “Yes” and “No” on relation labels.
3653
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
Binary-class Multi-class
Syntax Type Model Ternary Binary Ternary Binary
Single Cross Single Cross Cross Cross
DAG LSTM [Peng et al., 2017] 77.9 80.7 74.3 76.5 - -
Full Tree GRN [Song et al., 2018] 80.3 83.2 83.5 83.6 71.7 71.7
GCN [Zhang et al., 2018] 84.3 84.8 84.2 83.6 77.5 74.3
Pruned Tree GCN [Zhang et al., 2018] 85.8 85.8 83.8 83.7 78.1 73.6
Forest AGGCN [Guo et al., 2019a] 87.1 87.0 85.2 85.6 79.7 77.4
LF-GCN (Ours) 88.0 88.4 86.7 87.1 81.5 79.3
Table 1: Average test accuracies on the [Peng et al., 2017] dataset for binary-class n-ary relation extraction and multi-class n-ary relation
extraction. “Ternary” and “Binary” denote drug-gene-mutation tuple and drug-mutation pair, respectively. Single and Cross indicate that
the entities of relations reside in single sentence or multiple sentences, respectively.
We also use SemEval-2010 Task 8 [Hendrickx et al., 2009] Syntax Type Model F1
dataset from the news domain to evaluate the generalization Random-DDCNN [Lifeng et al., 2020] 45.4∗
capability of our model. It has 10,717 instances with 9 types None Att-GRU [Liu et al., 2017] 49.5∗
of relations and a special “Other” relation. Bran [Verga et al., 2018] 50.8∗
GCN [Zhang et al., 2018] 52.2∗
3.2 Settings Tree Tree-DDCNN [Lifeng et al., 2020] 50.3∗
Tree-GRN [Lifeng et al., 2020] 51.4∗
We tune the hyper-parameters according to the results on the
Edgewise-GRN [Song et al., 2019] 53.4∗
development sets. For the cross-sentence n-ary relation ex- KBest-GRN [Song et al., 2019] 52.4∗
traction task, we use the same data splits as Song et al. [2018], Forest
AGGCN [Guo et al., 2019a] 56.7∗
stochastic gradient descent optimizer with a 0.9 decay rate, ForestFT-DDCNN [Lifeng et al., 2020] 55.7∗
and 300-dimensional GloVe. The hidden size of both BiL- LF-GCN (Ours) 58.9∗
STM and GCNs are set as 300.
For cross-sentence task, we report the test accuracy aver- Table 2: Test results on the CPR dataset. Results on AGGCN and
aged over five cross validation folds [Song et al., 2018] for GCN are reproduced based on their released implementation.
the cross-sentence n-ary task. For the sentence-level task, we
report the F 1 scores [Lifeng et al., 2020].
learned from the data. Compared to GCN models, our model
3.3 Results on Cross-Sentence n-ary Task obtains 2.2 and 2.6 points improvement over the best per-
To verify the effectiveness of the model in predicting inter- forming model with pruned trees for the ternary relation ex-
sentential relations, we compare LF-GCN against state-of- traction. For binary relation extraction, our model achieves
the-art systems on the cross-sentence n-ary relation extrac- accuracy of 86.7 and 87.1 under Single and Cross set-
tion task [Peng et al., 2017], as shown in Table 1. Previous tings, respectively, which surpasses the state-of-the-art AG-
systems using the same syntax type are grouped together. GCN model. We believe that our LF-GCN is able to distill
Full Tree: models use the 1-best dependency graph con- relevant information and filter out noises from the represen-
structed by connecting roots of dependency trees correspond tation for better prediction.
to the input sentences. DAG LSTM encodes the graph by us-
ing graph-structure LSTM, while GRN and GCN encode it 3.4 Results on Sentence-Level Task
using graph recurrent networks and graph convolutional net- To examine LF-GCN on sentence-level task, we compare LF-
works, respectively. GCN with state-of-the-art models on two medical datasets,
Pruned Tree: model with pruned trees as inputs, whose de- i.e., CPR [Krallinger et al., 2017] and PGR [Sousa et al.,
pendency nodes and edges are removed based on rules [Zhang 2019]. These systems are grouped in three types based on
et al., 2018]. GCN is used to encode the resulted structure. the syntactic structure used as shown in Table 2 and Table
Forest: model constructs multiple fully-connected weighted 3. Results labeled with “∗” are obtained based on the re-
graphs based on the multi-head attention [Vaswani et al., trained models using their released implementations, as we
2017], where the graph can be viewed as a dependency forest. don’t have published results for the dataset.
Models with pruned trees as inputs tend to achieve higher None: models do not use any pre-trained parsers. Random-
results than models with full trees. Intuitively, longer sen- DDCNN uses a randomly initialized parser [Dozat and Man-
tences in the cross-sentence task correspond to more complex ning, 2017] fine-tuned by the relation prediction loss. Att-
dependency structures. Using an out-of-domain parser may GRU stacks a self-attention layer on top of the gated recur-
introduce more noise to the model. Removing the irrelevant rent units and Bran relies on a bi-affine self-attention model
nodes and edges of the parse tree enables the model to per- to capture the interactions in the sentence. BioBERT is a pre-
form better prediction. However, a rule-based pruning strat- trained language representation model for biomedical text.
egy [Zhang et al., 2018] may not yield optimal performance. Tree: models use the 1-best dependency tree. Full trees
In contrast, LF-GCN induces the dependency structure au- are encoded by GCN, GRN and DDCNN, respectively. BO-
tomatically, which can be viewed as a soft pruning strategy LSTM only encodes words on the shortest dependency path.
3654
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
3655
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
3656
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
References [Liu and Lapata, 2018] Yang Liu and Mirella Lapata. Learn-
[Blondel et al., 2018] Mathieu Blondel, André F. T. Martins, ing structured text representations. TACL, 2018.
and Vlad Niculae. Learning classifiers with fenchel-young [Liu et al., 2017] Sijia Liu, Feichen Shen, Yanshan Wang,
losses: Generalized entropies, margins, and algorithms. In Majid Rastegar-Mojarad, Ravikumar Komandur Elayav-
AISTATS, 2018. illi, Vipin Chaundary, and Hongfang Liu. Attention-based
[Chu and Liu, 1965] Yoeng-Jin Chu and Tseng-Hong Liu. neural networks for chemical protein relation extraction.
On the shortest arborescence of a directed graph. Scien- In Proceedings of the BioCreative VI Workshop, 2017.
tia Sinica, 1965. [Manning et al., 2014] Christopher D. Manning, Mihai Sur-
[Correia et al., 2019] Gonçalo M Correia, Vlad Niculae, and deanu, John Bauer, Jenny Finkel, Steven J. Bethard, and
André FT Martins. Adaptively sparse transformers. In David McClosky. The Stanford CoreNLP natural language
EMNLP, 2019. processing toolkit. In Demo@ACL, 2014.
[Dozat and Manning, 2017] Timothy Dozat and Christo- [Martins and Astudillo, 2016] André F. T. Martins and
pher D Manning. Deep biaffine attention for neural de- Ramón Fernández Astudillo. From softmax to sparsemax:
pendency parsing. In ICLR, 2017. A sparse model of attention and multi-label classification.
In ICML, 2016.
[Eisner, 1996] Jason M. Eisner. Three new probabilistic
[Nan et al., 2020] Guoshun Nan, Zhijiang Guo, Ivan
models for dependency parsing: An exploration. In COL-
ING, 1996. Sekulic, and Wei Lu. Reasoning with latent structure
refinement for document-level relation extraction. arXiv
[Guo et al., 2019a] Zhijiang Guo, Yan Zhang, and Wei Lu. preprint, 2020.
Attention guided graph convolutional networks for rela- [Niculae et al., 2018] Vlad Niculae, André F. T. Martins, and
tion extraction. In ACL, 2019.
Claire Cardie. Towards dynamic computation graphs via
[Guo et al., 2019b] Zhijiang Guo, Yan Zhang, Zhiyang sparse latent structure. In EMNLP, 2018.
Teng, and Wei Lu. Densely connected graph convolutional [Peng et al., 2017] Nanyun Peng, Hoifung Poon, Chris
networks for graph-to-sequence learning. TACL, 2019.
Quirk, Kristina Toutanova, and Wen tau Yih. Cross-
[Hendrickx et al., 2009] Iris Hendrickx, Su Nam Kim, Zor- sentence n-ary extraction with graph lstms. TACL, 2017.
nitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, [Smith and Smith, 2007] David A. Smith and Noah A.
Marco Pennacchiotti, Lorenza Romano, and Stan Sz-
Smith. Probabilistic models of nonprojective dependency
pakowicz. Semeval-2010 task 8: Multi-way classifica-
trees. In EMNLP, 2007.
tion of semantic relations between pairs of nominals. In
NAACL-HLT, 2009. [Song et al., 2018] Linfeng Song, Yue Zhang, Zhiguo Wang,
and Daniel Gildea. N-ary relation extraction using graph
[Kim et al., 2017] Yoon Kim, Carl Denton, Luong Hoang,
state lstm. In EMNLP, 2018.
and Alexander M. Rush. Structured attention networks.
In ICLR, 2017. [Song et al., 2019] Linfeng Song, Yue Zhang, Daniel Gildea,
Mo Yu, Zhiguo Wang, and Jinsong Su. Leveraging de-
[Kipf and Welling, 2017] Thomas N. Kipf and Max Welling.
pendency forest for neural medical relation extraction. In
Semi-supervised classification with graph convolutional EMNLP, 2019.
networks. In ICLR, 2017.
[Sousa et al., 2019] Diana Sousa, André Lamúrias, and
[Koo et al., 2007] Terry K Koo, Amir Globerson, Xavier
Francisco M Couto. A silver standard corpus of human
Carreras, and Michael Collins. Structured prediction mod- phenotype-gene relations. In Proc. of NAACL-HLT, 2019.
els via the matrix-tree theorem. In EMNLP, 2007.
[Tutte, 1984] William Thomas Tutte. Graph theory. 1984.
[Krallinger et al., 2017] Martin Krallinger, Obdulia Rabal,
Saber A Akhondi, et al. Overview of the biocreative vi [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki
chemical-protein interaction track. In BioCreative chal- Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
lenge evaluation workshop, 2017. Lukasz Kaiser, and Illia Polosukhin. Attention is all you
need. In NeurIPS, 2017.
[Lamurias et al., 2019] Andre Lamurias, Diana Sousa,
Luka A Clarke, and Francisco M Couto. Bo-lstm: clas- [Verga et al., 2018] Patrick Verga, Emma Strubell, and An-
sifying relations via long short-term memory networks drew McCallum. Simultaneously self-attending to all
along biomedical ontologies. BMC bioinformatics, 2019. mentions for full-abstract biological relation extraction. In
NAACL-HLT, 2018.
[Lee et al., 2019] Jinhyuk Lee, Wonjin Yoon, Sungdong
Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and [Yogatama et al., 2016] Dani Yogatama, Phil Blunsom,
Jaewoo Kang. BioBERT: a pre-trained biomedical lan- Chris Dyer, Edward Grefenstette, and Wang Ling. Learn-
guage representation model for biomedical text mining. ing to compose words into sentences with reinforcement
Bioinformatics, 2019. learning. In ICLR, 2016.
[Lifeng et al., 2020] Jin Lifeng, Linfeng Song, Yue Zhang, [Zhang et al., 2018] Yuhao Zhang, Peng Qi, and Christo-
Kun Xu, Weiyun Ma, and Dong Yu. Relation extraction pher D. Manning. Graph convolution over pruned depen-
exploiting full dependency forests. In AAAI, 2020. dency trees improves relation extraction. In EMNLP, 2018.
3657