Professional Documents
Culture Documents
Hume: Domain-Agnostic Extraction of Causal AD1189441
Hume: Domain-Agnostic Extraction of Causal AD1189441
0704-0188
The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,
searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments
regarding this burden estimate or any other aspect of this collection of information, including suggesstions for reducing this burden, to Washington
Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA, 22202-4302.
Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any oenalty for failing to comply with a collection
of information if it does not display a currently valid OMB control number.
PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.
1. REPORT DATE (DD-MM-YYYY) 2. REPORT TYPE 3. DATES COVERED (From - To)
12-07-2022 Final Report 1-Dec-2017 - 31-Mar-2022
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Final Report: Hume: Domain-Agnostic Extraction of Causal
Analysis Graphs 5b. GRANT NUMBER
W911NF-18-C-0003
5c. PROGRAM ELEMENT NUMBER
14. ABSTRACT
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 15. NUMBER 19a. NAME OF RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE ABSTRACT OF PAGES Yee Seng Chan
UU UU UU 19b. TELEPHONE NUMBER
UU
000-000-0000
Standard Form 298 (Rev 8/98)
Prescribed by ANSI Std. Z39.18
RPPR Final Report
as of 27-Jul-2022
Agency Code: 21XD
Major Goals: Report developed under contract W911NF-18-C-003. Raytheon BBN developed Hume, a system
which builds qualitative, causal analysis graphs (CAGs) by reading text (textbooks, academic literature,
government/other reports, encyclopedias, news and online sources). Specifically, during this effort we:
• Developed tools that extract events and causal factors from text using Deep Neural Networks (DNN).
• Developed tools that extract casual relations from text using DNN. The causal relation types are intended to be
domain-agnostic.
• Developed clustering tools to aid analysts in constructing and enriching ontologies in new domains.
• Participated in internal evaluations, technology assessments, and collaborative experiments.
Accomplishments: Report developed under contract W911NF-18-C-003. Raytheon BBN developed Hume, a
system which builds qualitative, causal analysis graphs (CAGs) by reading text (textbooks, academic literature,
government/other reports, encyclopedias, news and online sources). Specifically, during this effort we:
• Developed tools that extract events and causal factors from text using Deep Neural Networks (DNN).
• Developed tools that extract casual relations from text using DNN. The causal relation types are intended to be
domain-agnostic.
• Developed clustering tools to aid analysts in constructing and enriching ontologies in new domains.
• Participated in internal evaluations, technology assessments, and collaborative experiments.
PARTICIPANTS:
Partners
Jessica MacBride
Raytheon BBN Technologies
10 Moulton Street
Cambridge, MA, 02138
Prepared for:
Dr. Joshua Elliott
The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data
sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other
aspect of this collection of information, including suggestions for reducing the burden, to Department of Defense, Washington Headquarters Services, Directorate for Information
Operations and Reports (0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that notwithstanding any other provision
of law, no person shall be subject to any penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number.
PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.
1. REPORT DATE (DD-MM-YYYY) 2. REPORT TYPE 3. DATES COVERED (From - To)
3/31/2022 Final Report 12/1/2017 - 3/31/2022
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Hume: Domain-Agnostic Extrac- W911NF-18-C-0003
tion of Causal Analysis Graphs 5b. GRANT NUMBER
Figure 1: Base ontology extraction using BBN's SERIF and TAC-KBP system components. ......3
Figure 2: A CNN based model for event trigger and event argument classification. .....................6
Figure 3: A user interface that allows a user to provide, expand, and filter event triggers for new
types. ...................................................................................................................................8
Figure 4: LearnIt workflows. .....................................................................................................12
Figure 5: The web-based user interface of LearnIt. ....................................................................14
Figure 6: The TransE+BiLSTM model for learning event and phrase embeddings.....................15
Figure 7: Architecture of FCDNN .............................................................................................17
LIST OF TABLES
ORGANIZATION), 30+ entity-entity relations (e.g. Employee-or-Member-of), and over 300 classes
of events (e.g. CONFLICT, DEMONSTRATE, PROVIDEAID). An example of TAC extractions con-
forming to the TAC ontology appears in Figure 1.
Figure 1: Base ontology extraction using BBN's SERIF and TAC-KBP system components.
These tools have been applied to formal and informal data for knowledge base population and as
input to forecasting tools. A knowledge base is created from text by labeling and co-referencing
entities, relations and events in a corpus (Min, Freedman, Meltzer, 2017). Where possible, entities
are linked to an external resource (e.g. Wikipedia as in Figure 1). When no external link is availa-
ble, entities are clustered and given a system provided global ID (i.e. the system assigns the same
ID to J.A. Smith as Jane Smith when the mentions reference the same person). In the resulting
knowledge base, each fact is associated with a confidence and set of text snippets that provide
evidence of the fact. A user (or downstream algorithm) can query the knowledge base for the both
the ontologized fact and its justification(s).
1
https://www.nist.gov/tac/2016/KBP
2
https://en.wikipedia.org/wiki/Conflict_and_Mediation_Event_Observations
To adapt these strong baseline capabilities for the World Modelers domains, we developed heuris-
tic rules to map entities, relations, and events into the types in the World Modelers ontology. Large
portions of types in the existing ontologies (e.g. ACE, CAMEO and TAC KBP) have been mapped
into the World Modelers ontology. This provides a strong starting point for knowledge organiza-
tion.
Event argument extraction: Current argument examples, such as those defined in ACE, are event
type specific. For instance, the ACE corpus annotates Agent and Victim arguments for Injure
events, Attacker and Target arguments for Attack events, etc. To decode event arguments for new
event types, one need to annotate new event type specific argument examples as training data.
We developed an approach to learn a generic event argument model to extract Actor, Place, and
Time arguments for any new event types, without annotating new examples. We defined Actor as
a coarse-grained event argument role, encompassing Agent-like and Patient-like event roles. We
mapped Actor-like argument roles in ACE to a common Actor role label, and use the Place and
Time arguments in as they appear in ACE. The complete list of ACE event argument roles that we
mapped to Actor are:
• Agent, Artifact, Adjudicator, Victim, Buyer, Seller, Giver, Recipient, Org, Attacker, Target,
Entity, Defendant, Person, Plaintiff, Prosecutor
Using the above mapping approach, we trained a generic event argument classifier that can extract
Actor, Place, and Time arguments for any event type.
each phrase, we obtain a ranked list (in terms of cosine similarity) of the top-k most similar phrases
as input to the clustering. To cluster the phrases, we employ an algorithm based on the CBC algo-
rithm (Clustering By Committee) (Pantel, Lin, 2002), which uses average link agglomerative clus-
tering (Schütze et al., 2008, Ch. 17) to recursively form cohesive clusters that are dissimilar to one
another. For each cluster c that is formed, the algorithm assigns a score: |c|× avgsim(c), where |c|
is the number of members of c and avgsim(c) is the average pairwise cosine similarity between
members. This score reflects a preference for larger and cohesive clusters. We then rank the clus-
ters in decreasing order of their cluster scores, prioritizing the most effective and cohesive clusters
to the user for selection and addition to the taxonomy.
3.3.1 Grounding Clusters to an Ontology
Once the words/phrases have been clustered by CBC, our system attempts to ground each cluster
to the most similar concept in the user’s existing ontology. We compute similarity between a clus-
ter (essentially a list of words) and an ontology concept by generating a representative contextual
embedding for each cluster and ontology node and then use pairwise cosine similarity over the
vector representations.
For each cluster c, we produce a representative contextual embedding by computing the mean the
contextual embedding of the member words of c. We compute the contextual embedding of each
individual cluster member ci by taking the mean contextual embedding of a sampling of occur-
rences of that word across the seed corpus.
For each ontology concept, we first attempt to generate a representation by averaging the contex-
tual embeddings of manually annotated instances of the concept. In the case where no manually
annotated instances exist, we instead build a representation based on example terms provided as
metadata in the World Modelers ontology format.
3.3.2 Cluster Quality vs Efficiency Tradeoffs
In our initial clustering implementation, we leveraged BERT (Devlin et al., 2019) contextual em-
beddings. However, the interactive nature of a human in the loop tool led us to explore using static
word embeddings or contextual embeddings with fewer parameters, as a way to improve the over-
all clustering runtime. We analyzed clustering outcomes using two varieties of static word embed-
dings (Baroni et al., 2014, Pennington et al, 2014), as well as contextual embeddings from Distil-
BERT (Sanh et al., 2020), and TinyBERT (Jiao et al., 2020). Our qualitative analysis indicated
that using DistilBERT provides a good balance between runtime efficiency and clustering perfor-
mance with little impact on the accuracy of our automatic mapping of clusters to ontology nodes
when compared to using BERT.
In experiments using GPUs we found that DistilBERT cut clustering runtime in half (vs using
BERT embeddings). We looked to further speed this up and make it feasible to run on more widely-
available CPUs, by experimenting with TinyBERT, quantization of weights, and using single
batches with no subword padding. Quantization refers to the technique of using 8-bit integers,
instead of 32-bit floating point numbers to represent network weights. No padding means that
instead of padding all input sentences to a pre-determined length, we use no padding and perform
inference with a batch size of one.
3
https://github.com/spotify/annoy
We conducted experiments along two dimensions:
1. Using GPU, CPU, CPU + quantization, or CPU + quantization + no padding
2. Using DistilBERT, or one of two variants of TinyBERT
The results of our experiments, shown in Table 4, demonstrate that although moving inference
from GPU to CPU results in a 13 times increase in runtime, by utilizing quantization we reduce
that to 7.5 times increase, and by further utilizing no padding, we reduce that to 1.5 times. We
could recoup further time savings by using Tiny-BERT, but that change could impact cluster qual-
ity. We ultimately decided to continue using DistilBERT, but with quantization and no padding in
place, for a very reasonable 1.5 times increase in inference time when using CPU versus GPU.
ture that connects the pair of events. For example, “verb:lead[sub=0] [to=1]” is the proposition
pattern counterpart of “0 lead to 1”.
LearnIt uses a large, unannotated corpus for development purpose. For experiments in this section,
we used 1.5 million documents from the English Gigaword corpus (Parker et al., 2011). For all
4
The left and right argument of a relation is numbered 0 and 1 respectively. We focused on binary relations.
sentences in the development corpus, we ran the SERIF (Boschee et al., 2005) NLP stack to gen-
erate predicate-argument structures in the form of 𝑝𝑝(𝑟𝑟1 : 𝑎𝑎1 , 𝑟𝑟2 : 𝑎𝑎2 , … , 𝑟𝑟𝑛𝑛 : 𝑎𝑎𝑛𝑛 ), in which p is the
predicate (Part-of-Speech + word), ri is the role (e.g., subject sub and object obj) of the i-th argu-
ment ai. An argument is an event trigger word. Following Richer Event Description (O’Gorman et
al., 2016), we adopted a broad definition for Event: an event can be any occurrence, action, process
or event state. In practice, we tagged all predicate-like verbs and nominalizations as event triggers.
The basic workflows of LearnIt are summarized in Figure 7. The LearnIt system incorporates two
workflows, bootstrapping and iterative pattern/pair set expansion, in its iterative learning process.
LearnIt also allows flexible compositions of these two workflows, mediated by the user, to allow
more effective use of users’ effort. The learning process is guided by a small amount of user effort
provided via a User Interface (UI).
Workflow 1: Bootstrapping: LearnIt incorporates bootstrapping (Agichtein, Gravano, 2000; Yu,
Agichtein, 2003; Gupta et al., 2018) for relation extraction. The process works as follows: Given
a handful of initial event pairs that are known to express the target relation, LearnIt searches in a
development corpus to find instances (sentences with a pair of events) that match the known event
pairs. From these instances, LearnIt extracts relational patterns, ranks and presents them to the
user. The user then selects patterns that express the target relation. These patterns will be added
into the known pattern set. Similarly, given a set of known patterns, LearnIt again searches in the
corpus to find matched instances, from which it extracts additional event pairs, ranks and presents
them to the user. The user will select event pairs that express the target relation. The user can
perform multiple iterations of bootstrapping. A complete iteration is illustrated by the blue arrows
and blue text in Figure 7.
rejected and unknown patterns are shown in green, red and white backgrounds respectively.
• Pairs (right): Displays pairs, ranked by a scoring method chosen by the user. Accepted 6 , re-
5F
jected and unknown event pairs are shown in green, red and white backgrounds respectively.
• Instances (bottom): When a user click on a pattern or a pair, this pane will display a list of
instances matched by the pattern or the pair.
As described in the workflows, a user can
• Find new pattern/pairs through keyword search, by clicking on the ”ADD NEW” button
on the pattern/pair pane and then typing in keywords in an edit box. LearnIt will then return a
ranked list of matching patterns/pairs found in the corpus.
• Perform bootstrapping by clicking on the ”PROPOSE” button on the pattern/pair pane to
search for patterns from known pairs, or pairs from patterns, by exploring their shared in-
stances.
• Perform self-expansion of pattern/pair sets by clicking on the ”SIMILAR” button to search
for similar patterns given known patterns, or pairs given known pairs.
5
Accepted by the user as good patterns.
6
Accepted by the user as good pairs.
Figure 5: The web-based user interface of LearnIt.
Bootstrapping: A key ingredient to bootstrapping-based relation extraction systems is a scoring
function, which ranks unknown patterns and pairs given known pairs and patterns respectively.
An ideal function will rank patterns/pairs according to their true coverage and precision, leading
to minimal human effort required for bootstrapping. In practice, most ranking functions will end
up producing suboptimal results, given the limited amount of initial input from the user.
We allow the user to choose from a few scoring methods for patterns or pairs. A user can choose
a recall-driven scoring function to give the learning process a warm start, and then switch to a
precision-driven function so that she/he does not have to spend a lot of time reviewing a large
amount of highly specific patterns.
We implemented the following well-known scoring metrics (Agichtein, Gravano, 2000; Yu,
Agichtein, 2003) for patterns. At any given iterations, let P, N, U be the number of positive, neg-
ative, and unknown pairs respectively. The system updates the following scores for a pattern:
• Frequency: defined as P, the number of known event pairs the pattern matches.
• Precision: 𝐶𝐶𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 𝑃𝑃/(𝑃𝑃 + 𝑤𝑤𝑛𝑛𝑛𝑛𝑛𝑛 ⋅ 𝑁𝑁 + 𝑤𝑤𝑢𝑢𝑢𝑢𝑢𝑢 ⋅ 𝑈𝑈). 𝑤𝑤𝑛𝑛𝑛𝑛𝑛𝑛 and 𝑤𝑤𝑢𝑢𝑢𝑢𝑢𝑢 are the relative weight
of negative and unknown examples. We set them to 0.5 and 0.1 respectively.
′
• Frequency-weighted precision: 𝐶𝐶𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 𝐶𝐶𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 ⋅ log 2 𝑃𝑃.
Similarly, the system updates the following scores for each pair given the current parameterization
of all other variables:
• Frequency: defined as ′ , the number of known patterns matching the event pair.
|𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃|
• Precision: 𝐶𝐶𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 1 − ∏𝑖𝑖=0 (1 − 𝐶𝐶𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 ) in which 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑛𝑛𝑖𝑖 ∈ 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃, the set of
patterns extracted from the instances matched by the pair
′
• Frequency-weighted precision: This is defined as the 𝐶𝐶𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 𝐶𝐶𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 ⋅ log2 𝑃𝑃.
LearnIt dynamically updates these scores given its latest view of known patterns and pairs in each
iteration. The user interface allows the user to choose one of the scoring functions to rank patterns
or pairs.
Iterative Expansion of Pattern/Pair Sets: We applied TransE (Bordes et al., 2013), a relational
graph embedding model, to embed events and relational patterns into a continuous vector space.
The model scores a triple based on a distance function d�ei + 𝑟𝑟𝑘𝑘 , 𝑒𝑒𝑗𝑗 � = ��𝑒𝑒𝑖𝑖 + 𝑟𝑟𝑘𝑘 − 𝑒𝑒𝑗𝑗 �� 7 be-
tween the left and right event after applying the relational pattern’s translation to the left. Let
< 𝑒𝑒𝑖𝑖 , 𝑟𝑟𝑘𝑘 , 𝑒𝑒𝑗𝑗 > be a relational triple in which event i is related to j through relational pattern k 8. 7F
TransE uses a ranking loss in which the d for relational triples should be smaller than d for the
corrupted triples (in which the relation is false). For training, the pairwise ranking loss described
above is used with the margin 𝛾𝛾 (a configurable hyperparameter). The loss across all training tri-
ples 𝑆𝑆 and their corrupted counterparts 𝑆𝑆′ is:
which is optimized by sampling positive and negative pairs in Stochastic Gradient Descent (SGD).
TransE treats events and patterns as symbols. To further capture the text of events and patterns 9 , 8F
we followed (Toutanova et al., 2015) and represented the text of events and patterns in the rela-
tional embedding model. We modeled 𝑒𝑒𝑖𝑖 , 𝑟𝑟𝑘𝑘 , 𝑒𝑒𝑗𝑗 with three Bi-directional LSTMs (Hochreiter,
Schmidhuber, 1997) over the sequences of words in event i, pattern k, and event j respectively.
The revised TransE+BiLSTM model is shown in Figure 6. For events, we used the snippets where
the event triggers appear. For proposition patterns, we linearized them following (Zhang et al.,
2017). The joint model is trained with SGD with the same loss function L.
Figure 6: The TransE+BiLSTM model for learning event and phrase embeddings.
We illustrated the position of event trigger words with ”[]”.
Using < 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑡𝑡1 , 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝, 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑡𝑡2 > triples generated from 1.5 million English Gigaword docu-
ment, we trained TransE+BiLSTM to generate embeddings for events and patterns. To see if the
embeddings can capture paraphrases, we calculated pairwise cosine similarity for all patterns, and
then sampled some pairs with similarity > 0.9 and showed them in Table 4. It shows that the em-
bedding learning algorithm does project paraphrase to nearby areas in the embedding space.
7
We used Euclidean distance. Other distances can also be used, such as Manhattan, squared Euclidean, etc.
8
An example is <attack, led to, death>.
9
knowing that 0 caused 1 and 1 is caused by 0 are similar in meanings.
Table 4: Paraphrase pairs learned with TransE+BiLSTM.
“V” is short for “verb”. *: Occasionally the model fails to capture directionality correctly.
Given a set of known patterns, we ranked all other pattern according to cosine similarity of their
embeddings to the average embeddings of the known pattern set. Similarly, given a set of known
event pairs, we ranked all other event pairs according to the cosine similarity of their embeddings
to average embeddings of the known pairs. The embeddings of an event pair is the concatenation
of the embeddings of the two events in the pair.
Calculating pairwise similarity for millions of patterns is very time consuming. We applied an
efficient, approximate algorithm (Bachrach et al., 2014) 10 to calculate all pairwise similarities and
9F
cache the results. When a user searches for similar patterns or pairs, LearnIt returns results instantly
because it only needs to perform lookups.
Evaluating LearnIt relation extraction: We randomly sampled 1.5 million documents from the
English Gigaword corpus (Parker et al., 2011) as our development corpus. We sampled another
500 documents from Gigaword as the test corpus. As a preprocess step, we needed to tag the de-
velopment corpus with events. We ran the SERIF NLP system (Boschee et al., 2005) over both
corpora to generate eventive trigger words in the form of verb and normalization predicates. We
then asked our annotation team to aggregate these word and prune words that aren’t eventive ac-
cording to Richer Event Description (O’Gorman et al., 2016). This results in 32.8 event mentions
(trigger words) per document on average.
For the test corpus, we ask our annotation team to annotate six types of temporal and causal rela-
tions exhaustively for pairs of events appearing in the same sentence. The relations are defined in
Table 5. 100 instances were dual annotated. This results in an inter-annotator agreement of 0.76.
The final annotation dataset contains 629 positive instances across the six types of relations.
10
Implemented in https://github.com/spotify/annoy
user effort per type, the relation extractors customized by LearnIt achieved good performance. The
extractors perform better on some types (e.g., Cause) than other types. A reason, as reported by
our annotators, is that it is easier to find patterns for Cause than other types. We leave a compara-
tive study as our future work.
intuitively important features including: causal cues (e.g., because), discourse connectives (e.g.,
therefore), syntactic/lexical connections between Causal Mentions (CMs), links to ontology clas-
ses, external semantic resources (e.g., lists of terrorist groups), distance between CM sentences,
etc.
To learn implicit relations between a pair of CMs, we harvested corpus-level causal associations,
we trained targeted causal embeddings with neural networks models using bootstrapped cause-
effect pairs harvested by LearnIt from a large corpus.
Data sources: We used the following three types of causal relation instances to train the FCDNN:
• Causal relations annotated by LDC: We used the causal relation annotation dataset pro-
duced by Linguistic Data Consortium (LDC) under the CauseEx program. We concatenated
11
Context can be a single sentence, or the surrounding sentences.
all instances of causal relation annotation from all LDC causal relation annotated dataset to
train FCDNN.
• Instances labeled automatically by LearnIt: We applied LearnIt bootstrap learning to cu-
rate relational patterns, and then manually review the patterns for accuracy. This results in
dozens of propositional or lexical patterns per relation type. We then applied these patterns
over the Gigaword (Parker et al., 2011) corpus to produce 1.5 million training instances au-
tomatically.
• Examples curated via crowdsourcing: We used Amazon Mechanic Turk (AMT) to
crowdsource examples that express one of the causal relation types. We curated about 1,500
instances of causal relation instances via this process.
We then trained two models: the first model is trained with the causal relations annotated by
(LDC). The second model is trained with the instances labeled automatically by LearnIt. The two
models are combined in a pipeline in which we merged the predicted instances by both models,
to maximize recall. Due to the small size and the lower quality of the crowdsourced instances,
we did not use the crowdsourced data in training models.
4.0 RESULTS AND DISCUSSION
12
www.actionagainsthunger.org/sites/default/files/ publications/REFANI-lit-review-2015 0.pdf
• Evaluation documents, which range from thorough external evaluation of intervention opera-
tions 13, to brief presentations of results in programme documents, and postintervention sum-
12 F
mary articles.
• Programme documents by implementers 14. 13F
13
E.g. bmcnutr.biomedcentral.com/articles/10.1186/ s40795-016-0102-6
14
E.g. one.wfp.org/operations/current operations/ project docs/200275.pdf
Intervention Type Count F1-score
anti-retroviral treatment 114 0.59
capacity building human rights 65 0.68
child friendly learning spaces 39 0.30
provision of goods and services 432 0.77
sexual violence management 131 0.45
therapeutic feeding or treating 49 0.52
vector control 146 0.67
Aggregate 976 0.68
Table 8: Intervention types with number of trigger examples and F1-scores.
Counts and scores based on 5-fold cross validation.
descriptors (“livestock feed”, “fodder”, “hay”, etc.) for the category livestock feed.
Then, when we note that our coarse-grained trigger model had predicted a trigger instance of pro-
vision of goods and services in a sentence, we check the trigger’s surrounding context (five token
window) for mentions of livestock feed, farming tool, fishing tool, etc. We thus deterministically
re-label provision of goods and services into the appropriate finer-grained intervention type, de-
pending on which category of descriptor is present in the trigger’s context window. As shown in
Table 3, the coarse-grained F1-score of provision of goods and services is 0.77. After performing
the deterministic re-labeling into finer-grained intervention types, we obtain an aggregate F1-score
of 0.56 when evaluating against our fine-grained trigger labels. Recall misses such as those result-
ing from incomplete descriptor lists, and precision misses resulting from multiple descriptor cate-
gories being present within a trigger’s surrounding context, contributed to the drop in F1-score.
Extracting locations and time: We leverage the ACE corpus, which contains annotations of
Place and Time event arguments, to train an event type independent Place/Time argument classi-
fier, based on the neural architecture described in 3.2.1. In our evaluation, an argument is correctly
classified if its event type, event argument role, and offsets match any of the reference event argu-
ments.
African countries are often the focus sites of humanitarian programs and agencies, such as the
World Food Programme (WFP). Hence, to evaluate the performance of our argument model for
intervention events, we randomly selected 250 documents from around 6,000 documents collected
from allafrica.com.
We first apply our coarse-grained trigger classifier on these documents. We then ask annotators to
evaluate trigger predictions and retain only correct ones (188 triggers), which we subsequently use
to evaluate our argument classifier. We focus on using correct triggers to evaluate argument clas-
sification, to prevent error propagation (from erroneous trigger predictions) from muddling a fair
assessment of the argument classifier.
Our annotators assigned a total of 15 Time arguments and 77 Place arguments to the 188 event
triggers. Our argument classifier predicted a total of 12 Time arguments, giving a precision, recall,
and F1 of 0.92, 0.73, and 0.81 respectively. Our argument classifier predicted 30 Place arguments,
giving a precision, recall, and F1 of 0.93, 0.36, and 0.52 respectively.
15
These lists are available at https://github.com/ BBN-E/mr-intervention
5.0 CONCLUSIONS
During this effort, we made technical progress in event and causal factors extraction from text,
tools to aid ontology development, causal relation extraction from text, and experimentation sup-
port. Here we provide a brief summary of the key results we accomplished to date:
Event extraction from text. We have developed event extraction techniques that achieved state-
of-the-art results on publicly available benchmark datasets. To support the evolving operation
needs that rely on an evolving set of events and causal factors, we also developed the following
two additional techniques. First, we developed on-demand rapid customization techniques. It only
requires a small numbers of user interactions per each new event type to build a machine learning
models for the types of interest.
Ontology-In-A-Day. Our human-in-the-loop OIAD clustering service helps streamline the man-
ual process of constructing an ontology for each new domain and use case. This sort of human-
machine synergy enables IE’s massive reading ability to discover factors and relations and organ-
ize them in way that aligns with human understanding of complex problems, reducing overall
human effort.
Causal relation extraction from text. Our causal relation extraction techniques include a pattern-
based bootstrap learning approach for causal relation extraction and a deep neural network model
for improved recall while maintain the similar level of precision. The approach achieved state-of-
the-art performance. However, causal relation extraction for new domains and implicit causal re-
lations is still a challenging problem. We believe more labeled data can help. However, annotating
causal relations is a difficult and time-consuming task. We developed crowdsourcing techniques
to reduce the effort required to curate a large training dataset and performed multiple rounds of
pilot study to iterate on the annotation guideline. We are optimistic in overcoming the data bottle-
neck with this approach if we have more research and development effort.
Experimentation and Evaluation. We supported multiple experiments and evaluation activities,
including embed experiments with government transition partners. We also supported the pro-
gram-wide integration efforts between the BBN team and other performers.
6.0 REFERENCES
Agichtein E., Gravano L. (2000). Snowball: Extracting relations from large plain-text collec-
tions. In Proceedings of the fifth ACM conference on Digital libraries, pages 85–94. ACM.
Bachrach Y., Finkelstein Y., Gilad-Bachrach R., Katzir L., Koenigstein N., NirNice J., Paquet
U.. (2014). Speeding up the xbox recommender system using a euclidean transformation for in-
ner-product spaces. In Proceedings of the 8th ACM Conference on Recommender systems, pages
257–264. ACM.
Baroni, M., Dinu, G., and Kruszewski, G. (2014). Don’t count, predict! a systematic comparison
of context-counting vs. context-predicting semantic vectors. In ACL-2014, pages 238–247.
Bordes A., Usunier N., Garcia-Duran A., Weston J., Yakhnenko O. (2013). Translating embed-
dings for modeling multi-relational data. In Advances in neural information processing systems,
pages 2787–2795.
Boros E. (2018). Neural Methods for Event Extraction. Ph.D. thesis, Universite Paris-Saclay.
Boschee E., Weischedel R., Zamanian A. (2005). Automatic information extraction. In Proceed-
ings of the International Conference on In-telligence Analysis, volume 71. Citeseer.
Chen Y., Xu L., Liu K., Zeng D., Zhao J.. (2015). Event extraction via dynamic multi-pooling
convolutional neural networks. In ACL-IJCNLP2-2015, pages 167–176
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirec-
tional transformers for language understanding. In Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
Doddington G. R., Mitchell A., Przybocki M. A., Ramshaw L. A., Strassel S. M., Weischedel R.
M. (2004). The automatic content extraction (ace) program - tasks, data, and evaluation. In
LREC.
Dunietz J., Levin, L., & Carbonell, J (2017). BECauSE Corpus 2.0: Annotating Causality and
Overlapping Relations, In Proceedings of the 11th Linguistic Annotation Workshop.
Freedman, M. and Gabbard, R. (2014). Overview of the event argument evaluation. In Proceed-
ings of TAC KBP 2014 Workshop, National Institute of Standards and Technology, pages 17–
18.
Grishman, R. and Sundheim, B. (1996). Message understanding conference-6: A brief history. In
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics.
Gupta P., Roth B., Schutze H. (2018). Joint bootstrapping machines for high confidence relation
extraction. arXiv preprint. arXiv:1805.00254.
Hochreiter S., Schmidhuber J. (1997). Long short-term memory. Neural computation, 9(8):1735–
1780.
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., ... & Liu, Q. (2019). TinyBERT: Distil-
ling BERT for natural language understanding. arXiv preprint arXiv:1909.10351.
Lin D., Pantel P. (2001). Dirt: discovery of inference rules from text. In Proceedings of the sev-
enth ACM SIGKDD international conference on Knowledge discovery and data mining, pages
323–328. ACM.
Meyers, A., Kosaka, M., Xue, N., Ji, H., Sun, A., Liao, S., and Xu, W. (2009). Automatic recog-
nition of logical relations for English, Chinese and Japanese in the GLARF framework. In Pro-
ceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions
(SEW '09). Association for Computational Linguistics, USA, 146–154.
Min, B., Freedman, M., Meltzer T. (2017). Probabilistic Inference for Cold Start Knowledge
Base Population with Prior World Knowledge. In Proceedings of EACL 2017.
Mintz M., Bills S., Snow R., Jurafsky D. (2009). Distant supervision for relation extraction with-
out labeled data. In ACL-IJCNLP, pages 1003–1011.
O’Gorman T., Wright-Bettner K., Palmer M. (2016). Richer event description: Integrating event
coreference with temporal, causal and bridging annotation. In Proceedings of the 2nd Work-
shopon Computing News Storylines (CNS 2016), pages 47–56.
Pantel P., Crestan E., Borkovsky A., Popescu A., Vyas V. (2009). Web-scale distributional simi-
larity and entity set expansion. In Proceedings of the 2009 Conference on Empirical Methods in
Natural Language Processing: Volume 2, pages 938–947. Association for Computational Lin-
guistics.
Pantel, P. and Lin, D. (2002). Discovering word senses from text. In Proceedings of the eighth
ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD-
02), pages 613– 619.
Parker R., Graff D., Kong J., Chen K., Maeda K. (2011). English gigaword. Linguistic Data Con-
sortium.
Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word
representation. In Proceedings of the 2014 conference on empirical methods in natural language
processing (EMNLP) (pp. 1532-1543).
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2020). DistilBERT, a distilled version of BERT:
smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
Schütze, H., Manning, C. D., and Raghavan, P. (2008). Introduction to information retrieval, vol-
ume 39. Cambridge University Press Cambridge.
Toutanova K., Chen D., Pantel P., Poon H., Choudhury P., Gamon M. (2015). Representing text
for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Em-
pirical Methods in Natural Language Processing, pages 1499–1509.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf,
R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le
Scao, T., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. (2020). Transformers: State-of-the-art
natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing: System Demonstrations, pages 38–45, Online, October. Associa-
tion for Computational Linguistics.
Yang, J., Han, S. C., & Poon, J. (2022). A survey on extraction of causal relations from natural
language text. Knowledge and Information Systems, 1-26.
Yu H., Agichtein E. (2003). Extracting synonymous gene and protein terms from biological liter-
ature. Bioinformatics, 19:i340–i349.
Zhang S., Duh K., Van Durme B. (2017). Mt/ie: Cross-lingual open information extraction with
neural sequence-to-sequence models. In Proceedings of the 15th Conference of the European
Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 64–70.
LIST OF SYMBOLS, ABBREVIATIONS, AND ACRONYMS