Hume: Domain-Agnostic Extraction of Causal AD1189441

REPORT DOCUMENTATION PAGE Form Approved OMB NO.
0704-0188
The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,
searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments
regarding this burden estimate or any other aspect of this collection of information, including suggesstions for reducing this burden, to Washington
Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA, 22202-4302.
Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any oenalty for failing to comply with a collection
of information if it does not display a currently valid OMB control number.
PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.
1. REPORT DATE (DD-MM-YYYY) 2. REPORT TYPE 3. DATES COVERED (From - To)
12-07-2022 Final Report 1-Dec-2017 - 31-Mar-2022
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Final Report: Hume: Domain-Agnostic Extraction of Causal
Analysis Graphs 5b. GRANT NUMBER
W911NF-18-C-0003
5c. PROGRAM ELEMENT NUMBER
6. AUTHORS 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAMES AND ADDRESSES 8. PERFORMING ORGANIZATION REPORT

Raytheon BBN Technologies Corp. NUMBER
10 Moulton Street
Cambridge, MA 02138 -1119

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS 10. SPONSOR/MONITOR'S ACRONYM(S)
(ES) ARO
U.S. Army Research Office 11. SPONSOR/MONITOR'S REPORT
P.O. Box 12211 NUMBER(S)
Research Triangle Park, NC 27709-2211 72287-CS-DRP.1
12. DISTRIBUTION AVAILIBILITY STATEMENT
Approved for public release; distribution is unlimited.
13. SUPPLEMENTARY NOTES
The views, opinions and/or findings contained in this report are those of the author(s) and should not contrued as an official Department
of the Army position, policy or decision, unless so designated by other documentation.
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 15. NUMBER 19a. NAME OF RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE ABSTRACT OF PAGES Yee Seng Chan
UU UU UU 19b. TELEPHONE NUMBER
UU
000-000-0000
Standard Form 298 (Rev 8/98)
Prescribed by ANSI Std. Z39.18
RPPR Final Report
as of 27-Jul-2022
Agency Code: 21XD
Proposal Number: 72287CSDRP Agreement Number: W911NF-18-C-0003

INVESTIGATOR(S):
Name: Yee Seng Chan

Email: yeeseng.chan@raytheon.com
Phone Number: 0000000000
Principal: Y
Organization: Raytheon BBN Technologies Corp.

Address: 10 Moulton Street, Cambridge, MA 021381119
Country: USA
DUNS Number: 146595736 EIN: 412126829
Report Date: 31-Mar-2022 Date Received: 12-Jul-2022
Final Report for Period Beginning 01-Dec-2017 and Ending 31-Mar-2022
Title: Hume: Domain-Agnostic Extraction of Causal Analysis Graphs
Begin Performance Period: 01-Dec-2017 End Performance Period: 31-Mar-2022
Report Term: 0-Other
Submitted By: Carmine Iantosca Email: carmine.iantosca@raytheon.com
Phone: (617) 873-4015
Distribution Statement: 1-Approved for public release; distribution is unlimited.
STEM Degrees: STEM Participants:
Major Goals: Report developed under contract W911NF-18-C-003. Raytheon BBN developed Hume, a system
which builds qualitative, causal analysis graphs (CAGs) by reading text (textbooks, academic literature,
government/other reports, encyclopedias, news and online sources). Specifically, during this effort we:
• Developed tools that extract events and causal factors from text using Deep Neural Networks (DNN).
• Developed tools that extract casual relations from text using DNN. The causal relation types are intended to be
domain-agnostic.
• Developed clustering tools to aid analysts in constructing and enriching ontologies in new domains.
• Participated in internal evaluations, technology assessments, and collaborative experiments.
Accomplishments: Report developed under contract W911NF-18-C-003. Raytheon BBN developed Hume, a
system which builds qualitative, causal analysis graphs (CAGs) by reading text (textbooks, academic literature,
government/other reports, encyclopedias, news and online sources). Specifically, during this effort we:
• Developed tools that extract casual relations from text using DNN. The causal relation types are intended to be
domain-agnostic.
Training Opportunities: Nothing to Report
Results Dissemination: Nothing to Report
Honors and Awards: Nothing to Report
Protocol Activity Status:
Technology Transfer: Nothing to Report
PARTICIPANTS:
Participant Type: PD/PI

Participant: Jessica Macbride
RPPR Final Report
as of 27-Jul-2022
Person Months Worked: 15.00 Funding Support:
Project Contribution:
National Academy Member: N
Participant Type: Co PD/PI

Participant: Yee Seng Chan
Participant Type: Co PD/PI

Participant: Bonan Min
Partners
I certify that the information in the report is complete and accurate:

Signature: carmine iantosca
Signature Date: 7/12/22 2:25PM
Hume: Domain-Agnostic Extraction of Causal Analysis
Graphs
DARPA World Modelers
Jessica MacBride
Raytheon BBN Technologies
10 Moulton Street
Cambridge, MA, 02138
March 31, 2022
FINAL TECHNICAL REPORT FOR PERIOD

December 1, 2017 – March 31, 2022
Prime Contract Number W911NF-18-C-0003
Prepared for:
Dr. Joshua Elliott
Distribution Statement A: Approved for public release: distribution.

This document does not contain technology or technical data controlled under either the U.S. ITAR or the U.S. EAR.
The views and conclusions contained in this document are those of the authors and should not be interpreted as
representing the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency or
the US government.
Form Approved
REPORT DOCUMENTATION PAGE OMB No. 0704-0188
The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data
sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other
aspect of this collection of information, including suggestions for reducing the burden, to Department of Defense, Washington Headquarters Services, Directorate for Information
Operations and Reports (0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that notwithstanding any other provision
of law, no person shall be subject to any penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number.
PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.
1. REPORT DATE (DD-MM-YYYY) 2. REPORT TYPE 3. DATES COVERED (From - To)
3/31/2022 Final Report 12/1/2017 - 3/31/2022
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Hume: Domain-Agnostic Extrac- W911NF-18-C-0003
tion of Causal Analysis Graphs 5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER

Jessica Macbride
5e. TASK NUMBER
5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
Raytheon BBN REPORT NUMBER
10 Moulton Street BBN Report-8627
Cambridge, MA 02138
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR'S ACRONYM(S)
U.S. Army Research Office ARO
P.O. Box 12211, Research 11. SPONSOR/MONITOR'S REPORT
NUMBER(S)
Triangle Park, NC 27709-2211
12. DISTRIBUTION/AVAILABILITY STATEMENT

Distribution Statement A: Approved for public release; distribution is unlimited
13. SUPPLEMENTARY NOTES
The views, opinions and/or findings contained in this report are those of the author(s) and should not construed as an official
Department of the Army position, policy or decision, unless so designated by other documentation.
14. ABSTRACT
Report developed under contract W911NF-18-C-003. Raytheon BBN developed Hume, a system which builds qualitative,
causal analysis graphs (CAGs) by reading text (textbooks, academic literature, government/other reports, encyclopedias,
news and online sources). Specifically, during this effort we:
• Developed tools that extract casual relations from text using DNN. The causal relation types are intended to be domain-
agnostic.
15. SUBJECT TERMS
DARPA World Modeler Program
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER OF 19a. NAME OF RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE ABSTRACT PAGES
19b. TELEPHONE NUMBER (Include area code)
Table of Contents
List of Figures ........................................................................................................................... iv
List of Tables ..............................................................................................................................v
1.0 SUMMARY .....................................................................................................................1
2.0 INTRODUCTION ............................................................................................................2
3.0 METHODS, ASSUMPTIONS, AND PROCEDURES .....................................................3
3.1 Baseline Capabilities for Knowledge Organization........................................................3
3.2 Event Extraction from Text ...........................................................................................4
3.2.1 Convolutional Neural Networks for Event Extraction.............................................4
3.2.2 Rapid Customization for Event Extraction .............................................................7
3.3 Ontology-In-A-Day Clustering ......................................................................................8
3.3.1 Grounding Clusters to an Ontology ........................................................................9
3.3.2 Cluster Quality vs Efficiency Tradeoffs .................................................................9
3.3.3 Cluster Ranking and Filtering ..............................................................................10
3.4 Causal Relation Extraction from Text..........................................................................11
3.4.1 LearnIt bootstrap learning for relation extraction ..................................................11
3.4.2 Deep Neural Networks for relation extraction ......................................................17
4.0 RESULTS AND DISCUSSION .....................................................................................19
4.1 Experimentation and Evaluation..................................................................................19
4.2 Event Customization Case Study: Intervention ............................................................19
4.2.1 Intervention Event Ontology ................................................................................19
4.2.2 An Intervention Corpus ........................................................................................19
4.2.3 Annotating Intervention Instances ........................................................................20
4.2.4 Evaluating Intervention Performance ...................................................................21
5.0 CONCLUSIONS ............................................................................................................23
6.0 REFERENCES ...............................................................................................................24
LIST OF SYMBOLS, ABBREVIATIONS, AND ACRONYMS ..............................................27
LIST OF FIGURES
Figure 1: Base ontology extraction using BBN's SERIF and TAC-KBP system components. ......3
Figure 2: A CNN based model for event trigger and event argument classification. .....................6
Figure 3: A user interface that allows a user to provide, expand, and filter event triggers for new
types. ...................................................................................................................................8
Figure 4: LearnIt workflows. .....................................................................................................12
Figure 5: The web-based user interface of LearnIt. ....................................................................14
Figure 6: The TransE+BiLSTM model for learning event and phrase embeddings.....................15
Figure 7: Architecture of FCDNN .............................................................................................17
LIST OF TABLES
Table 1: Inter-annotator agreement filtered by attribution. ...........................................................6

Table 2: Comparative clustering runtimes (mm:ss). ...................................................................10
Table 3: Precision@K for ground truth novelty of clusters. .......................................................11
Table 4: Paraphrase pairs learned with TransE+BiLSTM. .........................................................16
Table 5: Causal and temporal relations between event X and Y .................................................16
Table 6: Performance of LearnIt relation extractors ...................................................................17
Table 7: Types of interventions with example text snippets, where event triggers are italicized. 20
Table 8: Intervention types with number of trigger examples and F1-scores. .............................21
1.0 SUMMARY
In response to Technical Area (TA) 1 of the DARPA/I2O World Modelers BAA, Raytheon BBN
Technologies Inc. (BBN) developed Hume, a system which builds qualitative, causal analysis
graphs (CAGs) by reading text (textbooks, academic literature, government/other reports, ency-
clopedias, news and online sources).
Specifically, during this effort we:
• Developed tools that extract events and causal factors from text using Deep Neural Networks
(DNN).
• Developed tools that extract casual relations from text using DNN. The causal relation types
are intended to be domain-agnostic.
• Developed clustering tools to aid analysts in constructing and enriching ontologies in new do-
mains.
This document serves as the end-of-contract final report.

2.0 INTRODUCTION
The World Modelers program aims to develop technologies that will enable analysts to rapidly
build models to analyze questions relevant to national and global security. To improve the speed
and efficiency of model development, it is necessary to develop automatic Machine Reading tech-
niques that can extract concepts and causal relations between them from text sources.
This document presents the technical details of BBN's Hume system, which automatically detects
domain-relevant factors and causal relations from heterogeneous text and uses them to build qual-
itative, causal analysis graphs (CAGs). We will present our existing capabilities that have been
integrated into the Hume system, event extraction and causal relation extraction from text, and
human-in-the-loop ontology development aids. We will then present results, discussion, and our
recommendations for future work
3.0 METHODS, ASSUMPTIONS, AND PROCEDURES
3.1 Baseline Capabilities for Knowledge Organization

We used BBN SERIF to provide Day-1 knowledge organization capabilities. Based on ontologies
such as TAC KBP 1 and CAMEO 2, the tools together extract 7 classes of entities (e.g. PERSON,
0F 1F
ORGANIZATION), 30+ entity-entity relations (e.g. Employee-or-Member-of), and over 300 classes
of events (e.g. CONFLICT, DEMONSTRATE, PROVIDEAID). An example of TAC extractions con-
forming to the TAC ontology appears in Figure 1.
Figure 1: Base ontology extraction using BBN's SERIF and TAC-KBP system components.
These tools have been applied to formal and informal data for knowledge base population and as
input to forecasting tools. A knowledge base is created from text by labeling and co-referencing
entities, relations and events in a corpus (Min, Freedman, Meltzer, 2017). Where possible, entities
are linked to an external resource (e.g. Wikipedia as in Figure 1). When no external link is availa-
ble, entities are clustered and given a system provided global ID (i.e. the system assigns the same
ID to J.A. Smith as Jane Smith when the mentions reference the same person). In the resulting
knowledge base, each fact is associated with a confidence and set of text snippets that provide
evidence of the fact. A user (or downstream algorithm) can query the knowledge base for the both
the ontologized fact and its justification(s).
1
https://www.nist.gov/tac/2016/KBP
2
https://en.wikipedia.org/wiki/Conflict_and_Mediation_Event_Observations
To adapt these strong baseline capabilities for the World Modelers domains, we developed heuris-
tic rules to map entities, relations, and events into the types in the World Modelers ontology. Large
portions of types in the existing ontologies (e.g. ACE, CAMEO and TAC KBP) have been mapped
into the World Modelers ontology. This provides a strong starting point for knowledge organiza-
tion.
3.2 Event Extraction from Text

Event extraction is the task of identifying events of interest with associated participating arguments
in text. For instance, given the following sentence:
S1: 21 people were wounded in Tuesday’s southern Philippines airport blast.
Event extraction aims to recognize the two events (Injury and Attack), triggered by the words
“wounded” and “blast” respectively. We also recognized that “21 people” and “airport” take on
the event argument roles Actor(s) involved and Place respectively.
Given an English sentence, we performed event extraction using a two-stage process:
• Stage 1: Trigger classification. Label words in the sentence with their predicted event types (if
any). For instance, in sentence S1, the extraction system should label “wounded” as a trigger
of an Injury event, and “blast” as a trigger of an Attack event
• Stage 2: Argument classification: For each predicted event triggers {𝑡𝑡𝑖𝑖 }, pair 𝑡𝑡𝑖𝑖 with all entity
and time mention {𝑚𝑚𝑖𝑖 } in the same sentence to generate candidate event arguments. Given a
candidate event argument(𝑡𝑡𝑖𝑖 , 𝑚𝑚𝑖𝑖 ), the system predicts an associated event role (if any). For
instance, given (“wounded”, “airport”), the system should predict the Place event role.
3.2.1 Convolutional Neural Networks for Event Extraction
For event extraction, we developed a series of DNNs (Deep Neural Networks). These methods
formulate both event trigger and argument extraction as supervised classification problems that
can be trained from a corpus of manually labeled examples that are specific to that ontology. For
instance, the popular Automatic Content Extraction (ACE) (Doddington et al., 2004) corpus con-
tains 599 documents manually annotated with examples for 33 event types, such as Attack and
Justice events.
We developed a Convolution Neural Network (CNN) model to perform event trigger classifica-
tion, and another CNN model for event argument classification used with our novel trigger and
argument example collection approaches. The two CNN models are very similar, with the argu-
ment model incorporating more features. Hence, we will describe the argument model in detail,
then provide a summary of the trigger model. As shown in Figure 2, the argument model consists
of (1) an embedding layer to encode words and word positions in the sentence, (2) a convolution
and max pooling layer to generate high-level features from the embedding representation of the
sentence, (3) a layer which concatenates the max pool layer and local context window around the
candidate trigger and argument, (4) followed by the SoftMax function for classifying the example
into one of the target classes.
For argument classification, the input is a sentence in which a trigger word and a candidate event
argument is identified, e.g. (“relief”, “states”) in Figure 2.
Embedding Layer encodes each word with:
• Word embeddings (WE): Given an input sentence 𝑥𝑥 of length 𝑡𝑡, we first transformed each
word into a real-valued vector of dimension 𝑑𝑑1 by looking up a word embedding matrix 𝑊𝑊1 ∈
𝑅𝑅𝑑𝑑1 ×|𝑉𝑉| , where 𝑉𝑉 is the vocabulary. We used word embeddings trained by Baroni, et al., 2014,
which achieved state-of-the-art results in a variety of NLP tasks.
• Position embeddings (PE): 𝑃𝑃𝐸𝐸𝑡𝑡 encodes the relative distance of each word to the trigger word
as a real-valued vector of dimension 𝑑𝑑2 by an embedding matrix 𝑊𝑊2 ∈ 𝑅𝑅𝑑𝑑2 ×|𝐷𝐷| , where 𝐷𝐷 is the
set of relative distances in a dataset. 𝑊𝑊2 is randomly initialized and learnt during training. We
similarly used 𝑃𝑃𝐸𝐸𝑎𝑎 to encode relative distances to the candidate argument, by defining 𝑊𝑊3 ∈
𝑅𝑅𝑑𝑑3 ×|𝐷𝐷| .
The final embedding dimension for each token is 𝑛𝑛1 = (𝑑𝑑1 + 𝑑𝑑2 + 𝑑𝑑3 ). This layer produces
an embedding representation 𝑥𝑥 (1) ∈ 𝑅𝑅 𝑛𝑛1×𝑡𝑡 when fed with an input sentence 𝑥𝑥 (1) = 𝑥𝑥.
Convolution and Max Pooling Layer: We used a set of filters with different window sizes to
capture important n-gram features from an input sentence. Due to space constraints, we omitted
the definitions of the convolution and max pool layers. We denoted the max pool layer using a
fixed-sized feature vector 𝑥𝑥 (2) ∈ 𝑅𝑅𝑛𝑛2 , where 𝑛𝑛2 is the total number of filters.
Concatenate Layer: We selected the word embeddings of the trigger, the candidate argument,
and their local windows. We defined the window surrounding a word, as the 𝑘𝑘 = 3 tokens to the
left and right of the word. We concatenated these embeddings to the max pool layer, to obtain a
concatenated vector 𝑥𝑥 (3) .
Event Argument Classification: We have 𝑜𝑜 = 𝑊𝑊3 𝑥𝑥 (3) + 𝑏𝑏(3) , where 𝑊𝑊3 and 𝑏𝑏 (3) are param-
eters learnt in this layer. Here, 𝑜𝑜 ∈ 𝑅𝑅𝑛𝑛3 , where 𝑛𝑛3 is equal to the number of event argument roles
including the “NONE” label for candidate arguments which are not actual event arguments to the
trigger. Given an input example x, our network with parameters 𝜃𝜃 outputs the vector 𝑜𝑜, where the
i-th component contains the score for event role i. To obtain the conditional probability
𝑝𝑝(𝑖𝑖 |𝑥𝑥, 𝜃𝜃 ), we applied SoftMax:
𝑒𝑒𝑜𝑜𝑖𝑖
𝑝𝑝(𝑖𝑖 |𝑥𝑥, 𝜃𝜃 ) =
∑𝑗𝑗 𝑒𝑒𝑜𝑜𝑗𝑗
The CNN for trigger classification is largely the same as the above CNN for argument classi-
fication, omitting just the argument associated features, i.e. 𝑃𝑃𝐸𝐸𝑎𝑎 and the argument window shown
at the bottom of Figure 2. The input is a sentence in which a word is the candidate trigger word,
e.g. “relief” in Figure 2. The output is a SoftMax function predicting one of the event types or
NONE, indicating the candidate word is not a valid trigger for any of the event types.
An illustration of the CNN model is in the following figure:
Figure 2: A CNN based model for event trigger and event argument classification.
For simplicity, only event argument classification is shown. WE is word embeddings. 𝑷𝑷𝑬𝑬𝒕𝒕 and
𝑷𝑷𝑬𝑬𝒂𝒂 are position embeddings, capturing a token’s distance to the candidate rigger and argument
respectively. These position embeddings are randomly initialized and learned during training
We first conducted experiments to verify that our CNN model implementation achieves compara-
ble performance to state-of-the-art CNN-based event extraction systems (Chen et al., 2015, Boros,
2018) to ensure that it is suitable for use in our rapid event customization approach. Following this
prior work, we used the ACE-2005 corpus, with the same sets of 529 training documents, 30 de-
velopment documents, and 40 test documents. We used the same following criteria to judge the
correctness of our event extractions: A trigger is correctly classified if its event subtype and offsets
match those of a reference trigger; an argument is correct classified if its event subtype, event
argument role, and offsets match any of the reference event arguments.
Table 1: Inter-annotator agreement filtered by attribution.

Counts of articles and trigger examples, in training corpora for distant supervision (𝑪𝑪𝒅𝒅𝒅𝒅 ), distant
supervision followed by human adjudication (𝑪𝑪𝒂𝒂𝒂𝒂𝒂𝒂 ), and sampled distant supervision ((𝑪𝑪𝒅𝒅𝒔𝒔′ ), as
well as corpora for development (Dev) and test (Test).
Since the ratio of positive (valid) vs negative examples is relatively skewed (for instance, most
words in a sentence are not triggers), we tried different weights for the positive examples: 1, 3, 5,
or 10. We tuned this and other hyper-parameters (batch size, number of CNN filters, number of
epochs) on the development documents. We also followed (Chen et al., 2015) by using the
Adadelta update rule with parameters ρ = 0.95 and = 1e −6 , and a dropout rate of 0.5. On the ACE
test data, our trigger model achieves an F1 score of 0.65, close to the scores of 0.66 and 0.68
reported in (Chen et al., 2015) and (Boros, 2018) respectively. Our argument model using gold
triggers achieves an F1 score of 0.53, close to the score of 0.55 reported in (Boros, 2018).
3.2.2 Rapid Customization for Event Extraction
We developed a system that facilitates rapid extension of extraction capabilities to a large number
of novel event type. We focused on the problem of rapid customization of event extractors for new
event types where we did not have a large amount of hand-labeled data available.
Event trigger classification: To extract triggers of a new event type, one needs to annotate a large
amount of training examples specific to that new event type, to enable training a supervised clas-
sification model for event extraction.
Our system enables rapidly gathering of event trigger examples for new event types with minimal
human effort, aided by the UI shown in Figure 3, using this workflow:
• Given a new target event type, the user first provides some initial keywords. The UI (backed
by an unannotated text corpus) presents up to 3 text snippets (sentences) mentioning each trig-
ger.
• The user can then easily gather additional discriminative keywords using the UI via interactive
search. By clicking on the “Find similar” button in each pane, the system will suggest new
event keywords that are similar to the current set of keywords, displaying these suggested key-
words in the working pane on the left of the UI. Our system suggests new keywords using
WordNet hyponyms and cosine similarity in a word embedding space.
• The user can then repeat this process for additional event types. This can be seen in Figure 3,
where each pane (column) shows an event type name at the top, followed by event triggers (in
red) and text snippets (clickable to expand to full sentence) mentioning these triggers.
• The user can edit between event types by drag and drop, moving a trigger or snippet from one
event to another. The user can also click on “−” to remove an event, a trigger with its snippets,
or just a snippet. The user can also click on the “More” button to the right of each trigger, to
display additional text snippets containing the trigger.
• When the user is satisfied with the current set of keywords and associated text snippets, our
system then performs distant supervision (Mintz et al., 2009) by using the occurrences of these
keywords (their associated text snippets) as event trigger examples for the new event type.
We will show that over a set of 67 new event types described, the user spent an average of 4.5
minutes to provide 8.6 initial triggers and associated text snippets. Then another 5 minutes inter-
34 acting with the UI to expand and filter the triggers, for a total of less than 10 minutes per event
type.
Figure 3: A user interface that allows a user to provide, expand, and filter event triggers for new
types.
A demonstration video is available on github.com/BBN-E/Rapid-customization-events-acl19.
Event argument extraction: Current argument examples, such as those defined in ACE, are event
type specific. For instance, the ACE corpus annotates Agent and Victim arguments for Injure
events, Attacker and Target arguments for Attack events, etc. To decode event arguments for new
event types, one need to annotate new event type specific argument examples as training data.
We developed an approach to learn a generic event argument model to extract Actor, Place, and
Time arguments for any new event types, without annotating new examples. We defined Actor as
a coarse-grained event argument role, encompassing Agent-like and Patient-like event roles. We
mapped Actor-like argument roles in ACE to a common Actor role label, and use the Place and
Time arguments in as they appear in ACE. The complete list of ACE event argument roles that we
mapped to Actor are:
• Agent, Artifact, Adjudicator, Victim, Buyer, Seller, Giver, Recipient, Org, Attacker, Target,
Entity, Defendant, Person, Plaintiff, Prosecutor
Using the above mapping approach, we trained a generic event argument classifier that can extract
Actor, Place, and Time arguments for any event type.
3.3 Ontology-In-A-Day Clustering

One recurring challenge when applying Machine Reading tools to solve analytic problems is the
need to ground extractions to a relevant ontology or taxonomy, which may vary with each new
domain and use case. The task of manually creating and maintaining these ontological resources
is laborious and never complete. To help streamline the curation process, we developed an unsu-
pervised clustering service backed by the power and expressivity of contextualized word embed-
dings. The service works in concert with components developed by other performers to support an
Ontology-In-A-Day (OIAD) analyst tool.
After candidate phrase extraction, phrases are clustered into semantically cohesive groups to serve
as taxonomy node suggestions. We use Huggingface transformers (Wolf et al., 2020) to obtain
contextualized DistilBERT (Sanh et al., 2020) embeddings of each occurrence of each phrase. We
then use Annoy 3 to perform time-efficient nearest neighbor search over these embeddings. For
2F
each phrase, we obtain a ranked list (in terms of cosine similarity) of the top-k most similar phrases
as input to the clustering. To cluster the phrases, we employ an algorithm based on the CBC algo-
rithm (Clustering By Committee) (Pantel, Lin, 2002), which uses average link agglomerative clus-
tering (Schütze et al., 2008, Ch. 17) to recursively form cohesive clusters that are dissimilar to one
another. For each cluster c that is formed, the algorithm assigns a score: |c|× avgsim(c), where |c|
is the number of members of c and avgsim(c) is the average pairwise cosine similarity between
members. This score reflects a preference for larger and cohesive clusters. We then rank the clus-
ters in decreasing order of their cluster scores, prioritizing the most effective and cohesive clusters
to the user for selection and addition to the taxonomy.
3.3.1 Grounding Clusters to an Ontology
Once the words/phrases have been clustered by CBC, our system attempts to ground each cluster
to the most similar concept in the user’s existing ontology. We compute similarity between a clus-
ter (essentially a list of words) and an ontology concept by generating a representative contextual
embedding for each cluster and ontology node and then use pairwise cosine similarity over the
vector representations.
For each cluster c, we produce a representative contextual embedding by computing the mean the
contextual embedding of the member words of c. We compute the contextual embedding of each
individual cluster member ci by taking the mean contextual embedding of a sampling of occur-
rences of that word across the seed corpus.
For each ontology concept, we first attempt to generate a representation by averaging the contex-
tual embeddings of manually annotated instances of the concept. In the case where no manually
annotated instances exist, we instead build a representation based on example terms provided as
metadata in the World Modelers ontology format.
3.3.2 Cluster Quality vs Efficiency Tradeoffs
In our initial clustering implementation, we leveraged BERT (Devlin et al., 2019) contextual em-
beddings. However, the interactive nature of a human in the loop tool led us to explore using static
word embeddings or contextual embeddings with fewer parameters, as a way to improve the over-
all clustering runtime. We analyzed clustering outcomes using two varieties of static word embed-
dings (Baroni et al., 2014, Pennington et al, 2014), as well as contextual embeddings from Distil-
BERT (Sanh et al., 2020), and TinyBERT (Jiao et al., 2020). Our qualitative analysis indicated
that using DistilBERT provides a good balance between runtime efficiency and clustering perfor-
mance with little impact on the accuracy of our automatic mapping of clusters to ontology nodes
when compared to using BERT.
In experiments using GPUs we found that DistilBERT cut clustering runtime in half (vs using
BERT embeddings). We looked to further speed this up and make it feasible to run on more widely-
available CPUs, by experimenting with TinyBERT, quantization of weights, and using single
batches with no subword padding. Quantization refers to the technique of using 8-bit integers,
instead of 32-bit floating point numbers to represent network weights. No padding means that
instead of padding all input sentences to a pre-determined length, we use no padding and perform
inference with a batch size of one.
3
https://github.com/spotify/annoy
We conducted experiments along two dimensions:
1. Using GPU, CPU, CPU + quantization, or CPU + quantization + no padding
2. Using DistilBERT, or one of two variants of TinyBERT
The results of our experiments, shown in Table 4, demonstrate that although moving inference
from GPU to CPU results in a 13 times increase in runtime, by utilizing quantization we reduce
that to 7.5 times increase, and by further utilizing no padding, we reduce that to 1.5 times. We
could recoup further time savings by using Tiny-BERT, but that change could impact cluster qual-
ity. We ultimately decided to continue using DistilBERT, but with quantization and no padding in
place, for a very reasonable 1.5 times increase in inference time when using CPU versus GPU.
GPU CPU CPU + Quantization CPU + Quantization + No Padding

DistilBERT 01:17 16:21 09:34 01:59
TinyBERT 1 00:39 00:59 00:32 00:26
TinyBERT 2 00:56 03:53 02:10 00:43
Table 2: Comparative clustering runtimes (mm:ss).
Experiments performed on a sample of 29 documents using DistilBERT (Sanh et al., 2020), Ti-
nyBERT1 and TinyBERT2 (Jiao et al., 2020) contextual embeddings.
3.3.3 Cluster Ranking and Filtering
User feedback from the June 2021 embed experiment led us to implement a “cluster novelty”
metric to complement the original cluster cohesion score. The novelty metric aims to prioritize
user effort by identifying clusters that are absent or novel when compared with an existing ontol-
ogy. To predict cluster novelty we leveraged the same calculations used above to ground clusters
to ontology nodes. We consider clusters with minimum cosine similarity to nodes in the existing
ontology to be the most novel. To evaluate our metric, we manually annotated a set of sample
clusters with binary decisions about their ground truth novelty with respect to a sample ontology
and then measured Precision@K over a ranking of the clusters in terms of their predicted novelty
score.
The experimental results shown in Table 5 demonstrate that our novelty metric correlates well
with ground truth novelty. In order to maximize the utility of clusters presented to the user, we
decided to rank the clusters via a weighted average of cluster cohesion and cluster novelty. We
observed that this combination increased the overall precision over using cohesion alone, while
still taking into account the overall quality of each cluster.
In addition, we addressed a user suggestion to filter “uninteresting" clusters. These are clusters
that might be cohesive, but would be deemed "uninteresting" from the program perspective, e.g.
clusters of colors, person titles, etc. We addressed this filtering in a variety of ways, e.g. filtering
clusters consisting of primarily very short words, checking against prior clusters that we had man-
ually annotated as "uninteresting" when we were developing our clustering tool, etc.
K Novelty Cohesion Avg.
50 0.96 0.48 0.82
100 0.95 0.40 0.76
150 0.93 0.41 0.74
200 0.91 0.41 0.73
250 0.88 0.41 0.72
Table 3: Precision@K for ground truth novelty of clusters.
Clusters ranked according to predicted novelty score, cluster cohesion and a weighted average of
novelty and cohesion.
3.4 Causal Relation Extraction from Text

3.4.1 LearnIt bootstrap learning for relation extraction
Understanding relations (e.g., a flood caused migration of farmers) between real-world events is
very useful for situation awareness and decision-making. However, creating an event-event rela-
tion extractor often requires a significant amount of time and effort. For example, a developer may
need to write a large set of extraction rules by hand, or curate a large labeled data set to train a
classifier. Such approaches will not be applicable for new relation types nor new genres of text
different from the training data.
We developed Learning relation extractors Iteratively (LearnIt), a system for on-demand rapid
customization of event-event relation extractors with a user in the loop. It has the following key
features:
• First, LearnIt incorporates bootstrapping to iteratively learn event pairs from patterns, and pat-
terns from pairs, by leveraging an unannotated development corpus.
• Second, LearnIt incorporates iterative expansion of its relational pattern set through ranking
and selecting additional patterns that are similar to the known patterns. Similarly, it also sup-
ports iterative expansion of its set of known event pairs. To enable pairwise semantic similarity
calculation, we developed an unsupervised Neural Network model that learns embeddings for
phrases and events from a large corpus.
• Third, it involves a Human In the Loop (HIL) to prevent semantic drift for bootstrapping and
iterative expansion of patterns/event-pair sets. We develop a UI to allow the user to review/se-
lect examples and steer the customization process, with a small amount of effort.
LearnIt aims to learn a set of patterns that can be applied to text for extracting event-event relations.
A pattern is (1) a lexical pattern, which is a sequence of words between a pair of events, e.g., “0
lead to 1” 4 , or (2) a proposition pattern, which is the (possibly nested) predicate-argument struc-
3F
ture that connects the pair of events. For example, “verb:lead[sub=0] [to=1]” is the proposition
pattern counterpart of “0 lead to 1”.
LearnIt uses a large, unannotated corpus for development purpose. For experiments in this section,
we used 1.5 million documents from the English Gigaword corpus (Parker et al., 2011). For all
4
The left and right argument of a relation is numbered 0 and 1 respectively. We focused on binary relations.
sentences in the development corpus, we ran the SERIF (Boschee et al., 2005) NLP stack to gen-
erate predicate-argument structures in the form of 𝑝𝑝(𝑟𝑟1 : 𝑎𝑎1 , 𝑟𝑟2 : 𝑎𝑎2 , … , 𝑟𝑟𝑛𝑛 : 𝑎𝑎𝑛𝑛 ), in which p is the
predicate (Part-of-Speech + word), ri is the role (e.g., subject sub and object obj) of the i-th argu-
ment ai. An argument is an event trigger word. Following Richer Event Description (O’Gorman et
al., 2016), we adopted a broad definition for Event: an event can be any occurrence, action, process
or event state. In practice, we tagged all predicate-like verbs and nominalizations as event triggers.
The basic workflows of LearnIt are summarized in Figure 7. The LearnIt system incorporates two
workflows, bootstrapping and iterative pattern/pair set expansion, in its iterative learning process.
LearnIt also allows flexible compositions of these two workflows, mediated by the user, to allow
more effective use of users’ effort. The learning process is guided by a small amount of user effort
provided via a User Interface (UI).
Workflow 1: Bootstrapping: LearnIt incorporates bootstrapping (Agichtein, Gravano, 2000; Yu,
Agichtein, 2003; Gupta et al., 2018) for relation extraction. The process works as follows: Given
a handful of initial event pairs that are known to express the target relation, LearnIt searches in a
development corpus to find instances (sentences with a pair of events) that match the known event
pairs. From these instances, LearnIt extracts relational patterns, ranks and presents them to the
user. The user then selects patterns that express the target relation. These patterns will be added
into the known pattern set. Similarly, given a set of known patterns, LearnIt again searches in the
corpus to find matched instances, from which it extracts additional event pairs, ranks and presents
them to the user. The user will select event pairs that express the target relation. The user can
perform multiple iterations of bootstrapping. A complete iteration is illustrated by the blue arrows
and blue text in Figure 7.
Figure 4: LearnIt workflows.

Bootstrap learning is illustrated with the blue arrows and blue text. Iterative self-expansion of
patterns or event pairs is illustrated in orange arrows (self-loops) and the orange text
Workflow 2: Pattern/pair set expansion: This workflow incorporates key ideas from distribu-
tional-similarity-based paraphrase (Lin, Pantel, 2001) and entity set expansion (Pantel et al., 2009).
Given a set of seed patterns expressing the target relation, LearnIt ranks all other patterns based
on their similarity to the known patterns and presents a ranked list to the user. The user adds good
patterns into the known pattern list. The process repeats iteratively. This allows the user to itera-
tively expand the list of patterns for the target relation with a small amount of effort. Similarly, it
also allows the user to select additional event pairs that indicate the target relation, if event pairs
were provided as seeds. This workflow is illustrated with the two orange self-loops in Figure 7. A
continuous vector representation (a.k.a., “embeddings”) of patterns and event pairs are learned
with an unsupervised Neural Network model described later on in this section.
Workflow 3+: User-composed workflows (WF) combining WF1 and WF2: The two iterative
learning approaches mentioned above do not have to be performed one after the other, or in a
particular order. The user can compose many workflows in flexible ways: for example, a user
could perform one iteration of bootstrapping and then a few iterations of expansions of pattern or
pair sets, and then go back to bootstrapping to explore previous unreachable semantic space. This
illustrates one of the many possible combinations; the user has the flexibility to compose their own
workflow to maximize efficiency given a small amount of effort.
Web-based user interface (UI): LearnIt’s web-based UI is shown in Figure 5. There are three
panes in the UI:
• Patterns (left): Displays patterns, ranked by a scoring method chosen by the user. Accepted 5, 4F
rejected and unknown patterns are shown in green, red and white backgrounds respectively.
• Pairs (right): Displays pairs, ranked by a scoring method chosen by the user. Accepted 6 , re-
5F
jected and unknown event pairs are shown in green, red and white backgrounds respectively.
• Instances (bottom): When a user click on a pattern or a pair, this pane will display a list of
instances matched by the pattern or the pair.
As described in the workflows, a user can
• Find new pattern/pairs through keyword search, by clicking on the ”ADD NEW” button
on the pattern/pair pane and then typing in keywords in an edit box. LearnIt will then return a
ranked list of matching patterns/pairs found in the corpus.
• Perform bootstrapping by clicking on the ”PROPOSE” button on the pattern/pair pane to
search for patterns from known pairs, or pairs from patterns, by exploring their shared in-
stances.
• Perform self-expansion of pattern/pair sets by clicking on the ”SIMILAR” button to search
for similar patterns given known patterns, or pairs given known pairs.
5
Accepted by the user as good patterns.
6
Accepted by the user as good pairs.
Figure 5: The web-based user interface of LearnIt.
Bootstrapping: A key ingredient to bootstrapping-based relation extraction systems is a scoring
function, which ranks unknown patterns and pairs given known pairs and patterns respectively.
An ideal function will rank patterns/pairs according to their true coverage and precision, leading
to minimal human effort required for bootstrapping. In practice, most ranking functions will end
up producing suboptimal results, given the limited amount of initial input from the user.
We allow the user to choose from a few scoring methods for patterns or pairs. A user can choose
a recall-driven scoring function to give the learning process a warm start, and then switch to a
precision-driven function so that she/he does not have to spend a lot of time reviewing a large
amount of highly specific patterns.
We implemented the following well-known scoring metrics (Agichtein, Gravano, 2000; Yu,
Agichtein, 2003) for patterns. At any given iterations, let P, N, U be the number of positive, neg-
ative, and unknown pairs respectively. The system updates the following scores for a pattern:
• Frequency: defined as P, the number of known event pairs the pattern matches.
• Precision: 𝐶𝐶𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 𝑃𝑃/(𝑃𝑃 + 𝑤𝑤𝑛𝑛𝑛𝑛𝑛𝑛 ⋅ 𝑁𝑁 + 𝑤𝑤𝑢𝑢𝑢𝑢𝑢𝑢 ⋅ 𝑈𝑈). 𝑤𝑤𝑛𝑛𝑛𝑛𝑛𝑛 and 𝑤𝑤𝑢𝑢𝑢𝑢𝑢𝑢 are the relative weight
of negative and unknown examples. We set them to 0.5 and 0.1 respectively.
′
• Frequency-weighted precision: 𝐶𝐶𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 𝐶𝐶𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 ⋅ log 2 𝑃𝑃.
Similarly, the system updates the following scores for each pair given the current parameterization
of all other variables:
• Frequency: defined as ′ , the number of known patterns matching the event pair.
|𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃|
• Precision: 𝐶𝐶𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 1 − ∏𝑖𝑖=0 (1 − 𝐶𝐶𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 ) in which 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑛𝑛𝑖𝑖 ∈ 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃, the set of
patterns extracted from the instances matched by the pair
′
• Frequency-weighted precision: This is defined as the 𝐶𝐶𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 𝐶𝐶𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 ⋅ log2 𝑃𝑃.
LearnIt dynamically updates these scores given its latest view of known patterns and pairs in each
iteration. The user interface allows the user to choose one of the scoring functions to rank patterns
or pairs.
Iterative Expansion of Pattern/Pair Sets: We applied TransE (Bordes et al., 2013), a relational
graph embedding model, to embed events and relational patterns into a continuous vector space.
The model scores a triple based on a distance function d�ei + 𝑟𝑟𝑘𝑘 , 𝑒𝑒𝑗𝑗 � = ��𝑒𝑒𝑖𝑖 + 𝑟𝑟𝑘𝑘 − 𝑒𝑒𝑗𝑗 �� 7 be-
tween the left and right event after applying the relational pattern’s translation to the left. Let
< 𝑒𝑒𝑖𝑖 , 𝑟𝑟𝑘𝑘 , 𝑒𝑒𝑗𝑗 > be a relational triple in which event i is related to j through relational pattern k 8. 7F
TransE uses a ranking loss in which the d for relational triples should be smaller than d for the
corrupted triples (in which the relation is false). For training, the pairwise ranking loss described
above is used with the margin 𝛾𝛾 (a configurable hyperparameter). The loss across all training tri-
ples 𝑆𝑆 and their corrupted counterparts 𝑆𝑆′ is:
𝐿𝐿 = � � [𝛾𝛾 + 𝑑𝑑�𝑒𝑒𝑖𝑖 + 𝑟𝑟𝑘𝑘 , 𝑒𝑒𝑗𝑗 � − 𝑑𝑑�𝑒𝑒𝑖𝑖 ′ + 𝑟𝑟𝑘𝑘 ′ , 𝑒𝑒𝑗𝑗 ′ �]

(𝑖𝑖,𝑗𝑗,𝑘𝑘)∈𝑆𝑆 (𝑖𝑖 ′ ,𝑗𝑗 ′ ,𝑘𝑘)∈𝑆𝑆 ′
which is optimized by sampling positive and negative pairs in Stochastic Gradient Descent (SGD).
TransE treats events and patterns as symbols. To further capture the text of events and patterns 9 , 8F
we followed (Toutanova et al., 2015) and represented the text of events and patterns in the rela-
tional embedding model. We modeled 𝑒𝑒𝑖𝑖 , 𝑟𝑟𝑘𝑘 , 𝑒𝑒𝑗𝑗 with three Bi-directional LSTMs (Hochreiter,
Schmidhuber, 1997) over the sequences of words in event i, pattern k, and event j respectively.
The revised TransE+BiLSTM model is shown in Figure 6. For events, we used the snippets where
the event triggers appear. For proposition patterns, we linearized them following (Zhang et al.,
2017). The joint model is trained with SGD with the same loss function L.
Figure 6: The TransE+BiLSTM model for learning event and phrase embeddings.
We illustrated the position of event trigger words with ”[]”.
Using < 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑡𝑡1 , 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝, 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑡𝑡2 > triples generated from 1.5 million English Gigaword docu-
ment, we trained TransE+BiLSTM to generate embeddings for events and patterns. To see if the
embeddings can capture paraphrases, we calculated pairwise cosine similarity for all patterns, and
then sampled some pairs with similarity > 0.9 and showed them in Table 4. It shows that the em-
bedding learning algorithm does project paraphrase to nearby areas in the embedding space.
7
We used Euclidean distance. Other distances can also be used, such as Manhattan, squared Euclidean, etc.
8
An example is <attack, led to, death>.
9
knowing that 0 caused 1 and 1 is caused by 0 are similar in meanings.
Table 4: Paraphrase pairs learned with TransE+BiLSTM.
“V” is short for “verb”. *: Occasionally the model fails to capture directionality correctly.
Given a set of known patterns, we ranked all other pattern according to cosine similarity of their
embeddings to the average embeddings of the known pattern set. Similarly, given a set of known
event pairs, we ranked all other event pairs according to the cosine similarity of their embeddings
to average embeddings of the known pairs. The embeddings of an event pair is the concatenation
of the embeddings of the two events in the pair.
Calculating pairwise similarity for millions of patterns is very time consuming. We applied an
efficient, approximate algorithm (Bachrach et al., 2014) 10 to calculate all pairwise similarities and
9F
cache the results. When a user searches for similar patterns or pairs, LearnIt returns results instantly
because it only needs to perform lookups.
Evaluating LearnIt relation extraction: We randomly sampled 1.5 million documents from the
English Gigaword corpus (Parker et al., 2011) as our development corpus. We sampled another
500 documents from Gigaword as the test corpus. As a preprocess step, we needed to tag the de-
velopment corpus with events. We ran the SERIF NLP system (Boschee et al., 2005) over both
corpora to generate eventive trigger words in the form of verb and normalization predicates. We
then asked our annotation team to aggregate these word and prune words that aren’t eventive ac-
cording to Richer Event Description (O’Gorman et al., 2016). This results in 32.8 event mentions
(trigger words) per document on average.
For the test corpus, we ask our annotation team to annotate six types of temporal and causal rela-
tions exhaustively for pairs of events appearing in the same sentence. The relations are defined in
Table 5. 100 instances were dual annotated. This results in an inter-annotator agreement of 0.76.
The final annotation dataset contains 629 positive instances across the six types of relations.
Table 5: Causal and temporal relations between event X and Y

Customizing event-event relation extractors with LearnIt: Our annotators used the LearnIt
system to construct relation extractors for the six types of relations, using the 1.5 million docu-
ments as the development corpus. On average, the annotators spend 18.7 minutes per relation type
and found 134 patterns per type. We applied these patterns to the testing corpus to extract instances
of the six types of relations. The performance is shown in Table 6. With just under 20 minutes of
10
Implemented in https://github.com/spotify/annoy
user effort per type, the relation extractors customized by LearnIt achieved good performance. The
extractors perform better on some types (e.g., Cause) than other types. A reason, as reported by
our annotators, is that it is easier to find patterns for Cause than other types. We leave a compara-
tive study as our future work.
Table 6: Performance of LearnIt relation extractors

3.4.2 Deep Neural Networks for relation extraction
To extract local causal relations (i.e., causal relations appearing within a sentence), we developed
a Feature-rich Context-aware Deep Neural Network (FCDNN).
Figure 7: Architecture of FCDNN

Figure 7 shows the architecture of our FCDNN, which combines a DNN with a rich set of causal-
indicative features for causal relation extraction. It concatenates the output of a CNN with hand-
crafted features, passed through multiple hidden layers to learn high level representations. A Soft-
Max layer predicts the existence of a causal relation.
The FCDNN models cause, effect, and textual context 11 with three CNNs. The FCDNN also use
10F
intuitively important features including: causal cues (e.g., because), discourse connectives (e.g.,
therefore), syntactic/lexical connections between Causal Mentions (CMs), links to ontology clas-
ses, external semantic resources (e.g., lists of terrorist groups), distance between CM sentences,
etc.
To learn implicit relations between a pair of CMs, we harvested corpus-level causal associations,
we trained targeted causal embeddings with neural networks models using bootstrapped cause-
effect pairs harvested by LearnIt from a large corpus.
Data sources: We used the following three types of causal relation instances to train the FCDNN:
• Causal relations annotated by LDC: We used the causal relation annotation dataset pro-
duced by Linguistic Data Consortium (LDC) under the CauseEx program. We concatenated
11
Context can be a single sentence, or the surrounding sentences.
all instances of causal relation annotation from all LDC causal relation annotated dataset to
train FCDNN.
• Instances labeled automatically by LearnIt: We applied LearnIt bootstrap learning to cu-
rate relational patterns, and then manually review the patterns for accuracy. This results in
dozens of propositional or lexical patterns per relation type. We then applied these patterns
over the Gigaword (Parker et al., 2011) corpus to produce 1.5 million training instances au-
tomatically.
• Examples curated via crowdsourcing: We used Amazon Mechanic Turk (AMT) to
crowdsource examples that express one of the causal relation types. We curated about 1,500
instances of causal relation instances via this process.
We then trained two models: the first model is trained with the causal relations annotated by
(LDC). The second model is trained with the instances labeled automatically by LearnIt. The two
models are combined in a pipeline in which we merged the predicted instances by both models,
to maximize recall. Due to the small size and the lower quality of the crowdsourced instances,
we did not use the crowdsourced data in training models.
4.0 RESULTS AND DISCUSSION
4.1 Experimentation and Evaluation

Under this effort, we participated in a variety of evaluations, technical assessments and collabora-
tive experiments. We supported multiple experiments and evaluation activities, including three
separate embed experiments with government transition partners in June 2021, August 2021 and
April 2022. We also supported integration and collaboration efforts between BBN and other World
Modelers performers. The final embed experiment, in April 2022, will occur after the end of
BBN’s contracted period of performance and therefore any feedback from government transition
partners will be unavailable for inclusion in this report.
The April 2022 embed experiment will focus on two use cases: a WRI and DSMT-E team analyz-
ing a Food and Water Early Warning Hub use case and a Department of Defense/USAFRICOM
climate change use case. In preparation for the embed experiment, we exercised our LearnIt event
customization approach to make targeted improvements to Hume’s extraction of relevant concepts,
e.g. “food security” and “crop failure”. We also deployed our OIAD clustering service to aid users
in developing ontologies tailored to the embed use cases.
4.2 Event Customization Case Study: Intervention

In order to evaluate the effectiveness of our event extraction system in customizing extractors for
new event types, we conducted a case study on extracting interventions from humanitarian-assis-
tance program literature. In this section, we present the details of that case study.
4.2.1 Intervention Event Ontology
We focused on modeling interventions or humanitarian assistances that are meant to alleviate mass
suffering, improve socioeconomic conditions, and maintain human dignity. We list our interven-
tion ontology in Table 7. Types include promotion of anti-retroviral healthcare, promoting respect
of human rights, ensuring children friendly learning spaces, management of sexual violence, ther-
apeutic feeding of the severely malnourished, vector control of insects and pests, and provision of
various humanitarian aid such as cash, food, etc.
4.2.2 An Intervention Corpus
Humanitarian assistance programs are associated with various documentation: project proposals,
guidances, progress reports, and evaluation reports on program execution. These documents are
ideal for mining intervention instances. We collected several hundred documents from the follow-
ing sources:
• Literature reviews, e.g., the REFANI review 12, which reviews Cash Transfer Programmes
11F
and their impact on malnutrition in humanitarian contexts.

• Programme/project guidelines, e.g., the Sphere Handbook, which lists universal standards in
core areas of humanitarian response.
12
www.actionagainsthunger.org/sites/default/files/ publications/REFANI-lit-review-2015 0.pdf
• Evaluation documents, which range from thorough external evaluation of intervention opera-
tions 13, to brief presentations of results in programme documents, and postintervention sum-
12 F
mary articles.
• Programme documents by implementers 14. 13F
• Academic studies, such as quasi-experimental studies from ebrary.ifpri.org.
Intervention Type Example Snippet

anti-retroviral treatment postpartum ARV drugs may also be given to infants
capacity building human rights mission personnel are also engaged in building the capacity of national authorities to
promote and respect human rights
child friendly learning spaces promotes quality education for indigenous girls and boys through child-friendly learn-
ing environments
provision of goods and services
• provide cash cash distributions during emergencies
• provide delivery kit distributing a home delivery kit to every pregnant woman
• provide education kit developing and freely distributing education materials
• provide farming tool the scope of the program encompasses provision of fertilizer
• provide fishing tool restoration of livelihoods through provision of fishing boats and fishing equipment
• provide food food aid is often supplied in emergency situations together with seed aid
• provide hygiene tool respond to humanitarian emergencies always aim to distribute soap routinely
• provide livestock feed where they were provided with fodder
• provide seed food aid is often supplied in emergency situations together with seed aid
• provide veterinary service providing free or subsidized animal health services
sexual violence management health professionals expected to provide post-rape care
therapeutic feeding or treating therapeutic food provided in supplementary feeding centers
vector control Malathion is commonly used to control mosquitoes
Table 7: Types of interventions with example text snippets, where event triggers are italicized.
4.2.3 Annotating Intervention Instances
We provided definitions and text examples for the intervention types to two annotators, and then
asked them to identify and annotate intervention instances for each document. Annotators were
provided with a User Interface (UI) similar to the one described in 3.2.2, which allowed them to
search for examples efficiently. 30 documents are annotated by two annotators, resulting in an
inter-annotator agreement of 0.83. A total of 976 intervention instances (triggers) were found for
the target intervention types. The “Count” column of Table 8 shows numbers of examples for each
intervention type.
13
E.g. bmcnutr.biomedcentral.com/articles/10.1186/ s40795-016-0102-6
14
E.g. one.wfp.org/operations/current operations/ project docs/200275.pdf
Intervention Type Count F1-score
anti-retroviral treatment 114 0.59
capacity building human rights 65 0.68
child friendly learning spaces 39 0.30
provision of goods and services 432 0.77
sexual violence management 131 0.45
therapeutic feeding or treating 49 0.52
vector control 146 0.67
Aggregate 976 0.68
Table 8: Intervention types with number of trigger examples and F1-scores.
Counts and scores based on 5-fold cross validation.
4.2.4 Evaluating Intervention Performance

In this section, we first present experiments in extracting interventions (triggers), and then describe
early results on extracting locations and time for interventions. As shown in Table 2, a large num-
ber of the intervention types (e.g. provide cash, provide delivery kit) have to do with provision of
goods and services. Although we have kept the labeling of these interventions separate during the
annotation process, so that we could optionally perform fine-grained evaluation (and indeed we
will later in this section), we found that these interventions share common trigger words (e.g. “pro-
vide”, “provision”, “distribute”, etc.) and really rely on their arguments (e.g. “cash”, “fertilizer”,
“fishing boats”) for disambiguation. Hence, we perform two sets of trigger classification evalua-
tion: coarse-grained and fine-grained.
Coarse-grained trigger classification: Our annotated trigger examples are spread across 240 doc-
uments. We perform 5-fold cross validation to evaluate trigger classification, over the seven inter-
vention types shown in Table 3. In each fold, we use 20% of the documents as test data and the
remainder as training data. We performed minimal hyper-parameters tuning, using 30 epochs and
batch size of 40. These were found to achieve good performance in preliminary experiments where
we had further split the training data into training and development. We follow (Chen et al., 2015)
for the values of the remaining hyperparameters, e.g. CNN filter size of 3, position and event
embeddings of length 5, etc. In our evaluation, a trigger is correctly classified if its intervention
event type and offsets match those of a reference trigger. We show the coarse-grained trigger clas-
sification scores in the column “F1-score” of Table 3. We obtained a micro-averaged F1-score of
0.68 from the cross validation experiments.
Analysis on the decoding results show that examples vary greatly in terms of difficulty in extract-
ing them. For example, for “sexual violence management”, some triggers are phrases such as
“post-rape care” that are straightforward for a classifier to recognize, if given sufficient training
data. However, there is a long tail of examples where long-range dependencies need to be resolved
in order to type them correctly. For example, in “counseling ... sexual violence” and “clinical
management ... sexual abuse”, the trigger words are often more than 5 tokens far away from
additional contextual clues that indicate the target type. We leave modeling these as future work.
Fine-grained trigger classification: As mentioned earlier in this section, we propose that the in-
tervention type provision of goods and services rely on their event arguments (artifacts involved)
for disambiguation into the finer-grained interventions listed in Table 2. To enable this, we first
need to detect mentions of different goods/services in text. We adopt a simple list-based approach,
where we manually compiled lists 15 of descriptors for each category. For instance, we use the
14 F
descriptors (“livestock feed”, “fodder”, “hay”, etc.) for the category livestock feed.
Then, when we note that our coarse-grained trigger model had predicted a trigger instance of pro-
vision of goods and services in a sentence, we check the trigger’s surrounding context (five token
window) for mentions of livestock feed, farming tool, fishing tool, etc. We thus deterministically
re-label provision of goods and services into the appropriate finer-grained intervention type, de-
pending on which category of descriptor is present in the trigger’s context window. As shown in
Table 3, the coarse-grained F1-score of provision of goods and services is 0.77. After performing
the deterministic re-labeling into finer-grained intervention types, we obtain an aggregate F1-score
of 0.56 when evaluating against our fine-grained trigger labels. Recall misses such as those result-
ing from incomplete descriptor lists, and precision misses resulting from multiple descriptor cate-
gories being present within a trigger’s surrounding context, contributed to the drop in F1-score.
Extracting locations and time: We leverage the ACE corpus, which contains annotations of
Place and Time event arguments, to train an event type independent Place/Time argument classi-
fier, based on the neural architecture described in 3.2.1. In our evaluation, an argument is correctly
classified if its event type, event argument role, and offsets match any of the reference event argu-
ments.
African countries are often the focus sites of humanitarian programs and agencies, such as the
World Food Programme (WFP). Hence, to evaluate the performance of our argument model for
intervention events, we randomly selected 250 documents from around 6,000 documents collected
from allafrica.com.
We first apply our coarse-grained trigger classifier on these documents. We then ask annotators to
evaluate trigger predictions and retain only correct ones (188 triggers), which we subsequently use
to evaluate our argument classifier. We focus on using correct triggers to evaluate argument clas-
sification, to prevent error propagation (from erroneous trigger predictions) from muddling a fair
assessment of the argument classifier.
Our annotators assigned a total of 15 Time arguments and 77 Place arguments to the 188 event
triggers. Our argument classifier predicted a total of 12 Time arguments, giving a precision, recall,
and F1 of 0.92, 0.73, and 0.81 respectively. Our argument classifier predicted 30 Place arguments,
giving a precision, recall, and F1 of 0.93, 0.36, and 0.52 respectively.
15
These lists are available at https://github.com/ BBN-E/mr-intervention
5.0 CONCLUSIONS
During this effort, we made technical progress in event and causal factors extraction from text,
tools to aid ontology development, causal relation extraction from text, and experimentation sup-
port. Here we provide a brief summary of the key results we accomplished to date:
Event extraction from text. We have developed event extraction techniques that achieved state-
of-the-art results on publicly available benchmark datasets. To support the evolving operation
needs that rely on an evolving set of events and causal factors, we also developed the following
two additional techniques. First, we developed on-demand rapid customization techniques. It only
requires a small numbers of user interactions per each new event type to build a machine learning
models for the types of interest.
Ontology-In-A-Day. Our human-in-the-loop OIAD clustering service helps streamline the man-
ual process of constructing an ontology for each new domain and use case. This sort of human-
machine synergy enables IE’s massive reading ability to discover factors and relations and organ-
ize them in way that aligns with human understanding of complex problems, reducing overall
human effort.
Causal relation extraction from text. Our causal relation extraction techniques include a pattern-
based bootstrap learning approach for causal relation extraction and a deep neural network model
for improved recall while maintain the similar level of precision. The approach achieved state-of-
the-art performance. However, causal relation extraction for new domains and implicit causal re-
lations is still a challenging problem. We believe more labeled data can help. However, annotating
causal relations is a difficult and time-consuming task. We developed crowdsourcing techniques
to reduce the effort required to curate a large training dataset and performed multiple rounds of
pilot study to iterate on the annotation guideline. We are optimistic in overcoming the data bottle-
neck with this approach if we have more research and development effort.
Experimentation and Evaluation. We supported multiple experiments and evaluation activities,
including embed experiments with government transition partners. We also supported the pro-
gram-wide integration efforts between the BBN team and other performers.
6.0 REFERENCES
Agichtein E., Gravano L. (2000). Snowball: Extracting relations from large plain-text collec-
tions. In Proceedings of the fifth ACM conference on Digital libraries, pages 85–94. ACM.
Bachrach Y., Finkelstein Y., Gilad-Bachrach R., Katzir L., Koenigstein N., NirNice J., Paquet
U.. (2014). Speeding up the xbox recommender system using a euclidean transformation for in-
ner-product spaces. In Proceedings of the 8th ACM Conference on Recommender systems, pages
257–264. ACM.
Baroni, M., Dinu, G., and Kruszewski, G. (2014). Don’t count, predict! a systematic comparison
of context-counting vs. context-predicting semantic vectors. In ACL-2014, pages 238–247.
Bordes A., Usunier N., Garcia-Duran A., Weston J., Yakhnenko O. (2013). Translating embed-
dings for modeling multi-relational data. In Advances in neural information processing systems,
pages 2787–2795.
Boros E. (2018). Neural Methods for Event Extraction. Ph.D. thesis, Universite Paris-Saclay.
Boschee E., Weischedel R., Zamanian A. (2005). Automatic information extraction. In Proceed-
ings of the International Conference on In-telligence Analysis, volume 71. Citeseer.
Chen Y., Xu L., Liu K., Zeng D., Zhao J.. (2015). Event extraction via dynamic multi-pooling
convolutional neural networks. In ACL-IJCNLP2-2015, pages 167–176
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirec-
tional transformers for language understanding. In Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
Doddington G. R., Mitchell A., Przybocki M. A., Ramshaw L. A., Strassel S. M., Weischedel R.
M. (2004). The automatic content extraction (ace) program - tasks, data, and evaluation. In
LREC.
Dunietz J., Levin, L., & Carbonell, J (2017). BECauSE Corpus 2.0: Annotating Causality and
Overlapping Relations, In Proceedings of the 11th Linguistic Annotation Workshop.
Freedman, M. and Gabbard, R. (2014). Overview of the event argument evaluation. In Proceed-
ings of TAC KBP 2014 Workshop, National Institute of Standards and Technology, pages 17–
18.
Grishman, R. and Sundheim, B. (1996). Message understanding conference-6: A brief history. In
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics.
Gupta P., Roth B., Schutze H. (2018). Joint bootstrapping machines for high confidence relation
extraction. arXiv preprint. arXiv:1805.00254.
Hochreiter S., Schmidhuber J. (1997). Long short-term memory. Neural computation, 9(8):1735–
1780.
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., ... & Liu, Q. (2019). TinyBERT: Distil-
ling BERT for natural language understanding. arXiv preprint arXiv:1909.10351.
Lin D., Pantel P. (2001). Dirt: discovery of inference rules from text. In Proceedings of the sev-
enth ACM SIGKDD international conference on Knowledge discovery and data mining, pages
323–328. ACM.
Meyers, A., Kosaka, M., Xue, N., Ji, H., Sun, A., Liao, S., and Xu, W. (2009). Automatic recog-
nition of logical relations for English, Chinese and Japanese in the GLARF framework. In Pro-
ceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions
(SEW '09). Association for Computational Linguistics, USA, 146–154.
Min, B., Freedman, M., Meltzer T. (2017). Probabilistic Inference for Cold Start Knowledge
Base Population with Prior World Knowledge. In Proceedings of EACL 2017.
Mintz M., Bills S., Snow R., Jurafsky D. (2009). Distant supervision for relation extraction with-
out labeled data. In ACL-IJCNLP, pages 1003–1011.
O’Gorman T., Wright-Bettner K., Palmer M. (2016). Richer event description: Integrating event
coreference with temporal, causal and bridging annotation. In Proceedings of the 2nd Work-
shopon Computing News Storylines (CNS 2016), pages 47–56.
Pantel P., Crestan E., Borkovsky A., Popescu A., Vyas V. (2009). Web-scale distributional simi-
larity and entity set expansion. In Proceedings of the 2009 Conference on Empirical Methods in
Natural Language Processing: Volume 2, pages 938–947. Association for Computational Lin-
guistics.
Pantel, P. and Lin, D. (2002). Discovering word senses from text. In Proceedings of the eighth
ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD-
02), pages 613– 619.
Parker R., Graff D., Kong J., Chen K., Maeda K. (2011). English gigaword. Linguistic Data Con-
sortium.
Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word
representation. In Proceedings of the 2014 conference on empirical methods in natural language
processing (EMNLP) (pp. 1532-1543).
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2020). DistilBERT, a distilled version of BERT:
smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
Schütze, H., Manning, C. D., and Raghavan, P. (2008). Introduction to information retrieval, vol-
ume 39. Cambridge University Press Cambridge.
Toutanova K., Chen D., Pantel P., Poon H., Choudhury P., Gamon M. (2015). Representing text
for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Em-
pirical Methods in Natural Language Processing, pages 1499–1509.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf,
R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le
Scao, T., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. (2020). Transformers: State-of-the-art
natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing: System Demonstrations, pages 38–45, Online, October. Associa-
tion for Computational Linguistics.
Yang, J., Han, S. C., & Poon, J. (2022). A survey on extraction of causal relations from natural
language text. Knowledge and Information Systems, 1-26.
Yu H., Agichtein E. (2003). Extracting synonymous gene and protein terms from biological liter-
ature. Bioinformatics, 19:i340–i349.
Zhang S., Duh K., Van Durme B. (2017). Mt/ie: Cross-lingual open information extraction with
neural sequence-to-sequence models. In Proceedings of the 15th Conference of the European
Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 64–70.
LIST OF SYMBOLS, ABBREVIATIONS, AND ACRONYMS
(ACE) Automatic Content Extraction

(AMT) Amazon Mechanical Turk
(BBN) Bolt, Beranek, and Newman
(BERT) Bidirectional Encoder Representations from Transformers
(CAG) Causal Analysis Graph
(CBC) Clustering By Committee
(CM) Causal Mentions
(CNN) Convolution Neural Network
(DNN) Deep Neural Networks
(EM) Event Mention
(FCDNN) Feature-rich Context-aware Deep Neural Network
(HIL) Human In the Loop
(LearnIt) Learning relation extractors Iteratively
(LDC) Linguistic Data Consortium
(ML) Machine Learning
(NLP) Natural Language Processing
(NN) Neural Network
(OIAD) Ontology In A Day
(TA) Technical Area
(TAC KBP) Text Analysis Conference Knowledge Base Population
(UI) User Interface
(WE) Word Embeddings
(WF) Workflows

Hume: Domain-Agnostic Extraction of Causal AD1189441

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hume: Domain-Agnostic Extraction of Causal AD1189441

Uploaded by

Copyright:

Available Formats

REPORT DOCUMENTATION PAGE Form Approved OMB NO.

6. AUTHORS 5d. PROJECT NUMBER

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAMES AND ADDRESSES 8. PERFORMING ORGANIZATION REPORT

Cambridge, MA 02138 -1119

15. SUBJECT TERMS

Proposal Number: 72287CSDRP Agreement Number: W911NF-18-C-0003

Name: Yee Seng Chan

Organization: Raytheon BBN Technologies Corp.

STEM Degrees: STEM Participants:

Training Opportunities: Nothing to Report

Results Dissemination: Nothing to Report

Honors and Awards: Nothing to Report

Protocol Activity Status:

Technology Transfer: Nothing to Report

Participant Type: PD/PI

Participant Type: Co PD/PI

Participant Type: Co PD/PI

I certify that the information in the report is complete and accurate:

DARPA World Modelers

March 31, 2022

FINAL TECHNICAL REPORT FOR PERIOD

Prime Contract Number W911NF-18-C-0003

Distribution Statement A: Approved for public release: distribution.

5c. PROGRAM ELEMENT NUMBER

6. AUTHOR(S) 5d. PROJECT NUMBER

5f. WORK UNIT NUMBER

12. DISTRIBUTION/AVAILABILITY STATEMENT

Table 1: Inter-annotator agreement filtered by attribution. ...........................................................6

This document serves as the end-of-contract final report.

3.1 Baseline Capabilities for Knowledge Organization

3.2 Event Extraction from Text

Table 1: Inter-annotator agreement filtered by attribution.

3.3 Ontology-In-A-Day Clustering

GPU CPU CPU + Quantization CPU + Quantization + No Padding

3.4 Causal Relation Extraction from Text

Figure 4: LearnIt workflows.

𝐿𝐿 = � � [𝛾𝛾 + 𝑑𝑑�𝑒𝑒𝑖𝑖 + 𝑟𝑟𝑘𝑘 , 𝑒𝑒𝑗𝑗 � − 𝑑𝑑�𝑒𝑒𝑖𝑖 ′ + 𝑟𝑟𝑘𝑘 ′ , 𝑒𝑒𝑗𝑗 ′ �]

Table 5: Causal and temporal relations between event X and Y

Table 6: Performance of LearnIt relation extractors

Figure 7: Architecture of FCDNN

4.1 Experimentation and Evaluation

4.2 Event Customization Case Study: Intervention

and their impact on malnutrition in humanitarian contexts.

• Academic studies, such as quasi-experimental studies from ebrary.ifpri.org.

Intervention Type Example Snippet

4.2.4 Evaluating Intervention Performance

(ACE) Automatic Content Extraction

You might also like