Thesis Proposal Isabelle Mohr Revised

Institute of Natural Language Processing
University of Stuttgart
Proposal for Master Thesis
Fact-Checking Biomedical COVID-19

Tweets
Isabelle Mohr
Time frame: June 2021 - November 2021
Supervisors
PD Dr. Roman Klinger
Amelie Wührl
1 Motivation
Over the course of the last 14 months, information about the COVID-19 pandemic has
been disseminated concerning government action in an effort to control the spread of the
virus. Simultaneously, the COVID-19 crisis has elicited a global response in the scientific
community, leading to a burst of new information regarding the pathophysiology of the
virus, new treatments for infected patients as well as vaccines in production (Chahrour
et al., 2020).
Beside traditional news media outlets, social media platforms are places where impor-
tant news and biomedical articles are shared among many users (Haustein et al., 2014).
Of course, reaching large networks of individuals in short amounts of time is paramount
to curbing the spread of a virus like SARS-CoV-2. However, the stream of information
on social media sites also requires some critical thought and filtering, since it contains not
only truthful and useful information, but also false or misleading information which may
be detrimental to efforts in controlling the pandemic.
Consider the event at the end of March 2020, where nearly 2200 Iranians suffered from
poisoning after ingesting methanol, with 900 patients being admitted to Intensive Care
Units and 296 resulting fatalities (Soltaninejad, 2020). Patients said that social media
messages suggested the consumption of alcohol could prevent becoming infected with
SARS-CoV-2. This illustrates how content posted online can be a real danger in stressful
times, such as during a pandemic, where individuals are fearful and more susceptible to
taking drastic measures.
With this in mind, the term “infodemic” was coined by the World Health Organization
(WHO) to refer to the global epidemic of misinformation surrounding COVID-19 (Zaro-
costas, 2020) which makes it difficult for users to find reliable sources of information.
Gathering and internalizing false information from news media or social media without
verifying the correctness may thus effect individual’s mental health, daily lives, as well as
behaviours in public.
Users on social media platforms may find it too great a task to verify each statement
they come into contact with online. Thus, the infodemic poses a challenge as well as
an opportunity for researchers to implement and apply new data-driven technologies that
1
alleviate this burden. This can be done by automatically fact-checking information and
presenting the user with, for example, a truthfulness rating, which can help the user to
decide whether or not to believe something.
Although datasets of fact-checked claims, and databases containing known facts, exist,
at the time of writing this proposal there are no other known resources that address the
truthfulness of biomedical information circulating on social media. The proposed project
will attempt to contribute to this area of research by creating a corpus of fact-checked
biomedical tweets and investigating multiple approaches to fact-checking to discover their
feasibility, advantages and disadvantages in this domain.
2 Background
This section introduces important background information in the form of terminology,
descriptions of similar but distinct tasks to that of fact-checking and what fact-checking
traditionally entailed as a task in journalism. Section 3 will then dive into related work in
the field of automatic fact-checking.
2.1 Terminology and Related Tasks
Generally, the task of fact-checking is considered as assessing whether claims made in

written or spoken language are true (Thorne and Vlachos, 2018). In contrast, some re-
searchers have defined fake news detection as well as stance detection as the classification
of whether an article body supports its headline or not (Pomerleau and Rao, 2017; Ferreira
and Vlachos, 2016). This task is somewhat different to fact-checking in that it uses the
body of the text to substantiate the headline, which is also often a claim. Fact-checking
instead usually sources external evidence that confirms or refutes the claim or statements.
Additionally, the term “fake news” has been used recently in the context of US presiden-
tial elections to refer to media organizations of opposing political parties and their stories,
which has little to do with the task of actually verifying or fact-checking information
(Thorne and Vlachos, 2018). In the current research, the term “fake news” will therefore
be avoided altogether.
Distinctions have also been made in journalism with regard to the words fact-checking
2
and verification. Put simply, the task of verification relies more on retrieving the source,
date and location of statements in order to ‘verify’ their origin (Silverman, 2016). In
a complementary way, fact-checking then focuses on addressing the claim’s logic, co-
herence and context. Verification of a claim is therefore an important first step in the
journalistic process of fact-checking, which helps to assess the trustworthiness of source.
A further distinction can potentially be made between the terms misinformation and
disinformation, where the former will here be used to refer to information that is some-
how incomplete or inaccurate given its context, while the latter has the additional motive
of purposefully deceiving the reader (Jowett and Donnell, 2006). An example may be
the headline “People seek refuge in Islam by reading Quran to cure COVID-19”, which
is categorized as misinformation, in that reading the Quran does not cure COVID-19.
Simultaneously, it can be further categorized as disinformation in that it may be purpose-
fully disseminated to deceive readers. As disinformation forms a subset of misinformation
concerning itself with intent, it is somewhat more difficult to detect and distinguish. The
current research will therefore focus on misinformation as a general category, without
attempting to differentiate between different kinds of misinformation.
Furthermore, fact-checking sites like Snopes and Politifact have designed their own
classes and ratings for articles and claims, for example “Pants on Fire” (Politifact), which
are defined as “. . . rated statements [that] are demonstrably false and make a ridiculous
claim, major exaggeration or make fear-mongering statements with the intent to provoke
a panic reaction” (Holan, 2020). Others make use of graded ratings like “Partially False”
and “Mostly False”, where minor details of a claim are accurate, but the claim itself is
still false, illustrating the complexity that some claims and arguments may have.
2.2 Journalism
The task of fact-checking has historically been performed by trained professionals, who
dive into literature, databases and reputable news sources in order to fact-check claims in
speeches, debates and news stories using published figures and known facts as evidence
for their verdict (Cohen et al., 2011). This process can take anywhere from a few minutes
for a simple claim, to a few days, when working on whole documents with complex claims
(Hassan et al., 2015).
3
As a whole, the journalism landscape has changed over the last decade, with fewer
reporters gathering original material than in preceding generations (Cohen et al., 2011).
Traditional news media has given way to online news organizations, many of which per-
form the role of aggregators, who read others’ blogs and news reports in order to refor-
mulate their own version of the information. In this process, information often becomes
distorted as it may through a broken telephone. This stands in contrast to investigative re-
porting seen in newspapers, which prides itself in obtaining truthful information through
deep research into a topic of interest over long periods of time. Unfortunately, this form
of journalism is expensive and time consuming, and has started falling out of fashion with
the rise of online news organizations.
This societal change has resulted in the appearance of fact-checking websites, such as
Snopes.com and Politifact.com, which try to address the spread of misinformation online.
These sites recruit individuals, often using crowdsourcing platforms, who perform the
task of fact-checking on viral claims, assess their credibility and offer a verdict in the
form of a ‘truthfulness rating’ along with evidence (Holan, 2020). As already mentioned,
this effort is time-consuming and cannot address a meaningful fraction of the information
being spread online, especially not in real-time. With an extreme increase in information
made publicly available online every minute, it only makes sense to attempt automation
of this process. The journalism community has made calls for database researchers to
aid human fact-checkers in their work by developing tools that automate parts of this
process (Cohen et al., 2011). The following section will discuss semantic networks and
knowledge graphs, which have played a role in recent fact-checking approaches.
2.3 Semantic Networks and the Development of Knowledge Graphs
The term ‘knowledge graph’ was coined as early as 1972, where it was used to refer to the
design of semantic networks with edges restricted to a set of relations in order to enable
algebras on the graph Kejriwal (2020). For some context, semantic networks are generally
knowledge bases that represent semantic relations between concepts. An example may
be WordNet, a lexical database of English that captures semantic relationships between
words and meanings. Although the distinction between semantic networks and knowledge
graphs is rather blurry, the term ‘knowledge graph’ was brought into common use in 2012,
4
when Google introduced their Knowledge Graph.
Knowledge graphs have been particularly popular in web search and information re-
trieval, as they can provide the architecture of an interconnected network and thereby
replace simple keyword queries that could only retrieve documents containing the key-
words. Instead, information can be retrieved through querying the graph with a decom-
posed version of a string query, for example “How many students are enrolled in Winter
2020 in Emotion Analysis?” which can be decomposed into entities (‘students’ and ‘Emo-
tion Analysis’) and relations (‘enrolled in Winter 2020’) and constraining the answer to
be numerical. Semantic Search, an example of this, is possibly the most influential mile-
stone in modern knowledge graph research, used in the Google Knowledge Graph (Bizer
et al., 2009) over the last decade.
Recent research shows that knowledge graphs may offer a stepping stone in the au-
tomation of fact-checking, as they have an inherently structured nature. Knowledge
graphs contain three important types of objects: entities, attributes and relations (Hogan
et al., 2021). Entities are the key objects of interest in the graph, while attributes are
characteristics of the entities in the graph. Relations are directed edges connecting either
two entities, or expressing the fact that an entity has an attribute. Further constraints are
placed on all components in the graph, which are made explicit in a ‘domain ontology’.
For example, in a ‘university’ graph, only a ‘student’ entity may have the attribute ‘name’,
a ‘course’ entity should not be allowed to have this attribute. More generally, knowledge
graphs are inherently structured, allowing complex types of entities, attributes and their
interrelationships to be declared with corresponding constraints. Because of this structure,
knowledge graphs allow logical inference for retrieving implicit knowledge rather than
only allowing queries that seek explicit knowledge (Hogan et al., 2021). Current work in
COVID-19 related knowledge graphs, as well as some applications to fact-checking, are
reviewed in Section 3.4.
3 Related Work
This section reviews work that has been done with regard to fact-checking from a compu-
tational perspective, first considering the input to be considered for fact-checking (Section
5
3.1), then reviewing some approaches to modelling the fact-checking task (Section 3.2),
and lastly looking at datasets (Section 3.3) as well as knowledge graphs (Section 3.4) that
may be used to this end.
3.1 What should be fact-checked?
The first component in the automation of fact-checking is deciding what the content of
interest for this task should be (Thorne and Vlachos, 2018). The first consideration lies in
the format or granularity of the input, which in some applications is a textual claim, con-
sisting of a short sentence taken from a longer document. This is a common unit for this
task on fact-checking websites like the ones mentioned above. Sometimes, surrounding
sentences may also be available to the fact-checker, in order to present useful context.
Another important question to ask is what constitutes a claim and how these can be
automatically retrieved or extracted, particularly when dealing with whole documents.
Fact-checks are usually only done on content that is not deemed “subjective” and can
thus be verified objectively, avoiding any claims that are not inherently verifiable (Co-
hen et al., 2011). This falls into the tasks of subjectivity detection and claim detection,
where a claim is an assertion or statement that something is true, usually surrounded by
different supporting elements to formulate an argument (Toulmin, 2003). Here, claims
must first be identified and extracted and then fact-checked, increasing the complexity of
the automation of this task. In related work, claims have been extracted in the form of
triples by performing relation extraction (Vlachos and Riedel, 2015) and through super-
vised sentence-level classification (Hassan et al., 2015).
3.2 Computational Approaches to Fact-Checking
This section will discuss approaches that address automatic fact-checking from two per-
spectives, which are here termed unstructured and structured approaches.
3.2.1 Unstructured Approaches
An unstructured approach based on supervised learning, which does not consider external
evidence to score a claim with a truthfulness rating, instead uses surface-level linguistic
6
features of the claim itself to make a prediction (Rashkin et al., 2017). This stands in
contrast to how journalists and fact-checkers have traditionally tackled the task. Instead,
the approach relies entirely on the features of the claim, how it is presented and writ-
ten, rather than considering true facts about the world. These features can be captured
in bag-of-words (BOW) appraoches, language models and lexicons. For example, ap-
proaches like LSTM-text (Rashkin et al., 2017), require substantial feature-modelling and
rich lexicons, which capture differences between credible and misleading language, and
thereby make an assessment. An example may be if a text contains many superlatives (eg.
“biggest”, “richest”) and adverbs (eg. “shockingly”, “extremely”, “surprisingly”) it could
be untruthful. Although simplistic, these approaches can achieve substantial results, up to
62% accuracy scores on predicting true from false claims (Rashkin et al., 2017).
Another unstructured approach attempts to model the traditional process of fact-checking
by combining a credibility assessment of the source of the claim with attempts at retriev-
ing substantiating articles or resources that provide evidence for or against the claim. A
model that attempts to mimic the entire journalistic process is the DeClarE2 model (Popat
et al., 2018), which uses a neural architecture to judiciously aggregate information from
external evidence articles retrieved from the internet, the language of these articles and
the trustworthiness of their sources. Particular words in the retrieved article are focused
on by means of an attention mechanism that considers both the claim and the retrieved
article. An advantage of retrieving articles from the web is that they not only help to make
informed decisions, but can also be used to provide the end-user with an explanation as
to why a claim was deemed true or false.
3.2.2 Structured Approaches
The above approaches, do not take advantage of the structured nature of information
available in databases. Thorne and Vlachos (2018) observe two types of formulations that
use this type of information as evidence.
The first approach is to retrieve the element in a database that provides the informa-
tion that supports or refutes the claim under investigation. Vlachos and Riedel (2015)
make use of this in order to fact-check claims with statistical properties, for example,
claims about population size or inflation rates, using the semi-structured knowledge base
7
Freebase (Vlachos and Riedel, 2015). In this approach, a claim is decomposed into
(subject, predicate, object) triples, which are then checked against the database. If a
relevant triple is found, a prediction is made on how likely this claim is to be true. This is
done by computing an error score between the claimed values and retrieved values from
the database in a rule-based way.
A second approach uses a knowledge graph as a topology in order to predict how likely
a claim, which is expressed as a node in a graph, is to be true (Ciampaglia et al., 2015).
This approach relies on the ‘plausibility’ of claims being represented in the topology, and
although implausible statements are unlikely to be true, this is not a deterministic way of
negating a statements truthfulness. Additionally, improbable but believable claims have
a greater probability of going viral and should therefore be assessed more rigorously by
fact-checking systems (Thorne and Vlachos, 2018). A review of a selection of knowledge
graphs can be found in Section 3.4.
3.3 Datasets
A different source of evidence that can possibly be used as part of a supervised learning
approach are datasets of previously fact-checked claims. An example of this is a dataset of
221 labeled claims in the political domain that were manually fact-checked by Politifact
and Channel4 (Vlachos and Riedel, 2014). As this dataset is rather small, another larger
one with the same domain was released with 12 800 labelled claims from Politifact (Wang,
2017). In this dataset, claims have meta-data concerning speaker affiliation and context
the claim appeared in (for example “speech” or “tweet”), and because it is much larger, it
can be used to train and evaluate machine learning models for fact-checking.
Another interesting dataset is the FakeCovid dataset (Shahi and Nandini, 2020), which
consists of 5182 fact-checked news articles concerning COVID-19 collected from January
2020 till May 2020 from various fact-checking websites. The dataset contains articles in
40 languages from 105 different countries and potentially presents an interesting source
of information for unstructured approaches to fact-checking, which could use this infor-
mation to fine-tune language models.
8
3.4 Knowledge Graphs
Over the last decade, applications of domain-specific knowledge graphs have been ex-
plored, many of which aid scientific research (Kejriwal, 2020). In particular, when the
COVID-19 pandemic started escalating, a team at Verizon Media started building a COVID-
19 knowledge graph called YK-COVID-19 (Nagpal, 2020). The YK-COVID-19 dataset
is updated multiple times a day, extracting statistics from hundreds of sources from around
the world. Another effort, the ‘COVID-19 Knowledge Graph’, captures the pathophisi-
ology of the COVID-19 virus and consolidates knowledge from 145 academic research
papers published on the topic (Domingo-Fernández et al., 2020).
Needless to say, there are many efforts to construct KGs for different COVID-related
data sources, some based on scientific literature, others concerning confirmed case and
mortality data, to mention only a few. These are separate resources that often follow var-
ious different ontologies and frameworks. This is a real challenge of KGs, which lack
a consistent use of vocabulary for nodes and edges, and lack standardized schemas as
they are often built as ‘silos’ for a particular use-case Kejriwal (2020). To combat this
problem, both COVID-19 Graph (Domingo-Fernández et al., 2020) and KG-COVID-
19 (Reese et al., 2021) use the Biolink Model (Biolink) as their ontology. This is a
high-level data model for representing biological and biomedical knowledge consisting
of entities, associations, predicates and properties. An example from this ontology is
the entity Disease which is of class DiseaseOrPhenotypicFeature and has the URI: bi-
olink:DiseaseOrPhenotypicFeature. The nature of this model allows it to be easily trans-
lated to RDF or Neo4J format to create graphs.
Although social media presents a useful and important source of information, there
does not seem to be any concerted effort to make greater use of the COVID-related data
on social media so far. One effort which uses Twitter as a source of data is GeoCov19
Qazi et al. (2020), which consists of more than half a billion multilingual tweets, each
with an inferred geolocation, collected between February and April 2020. However, there
do not seem to be any current attempts at leveraging the large amounts of data produced
on social media platforms for building knowledge graphs (Kejriwal, 2020).
9
3.5 Analyses of COVID-19 on Social Media
Although there are no knowledge bases taking advantage of information circulating on

social media, some efforts have turned their attention to the spread of information online
through large-scale analyses. An example would be the large-scale analysis of the dif-
fusion of information regarding COVID-19 on social media performed by Cinelli et al.
(2020). They looked to Twitter, Instagram, YouTube, Reddit and Gab for discourse sur-
rounding the pandemic, analysing the evolution and spread of information on each plat-
form. Analogous to tracking the spread of the virus itself in a population, the authors
model the information spreading with epidemic models characterized by the basic repro-
duction number R0 . This basic reproduction number is the expected number of cases
directly related to one case in the population (Cinelli et al., 2020).
By analysing how many posts are created with COVID-19 content as well as the user’s
engagement with COVID-19 content (in the form of commenting, reposting, mentioning,
hashtags, etc.), they find that each platform has unique interactional patterns, with the
highest volume of interactions in terms of commenting and posting observed on YouTube
and Twitter.
In real epidemics, R0 > 1 highlights the possibility of a pandemic. Cinelli et al.
(2020) reformulate this to suit the case of information spread online, where R0 > 1 now
suggests the possibility of an infodemic. They find that although each platform has its
own R0 , each one of them at the time of analysis (leading up to 14 Feb 2020) was shown
to be at ‘supercritical’ levels (Cinelli et al., 2020).
Additionally, the authors conducted an analysis of the diffusion of different kinds of
sources on each platform. Various sources were tagged as either ‘questionable’ or ‘re-
liable’ according to the independent fact-checking organization Media Bias/Fact Check,
and the spread (sharing) and interaction (reactions, comments) of their content was tracked.
The authors find that different volumes of misinformation circulate on different platforms,
with Gab being the platform with the highest amount of reactions and comments to ques-
tionable content. However, there is no significant difference in the spreading patterns of
content produced by reliable and questionable sources on the different platforms. The
authors clearly argue that their analysis shows the spread of information, whether reliable
or not, during times of crisis is at supercritical levels. This substantiates the use of the
10
term ‘infodemic’ by WHO to denote the global epidemic of misinformation surrounding
COVID-19.
4 Goal of Thesis
In order to address the problem of false information spreading on social media platforms,
the goal of this research is to find useful methodologies to fact-check statements in an
automatic way. This will be done by comparing two approaches, the first being an un-
structured approach, and the second being a structured approach. The source of the data
is Twitter, restricting the domain to tweets concerning COIVD-19 produced in April/May
2021. The research questions of this work are the following:
RQ1: Can a structured approach to fact-checking, using a Knowledge Graph, achieve

better results than an unstructured approach that uses Language Models?
RQ2: How reliable is such a system/approach in its assessment of true and false
claims?
RQ3: What are properties of instances that can be found only with one of the two
approaches?
Optional goals, which will be undertaken if time permits:
RQ4: How can the two fact-checking pipelines be improved through their various
components? (eg. transfer learning for entity-relation extraction)
What follows will be an outline of the steps that will be followed in order to meet the
goals which have been formulated here.
5 Approach
In order to answer the research questions outlined in Section 4, these are the steps of
research and implementation that need to ensue:
11
5.1 Dataset Creation
Tweets will be collected using the Twitter API, searching only through recent tweets (oc-
curring up to a week prior). Only tweets that contain the search terms covid and cause will
be collected, so as to select tweets that potentially contain a causal relationship. At least
one of the following terms must also occur in the tweet: treatment, vaccine, side effect and
symptom, which is an attempt at capturing biomedical themed tweets. No retweets will
be collected. Each tweet will be collected with its corresponding ID and date as shown in
the example instance in Table 1 below.
Date ID Tweet
@razprowess Pain at the injection site is the com-
2021-04-27
13819732. . . monest side effect of covid vaccine...the cause of
16:05:08
the pain is still largely unknown.
Table 1: Example biomedical tweet
Since the collected tweets contain many non-biomedical tweets, the collection will
be filtered to result in a set of only biomedical tweets. This will be done by manually
selecting a starting set of 500 biomedical tweets, which will then be used to train a k-
means model (k=2) to filter the remaining collection of tweets into biomedical and non-
biomedical.
The next step will be to annotate the biomedical tweets with entities and relations,
where entities are [Treatment, Vaccine, Side-effect, Disease, Symptom, Patient], and rela-
tions are either Cause or DoesNotCause.
Because this is a laborious task, named entities will be extracted using preexisting
models (for example, ScispaCy (Neumann et al., 2019) for biomedical named entity
recognition). Thereafter, relations between entities will be annotated manually.
The final step in the dataset creation demands the instances to be manually fact-
checked. To do this, the claim should be entered into Google Search to find a supporting
research or news article. If found, the hyperlink to an article should be saved, and the
claim labeled as True or False. If no supporting evidence can be found within 2 minutes,
the claim may be labeled as Unverified.
12
5.2 Implementation of Unstructured Approach
The unstructured approach will use pretrained embeddings (eg. BERT) and fine-tune
these on the biomedical tweet domain by only training on instances labelled as True. In
order to predict true from false instances, the probability of a sequence of words will be
the determining factor, relying on the language model to capture more probable sequences
as True and less frequently seen sequences as False. The motivation behind this is to see
whether, and to what degree, such an unstructured approach is able to make accurate
predictions, as well as to see where this model fails.
5.3 Implementation of Structured Approach
The structured approach will attempt to make use of the annotated entity-relation labels
by building a Knowledge Graph. This will be done using the Biolink ontology, and popu-
lating the graph with facts from the gathered biomedical tweets. The fact-check will then
be performed by searching the graph for a particular node that represents a claim, and
computing the probability that it is True or False (labels may still be subject to change).
5.4 Comparisons and Evaluation
Since the goal is ultimately to compare and contrast the effectiveness of the two ap-
proaches to the task of fact-checking, both approaches will be evaluated on the same
gold testing data, to see which comes closer to human judgement. Additionally, an error
analysis will be done to see which instances each pipeline succeeded and struggled with
the most.
13
6 Project Plan
The following table presents a sketch of the time line for the thesis with major milestones.
Dates represent the spans in which writing/research and implementations should be com-
pleted.
Time Writing/Research Implementation

31 May - 06 June Misinformation on Twitter Collect tweets
Annotate Entity-Relations (includ-
07 June - 04 July Entity-Relation extraction
ing distant labelling)
Crowdsource fact-checking of in-
05 July - 22 Aug Fact-checking using crowdsourcing
stances
Unstructured approach to fact- LM Baseline Fact-checking Archi-
23 Aug - 19 Sept
checking tecture
20 Sept - 17 Oct KG’s and fact-checking Build KG
KG Baseline Fact-checking Archi-
18 Oct - 14 Nov (buffer)
tecture
15 Nov - 29 Nov Results, Comparison & Discussion (improve baselines if time permits)
Table 2: Planned Time Table
14
References
Biolink. Biolink Model. https://biolink.github.io/biolink-model/. Accessed:
2021-05-05.
C. Bizer, Tom Heath, and T. Berners-Lee. 2009. Linked Data - The Story So Far. Int. J. Semantic
Web Inf. Syst., 5:1–22.
Mohamad Chahrour, Sahar Assi, Michael Bejjani, Ali A. Nasrallah, Hamza Salhab, Mohamad
Fares, and Hussein H. Khachfe. 2020. A Bibliometric Analysis of COVID-19 Research Activ-
ity: A Call for Increased Output. Cureus.
Giovanni Luca Ciampaglia, Prashant Shiralkar, Luis M. Rocha, Johan Bollen, Filippo Menczer,
and Alessandro Flammini. 2015. Computational Fact Checking from Knowledge Networks.
PLOS ONE, 10(6):1–13.
Matteo Cinelli, Walter Quattrociocchi, Alessandro Galeazzi, Carlo Michele Valensise, Emanuele
Brugnoli, Ana Lucia Schmidt, Paola Zola, Fabiana Zollo, and Antonio Scala. 2020. The
COVID-19 social media infodemic. Scientific Reports, 10.
Sarah Cohen, James Hamilton, and Fred Turner. 2011. Computational Journalism: A Call to Arms
to Database Researchers. Commun. ACM, 54:66–71.
Daniel Domingo-Fernández, Shounak Baksi, Bruce Schultz, Yojana Gadiya, Reagon Karki,
Tamara Raschka, Christian Ebeling, Martin Hofmann-Apitius, and Alpha Tom Kodamullil.
2020. COVID-19 Knowledge Graph: a computable, multi-modal, cause-and-effect knowledge
model of COVID-19 pathophysiology. Bioinformatics. Btaa834.
William Ferreira and Andreas Vlachos. 2016. Emergent: a novel data-set for stance classifica-
tion. In Proceedings of the 2016 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, pages 1163–1168, San Diego,
California. Association for Computational Linguistics.
Naeemul Hassan, Bill Adair, James Hamilton, Chengkai Li, Mark Tremayne, Jun Yang, and Cong
Yu. 2015. The Quest to Automate Fact-Checking. Proceedings of the 2015 Computation +
Journalism Symposium.
Stefanie Haustein, Isabella Peters, Cassidy R. Sugimoto, Mike Thelwall, and Vincent Larivière.
2014. Tweeting biomedicine: An analysis of tweets and citations in the biomedical literature.
Journal of the Association for Information Science and Technology, 65(4):656–669.
Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio
Gutierrez, José Emilio Labra Gayo, Sabrina Kirrane, Sebastian Neumaier, Axel Polleres,
15
Roberto Navigli, Axel-Cyrille Ngonga Ngomo, Sabbir M. Rashid, Anisa Rula, Lukas
Schmelzeisen, Juan Sequeda, Steffen Staab, and Antoine Zimmermann. 2021. Knowledge
Graphs.
Angie Drobnic Holan. 2020. The Principles of the Truth-O-Meter: PolitiFact’s methodology for
independent fact-checking. https://www.politifact.com/article/2018/feb/
12/principles-truth-o-meter-politifacts-methodology-i/. Accessed:
2021-04-03.
Garth S. Jowett and Victoria O’ Donnell. 2006. What is propaganda, and how does it differ from
persuasion? In Propaganda and Misinformation, chapter 1. Sage Publishers.
Mayank Kejriwal. 2020. Knowledge Graphs and COVID-19: Opportunities, Challenges, and
Implementation. Harvard Data Science Review. Https://hdsr.mitpress.mit.edu/pub/xl0yk6ux.
Amit Nagpal. 2020. Yahoo Knowledge Graph Announces COVID-19 Dataset, API, and
Dashboard with Source Attribution. https://developer.yahoo.com/blogs/
616566076523839488/. Accessed: 2021.04.09.
Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. 2019. ScispaCy: Fast and Robust
Models for Biomedical Natural Language Processing. In Proceedings of the 18th BioNLP
Workshop and Shared Task, pages 319–327, Florence, Italy. Association for Computational
Linguistics.
Dean Pomerleau and Delip Rao. 2017. Fake News Challenge. http://
fakenewschallenge.org/. Accessed: 2021-04-03.
Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, and Gerhard Weikum. 2018. DeClarE:
Debunking Fake News and False Claims using Evidence-Aware Deep Learning.
Umair Qazi, Muhammad Imran, and Ferda Ofli. 2020. GeoCoV19: A Dataset of Hundreds of
Millions of Multilingual COVID-19 Tweets with Location Information. CoRR, abs/2005.11177.
Hannah Rashkin, Eunsol Choi, Jin Yea Jang, Svitlana Volkova, and Yejin Choi. 2017. Truth of
varying shades: Analyzing language in fake news and political fact-checking. In Proceedings
of the 2017 Con- ference on Empirical Methods in Natural Language Processing, EMNLP ’17,
pages 117–130.
Justin T. Reese, Deepak Unni, Tiffany J. Callahan, Luca Cappelletti, Vida Ravanmehr, Seth Car-
bon, Kent A. Shefchek, Benjamin M. Good, James P. Balhoff, Tommaso Fontana, Hannah
Blau, Nicolas Matentzoglu, Nomi L. Harris, Monica C. Munoz-Torres, Melissa A. Haendel,
Peter N. Robinson, Marcin P. Joachimiak, and Christopher J. Mungall. 2021. KG-COVID-19:
A Framework to Produce Customized Knowledge Graphs for COVID-19 Response. Patterns,
16
2(1):100155.
Gautam Kishore Shahi and Durgesh Nandini. 2020. FakeCovid- A Multilingual Cross-domain
Fact Check News Dataset for COVID-19. Association for the Advancement of Artificial Intelli-
gence.
Craig Silverman. 2016. Verification handbook: Additional materials. http://
verificationhandbook.com/additionalmaterial/. Accessed: 2021-04-09.
Kambiz Soltaninejad. 2020. Methanol mass poisoning outbreak, a consequence of COVID-19
pandemic and misleading messages on social media. Int J Occup Environ Med, pages 148–150.
James Thorne and Andreas Vlachos. 2018. Automated Fact Checking: Task Formulations, Meth-
ods and Future Directions. In Proceedings of the 27th International Conference on Computa-
tional Linguistics, pages 3346–3359. Association for Computational Linguistics.
Stephen E. Toulmin. 2003. The Uses of Argument, 2 edition. Cambridge University Press.
Andreas Vlachos and Sebastian Riedel. 2014. Fact Checking: Task definition and dataset construc-
tion. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational
Social Science, pages 18–22, Baltimore, MD, USA. Association for Computational Linguistics.
Andreas Vlachos and Sebastian Riedel. 2015. Identification and Verification of Simple Claims
about Statistical Properties. Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing, page 2596–2601.
William Yang Wang. 2017. “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News
Detection. CoRR, abs/1705.00648.
John Zarocostas. 2020. How to fight an infodemic. The Lancet, 395.
17

Thesis Proposal Isabelle Mohr Revised

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis Proposal Isabelle Mohr Revised

Uploaded by

Copyright:

Available Formats

Institute of Natural Language Processing

Proposal for Master Thesis

Fact-Checking Biomedical COVID-19

Time frame: June 2021 - November 2021

2.1 Terminology and Related Tasks

Generally, the task of fact-checking is considered as assessing whether claims made in

2.3 Semantic Networks and the Development of Knowledge Graphs

3.1 What should be fact-checked?

3.2 Computational Approaches to Fact-Checking

3.2.1 Unstructured Approaches

3.2.2 Structured Approaches

Although there are no knowledge bases taking advantage of information circulating on

RQ1: Can a structured approach to fact-checking, using a Knowledge Graph, achieve

Optional goals, which will be undertaken if time permits:

Table 1: Example biomedical tweet

5.3 Implementation of Structured Approach

5.4 Comparisons and Evaluation

Time Writing/Research Implementation

Table 2: Planned Time Table

You might also like