Datasets

In recent years, many datasets related to fact checking have been publicly released.
These datasets have

been significant in improving research in the domain of fact verification and have played a crucial role in
the development of automated fact-checking systems. The available resources have experienced an
increase both in quantity and variety, including a range of languages, types of claims, and sources.
Consequently, researchers have now gained the ability to utilize a wide variety of data that accurately
portrays the complicated and sophisticated nature of the fact-checking domain. The application of these
datasets has encouraged innovation by facilitating the training and assessment of complex machine
learning models, techniques in natural language processing, and methods in information retrieval. This
has resulted in the augmentation of fact-checking systems, which are of the highest priority in an era
where the distribution of precise information holds great significance. The fact checking datasets can be
broadly categorized into two distinct categories as described below.
4.1. Veracity Detection Datasets without supporting evidences

4.2. Veracity Detection Datasets with supporting evidences
4.2.1 Retrieving Supporting evidence from web
4.2.2 Retrieving Supporting evidence from wikipedia pages
4.2.3 Retrieving Supporting evidence from local repository
4.2.4 Supporting evidence as premise articles
4.1. Veracity Detection Datasets without supporting evidences
https://arxiv.org/pdf/2111.03299.pdf
//dataset PAPER
1. 4.1.1. Liar Dataset

The LIAR dataset was introduced by Wang (20120) for detecting fake news. The dataset proposed in this
study was systematically compiled by the author, comprising a total of 12,800 statements annotated
directly from the POLITIFACT.COM website. In contrast to the FEVER dataset, which exclusively
utilizes Wikipedia entries for its construction, the issue addressed in the LIAR dataset is specifically
affiliated with the detection of fake news. This is due to the fact that the LIAR dataset is exclusively
composed of news content, such as tweets, interviews, and Facebook postings. An instance consists of a
statement, the individual that made the statement, and the surrounding one word or line context, such as
“presidential election” and “presidential announcement speech”. The instance's label is categorized into
one of the six pre-established fine-grained groups (such as false, barely-true, etc.). The proposed model in
the prior research comprises a hybrid framework that integrates Convolutional Neural Networks (CNNs)
with bidirectional Long Short-Term Memory (LSTM) networks. This combined architecture aims to
forecast the label based on both the statement and the accompanying metadata.
4.1.2. FakeNewsNet
FakeNewsNet Dataset is a multi dimensional data repository that contains two datasets without
supporting evidence. The datasets consist of social context , news content, and dynamic information.The
datasets contain a diverse range of features, which presents an advantageous chance to explore various
methodologies for detecting fake news and get insights into the spread of false information across social
networks, hence enabling potential interventions. Furthermore, the utilization of dynamic information
allows for the examination of early detection methods for fake news by generating synthetic user
engagements based on previous temporal user engagement patterns within the dataset Furthermore, a
comprehensive examination of the dissemination process of false information can be conducted through
the identification of sources, influencers, and the formulation of enhanced tactics for intervening in the
spread of fake news .
4.1.3. FacebookHoax
The dataset consists of data about posts from Facebook pages that are associated with scientific news
(verified) and conspiracy pages (unverified), which were gathered using the Facebook Graph API. The
dataset comprises a total of 15,500 posts extracted from 32 distinct pages without any supporting
evidence provided, consisting of 14 pages dedicated to conspiracy-related content and 18 pages focused
on scientific topics.
4.1.4. GossipCop
GossipCop, another name for Suggest [16], analyzes fake news from US entertainment and celebrity
stories that are posted on websites and in magazines. Each story is given a score between 0 and 10, which
indicates that the news is entirely factual, and 0 indicates that the rumor is entirely fake or fictitious.
4.1.5 FakeCovid
4.2. Veracity Detection Datasets with supporting evidences

4.2.1 Retrieving Supporting evidence from web
https://arxiv.org/pdf/2111.03299.pdf
1. 4.2.1.1 Snopes Dataset
Popat et al. (2017) introduced a corpus that contained a significantly larger number of verified claims.
The dataset consists of 4,956 claims along with their corresponding verdicts. These claims have been
sourced from the Snopes website, as well as the Wikipedia collection of confirmed hoaxes and fictitious
individuals. The researchers retrieved approximately 30 related documents from the web for each claim
by using the Google search engine. This approach generated a total collection of 136,085 documents.
2. 4.2.1.2 MultiFC
In contrast to the FEVER dataset, which is based on claims generated from Wikipedia, the MultiFC
dataset, as proposed by Augenstein et al. [7], gathers information from 26 fact-checking websites.
Specifically, a dataset of 34,918 claims was developed by the researchers. The claims were collected from
26 fact-checking websites, Top evidence pages from the web to verify the claims; the context and
metadata . We perform a thorough analysis to identify characteristics of the dataset such as entities
mentioned in claims. The claims were extracted from diverse domains, each characterized by a distinct
number of labels which is also a notable challenge of this dataset.
4.2.1.3 Claim Buster

The Dataset consists of debates of US presidential elections collected from famous fact-
checking websites such as POLIFACT.com. The dataset consists of 23,533 sentences and are
classified into three categories i-e check-worthy factual statement, unnecessary factual statement,
and non-factual statement. The dataset covers the history of over 50 years of facts and has
discussion of general issues and is collected by 101 participants in an interval of two years. The
check-worthy claims find evidence from a local repository which has all the information related
to the claim.
2. 4.2.2 Retrieving Supporting evidence from wikipedia pages
4.2.2.1 FEVER Dataset

Thorne et al. (2018a) introduced FEVER corpus, which is now regarded as the most comprehensive
accessible corpus for fact-checking purposes. The corpus comprises 50,000 large documents from
Wikipedia. Annotators were responsible for modifying their phrases to extract claims, as well as labeling ,
and additional sentences within the articles that either substantiate or contradict the given claim, serving
as evidence. This type of corpus enables the training models to perform three typical sub-tasks: document
retrieval, evidence extraction, and claim validation.
4.2.2.2. Climate-Fever
The Dataset consists of real world climate related 1535 claims.The supporting evidence was retrieved
from wikipedia. The clustering technique helps in identifying more than 20 distinct topics , such as
claims concerning “sea-level rise”, “climate change in the arctic”,and “climate_change and global
warming” . The evidence labeling task produced a dataset of 1,535 claims with an annotated set of five
evidence candidates for each claim.The distribution for aggregate claim-labels REFUTES, SUPPORTS ,
DISPUTED, and NOT_ENOUGH_INFO is 253 (16.5%), 655 (42.67%), 153 (9.97%), and 474 (30.88%).
https://aclanthology.org/2020.lrec-1.849/
4.2.3.3 WikiFactCheck-English
WikiFactCheck-English is a dataset comprising more than 124k claim, context, and an evidence
document which is extracted from citations and articles on English Wikipedia, together with more than
34k handwritten claims that are contradicted by the evidence documents. This is the largest fact-checking
dataset to date, including actual real world claims and supporting evidence. In the actual world, it will
facilitate the creation of fact-checking systems that can more effectively handle assertions and supporting
data.
4.2.3 Retrieving Supporting evidence from local repository

4.2.3.1 PolitiFact14.
The PolitiFact14 Dataset was constructed for examining the fact-checking problem from the blog of
Channel 43 and annotated from the PolitiFact website. The corpus contains claims, evidence that has
been used by fact-checkers for validating the claims, metadata, and date. Since this is very initial work in
automated fact-checking, Vlachos and Riedel (2014) only focused on the examining and analyzing of the
task. Also, the corpus is very small in size and contains 106 claims only.
4.2.3.2. Perspectrum Master
The dataset utilized in this study comprises a total of 907 claims , 11,164 perspectives, and 8,092
evidence paragraphs. The construction process involves collecting data from online debates websites as
the primary source of initial data, which is then enhanced with additional data and paraphrased content for
generalizing it.Crowdsourcing was used for enhancing the data's quality and eliminating any noise caused
by annotations. The objective of the dataset was to develop a system that can effectively identify a wide
range of well-supported perspectives that express a stance in relation to a given claim.
4.2.3.2. SCI-Fact
The dataset comprises 1409 expert-written claims paired with a corpus of 5183 abstracts which acts as an
evidence. The evidence corpus helps in supporting and refuting the claim. They used S2ORC , which is a
corpus of millions of scientific articles. The verifiers used for assessing the quality of the dataset are
three NLP experts, five graduate students studying life sciences and five life science undergraduates and
Annotators verified claims.
4.2.3.3 PUBHEALTH Dataset
The dataset PUBHEALTH comprises 11.8K claims paired with journalist approved and gold standard
explanations for supporting the fact-check labels for claims. The dataset is specific to health topics
including biomedical subjects e.g, infections , diseases, cells along with government healthcare policies
and public health-related stories. The claims were gathered from five reputable fact-checking websites
(Snopes, Politifact, FactorFiction, FactCheck, and FullFact). Additionally, 9,023 claims were gathered
from the health section and health tags of Associated Press and Reuters News websites, while 2,700
claims were sourced from the news review site Health News Review (HNR).
■ 4.2.3.4 Emergent
An Emergent dataset for rumor debunking was obtained from a digital journalism programme . The
dataset includes 2,595 news stories that are related to 300 rumored claims and their corresponding
veracity classifications (true, false, or unconfirmed), indicating whether it's stance is supporting, refuting,
or observing the claim.
4.2.3.5 FACTIFY
The FACTIFY dataset includes evidence documents linked to the written claims and images. This is the
first multi-modal fact verification dataset with three main categories labeled: support, no-evidence, and
refute . It also includes textual claims, reference textual resources, and images. Classifying the claims
according to the evidence provided is the goal.With 50,000 data points encompassing news from India
and the US, the FACTIFY dataset is notable for being the largest multi-modal fact verification public
dataset . It was made available for use in a group project during the De-Factify workshop in AAAI-21 5.
4.2.4 Supporting evidence as premise articles

https://aclanthology.org/W18-5513.pdf
4.2.4.1 LIAR PLUS

The LIAR PLUS is an extension of the LIAR dataset that contains only claims and some meta data.
However , in the LIAR PLUS dataset the premise article is extracted from the URL of the claim and the
sentences which have verdict were filtered out from the article. The dataset has 6 labels True , pants on
fire, false, mostly false, half true, mostly true, true against 12,836 short statements.
4.2.4.2 WatClaimCheck
The WatClaimCheck dataset includes claims interviews speeches, social media and news articles, review
articles which are published by professional fact checkers .The evidential corpus have premise articles
written by those professional fact checkers for supporting and verifying the veracity of the claim. Eight
fact checking sites: Snopes, Politifact, Alt News, USA Today, AFP Fact Check, Africa Check, Full Fact
and FactCheck.org.
4.2.4.3 HealthFC
The HealthFC dataset consists of 750 health related claims in German and English language. The
evidence for these claims were gathered from clinical studies.The dataset was collected from web portal
Medizin Transparent. Every claim has an associated single document.The dataset has three labels
support , refute and NEI.

Datasets

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Datasets

Uploaded by

Copyright:

Available Formats

In recent years, many datasets related to fact checking have been publicly released.

These datasets have

4.1. Veracity Detection Datasets without supporting evidences

4.1. Veracity Detection Datasets without supporting evidences

1. 4.1.1. Liar Dataset

4.2. Veracity Detection Datasets with supporting evidences

4.2.1.3 Claim Buster

2. 4.2.2 Retrieving Supporting evidence from wikipedia pages

4.2.2.1 FEVER Dataset

4.2.3 Retrieving Supporting evidence from local repository

4.2.4 Supporting evidence as premise articles

4.2.4.1 LIAR PLUS

You might also like