Professional Documents
Culture Documents
F T: Factuality Detection in Generative AI A Tool Augmented Framework For Multi-Task and Multi-Domain Scenarios
F T: Factuality Detection in Generative AI A Tool Augmented Framework For Multi-Task and Multi-Domain Scenarios
F T: Factuality Detection in Generative AI A Tool Augmented Framework For Multi-Task and Multi-Domain Scenarios
Table 1: A comparison of published approaches for factuality detection in terms of generated responses and claims
to be verified based on collected evidence. “Scenario” represents which task and domain the corresponding approach
has been justified. “Sci.” represents “Scientific”.
to have a more comprehensive factuality detection • We connect the concept of “tool use” with “fac-
and verification framework that is similarly versa- tuality detection”, developing a unified and ver-
tile. satile framework for factuality detection across a
Additionally, in the current literature, the task variety of domains and tasks.
of factuality detection is usually simplified as ei- • We use FAC T OOL to evaluate the factuality of
ther (i) given a claim, determining whether it is modern chatbots, and found that GPT-4 has the
factually correct, (ii) or given evidence, determin- best factuality across almost all scenarios. Su-
ing whether the generated claim is supported. This pervisely fine-tuned chatbots (Vicuna-13B) have
task definition is not well suited to writing tasks reasonably good factuality in KB-based QA but
that users commonly engage with when interacting perform poorly in more challenging scenarios, in-
with generative models (e.g., ChatGPT), where we cluding code generation, math problem solving,
often need to validate the factuality of a long-form and scientific literature review writing.
generation without explicit claims and evidence.
In this paper, we propose a task and domain- 2 Related Work
agnostic framework, FAC T OOL, which aims to de-
Factuality Detection in Natural Language Pro-
tect factual errors in LLM-generated texts. We il-
cessing Factuality detection was a topic of rig-
lustrate our framework in Fig. 1, where we connect
orous study even before the advent of generative
the concept of “tool use” (Thoppilan et al., 2022;
AI. Existing works can be organized by their dif-
Gao et al., 2022b; Schick et al., 2023) with “fac-
ferences in terms of the “response” to be veri-
tuality detection” and demonstrate that the ability
fied, the “claim” extracted from the response, and
to use tools in LLMs is crucial for factuality de-
supporting “evidence”. As illustrated in Tab. 1,
tection. Specifically, FAC T OOL leverages various
the creation of the FEVER dataset (Thorne et al.,
tools, including Google Search, Google Scholar,
2018a) spawned models (Zhong et al., 2020; Kr-
code interpreters, Python, or even LLMs them-
ishna et al., 2022) that determine whether a given
selves, to gather evidence about the factuality of
fine-grained claim made based on Wikipedia1 arti-
the generated content. Moreover, our framework
cles is correct. In this task setting, both the claim
employs the reasoning abilities of LLMs to assess
and related evidence are given. FactCC (Kryscin-
the factuality of the content, given the evidence that
ski et al., 2020) and QAGS-based models (Wang
has been gathered. We develop a benchmark and
et al., 2020) adopted different task formulations
perform experiments across four tasks: knowledge-
to detect factual consistency, i.e., given the ev-
based QA, code generation, math problem solving,
idence text, and the goal is to determine if the
and scientific literature review writing.
generated summaries or summary sentences are
In summary, our contributions are: factually consistent with the given text. WICE-
based methods (Kamoi et al., 2023) decide if a
• We revisit the task of factuality detection and fact from a Wikipedia sentence could be supported
extend it in a way that allows for a better audit
1
of current generative AI models. https://www.wikipedia.org/
by provided evidence. RARR (Gao et al., 2022a) broader definition, we are able to establish a uni-
proposed a new approach by directly prompting fied framework for addressing factuality issues in
LLMs to generate queries, retrieve evidence and generative AI beyond just the textual domain.
determine factuality.
Fine-grained Factuality One can usually detect
Existing works typically rely on either a given
the factuality of a given generated signal (e.g., text)
claim or given evidence and target a specific use
at different levels of granularity, such as sentences,
case. However, in this paper, we introduce a more
and documents. A more granular assessment can
challenging yet practical task setting, i.e., factuality
be particularly valuable because it (1) not only
detection without explicit claims or evidence, and
allows users to pinpoint where inaccuracies oc-
propose a framework capable of addressing this
cur (Liu et al., 2021) but also (2) serves as a reward
challenge in a variety of scenarios.
model for developers to refine their generative sys-
Tool use in Large Pretrained Language Models tems (Lightman et al., 2023).
Language models store limited knowledge within However, implementing fine-grained factuality
their parameters. To overcome this limitation, vari- detection is challenging due to two reasons: (1)
ous tools have been introduced to assist language specifying the desired granularity level without am-
models in order to further expand their capabili- biguity, and (2) extracting claims in line with the
ties. For example, Press et al. (2022); Komeili et al. predetermined granularity level. In this paper, we
(2022) gathered information from the Internet to argue that by utilizing the powerful instruction-
enhance question answering and dialog systems, following ability and the natural language inter-
respectively. Schick et al. (2023) trained a model face of LLMs, we can effectively address the chal-
capable of interacting with five tools including a lenge of defining and extracting fine-grained claims
calculator, a translation system, etc. Recently, Shen through claim definition-based few-shot prompting.
et al. (2023) introduced a framework that employs More details can be found in §4.1.
LLMs to connect various AI models from the ma- Structurally speaking, given a prompt (e.g., a
chine learning communities to tackle AI tasks. Fur- query or instruction) and the corresponding model-
thermore, Liang et al. (2023) proposed a new AI generated response, the fine-grained factuality de-
ecosystem that connects LLMs with millions of tection task involves the following concepts:
existing APIs to accomplish tasks. In this work, we Prompt (p) a query or instruction that users pro-
explore tool use in LLMs for the task of factuality vide to the generative model.
detection. Response (r) a piece of text (usually in long form)
generated by the generative model.
3 Revisiting Factuality in Generative AI Claim (c) a statement inferred from the model re-
sponse, whose granularity is defined by a natural
3.1 Definition language text.
Versatile Factuality In most previous works, fac- Evidence (e) The available information (e.g.,
tuality has been defined as whether a claim in a knowledge base, pre-defined rules) that support
text can be supported by evidence from a separate, or demonstrate the truth or validity of a claim.
trustworthy knowledge base, with applications in 3.2 Instantiations in Different Scenarios
fact-checking (Thorne et al., 2018b) (where the
Using the above task definition, we can define fac-
knowledge base is a large source like Wikipedia)
tuality in different application scenarios (see also
and summarization (Kryscinski et al., 2020) (where
in Tab.2).
the knowledge base is an input document or doc-
uments). In this paper, we extend this definition Knowledge-based QA Knowledge-based (KB)
to whether the claims made in generated signals QA (Chen et al., 2017) aims to answer questions
(which could be text, code, or mathematical ex- using a given knowledge base or open-domain data
pressions and so on) can be supported by evidence source (e.g., Wikipedia). In this task, we define
under specific rules. Specifically, these rules can factuality as how well each claim in the generated
range from consistency with a knowledge base de- answer is supported by world knowledge. In this
rived from Wikipedia, to a verification rule spec- paper, we consider a more challenging scenario:
ified within a Python library, or an operational open-domain QA that requires long-form answers,
rule derived from mathematics. By adopting this rather than short ones.
Tasks Prompt (p) Response (r) Claim (c) Evidence (e)
KB-based QA Question Long-form answer Atomic component unit Web searched results
Code Generation Code Query Executable code Code snippet Python library
Math Problems Math problems Math solution Math calculation Calculator
Sci. Lit Review Scientific question Long-form review Tuple (paper title, year, authors) Google scholar
Table 2: Factuality definition in different tasks. “Sci. Lit Review” represents scientific literature review.
Code Generation The code generation task (Yin the claim. On the other hand, the ability of LLMs
and Neubig, 2017) involves generating executable to utilize multiple tools paves the way for multiple
code based on a given query. We define factual- tool-augmented factuality detection. For example,
ity in code generation as how well the generated by directly using ChatGPT plugins,3 we can inte-
code, as a whole, can be executed correctly within a grate multiple tools into a chatbot.
specific programming language (e.g., Python) and The framework is illustrated in Fig. 1, which
fulfills the provided requirements. This definition consists of five main components: claim extraction,
is grounded in an execution-based approach to code query generation, tool querying, evidence collec-
evaluation, which measures the correctness of gen- tion, and agreement verification. We elaborate each
erated code by executing it against some test case component below.
inputs and comparing its output to the expected
output. 4.1 Claim Extraction
Math Problem Solving The math problem solv- Extracting claims from responses under various
ing task involves the use of automated methods task settings is challenging due to the inconsistent
to address mathematical problems (Cobbe et al., definitions of claims across tasks and scenarios.
2021). At the claim level, factuality in math prob- This inconsistency hinders the development of ap-
lem solving is defined as the extent to which the plications such as text summarization evaluation
generated statements adhere to the calculation rules. and factuality detection. To tackle this, we propose
At the response level, factuality in math problem an approach in this paper that treats claim extrac-
solving is defined as how effectively the overall tion as a process guided by LLM prompts based
mathematical solution addresses the given prob- on the specific definition of claims. This approach
lem. offers the following advantages:
(i) Leveraging the strong instruction-following
Scientific Literature Review Writing The scien- capabilities of LLMs can significantly reduce the
tific literature review writing task (Jha et al., 2015) costs associated with data annotation and model
aims to analyze and synthesize existing research on training for claim extraction.
a specific topic in a field of study. In this task, we (ii) When developing a system or constructing
define factuality as whether the generated scientific a dataset for an application that relies on the def-
literature review correctly cites existing scientific inition of claims, one simply needs to provide a
literature, including the correct mention of authors textual definition of the claim using a large model.
and publication years.2 This enables future researchers to effectively utilize
these definitions as a foundation in their work.
4 Approach
(iii) Our experiments demonstrate that the claim
We propose a tool-augmented framework for detect- extraction module, implemented by ChatGPT, ex-
ing factual errors that can apply a unified approach hibits strong performance in extracting claims
across various tasks. The motivation for using tools (atomic component units). The detailed results of
is twofold. On one hand, each tool embodies the these experiments are discussed in Section 6.1.
domain expertise, assisting us in the effective gath- Here, we employ ChatGPT as a base LLM and
ering of evidence that verifies the correctness of apply different textual definitions of claims across
2
four tasks. Our goal is to extract all verifiable
In this paper, our focus lies in examining the consistency
of the relationship between the paper title, authors, and publi- claims within the generated text x, denoted as
cation year. However, the task of determining the suitability
3
of the cited paper as the most appropriate choice is left for https://openai.com/blog/
future investigation. chatgpt-plugins
Prompt Who is the CEO of Twitter? Knowledge-based QA Prompt Return a string containing space-delimited Code Generation
numbers starting from 0 up to n inclusive.
LLM Response Claims
The CEO of Twitter at the time LLM Response Claims
(1) Jack Dorsey is the CEO of Twitter
of writing this answer is Jack Claim def string_sequence(n): def string_sequence(n):
(2) Jack Dorsey co-founded Twitter in 2006 Claim
Dorsey. He co-founded Twitter …… result = "" result = ""
Extraction for i in range(n+1):
in 2006 …… for i in range(n+1): Extraction
Evidence result += str(i) + " " result += str(i) + " "
Queries Query (1.1) Noah Glass, Evan Williams, and Biz return result.strip() return result.strip()
(1.1) Is Jack Dorsey the CEO Generation Stone co-founded Odeo…… Query
Generation Exec Results
of Twitter? (1.2) Former NBC Universal advertising Test Cases
(1) 0 1 2 3 4 expected: 0 1 2 3 4
(1.2) Who is the current CEO chief Linda Yaccarino will become…
…… (1) string_sequence(4) (2) 0 expected: 0
of Twitter?
(2) string_sequence(0) (3) 0 1 2 3 expected: 0 1 2 3
(2.1) Did Jack Dorsey co-found Scores
Twitter in 2006? Claim-level Factuality: [0, 1, …] (3) string_sequence(3) Scores
…… Response-level Factuality: 0 Response-level Factuality: 1
!
Marie ordered 5 packs of milk that costs
Math Problem Solving FacTool Prompt Scientific Literature Review Writing
Prompt
$3 each, and some boxes of pizza. Marie Math Claims Claims
Discuss the applications and limitations (1) {title: Quantum Computing in the
paid a total of $45. How many boxes of (1) 5*3 = $15 of quantum computing, citing at least NISQ era and beyond, authors: John
pizza did Marie order if each box costs $3 (2) 45 - 15 = $30
Claim one relevant paper. When citing papers, Preskill, publication_year: 2018}
(3) 30 / 3 = 10 please include the title, the author(s),
LLM Response Extraction Claim Query
Query and the publication year.
Extraction Queries Generation
- Marie ordered 5 packs of milk that costs $3 Queries Generation
(1) Quantum Computing in the NISQ
each. The total cost of these items is: 5*3 = $15 LLM Response
(1) print(round(5*3, 7) == 15) era and beyond
- To find the cost of the boxes of pizza, we (2) print(round(45-15, 7)==30) Quantum computing has the potential to
subtract the total cost of the other items from (3) print(round(30/3, 7) == 10) revolutionize various fields such as Evidence
the total amount paid by Marie: 45 - 15 = $30 cryptography, optimization, and simulation. (1) {title: Quantum Computing in the
However, there are also limitations such as NISQ era and beyond, authors: John
- To find the number of boxes of pizza, we the need for error correction. One papers
divide the remaining amount by the cost of Scores Preskill, publication_year: 2018}
Exec (1) True that have contributed to this field is
each box: 30 / 3 = 10 Response-level Results (2) True “Quantum Computing in the NISQ era and Scores
- Therefore, Marie ordered 10 boxes of pizza. Factuality:1 (3) True beyond” by John Preskill (2018). Claim-level Factuality: [1]
Response-level Factuality: 1
Figure 2: Our proposed framework for factuality detection in four domains: knowledge-based QA, code generation,
math problem solving and scientific literature review writing.
{ci }i=1···n . Detailed prompting instructions can and the calculated answer. We prompt ChatGPT to
be found in Appendix A. extract all such claims.
KB-based QA The claim is defined using the Scientific Literature Review Each claim within
concept of atomic content units (ACUs) (Liu the generated review is defined as a tuple of “(paper
et al., 2022). Each ACU corresponds to a single title, year, authors)” contained from the generated
atomic fact within a generated answer. In practice, review. We then prompt ChatGPT to extract all
we leverage ChatGPT4 (specifically, the “gpt-3.5- such tuples within the generated review.
turbo” version) to extract claims based on two cri-
teria: (i) each claim should not exceed 15 words, 4.2 Query Generation
and (ii) it should clearly describe a fact. We also For each claim ci , we convert it into a list of queries
include two in-context examples from the RoSE {qij }j=1···m that can be used to query external tools
dataset (Liu et al., 2022) in our prompt to obtain such as search engines, the Python interpreter, or
more fine-grained claims. Additionally, we ask Google scholar.
ChatGPT to resolve any coreferences or ambiguity,
such as unclear pronouns and other related expres- KB-based QA We prompt ChatGPT or GPT-4 to
sions within the claims. generate two search engine queries from each claim
ci . These queries are intended to help humans in
Code Generation We consider each generated verifying the factuality of ci . Detailed prompting
code snippet within the response as a single claim instructions can be found in Appendix A.
to be verified. We extract all such code snippets
that are enclosed with brackets, in other words, Code Generation For each claim ci we gener-
within a code block. ate two different types of queries: simulated test
case inputs, denoted as {qt ij }j=1···m , and poten-
Math Problems We define each claim in a step- tial solutions, denoted as {qs ij }j=1···m . Both types
by-step math solution as the arithmetic operation of queries are generated by ChatGPT or GPT-4.
performed between known real numbers. Each of The simulated test case inputs are function calls
these operations contains two parts: the calculation generated for a given code snippet, while potential
4
We have also explored other entailment-based models solutions are repeatedly generated solutions that
with BERT, and the result is no better than ChatGPT. ChatGPT generates in response to the user prompt
def square_a_num(n): Verify
return n * n Outputs 4 0 25
Table 5: Experimental results of FAC T OOL powered by ChatGPT and FAC T OOL powered by GPT-4 on KB-based
QA, Code Generation, Math Problems, and Scientific Literature Review.
that claim-level factuality is considered equivalent view. On FAC T OOL powered by GPT-4 v.s. Self-
to response-level factuality in our experiment for Check (3) powered by GPT-4, we observe: 95.24
code generation), a 98.97 claim-level F1 / 80.36 v.s. 36.71 claim-level F1 and 94.74 v.s. 21.54
response-level F1 on math problems, and a 95.24 response-level F1. Here, Google Scholar shown to
claim-level F1 / 94.74 response-level F1 on scien- be highly robust in performing its specified task of
tific literature review. Each of these figures is the finding citations when compared to LLM itself.
highest for their respective tasks.
FAC T OOL powered by GPT-4 outperforms FAC -
FAC T OOL powered by GPT-4 outperforms all T OOL powered by ChatGPT FAC T OOL pow-
self-check baselines across all scenarios From ered by GPT-4 outperforms FAC T OOL powered
Tab. 5, we show that FAC T OOL with GPT-4 outper- by ChatGPT across all scenarios. This trend is
forms all self-check baselines across all scenarios. especially significant in KB-QA, where query gen-
On FAC T OOL powered by GPT-4 v.s. Self-Check eration and agreement verification are harder for
(3) powered by GPT-4, we observe: 71.79 v.s. ChatGPT but relatively easier for GPT-4 (89.09 v.s
57.14 response-level F1 on KB-based QA, 92.11 81.25 claim-level F1 and 71.79 v.s 52.63 response-
v.s. 85.26 response-level F1 on code generation, level F1). On the other hand, in scenarios where
80.36 v.s. 70.59 response-level F1 on math prob- query generation and agreement verification are
lems, and 94.74 v.s. 21.54 response-level F1 on relatively easy for both ChatGPT and GPT-4, the
scientific literature review. performance is similarly good.
FAC T OOL powered by GPT-4 significantly out- Self-check models are prone to false positive
performs all self-check baselines in scientific lit- and thus less sensitive in detecting errors From
erature review From Tab. 5, we show that FAC - Tab. 5, we observe that self-check models have
T OOL powered by GPT-4 significantly outperforms lower precision compared to FAC T OOL. On Self-
the self-check baselines in scientific literature re- Check (3) powered by GPT-4 v.s. FAC T OOL pow-
ered by GPT-4, we observe: 63.16 v.s. 87.50 queries, and the LLM’s ability to reason about the
response-level precision on KB-based QA, 76.43 validity of the claim given the retrieved evidence.
v.s. 89.74 response-level precision on code gen- We found that FAC T OOL powered by GPT-4 is
eration, 58.33 v.s. 69.23 response-level precision especially capable under the following situations:
on math problems, and 12.73 v.s. 100.00 response- (1) Fact-checking recent events, discoveries, or
level precision on scientific literature review. These news: FAC T OOL powered by GPT-4 successfully
figures show that self-check models tend to clas- identify false claims such as “Argentina
sify claims as “True” considerably more frequently has not won the World Cup since
than FAC T OOL, suggesting a lower sensitivity for 1986” and “The most valuable NFT
error detection. ever sold is a digital artwork
called ‘Everydays: The First
Self-check models powered by ChatGPT
5000 Days’”. (2) Fact-checking high-precision
outperform FAC T OOL powered by ChatGPT
statistics: FAC T OOL powered by GPT-4 success-
on KB-QA Tab. 5 shows that Self-Check (0)
fully identify false claims such as “Ireland
powered by ChatGPT outperforms FAC T OOL
has an obesity rate of 26.9%” and
powered by ChatGPT. Through examining specific
“Everydays: The First 5000 Days’
cases, we found that reasoning errors are the main
sold for 69 million”. Detailed examples
reason why FAC T OOL powered by ChatGPT
can be found in Fig. 10 of Appendix B.
performs worse than the self-check baselines.
Even when provided with sufficient evidence Code Generation The fact-checking capability
to determine whether the claim is factual or of FAC T OOL on code generation is determined by
not, the agreement verification implemented by the LLM’s capability to generate high-quality test
ChatGPT can become confused. For example, cases and potential solutions. We demonstrate that
for the claim “The modern-day version due to GPT-4’s exceptional ability to generate such
of fortune cookies was invented high-quality test cases and potential solutions, FAC -
in the United States.”, the reasoning T OOL powered by GPT-4 outperforms other base-
of FAC T OOL powered by ChatGPT is self- lines. For example, in “HumanEval/36”, GPT-4
contradictory: “The given text is not is consistently generating high quality solutions,
entirely factual. The modern-day leading to its correctly identifies the mistakes in
version of fortune cookies was the response, while ChatGPT fails to identify the
not invented in the United States. mistake. Detailed examples can be found in Fig. 11
Most people nowadays believe that and Fig. 12 of Appendix B.
fortune cookies were created
by a Japanese man named Makoto Math Problems The fact-checking capability of
Hagiwara in 1914 in San Francisco. FAC T OOL on math problems is determined by the
Hagiwara owned what is now called LLM’s capability to generate accurate Python snip-
the Golden Gate Park Japanese pets that verify the correctness of given extracted
Tea Garden, where he served mathematical calculations. Both FAC T OOL pow-
tea and fortune cookies. This ered by GPT-4 and FAC T OOL powered by Chat-
is supported by the provided GPT excel in this regard. For example, both FAC -
evidences.” Detailed examples can be T OOL powered by GPT-4 and FAC T OOL pow-
found in Fig. 9 of Appendix B. ered by ChatGPT correctly identify 23 × 4319216
doesn’t equal to 99305768. Detailed examples can
6.2.2 Performance Analysis be found in Fig. 13 of Appendix B.
We take a closer look at performance in different
Scientific Literature Review The fact-checking
scenarios by examining evaluated cases.
capability of FAC T OOL on Scientific Literature
KB-based QA The fact-checking capability of Review is determined by the LLM’s capability
FAC T OOL on KB-based QA is determined by sev- to identifying whether the author list generated
eral factors, including whether the search engine is a subset of the actual author list. Both FAC -
can return the most relevant snippets that could T OOL powered by GPT-4 and FAC T OOL pow-
assist in determining the factuality of the given ered by ChatGPT excel in this regard. For exam-
claim, the quality of the generated search engine ple, both FAC T OOL powered by GPT-4 and FAC -
T OOL powered by ChatGPT correctly identify that “decimal_to_binary(0)” as “db0db”. De-
the paper “The Impact of Artificial tailed examples can be found in Fig. 16 of Ap-
Intelligence on Employment” was not pendix B.
written by “Acemoglu and Restrepo”. De-
tailed examples can be found in Fig. 14 of Ap- Math Problems There are two major types of
pendix B. errors in factuality detection for math problems:
(1) Round-off error: Round-off errors can occur
6.2.3 Failure Analysis during numerical calculations in Python. For
To gain a comprehensive understanding of FAC - example, FAC T OOL powered by GPT-4 incorrectly
T OOL’s performance, we conduct analysis on cases classify the math calculation “60444034 / 12
where FAC T OOL will fail. = 5037002.83” as “False”. (2) Reasoning
KB-based QA We summarize following sources error: Since the claims extracted by FAC T OOL
of errors: (1) Reasoning error: Although the only involve mathematical calculations, FAC -
evidence provided is sufficient and the LLM T OOL will not verify the reasoning process
accurately finds the most relevant information, of the mathematical solution. For example,
the model fails to reason about the relationship for the question “Kylar went to the
between the claim and the provided evidence. store to buy glasses for his new
For example, for claim “Jupiter is less apartment. One glass costs $5,
dense than Saturn", FAC T OOL powered by but every second glass costs only
GPT-4 fails to reason the relative relationship even 60% of the price. Kylar wants
though the evidences provided are sufficient. (2) to buy 5364765 glasses. How much
Conflicting evidence: Conflict in evidence can does he need to pay for them?”, the
cause confusion for LLM, leading to incorrect ChatGPT generated response contains reasoning
decisions. For example, for claim “Jupiter error that incorrectly substitute the total cost as
has a density of 1.33 grams per “5,364,765 * 5”. However, since FAC T OOL
cubic centimeter", there are conflicting only checks math calculation errors, FAC T OOL
evidences claiming that the density is 1.326 or powered by GPT-4 did not identify the reasoning
1.33g/cm3 . (3) Ambiguity in claim: Ambigu- error. Detailed examples can be found in Fig. 17
ous descriptions and subjective adjectives can of Appendix B.
lead to incorrect decisions. For example, the
Scientific Literature Review There are two
claim “Fortune cookies are enjoyed
major types of errors in factuality detection for
by people all over the world." is
scientific literature review: (1) Errors in title match-
ambiguous and can have different answers based
ing: Title matching can sometimes be problematic
on different interpretations. Detailed examples can
due to abbreviations in the generated citations or
be found in Fig. 15 of Appendix B.
the retrieved title. For example, although the paper
Code Generation Errors in code generation “MDMA-assisted psychotherapy for
mainly comes from: (1) Limited variety in syn- treatment of PTSD: study design
thetic test cases: The synthetic test cases gen- and rationale for phase 3 trials
erated by LLMs may not be fully representa- based on pooled analysis of six
tive or sufficiently diverse. For example, in phase 2 randomized controlled
the “HumanEval/64” sample, all the inputs trials exists, FAC T OOL powered by GPT-4
of the generated synthetic test cases are com- identify the paper title as incorrect. (2) Errors in au-
posed of strings that only include lowercase let- thor matching: the author matching process might
ters (without uppercase letters). (2) Potential er- sometimes not be robust. For example, although
rors in code generation: The generated poten- the authors of “Language Models are
tial solutions could contain errors or bugs. De- Unsupervised Multitask Learners"
spite implementing a majority voting system to are indeed “Alec Radford, Jeffrey Wu,
lessen this issue, it cannot completely eliminate Rewon Child, David Luan, Dario
the chance of bugs in the code generation pro- Amodei, and Ilya Sutskever, FAC -
cess. For example, in the “HumanEval/79” sam- T OOL powered by GPT-4 identify the author list
ple, all the generated solutions failed to correctly as incorrect. Detailed examples can be found in
Fig. 18 of Appendix B. strate fine-grained performance w.r.t each scenario
(KB-QA, code, math, scientific). We observe that
6.3 Exp-III: Using FAC T OOL to Evaluate the (a) GPT-4 has the best claim-level accuracy and
Factuality of Modern Chatbots response-level accuracy in most of the scenarios.
The purpose of developing a factuality detector is (b) Supervised fine-tuned Chatbots like Vicuna-
to audit the actual generative chatbots to assess the 13B perform reasonably well in more common sce-
reliability of the responses generated by chatbots. narios like KB-QA but less so in more challenging
To this end, we evaluate the factuality of modern scenarios such as math, code, and scientific.
chatbots, including GPT-4, ChatGPT, Claude-v1,
Bard, and Vicuna-13B, using FAC T OOL powered
LLMs WCL Acc. RL Acc. Avg. Resp. Len.
by GPT-4. It is important to note that in Exp-III,
GPT-4 75.60 43.33 196.83
we consider FAC T OOL as a golden evaluator, re- ChatGPT 68.63 36.67 144.05
sponsible for evaluating the factual accuracy of Claude-v1 63.95 26.67 208.70
the responses generated by different chatbots. For Bard 61.15 33.33 263.77
Vicuna-13B 50.35 21.67 207.13
prompts selection, we follow the same intuition
as (Zhou et al., 2023): KB-QA is the most com- Table 6: Factual accuracy of different chatbots evaluated
mon scenario. Thus, we select 30 KB-QA prompts, by FAC T OOL. WCL Acc. stands for weighted claim-
10 code prompts, 10 math prompts. and 10 scien- level accuracy of each chatbot. RL Acc. stands for
tific prompts (i.e., 3 times more KB-QA prompts response-level accuracy of each chatbot. Avg. Resp.
compare to prompts from other scenarios) to carry Len. stands for average response length of each chatbot.
out this factuality evaluation on chatbots. The KB- Note that we consider FAC T OOL as the golden evaluator
that evaluates the factuality of the responses generated
QA prompts are collected from (Zhou et al., 2023),
by each chatbot.
code prompts from HumanEval (Chen et al., 2021),
math prompts from (Gao et al., 2022b), while the
scientific prompts are generated by us. Responses
for these prompts are generated by each of the eval-
uated chatbots. 7 Conclusion
We report both the claim-level and response-
level accuracies for each chatbot, as evaluated by We introduce FAC T OOL, a task- and domain-
FAC T OOL powered by GPT-4. Given that KB-QA agnostic framework designed to tackle the esca-
responses contain significantly more claims than lating challenge of factual error detection in gen-
responses from other scenarios, we report the erative AI. We expand the conventional definition
weighted claim-level accuracy. This weight is of factuality, particularly focusing on auditing the
determined by the ratio of the number of prompts capabilities of generative AI models. Realizing that
in each scenario. In other words, (1) the generated texts of LLMs tend to be lengthy
and lack a clearly defined granularity for individ-
weighted_claim_level_accuracy
ual facts, and that (2) there is a scarcity of explicit
3
= × claim_level _accuracy in KB − QA evidence available during the process of fact check-
6
1 ing, we build FAC T OOL as a 5-step tool-augmented
+ × claim_level _accuracy in Code
6 framework that consists of claim extraction, query
1 generation, tool querying, evidence collection, and
+ × claim_level _accuracy in Math
6 verification.
1
+ × claim_level _accuracy in Scientific
6 We demonstrate the potential of incorporating
Adopting the weighted-claim level accuracy tools like Google Search, Google Scholar, code
evaluation helps us provide a more holistic and interpreters, Python, and even LLMs themselves
fair assessment of each chatbot’s factual accuracy. in factual error detection through experimentation
across diverse tasks such as knowledge-based QA,
Results Tab. 6 shows that GPT-4 has the best code generation, math problem solving, and sci-
weighted claim-level factual accuracy and response- entific literature review writing. We believe that
level accuracy compared to ChatGPT, Bard, our holistic and adaptable framework can be easily
Claude-v1, and Vicuna. Fig. 4 and 5 demon- extended to more scenarios.
GPT-4 ChatGPT Bard Claude-v1 Vicuna-13B GPT-4 ChatGPT Bard Claude-v1 Vicuna-13B
1
1
0.8
Accuracy
Accuracy
0.5 0.6
0.4
0 0.2
0
KB-QA Code Math Scientific KB-QA Code Math Scientific
Figure 4: Claim-Level Accuracy across scenarios for Figure 5: Response-Level Accuracy across scenarios
GPT-4, ChatGPT, Bard, Claude-v1, and Vicuna-13B for GPT-4, ChatGPT, Bard, Claude-v1, and Vicuna-13B
Figure 10: Examples of KB-QA claims verified by FAC T OOL powered by GPT-4. The queries generated, evidences
collected, the reasoning process, and the final decision of FAC T OOL powered by GPT-4 is reported.
{ {
"task_id": "HumanEval/2", "task_id": "HumanEval/57",
"prompt": "prompt":
def monotonic ( l : l i s t ) :
i f len ( l ) < 2:
return True
direction = l [1] − l [0]
f o r i i n range ( 2 , l e n ( l ) ) :
i f d i r e c t i o n == 0 :
d i r e c t i o n = l [ i ] − l [ i − 1]
continue
i f ( l [ i ] − l [ i − 1]) * d i r e c t i o n < 0:
return False
return True
def monotonic ( l : l i s t ) :
i f l e n ( l ) <= 1 :
return True
i n c r e a s i n g = a l l ( l [ i ] <= l [ i + 1 ] f o r i i n range ( l e n ( l ) − 1 ) )
d e c r e a s i n g = a l l ( l [ i ] >= l [ i + 1 ] f o r i i n range ( l e n ( l ) − 1 ) )
r e t u r n i n c r e a s i n g or d e c r e a s i n g
]
"exec_result": [[true, true, true, true], [true, true, true, true], [false, false, false,
false]]
}
Figure 11: Examples of code claims verified by FAC T OOL powered by GPT-4. The queries generated, evidences
collected, the reasoning process, and the final decision of FAC T OOL powered by GPT-4 is reported.
[ChatGPT] [GPT-4]
{ {
"task_id": "HumanEval/36", "task_id": "HumanEval/36",
"prompt": "prompt":
def f i z z _ b u z z ( n : i n t ) : def f i z z _ b u z z ( n : i n t ) :
" " " R e t u r n t h e number o f t i m e s t h e d i g i t 7 " " " R e t u r n t h e number o f t i m e s t h e d i g i t 7
appears in i n t e g e r s l e s s than n appears in i n t e g e r s l e s s than n
w h i c h a r e d i v i s i b l e by 11 o r 1 3 . " " " w h i c h a r e d i v i s i b l e by 11 o r 1 3 . " " "
, ,
"entry_point": "fizz_buzz", "entry_point": "fizz_buzz",
"completion": "completion":
def f i z z _ b u z z ( n : i n t ) : def f i z z _ b u z z ( n : i n t ) :
count = 0 count = 0
f o r i i n range ( n ) : f o r i i n range ( n ) :
i f i % 11 == 0 or i % 13 == 0 : i f i % 11 == 0 or i % 13 == 0 :
i f str ( i ) . count ( ’7 ’ ) > 0: i f str ( i ) . count ( ’7 ’ ) > 0:
c o u n t += 1 c o u n t += 1
return count return count
def f i z z _ b u z z ( n : i n t ) : def f i z z _ b u z z ( n : i n t ) :
count = 0 count = 0
f o r i i n range ( n ) : f o r i i n range ( 1 , n ) :
i f i % 11 == 0 or i % 13 == 0 : i f i % 11 == 0 or i % 13 == 0 :
i f ’7 ’ in st r ( i ) : c o u n t += s t r ( i ) . c o u n t ( ’ 7 ’ )
c o u n t += 1 return count
return count
def f i z z _ b u z z ( n : i n t ) :
def f i z z _ b u z z ( n : i n t ) : count = 0
count = 0 f o r i i n range ( 1 , n ) :
f o r i i n range ( n ) : i f i % 11 == 0 or i % 13 == 0 :
i f i % 11 == 0 or i % 13 == 0 : c o u n t += s t r ( i ) . c o u n t ( ’ 7 ’ )
c o u n t += s t r ( i ) . c o u n t ( ’ 7 ’ ) return count
return count
def f i z z _ b u z z ( n : i n t ) :
def f i z z _ b u z z ( n : i n t ) : count = 0
count = 0 f o r i i n range ( n ) :
f o r i i n range ( n ) : i f i % 11 == 0 or i % 13 == 0 :
i f i % 11 == 0 or i % 13 == 0 : i f str ( i ) . count ( ’7 ’ ) > 0:
i f str ( i ) . count ( ’7 ’ ) > 0: c o u n t += 1
c o u n t += 1 return count
return count ]
Figure 12: Examples of code claims verified by FAC T OOL powered by ChatGPT and FAC T OOL powered by GPT-4.
The queries generated, evidences collected, the reasoning process, and the final decisions of FAC T OOL powered by
ChatGPT and FAC T OOL powered by GPT-4 are reported.
[ChatGPT] [GPT-4]
{ {
"math_calculation": "23 * 4319216", "math_calculation": "23 * 4319216",
"calculated_answer": "99305768", "calculated_answer": "99305768",
"label": false, "label": false,
"query": "python_snippet": "print(23 * 4319216 == 99305768)", "query": {"python_snippet": "calculation = 23 * 4319216; answer = 99305768;
"exec_result": "False", print(calculation == answer)"},
"with_tool_classification": false "exec_result": "False",
} "with_tool_classification": false
{ }
"math_calculation": "20/100 x $10884297.00", {
"calculated_answer": "2176859.40", "math_calculation": "20/100 x $10884297.00",
"label": true, "calculated_answer": "2176859.40",
"query": {"python_snippet": "print(abs((20/100) * 10884297.00 - 2176859.40) "label": true,
< 0.01)"}, "query": {"python_snippet": "calculation = (20/100) * 10884297.00; answer =
"exec_result": "True", 2176859.40; print(abs(calculation - answer) < 0.01)"},
"with_tool_classification": true "exec_result": "True",
} "with_tool_classification": true
}
Figure 13: Examples of math claims verified by FAC T OOL powered by ChatGPT and GPT-4. The full response
of FAC T OOL powered by GPT-4 is reported, including the queries generated, evidences collected, the reasoning
process, and the final decision.
[ChatGPT] [GPT-4]
{ {
"paper_title": "BERT: Pre-training of Deep Bidirectional Transformers for "paper_title": "BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding", Language Understanding",
"paper_author(s)": "Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina "paper_author(s)": "Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina
Toutanova", Toutanova",
"paper_pub_year": "2018", "paper_pub_year": "2018",
"label": true, "label": true,
"with_tool_classification": true, "with_tool_classification": true,
"error": [] "error": []
} }
{ {
"paper_title": "The Impact of Artificial Intelligence on Employment", "paper_title": "The Impact of Artificial Intelligence on Employment",
"paper_author(s)": "Acemoglu and Restrepo", "paper_author(s)": "Acemoglu and Restrepo",
"paper_pub_year": "2019", "paper_pub_year": "2019",
"label": false, "label": false,
"with_tool_classification": false, "with_tool_classification": false,
"error": ["wrong_paper_author(s)", "wrong_paper_pub_year"] "error": ["wrong_paper_author(s)", "wrong_paper_pub_year"]
} }
Figure 14: Examples of scientific claims verified by FAC T OOL powered by ChatGPT and GPT-4. The full response
of FAC T OOL powered by ChatGPT and GPT-4 is reported, including the queries generated, evidences collected, the
reasoning process, and the final decision.
Figure 15: Some error cases of FAC T OOL on KB-QA. The full response of FAC T OOL powered by GPT-4 is reported,
including the queries generated, evidences collected, the reasoning process, and the final decision is also reported.
[Limited variety in synthetic testcases] [Potential errors in code generation]
{ {
"task_id": "HumanEval/64", "task_id": "HumanEval/79",
"prompt": "Write a function vowels_count(s) which takes a string repre- "prompt":
senting a word as input and returns the number of vowels in the string. Vowels
in this case are ’a’, ’e’, ’i’, ’o’, ’u’. Here, ’y’ is also a vowel, but only when it is def decimal_to_binary ( decimal ) :
at the end of the given word.", " " " You w i l l be g i v e n a number i n d e c i m a l f o r m
"entry_point": "vowels_count", and y o u r t a s k i s t o c o n v e r t i t t o b i n a r y f o r m a t .
"completion": The f u n c t i o n s h o u l d r e t u r n a s t r i n g ,
w i t h e a c h c h a r a c t e r r e p r e s e n t i n g a b i n a r y number .
def vowels_count ( s ) : Each c h a r a c t e r i n t h e s t r i n g w i l l be ’ 0 ’ o r ’ 1 ’ .
vowels = [ ’ a ’ , ’ e ’ , ’ i ’ , ’o ’ , ’u ’ ] T h e r e w i l l be an e x t r a c o u p l e o f
count = 0 c h a r a c t e r s ’ db ’ a t t h e b e g i n n i n g
f o r i i n range ( l e n ( s ) ) : and a t t h e end o f t h e s t r i n g .
i f s [ i ] in vowels : The e x t r a c h a r a c t e r s a r e t h e r e
c o u n t += 1 to help with the format . """
i f i == l e n ( s ) − 1 and s [ i ] == ’ y ’ :
c o u n t += 1 ,
return count "entry_point": "decimal_to_binary",
"completion":
"result": "failed: AssertionError: Test 2",
"passed": false, def decimal_to_binary ( decimal ) :
"with_tool_classification": true, binary_list = []
"testcases_input": ["vowels_count(’encyclopedia’)", "vowels_count(’rhythm’)", while decimal > 0:
"vowels_count(’beautiful’)"], remainder = decimal % 2
"multi_solutions": [ b i n a r y _ l i s t . append ( s t r ( r e m a i n d e r ) )
decimal = decimal / / 2
def vowels_count ( s ) : binary_list . reverse ()
vowels = ’ aeiou ’ binary_string = ’ ’ . join ( binary_list )
count = 0 r e t u r n ’ db ’ + b i n a r y _ s t r i n g + ’ db ’
f o r i , c i n enumerate ( s ) :
i f c . lower ( ) in vowels : "result": "failed: AssertionError: ",
c o u n t += 1 "passed": false,
e l i f c . l o w e r ( ) == ’ y ’ and i == l e n ( s ) − 1 : "with_tool_classification": true,
c o u n t += 1 "testcases_input": ["decimal_to_binary(10)", "decimal_to_binary(255)", "deci-
return count mal_to_binary(0)"],
"multi_solutions": [
def vowels_count ( s ) :
def decimal_to_binary ( decimal ) :
vowels = ’ aeiou ’
binary_num = ’ ’
count = 0
while decimal > 0:
for char in s :
remainder = decimal % 2
i f char . lower ( ) in vowels :
decimal = decimal / / 2
c o u n t += 1
binary_num = s t r ( remainder ) + binary_num
i f s [ − 1 ] . l o w e r ( ) == ’ y ’ :
r e t u r n ’ db ’ + b i n a r y _ n u m + ’ db ’
c o u n t += 1
return count
def decimal_to_binary ( decimal ) :
binary = ’ ’
def vowels_count ( s ) :
while decimal > 0:
vowels = ’ aeiou ’
remainder = decimal % 2
count = 0
decimal = decimal / / 2
for char in s :
binary = str ( remainder ) + binary
i f char . lower ( ) in vowels :
r e t u r n ’ db ’ + b i n a r y + ’ db ’
c o u n t += 1
i f s [ − 1 ] . l o w e r ( ) == ’ y ’ :
def decimal_to_binary ( decimal ) :
c o u n t += 1
binary = ’ ’
return count
while decimal > 0:
remainder = decimal % 2
def vowels_count ( s ) :
decimal = decimal / / 2
vowels = [ ’ a ’ , ’ e ’ , ’ i ’ , ’o ’ , ’u ’ ]
binary = str ( remainder ) + binary
count = 0
r e t u r n ’ db ’ + b i n a r y + ’ db ’
f o r i i n range ( l e n ( s ) ) :
i f s [ i ] in vowels :
def decimal_to_binary ( decimal ) :
c o u n t += 1
binary_list = []
i f i == l e n ( s ) − 1 and s [ i ] == ’ y ’ :
while decimal > 0:
c o u n t += 1
remainder = decimal % 2
return count
b i n a r y _ l i s t . append ( s t r ( r e m a i n d e r ) )
] decimal = decimal / / 2
"exec_result": [[5, 5, 5, 5], [0, 0, 0, 0], [5, 5, 5, 5]] binary_list . reverse ()
} binary_string = ’ ’ . join ( binary_list )
r e t u r n ’ db ’ + b i n a r y _ s t r i n g + ’ db ’
],
"exec_result": [["db1010db", "db1010db", "db1010db", "db1010db"],
["db11111111db", "db11111111db", "db11111111db", "db11111111db"],
["dbdb", "dbdb", "dbdb", "dbdb"]]
}
Figure 16: Some error cases of FAC T OOL on code. The full response of FAC T OOL powered by GPT-4 is reported,
including the queries generated, evidences collected, the reasoning process, and the final decision is also reported.
[Round-off error]
{
"math_calculation": "60444034 / 12",
"calculated_answer": "5037002.83",
"label": true,
"query": "python_snippet": "calculation = 60444034 / 12; answer = 5037002.83; print(abs(calculation - answer) < 1e-9)",
"exec_result": "False",
"with_tool_classification": false
}
[Reasoning error]
{
"input": "Kylar went to the store to buy glasses for his new apartment. One glass costs $5, but every second glass costs only 60% of the price. Kylar wants to buy
5364765 glasses. How much does he need to pay for them?",
"target": 21459061.0,
"response": "Let’s assume that Kylar buys x glasses at the full price of $5 each. Therefore, he will buy (5364765 - x) glasses at a discounted price of 60% of $5,
which is $3. Now, we can set up the equation:
5x + 3(5364765 - x) = total cost
Simplifying the equation, we get:
5x + 16094295 - 3x = total cost
2x + 16094295 = total cost
Therefore, the total cost of all the glasses will be 2x + 16094295 dollars. Now, we just need to find the value of x. We know that Kylar wants to buy 5364765 glasses,
so we can substitute this value in the equation:
2x + 16094295 = total cost
2x + 16094295 = 5,364,765 * 5
2x + 16094295 = 26823825
2x = 10729530
x = 5364765
Therefore, Kylar needs to pay 2x + 16094295 dollars, which is: 2(5364765) + 16094295 = 26823825.
Hence, Kylar needs to pay $26823825 for all the glasses.",
"answer": "26823825",
"annotation": false,
"claims": [{"math_calculation": "5364765 - x", "calculated_answer": "5364765 - x", "label": true}, {"math_calculation": "0.60 * 5", "calculated_answer": "3",
"label": true}, {"math_calculation": "5x", "calculated_answer": "5x", "label": true}, {"math_calculation": "3(5364765 - x)", "calculated_answer": "16094295 -
3x", "label": true}, {"math_calculation": "5x + 16094295 - 3x", "calculated_answer": "16094295 + 2x", "label": true}, {"math_calculation": "2x + 16094295",
"calculated_answer": "total cost", "label": "null"}, {"math_calculation": "5364765 * 5", "calculated_answer": "26823825", "label": true}, {"math_calculation":
"2(5364765)", "calculated_answer": "10729530", "label": true}, {"math_calculation": "2x", "calculated_answer": "10729530", "label": "null"}]
"math_calculation": "5364765 - x", "calculated_answer": "5364765 - x", "label": true, "query": "python_snippet": "x = 0; math_calculation = 5364765 - x;
calculated_answer = 5364765 - x; print(math_calculation == calculated_answer)", "exec_result": "True", "with_tool_classification": true "math_calculation": "0.60 *
5", "calculated_answer": "3", "label": true, "query": "python_snippet": "calculation = 0.60 * 5; answer = 3; print(calculation == answer)", "exec_result": "True",
"with_tool_classification": true "math_calculation": "5x", "calculated_answer": "5x", "label": true, "query": "python_snippet": "def validate_calculation(calculation,
answer): return calculation == answer; print(validate_calculation(’5x’, ’5x’))", "exec_result": "True", "with_tool_classification": true "math_calculation": "3(5364765
- x)", "calculated_answer": "16094295 - 3x", "label": true, "query": "python_snippet": "x = 1; result1 = 3 * (5364765 - x); result2 = 16094295 - 3 * x; print(result1
== result2)", "exec_result": "True", "with_tool_classification": true "math_calculation": "5x + 16094295 - 3x", "calculated_answer": "16094295 + 2x", "label":
true, "query": "python_snippet": "x = 1; print((5 * x + 16094295 - 3 * x) == (16094295 + 2 * x))", "exec_result": "True", "with_tool_classification": true
"math_calculation": "2x + 16094295", "calculated_answer": "total cost", "label": "null", "query": "python_snippet": "x = 5; math_calculation = 2 * x +
16094295; calculated_answer = 16094305; print(math_calculation == calculated_answer)", "exec_result": "True", "with_tool_classification": true "math_calculation":
"5364765 * 5", "calculated_answer": "26823825", "label": true, "query": "python_snippet": "calculation = 5364765 * 5; answer = 26823825; print(calculation
== answer)", "exec_result": "True", "with_tool_classification": true "math_calculation": "2(5364765)", "calculated_answer": "10729530", "label": true, "query":
"python_snippet": "calculation = 2 * 5364765; answer = 10729530; print(calculation == answer)", "exec_result": "True", "with_tool_classification": true
"math_calculation": "2x", "calculated_answer": "10729530", "label": "null", "query": "python_snippet": "x = 5364765; print(2 * x == 10729530)", "exec_result":
"True", "with_tool_classification": true
}
Figure 17: Some error cases of FAC T OOL on math. The full response of FAC T OOL powered by GPT-4 is reported,
including the queries generated, evidences collected, the reasoning process, and the final decision is also reported.
Figure 18: Some error cases of FAC T OOL on scientific. The full response of FAC T OOL powered by GPT-4 is
reported, including the queries generated, evidences collected, the reasoning process, and the final decision is also
reported.