Hallucinations 2312.17249

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Do Androids Know They’re Only Dreaming of Electric Sheep?

Sky CH-Wang◦∗ Benjamin Van Durme• Jason Eisner• Chris Kedzie•



Department of Computer Science, Columbia University

Microsoft Semantic Machines
skywang@cs.columbia.edu, chriskedzie@microsoft.com

Abstract

Token Hallucination Prediction by Single Layer Classifiers


We design probes trained on the internal repre-
sentations of a transformer language model that
arXiv:2312.17249v1 [cs.CL] 28 Dec 2023

are predictive of its hallucinatory behavior on

Hallucination)
in-context generation tasks. To facilitate this

Shallow)
Transformer Layer
detection, we create a span-annotated dataset of
organic and synthetic hallucinations over sev-
eral tasks. We find that probes trained on the
force-decoded states of synthetic hallucinations
are generally ecologically invalid in organic

(Grounded
hallucination detection. Furthermore, hidden
(Deep
state information about hallucination appears
to be task and distribution-dependent. Intrin-
sic and extrinsic hallucination saliency varies
across layers, hidden state types, and tasks; no-
tably, extrinsic hallucinations tend to be more
salient in a transformer’s internal representa-
tions. Outperforming multiple contemporary (First Token Last Token)
baselines, we show that probing is a feasible
Token In Generated Response
and efficient alternative to language model hal-
lucination evaluation when model states are Figure 1: Prediction of hallucination in a generated re-
available. sponse by probing the hidden states of a transformer dur-
ing decoding. The true annotated span of hallucination
1 Introduction within the response is boxed in white; rows represent
probes trained to detect hallucination from layers within
Do language models know when they’re hallucinat- a transformer’s hidden states, with columns representing
ing?1 Detecting hallucinations in grounded gener- their prediction for each token in a generated response.
ation is commonly framed as a textual entailment
problem. Prior work has largely focused on cre-
ating secondary detection models trained on and whether the model has just begun to hallucinate or
applied to surface text (Ji et al., 2023). However, will eventually be found to have hallucinated. It
these secondary models are often of comparable is plausible that these encodings are “aware” of
complexity to the underlying language model that whether generation is retrieving information that is
generated the error and ignore the information that grounded in the prompt versus filling in plausible
was already computed during generation. In this information that is stored in the model’s weights.
work, we explore the degree to which probes on a Related work has studied how transformers retrieve
decoder-only transformer language model’s feed- relevant memories as they encode text (Meng et al.,
forward and self-attention states can detect halluci- 2022, 2023) and suggests that feed-forward layers
nations in a variety of grounded generation tasks. act as associative memory stores (Geva et al., 2021).
In particular, as the language model generates a A probe for hallucination detection must consider
response, we probe its encoded tokens to determine whether subsequent generated text is entailed by
∗ information in the prompt, given the retrieved mem-
Work done as an intern at Microsoft Semantic Machines.
1
That is, do they know that their electric sheep are un- ories, or is merely one plausible extrapolation from
grounded? the prompt (see also Monea et al., 2023).
We develop three probes of increasing complex- imal Row, in Los Angeles—with his
ity across three grounded generation tasks: abstrac- commission . . .
tive summarization, knowledge-grounded dialogue
TL;DR:
generation, and data-to-text. For each task, we
collect hallucinations in two ways: (1) from sam- A generated summary sentence of “Deckard can
pled responses, where we generate outputs from a now purchase an animal for his wife” is faithful,
large language model (LLM) conditioned on the “Deckard can purchase a goat in Iran” is an in-
inputs, or (2) by editing reference inputs or outputs trinsic hallucination, and “Sheep are usually cuter
to create discrepancies. We refer to these as organic and fluffier than goats” is an extrinsic hallucination.
and synthetic data, respectively. In both cases, we While models may leverage latent knowledge to
have human annotators mark hallucinations in the inform their responses, such as recognizing that
outputs. Method (2) produces hallucination annota- a goat is indeed an animal, directly stating such
tions at a higher rate, though we find that the utility knowledge in a response—i.e. only generating
of these synthetic examples is lower as they do not “Goats are animals” in the example above—is often
come from the test distribution. considered an extrinsic hallucination as it is neither
We summarize our contributions as follows: entailed nor contradicted by a given knowledge
• We produce a high-quality dataset of more source. Thus, non-hallucinatory generation in this
than 15k utterances with hallucination anno- setting involves balancing the language model’s re-
tations for organic and synthetic output texts trieval and manipulation of knowledge, both from
across three grounded generation tasks. the given context and from its own stored weights.
• We propose three probe architectures for de- Automated Diagnosis. Lexical metrics com-
tecting hallucinations and demonstrate im- monly used in the evaluation of NLG fail to
provements over multiple contemporary base- correlate with human judgments of faithfulness
lines in hallucination detection. (Kryscinski et al., 2019). More useful are NLI
• We analyze how probe accuracy is affected by approaches that either directly measure the de-
annotation type (synthetic/organic), hallucina- gree of entailment between a generated response
tion type (extrinsic/intrinsic), model size, and and a given source text (Hanselowski et al., 2018;
which part of the encoding is probed. Kryscinski et al., 2020; Goyal and Durrett, 2020),
or else break down texts into single-sentence claims
2 Related Work that are easily verifiable (Min et al., 2023; Semnani
Definitions. We center our study of hallucina- et al., 2023); these have been successfully used for
tions in the setting of in-context generation (Lewis hallucination detection (Laban et al., 2022; Bishop
et al., 2020) where grounding knowledge sources et al., 2023). Similarly, approaches that leverage
are provided within the prompt itself. This is in question-answer models to test if claims made by a
contrast to settings where a language model is ex- response can be supported by the source text, and
pected to generate a response entirely from its vice versa (Wang et al., 2020; Durmus et al., 2020;
world knowledge (i.e., grounded in the model’s Scialom et al., 2021), try to more explicitly identify
training data) (Azaria and Mitchell, 2023). the grounding sources of given generations in this
Though the term has often appeared in reference task. Under the intuition that a model is usually
to factuality (Gabriel et al., 2021), hallucinations not confident when it is hallucinating, hallucina-
are not necessarily factually incorrect. In their tory behavior can also be reflected in metrics that
survey, Ji et al. (2023) classify hallucinations as measure generation uncertainty or importance (Ma-
intrinsic, where generated responses directly con- linin and Gales, 2021; Guerreiro et al., 2023; Ren
tradict the knowledge sources, or extrinsic, where et al., 2023; Ngu et al., 2023; Madsen et al., 2023),
generated responses are neither entailed nor con- relative token contributions (Xu et al., 2023), and
tradicted by the sources. We follow this taxonomy. differences between the attention matrices and de-
For example, given the following task prompt for coder stability (Lee et al., 2018) during generation.
abstractive summarization,
Predicting Transformer Behavior. In small
. . . Deckard is now able to buy his wife transformer language models, it is possible to as-
Iran an authentic Nubian goat—from An- cribe behaviors like in-context learning to specific
Task History Knowledge Instruction Response

Abstractive ...dialed down that initial TL;DR: The National Weather


Summarization forecast to predict 13 to 19 Service forecast called
named storms, including six for 13 to 19 named
to nine hurricanes... storms including six
to 11 hurricanes.

KGDG ...B: ok how can I get a Three consecutive strikes Response that uses A: Indeed, you have to
turkey I think that may are known as a "turkey." the information in get four consecutive
be hard to get though knowledge: strikes to get one.

Data-to-Text name[The Wrestlers], Restaurant Located by the river-


priceRange[high] Description: side, The Wrestlers is
a restaurant with high
prices, and near Raja
Indian Cuisine.

Table 1: From top to bottom, examples of organic hallucinations in model responses for abstractive summarization,
knowledge-grounded dialogue generation (KGDG), and data-to-text tasks. Minimum annotated hallucination
spans are highlighted in cyan for intrinsic hallucinations and green for extrinsic hallucinations.

layers and modules (Elhage et al., 2021). This be- (2023) train a probe to detect the (un)answerability
comes harder in larger language models, but Meng of a question given a passage, which is similar in
et al. (2022) provide evidence that the feed-forward some respects to our task in Section 4; however, we
modules function as key-value stores of factual as- detect token-level hallucination during answer gen-
sociations and that it is possible to modify weights eration, which could happen even in theoretically
in these modules to edit facts associated with enti- answerable questions.
ties. Our probes, to be successful, must determine
whether the recalled facts were deployed in accor- 3 Grounded Generation Tasks
dance with the rules of grounding (i.e., to para- We test hallucination probes for autoregressive
phrase facts expressed in the prompt, rather than grounded generation in three distinct tasks: ab-
to replace them or to add details that are merely stractive summarization, knowledge-grounded di-
probable). Transformer internal states have also alogue generation, and data-to-text. We use the
been shown to be predictive of the likelihood that a same high-level prompt structure across all tasks to
model will answer a question correctly in the set- organically generate model responses to tasks and
ting of closed-book QA (Mielke et al., 2022) and to probe for hallucination. Each prompt contains
in the prediction of the truthfulness of a statement an optional interaction history, a knowledge source,
(Azaria and Mitchell, 2023), though these internal and an instruction resulting in the conditionally
representations of truthfulness can disagree system- generated response. Task prompts and examples of
atically with measures of model confidence (Liu organic hallucinations are shown in Table 1.
et al., 2023). These works measure hidden states
in the absence of supporting evidence (i.e. no pas- 3.1 Organic Hallucinations
sages are retrieved for either task), making them We use llama-2-13b (Touvron et al., 2023) as the
effectively probes of knowledge stored in model response model to generate outputs for each task.
weights and not the manipulation of novel or contin- To ensure that a non-negligible number of both hal-
gent information provided in a prompt. Monea et al. lucinations and grounded responses were present
(2023) find that there are separate computational in the generations for each task, each model re-
processes in transformer LLMs devoted to recall sponse is generated using top-k random sampling
of associations from memory vs grounding data, (Fan et al., 2018) with k = 2. The number of few-
and we show that we can detect whether associa- shot in-context learning examples for each task was
tions from grounding data are used correctly across chosen through a pilot exploration of several gen-
several distinct generation tasks. Slobodkin et al. eration strategies, optimizing for the presence of
Response Original Knowledge Changed Knowledge
Yes, Elvis was also one of Elvis Presley is regarded as one of Elvis Presley is regarded as one
the most famous musicians the most significant cultural icons of the least well-known music-
of the 20th century. of the 20th century. -ians of the 20th century.
Close to The Sorrento you name[Clowns], eatType[pub] name[Clowns],
can find Clowns pub. near[The Sorrento] near[The Sorrento]

Table 2: Examples of changed knowledge statements for our datasets of synthetic hallucinations. Changes made
to Conv-FEVER (top) and E2E (bottom) are shown in bold. The changed statements make the top response an
intrinsic predicate error and the bottom response an extrinsic entity error (highlighted as in Table 1).

a decent balance of hallucinations and grounded in-context learning.3


responses in all tasks, as determined via manual
inspection by the authors. Full prompt templates 3.2 Synthetic Hallucinations
are shown in Appendix C. Recent work (Ma et al., 2023; Monea et al., 2023)
has carefully created synthetic ungrounded datasets
Abstractive Summarization. The CNN / Dai-
by hand to evaluate metrics used to measure un-
lyMail (CDM) dataset (Hermann et al., 2015) is
faithfulness and models used to detect it. The
a standard benchmark in the setting of abstractive
BUMP dataset (Ma et al., 2023) is a version of
summarization (Lin and Ng, 2019). It consists of
CDM in which reference summaries have been
a set of news articles paired with reference sum-
manually edited to be ungrounded. It includes both
maries in the form of human-written lists of bullet
freestyle edits and systematic coverage of 4 specific
points that were originally published alongside the
types of faithfulness errors: (1) predicate errors,
articles. For automatic summarization, we follow
where the predicate in a generation is inconsistent
the GPT-2 paper (Radford et al., 2019) and prompt
with the source knowledge; (2) entity errors, where
our model to generate abstractive summaries of
the subject or object of a predicate is inconsistent;
news articles by appending TL;DR: after the article
(3) circumstance errors, or errors in the time, dura-
content (a 0-shot prompt),2 taking the first three
tion, or location of an event of the predicate; and (4)
generated sentences as the summary.
coreference errors. Types (1)–(3) may be intrinsic
Knowledge-Grounded Dialogue Generation. or extrinsic.
For a given conversation history and a pro- We use BUMP in our CDM experiments. For
vided knowledge sentence obtained via Wizard- our Conv-FEVER and E2E experiments, we sim-
Of-Wikipedia, the Conv-FEVER (Santhanam et al., ilarly created faithfulness errors in a uniformly
2021) task (CF) asks a respondent to create a next- sampled subset of these datasets. However, rather
turn response to the conversation that is grounded than changing an example’s reference response,
in the provided knowledge sentence. Here, we we changed its provided knowledge sentence so
prompt our response model to generate grounded that the existing reference response would be un-
responses under 2-shot in-context learning with faithful to it. This simulates the common case of
example grounded responses. hallucination where a large language model gener-
ates statements that may be true, but which are not
Data-to-Text Generation. The E2E task grounded in the source documents as required. For
(Novikova et al., 2017) asks respondents to write Conv-FEVER, we prompted ChatGPT to modify a
restaurant descriptions that are congruent with a given example’s provided knowledge sentence to
provided set of attributes. Here, we prompt our introduce intrinsic and extrinsic errors of types (1)–
response model to perform this task with 10-shot (4) above; we then manually judged these attempts
2 3
In Internet posts that appear in LLM pretraining data, As the original E2E task had omitted or missing infor-
this string often marks the start of a summary immediately mation in up to 40% of its data, we use the cleaned version
following the main content of the post. Prior work (Radford by Dušek et al. (2019). As even the cleaned version had
et al., 2019) has found that appending this string after texts instances of omitted or missing information, 10 in-context
in prompts to autoregressive LLMs is capable of producing examples were manually selected and manually verified for
abstractive summaries of the former. groundedness.
Task N-o H-o N-s H-s over one-fifth of all annotated responses were
annotated as containing at least one hallucinated
CDM 3399 23% 17785 50%
span (columns H-o and H-s).
CF 924 64% 2846 50%
E2E 4975 57% 9950 50% 4 Probing
Table 3: Dataset statistics. N-o is the number of anno- Probes (Alain and Bengio, 2017) are tools used to
tated organic generations, and H-o is the percentage of analyze a neural network’s internal representations.
organic generations with at least one hallucination. N-s Often taking the form of linear classifiers, they are
and H-s refer to the same for synthetic responses. Note applied to the internal representations of a network
that grounded synthetic responses were not generated and are trained to discriminate between different
by a model but come from gold reference data. types of inputs or outputs. Here, we are interested
in detecting hallucinatory outputs during an LLM’s
and edited the unsuccessful ones.4 In this way, we generation steps when conditioned on a prompt that
created equal numbers of examples with intrinsic requires a response grounded n the prompt.
and extrinsic predicate, entity, circumstance, and 4.1 Design
coreference errors, as well as freestyle errors.
Let u be a prompt as described at the start of Sec-
Similarly, for E2E, we modified a random sub-
tion 3. To determine whether a response x to a
set of the provided grounding attributes. Specifi-
prompt u contains hallucinations, we will probe the
cally, for an example with n attributes, we drew
hidden states of the transformer language model
k uniformly from [1, n − 1], uniformly selected
(i.e., decoder) as it generates x. Note that these
one of the subsets of size k, and then flipped a fair
hidden states depend on u and on the prefix of x
coin to decide whether to remove or perturb—by
that has been generated so far and that they may be
randomly sampling an alternative attribute from
reconstructed from the string pair (u, x) via forced
the entire dataset—these k attributes. Finally, we
decoding. A single transformer layer actually con-
manually verified that the reference response was
sists of two sublayers: the first transforms a token’s
unfaithful with respect to the modified attributes;
embedding by adding a vector predicted by atten-
if not, we further edited the attributes. Examples
tion, and the second transforms it again by adding
of changed knowledge statements are shown in Ta-
a vector predicted by a feed-forward network. We
ble 2; statistics of our synthetic datasets are shown
train a separate probe for each of the 2L sublayers,
in Table 3.
where L is the number of layers.
3.3 Annotation Linear Probe. This probe is a linear classifier
We ask annotators to label the minimal hallucina- using a single hidden state vector (output by some
tory spans in each response, if any. These are the sublayer) as input. Let xi be the ith response token
smallest text spans that would need to be modi- and let hi ∈ Rn be the corresponding LLM hidden
fied to transform an ungrounded response into a state at a fixed layer and component. We write
grounded one. yi ∈ {0, 1} to indicate whether xi falls within a
A team of 17 annotators completed a pilot minimum hallucination span. The linear probe
annotation exercise of 50 annotations per task. predicts yi by p(yi = 1 | u, x≤i ) = σ(w⊺ hi + b),
These pilot annotations were checked and verified where w ∈ Rn are learned weights, b ∈ R is a
by the authors to pinpoint common sources of error learned bias, and σ is the sigmoid function.
and disagreement. This feedback was integrated Attention-Pooling Probe. In order to directly
into a revised set of annotation guidelines, shown consider the previous hidden states of the response
in Appendix A, and the full dataset was then as well, we also experiment with an attention-
reannotated by the same team. Each example was pooling probe over all hidden states so far,
annotated once; example annotations are shown
in Table 1. The number of annotated examples p(yi = 1 | u, x≤i ) = σ(w⊺ h̄i )
is shown in Table 3. Note that for each dataset, i
X exp (q⊺ hj )
4
h̄i = αi,j hj , αi,j = Pi
⊺h
k=1 exp (q k)
In hindsight, this workflow proved inefficient; significant
j=1
human editing and validation were needed.
5
Obtained via BUMP (Ma et al., 2023). where q ∈ Rn is a learned query vector.
<INSTRUCTION> <KNOWLEDGE> <HISTORY>

Grounded Token Hallucination Token


Ensemble
Transformer Internal Representations Transformer Internal Representations
Classifier
Layers 1 to L Layers 1 to L


mlp mlp

Single Layer
Layer 0 Classifiers

self self
attn attn

Hallucination Prediction

Response …
Indeed , you have to get four consecutive .

Figure 2: General probe structure. Response tokens that were annotated as hallucinatory are shown in red. For each
internal representation layer, and each type of representation (attention output and feed-forward output), a classifier
is separately trained on the attention head out-projection and feed-forward hidden states for all representation layers
(or just the final token) of a response from a prompted language model. Depending on the probe, these classifiers
are then ensembled to make a final prediction of whether each individual token was, or otherwise if the response
contained, a hallucination.

Ensemble Probe. Our final approach is a single over the entire response x. To fit the probe parame-
feed-forward layer that combines the predictions ters, θ, we again minimize negative log-likelihood:
from our 2L separately trained probes (either the X
linear probes or the attention-pooling probes): L(θ) ∝ − log p(y | u, x≤|x| ; θ)
  (u,x,y)∈D

X
4.3 Metrics

p(yi | u, x≤i ) = σ 
 βl,m · pl,m (yi | u, x≤i )

l∈{1,...,L} Response-level Classification. F1 is the har-
m∈{attention,feed-forward}
monic mean of recall and precision. We denote
where βl,m are learned weights. The individual the F1 score of a response-level classifier by F1-R.
probes are frozen before βl,m parameters are fit.
Token-level Classification. For token-level clas-
4.2 Training Objectives sifiers, it is possible to evaluate the F1 score of
their per-token predictions; however, this would
Token-level. To train a probe to predict token-
give more weight to annotated or predicted spans
level hallucination indicators yi , we minimize nega-
that are longer in length. Here, we are more in-
tive log-likelihood on the annotations in our dataset
terested in span-level F1: specifically, whether we
D:
predicted all of the annotated hallucinations (re-
X |x|
X call), and whether all of the predicted spans were
L(θ) ∝ − log p(yi | u, x≤i ; θ) annotated as actual hallucinations (precision). As
(u,x,y)∈D i=1 requiring an exact match of the spans is too se-
vere—Appendix B shows how our human anno-
Response-level. We also try training a classifier tators often disagreed on the exact boundaries of
to predict whether x contains any hallucinations at a hallucinatory span—we modify span-level pre-
W|x|
all; denote this indicator as y = i=1 yi ∈ {0, 1}. cision and recall metrics to give partial credit, as
For this, we simply use an attention-pooling probe follows.
Define a span to be a set of tokens, where tokens Response-Level Classification, F1-R
from different responses are considered distinct.
Method CDM CF E2E
Let A be the set of all annotated spans in the ref-
erence corpus, and let P be the set of all predicted OC 0.38 0.78 0.72
spans on the same set of examples. We compute the Seq-Logprob 0.44 0.81 0.75
recall r as the average coverage of each reference FactCC 0.43 - -
span, i.e., the fraction of its tokens that are in some DAE 0.37 - -
predicted span. Conversely, we compute the pre- FEQA 0.35 - -
cision p as the average coverage of each predicted QuestEval 0.41 - -
span: SummaCZS 0.46 - -
SummaCConv 0.42 - -
1 X s∩
S  ChatGPT 0.34 0.62 0.37
ŝ∈P ŝ
r= ChatGPT CoT 0.36 0.68 0.45
|A| |s|
s∈A
S  PoolingSL 0.43 0.86 0.86
1 X ŝ ∩ s∈A s PoolingE 0.45 0.88 0.87
p=
|P | |ŝ| PoolingSL-SYNTH 0.46∗ 0.64 0.66
ŝ∈P
PoolingE-SYNTH 0.48∗ 0.65 0.69
where ∪, ∩, and | · | denote set union, intersec-
tion, and cardinality respectively. We define F1-Sp Human 0.71 0.87 0.87
(standing for "span, partial credit") as the harmonic
mean of p and r. Note that F1-Sp micro-averages Table 4: Response-level probe and baseline organic hal-
over examples, ignoring the boundaries between lucination detection performance across tasks as mea-
sured via F1 scores (F1-R) achieved on the test set. SL
them. Thus, in contrast to F1-R, examples with
denotes the best single-layer classifier (chosen on the
more spans have more influence on the final F1-Sp validation set), E denotes the ensembled classifier, and
metric achieved. SYNTH indicates that the classifier was trained on syn-
thetic data and tested on organic hallucinations. The
5 Experiments Human row measures the performance of a human an-
Unless otherwise indicated, all evaluations are per- notator who is different from the reference annotator;
this was measured on a subset of 50 examples that were
formed on organic data only, even if their probes
triply annotated, averaging over all 6 (annotator, refer-
were trained on synthetic data. ence annotator) pairs.
5.1 Probe Hyperparameters and Training
Single-layer hidden state probes, described in Sec-
(2) off-the-shelf synthetically-trained sentence-
tion 4, are trained under float32 precision with
level claim classification FactCC (Kryscinski et al.,
unregularized log loss. A batch size of 100 and
2020); (3) linguistic feature-based dependency
Adam optimization (Kingma and Ba, 2014) with
arc entailment DAE (Goyal and Durrett, 2020);
a learning rate of 0.001 and β1 , β2 = (0.9, 0.999)
(4) QA-based metrics FEQA (Durmus et al., 2020);
were chosen after a small grid search on hyperpa-
(5) QuestEval (Scialom et al., 2021); (6) the NLI-
rameters. With 70/10/20% train, validation, and
based metric SummaC (Laban et al., 2022); and
test splits, each classifier was trained using early
ChatGPT classifiers (Luo et al., 2023) (7) with-
stopping with 10-epoch patience. With the pre-
out and (8) with chain-of-thought prompting, with
dictions of these single-layer probes as input, the
prompts shown in Appendix C.1.
ensemble classifier was trained with the same hy-
perparameters. Metrics (2)–(6) were not originally intended to
detect hallucinations in individual responses, but
5.2 Response-Level Baselines rather to evaluate a collection of responses for the
We compare the performance of our probes against sake of system comparisons. To adapt to our set-
a set of contemporary methods used in faithfulness ting, however, we threshold each of these contin-
evaluation, measuring test set hallucination classifi- uous metrics to classify the individual responses,
cation F1. Specifically: (1) the model uncertainty- setting the threshold to maximize validation set F1,
based metric of length-normalized sequence log- following Laban et al. (2022). For sentence-level
probability Seq-Logprob (Guerreiro et al., 2023); optimized baselines such as FactCC, response-level
Span-level Classification, F1-Sp Response-level Classification, F1-R
Method CDM CF E2E Base Model Size CDM CF E2E
OC 0.09 0.58 0.15 7B 0.45 0.88 0.87
LinearSL 0.24 0.65 0.46 13B 0.45 0.88 0.87
LinearE 0.27 0.66 0.47 70B 0.45 0.89 0.87
PoolingSL 0.32 0.68 0.46
Span-level Classification, F1-Sp
PoolingE 0.34 0.68 0.47
Base Model Size CDM CF E2E
Human 0.61 0.69 0.58
7B 0.65 0.77 0.46
Table 5: Span-level probe hallucination detection per- 13B 0.64 0.77 0.46
formance, as measured via span-level partial match 70B 0.64 0.76 0.46
scores (F1-Sp ). The Human row is computed as in
Table 4 and is relatively stable across tasks, unlike the Table 6: Response and span-level F1 hallucination de-
systems. tection performance, as measured by F1-R and F1-Sp ,
respectively, of the PoolingE ensemble probe across
llama-2 response model sizes and test sets across tasks.
evaluation is performed by logical OR-ing sentence- For each task and model size, a probe is trained on la-
level baseline predictions (obtained after threshold- beled responses of the same task that were generated
ing the raw scores). organically from the 13B model.
As a further reference baseline, we additionally
report the performance achieved by randomly pre-
dicting a hallucination label with some probability perform on these datasets, given that even humans
p tuned to maximize F1 on the validation set, de- do not fully agree. We observe that our ensemble
noted here as Optimized Coin (OC). attention-pooling probe nearly meets this ceiling
for Conv-FEVER but lags behind for CDM and
5.3 Results E2E. Still, even this level of performance is po-
tentially useful for avoiding hallucinations, as we
Response-level Results. On two datasets, we
discuss in §6.
were actually able to match the interannotator
agreement rate (the “Human” row). On these
Layers. Evaluating the performance of trained
datasets, the probes trained on organic hallucina-
single-layer probes (Figure 3), we see that signals
tions worked best (Table 4). The exception was
of hallucination are strongest in model middle lay-
the CDM dataset, where the organic probes lost
ers during decoding. Probe performance shows
slightly to the synthetic probes, which in turn fell
a consistent and gradual increase from the zeroth
well short of human performance. On all datasets,
layer up until layer 10 to 20, after which there is
we were able to outperform all baseline systems,
a consistent decline in hallucination classification
showing that probing is a feasible (and efficient)
performance for the subsequent layers.
alternative to evaluating LLM outputs when model
states are available. Hidden State Type. Slight variations in ground-
Relative to other baselines that also have ing saliency—the strength of the signal of hallu-
some access to model internals—such as Seq- cination as able to be detected by a probe, mea-
Logprob—we hypothesize that probing-based ap- sured by probe hallucination classification perfor-
proaches perform better here due to being able to mance—are also present among hidden states of
capture information about model uncertainty that is different types during decoding. As shown in Fig-
lost when reduced to the output token probability. ure 3, feed-forward probes reach peak performance
Span-Level Results. These results appear in Ta- at slightly deeper layers than their attention head
ble 5. Again, the “Human” row serves as a ceiling out-projection probe counterparts and exceed them
on how well we can expect our automated system to in single-layer hallucination detection. Training en-
semble classifiers on only the hidden states of the

For CDM, the synthetic training dataset (from BUMP) is same type at every layer, we see how feed-forward
smaller than the organic training dataset, so the SYNTH results
are at a disadvantage. For CF and E2E, however, we generated hidden states possess a slight but statistically signif-
a synthetic training dataset that was matched in size. icant greater saliency of model grounding behavior.
CNN DailyMail ConvFEVER E2E
0.88 1.00
0.4
0.86
0.95
0.3 0.84
Test Set F1

0.82 0.90
0.2

.
0.80
0.85
0.1 0.78 Hidden State Type
0.80 MLP
0.0
0.76 self_attn.o_proj
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
. Transformer Layer .

Figure 3: Grounding behavior saliency—as measured by probe response-level F1 on test set—of single layer probe
classifiers trained on the attention-head out projections and feed-forward states of llama-2-13b, stratified across
tasks and layers.

Model Size. Training probes in the same manner Single Layer MLP Probes
on the force-decoded hidden states of responses on 0.8

our dataset using llama-2-7b and llama-2-70b,


0.6
we find no statistically significant difference (via

Test Set Recall


Kolmogorov–Smirnov) in hallucination detection
0.4
across the hidden states from base models of vary-
ing sizes (Table 6). These results show that we
0.2
can still detect a hallucination even when it was
originally generated from a different model, and 0.0
that even the lower-dimensional hidden states of 0 5 10 15 20 25 30 35 40
the 7B model still contain enough information to .
train a successful probe. Single Layer self_attn.o_proj Probes
0.8

Hallucination Type. Next, we test for the 0.7

saliency of different forms of hallucination behav- 0.6


Test Set Recall

0.5
ior in model internal states. Here, with annotator-
0.4
labeled test set responses according to whether they
0.3
contained intrinsic or extrinsic hallucinations, we
0.2 Extrinsic Synthetic
evaluate trained probe performance in correctly de- Intrinsic Synthetic
tecting the presence of hallucinations for instances
0.1 Extrinsic Organic
0.0 Intrinsic Organic
of each hallucination type. We observe in Figure 4 0 5 10 15 20 25 30 35 40
that extrinsic hallucination behavior is more salient Transformer Layer
in the hidden states of shallower layers for both
Figure 4: Intrinsic and extrinsic hallucination saliency
organic and synthetic hallucinations, with saliency
across layers for abstractive summarization as measured
noticeably peaking around layers 15 and 16, de- by response-level probe recall on hallucination types.
creasing in deeper layers. Overall, extrinsic hal-
lucination remained more salient in model hidden
states than intrinsic hallucination.

Synthetic Hallucinations. As Table 7 shows, work has shown that synthetic data generation ap-
probes achieve very high F1 in the detection of proaches specifically designed for factuality eval-
synthetically created hallucinations across all tasks. uation (Ma et al., 2023) do not align with actual
However, probe performance is subject to a sig- errors made by generation models in their lexical
nificant drop when trained and tested on different error distributions (Goyal and Durrett, 2021). Here,
modalities, with probes generally performing better our results show this to extend to their hidden state
when trained on organic hallucinations and tested signals as well, which are used by the probe to
on synthetic ungrounded responses compared to detect unfaithfulness in the case of forced decod-
the reverse, with the notable exception being ab- ing on human-created content, relative to organic
stractive summarization on the CDM dataset. Prior instances of hallucination.
Task Train Test F1-R Test Train F1-R
CDM Organic Organic 0.45 CDM CF 0.44
Synthetic Organic 0.48 E2E 0.44
Organic Synthetic 0.38 CF+E2E 0.46
Synthetic Synthetic 0.93 CDM 0.45
CF Organic Organic 0.88 CF CDM 0.60
Synthetic Organic 0.65 E2E 0.80
Organic Synthetic 0.83 CDM+E2E 0.75
Synthetic Synthetic 0.94 CF 0.88
E2E Organic Organic 0.87 E2E CDM 0.60
Synthetic Organic 0.69 CF 0.75
Organic Synthetic 0.90 CDM+CF 0.76
Synthetic Synthetic 0.98 E2E 0.87

Table 7: Response-level PoolingE probe performance, Table 8: Response-level PoolingE cross-task probe per-
as measured by F1-R, in detecting hallucinations on or- formance, as measured by F1-R. All experiments use
ganic or synthetic ungrounded responses. Performance the same amount of training data (a random subset of
is generally better when the training data resembles the the full training set). The "+" sign indicates an equal
test data. mixture of two training tasks.

Task. To test whether the saliency of ground- ble 8 suggest poor generalization to out-of-domain
ing behavior persists consistently in similar hidden tasks, although perhaps this would be mitigated by
state locations across different tasks, we train our larger and more diverse training sets. Furthermore,
probe on each specific task and evaluate its perfor- it perhaps goes without saying that probing requires
mance when applied to the others. As shown in access to the hidden states themselves. If LLMs
Table 8 and compared to same-task training, we continue to move behind closed-source APIs, such
find a lack of generalization in hallucination detec- hidden state probes will not be possible for those
tion performance for our probes across different without privileged access.
train and test tasks. This may mean that there is
not a consistent signal of hallucination across tasks, Ecological Validity. The conclusions from an
or that our probes are not distinguishing this signal ecologically valid experiment should hold outside
from its task-or-dataset-specific confounds, and are the context of that study (De Vries et al., 2020).
overfitting to the confounds. Synthetic hallucinations are generally ecologically
invalid for organic hallucination detection. Extend-
6 Discussion ing beyond surface-level error distributions (Goyal
and Durrett, 2021), these differences in internal rep-
In this work, we examined how LLM internal rep- resentations further suggest that language model
resentations reveal their hallucinatory behavior, via behavior during hallucination fundamentally mis-
a case study of several in-context grounded genera- aligns with current expectations of ungrounded gen-
tion tasks. Our analyses point to several nuances in erations. Training on the same modality—synthetic
the saliency of this behavior across model layers, or organic—is imperative to maximize hallucina-
hidden state types, model sizes, hallucination types, tion detection performance, and our results on the
and contexts. Below, we highlight a few takeaways synthetic modality, especially on CNN / DailyMail,
as well as limitations that naturally suggest direc- suggest probing to be a feasible method for com-
tions for future work. plex multi-sentence entailment classification that
can leverage the capability of LLMs.
Efficiency and Access. Our work is motivated
by the need for efficient LLM hallucination evalua- Annotator Disagreements. Despite asking an-
tion. Though computationally efficient, one of the notators to label what they deem to be the minimal
key limitations of probing is in the need for labeled spans of hallucination in a response, considerable
in-domain data for probe training. Our results in Ta- variation still exists between annotators on judg-
ments of their exact span boundaries. It remains References
an open questions how to improve annotator agree- Guillaume Alain and Yoshua Bengio. 2017. Under-
ment in this context. standing intermediate layers using linear classifier
probes. In The Fifth International Conference on
Probe Design. Even if span boundaries were Learning Representations.
standardized, they may not align perfectly with
Amos Azaria and Tom Mitchell. 2023. The internal
hallucination signals in LLM internal representa- state of an LLM knows when it’s lying. arXiv
tions. For instance, while a generation may first preprint arXiv:2304.13734.
contradict a given knowledge source at a certain
token, the ultimate signal that it will hallucinate Jennifer A. Bishop, Qianqian Xie, and Sophia Anani-
adou. 2023. LongDocFACTScore: Evaluating the
may already be salient in the tokens preceding it. factuality of long document abstractive summarisa-
Alternatively, the signal might be computed only tion. arXiv preprint arXiv:2309.12455.
later, as the decoder encodes previously generated
Harm De Vries, Dzmitry Bahdanau, and Christopher
tokens and recognizes that it is currently pursu- Manning. 2020. Towards ecologically valid re-
ing an ungrounded train of thought. Thus, there search on language user interfaces. arXiv preprint
may be probe architectures that could improve hal- arXiv:2007.14435.
lucination detection performance and mitigation.
Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
Nonlinear probes such as feed-forward networks question answering evaluation framework for faith-
are also worth considering: the hidden layer(s) of a fulness assessment in abstractive summarization. In
feed-forward network could learn to detect differ- Proceedings of the 58th Annual Meeting of the Asso-
ent types of hallucinations. ciation for Computational Linguistics, pages 5055–
5070, Online. Association for Computational Lin-
Mitigation. Finally, once we can detect un- guistics.
grounded generation behavior, we can try to learn Ondřej Dušek, David M. Howcroft, and Verena Rieser.
to avoid it. This has great practical importance 2019. Semantic noise matters for neural natural lan-
as retrieval-oriented generation becomes popular guage generation. In Proceedings of the 12th Interna-
tional Conference on Natural Language Generation,
across a variety of consumer-facing applications. pages 421–426, Tokyo, Japan. Association for Com-
We have left this subject for future work. A sim- putational Linguistics.
ple method is to reject a response or prefix where
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom
hallucinations have been detected with high proba-
Henighan, Nicholas Joseph, Ben Mann, Amanda
bility, and start sampling the response again (Guer- Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al.
reiro et al., 2023). Another approach would be 2021. A mathematical framework for transformer
to use the predicted hallucination probabilities to circuits. Transformer Circuits Thread, 1.
reweight or filter hypotheses during beam search. Angela Fan, Mike Lewis, and Yann Dauphin. 2018.
Finally, a detector can also be used as a proxy re- Hierarchical neural story generation. In Proceedings
ward method for fine-tuning the language model, of the 56th Annual Meeting of the Association for
using a reward-based method such as PPO (Schul- Computational Linguistics (Volume 1: Long Papers),
pages 889–898, Melbourne, Australia. Association
man et al., 2017). In addition to learning from for Computational Linguistics.
response-level negative rewards incurred at the end
of the utterance, PPO can benefit from early token- Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin Choi,
level negative rewards, which serve as a more in- and Jianfeng Gao. 2021. GO FIGURE: A meta eval-
uation of factuality in summarization. In Findings of
formative signal for reinforcement learning (i.e., the Association for Computational Linguistics: ACL-
reward shaping) and thus can reduce sample com- IJCNLP 2021, pages 478–487, Online. Association
plexity. for Computational Linguistics.

Mor Geva, Roei Schuster, Jonathan Berant, and Omer


Acknowledgments Levy. 2021. Transformer feed-forward layers are key-
value memories. In Proceedings of the 2021 Confer-
We thank Val Ramirez, Patrick Xia, Harsh Jham- ence on Empirical Methods in Natural Language Pro-
tani, Hao Fang, Tongfei Chen, and Radhika cessing, pages 5484–5495, Online and Punta Cana,
Gaonkar for their helpful comments, thoughts, and Dominican Republic. Association for Computational
discussions, as well as the data specialists who con- Linguistics.
tributed and aided in the creation of this work. Tanya Goyal and Greg Durrett. 2020. Evaluating factu-
ality in generation with dependency-level entailment.
In Findings of the Association for Computational Lin- Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fan-
guistics: EMNLP 2020, pages 3592–3603, Online. njiang, and David Sussillo. 2018. Hallucinations in
Association for Computational Linguistics. neural machine translation.
Tanya Goyal and Greg Durrett. 2021. Annotating and Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
modeling fine-grained factuality in summarization. Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
In Proceedings of the 2021 Conference of the North rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
American Chapter of the Association for Computa- täschel, et al. 2020. Retrieval-augmented generation
tional Linguistics: Human Language Technologies, for knowledge-intensive NLP tasks. Advances in
pages 1449–1462, Online. Association for Computa- Neural Information Processing Systems, 33:9459–
tional Linguistics. 9474.
Nuno M. Guerreiro, Elena Voita, and André Martins.
2023. Looking for a needle in a haystack: A com- Hui Lin and Vincent Ng. 2019. Abstractive summariza-
prehensive study of hallucinations in neural machine tion: A survey of the state of the art. Proceedings
translation. In Proceedings of the 17th Conference of the AAAI Conference on Artificial Intelligence,
of the European Chapter of the Association for Com- 33(01):9815–9822.
putational Linguistics, pages 1059–1075, Dubrovnik,
Croatia. Association for Computational Linguistics. Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, and
Jacob Andreas. 2023. Cognitive dissonance: Why do
Andreas Hanselowski, Hao Zhang, Zile Li, Daniil language model outputs disagree with internal rep-
Sorokin, Benjamin Schiller, Claudia Schulz, and resentations of truthfulness? In Proceedings of the
Iryna Gurevych. 2018. UKP-athene: Multi-sentence 2023 Conference on Empirical Methods in Natural
textual entailment for claim verification. In Proceed- Language Processing, pages 4791–4797, Singapore.
ings of the First Workshop on Fact Extraction and Association for Computational Linguistics.
VERification (FEVER), pages 103–108, Brussels, Bel-
gium. Association for Computational Linguistics. Zheheng Luo, Qianqian Xie, and Sophia Ananiadou.
2023. ChatGPT as a factual inconsistency evaluator
Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- for abstractive text summarization. arXiv preprint
stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, arXiv:2303.15621.
and Phil Blunsom. 2015. Teaching machines to read
and comprehend. Advances In Neural Information Liang Ma, Shuyang Cao, Robert L Logan IV, Di Lu,
Processing Systems, 28. Shihao Ran, Ke Zhang, Joel Tetreault, and Alejandro
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Jaimes. 2023. BUMP: A benchmark of unfaithful
Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea minimal pairs for meta-evaluation of faithfulness met-
Madotto, and Pascale Fung. 2023. Survey of halluci- rics. In Proceedings of the 61st Annual Meeting of
nation in natural language generation. ACM Comput- the Association for Computational Linguistics (Vol-
ing Surveys, 55(12):1–38. ume 1: Long Papers), pages 12788–12812, Toronto,
Canada. Association for Computational Linguistics.
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint Andreas Madsen, Siva Reddy, and Sarath Chandar. 2023.
arXiv:1412.6980. Faithfulness measurable masked language models.
arXiv preprint arXiv:2310.07819.
Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc-
Cann, Caiming Xiong, and Richard Socher. 2019. Andrey Malinin and Mark Gales. 2021. Uncertainty
Neural text summarization: A critical evaluation. In estimation in autoregressive structured prediction. In
Proceedings of the 2019 Conference on Empirical International Conference on Learning Representa-
Methods in Natural Language Processing and the 9th tions.
International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 540–551, Hong Kevin Meng, David Bau, Alex Andonian, and Yonatan
Kong, China. Association for Computational Linguis- Belinkov. 2022. Locating and editing factual asso-
tics. ciations in GPT. Advances in Neural Information
Wojciech Kryscinski, Bryan McCann, Caiming Xiong, Processing Systems, 36.
and Richard Socher. 2020. Evaluating the factual
consistency of abstractive text summarization. In Kevin Meng, Arnab Sen Sharma, Alex J Andonian,
Proceedings of the 2020 Conference on Empirical Yonatan Belinkov, and David Bau. 2023. Mass-
Methods in Natural Language Processing (EMNLP), editing memory in a transformer. In The Eleventh
pages 9332–9346, Online. Association for Computa- International Conference on Learning Representa-
tional Linguistics. tions.

Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Sabrina J. Mielke, Arthur Szlam, Emily Dinan, and Y-
Marti A. Hearst. 2022. SummaC: Re-visiting NLI- Lan Boureau. 2022. Reducing conversational agents’
based models for inconsistency detection in summa- overconfidence through linguistic calibration. Trans-
rization. Transactions of the Association for Compu- actions of the Association for Computational Linguis-
tational Linguistics, 10:163–177. tics, 10:857–872.
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike models. In Proceedings of the 2023 Conference on
Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Empirical Methods in Natural Language Processing,
Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. pages 3607–3625, Singapore. Association for Com-
FActScore: Fine-grained atomic evaluation of factual putational Linguistics.
precision in long-form text generation. In EMNLP.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Giovanni Monea, Maxime Peyrard, Martin Josifoski, bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Vishrav Chaudhary, Jason Eisner, Emre Kıcıman, Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Hamid Palangi, Barun Patra, and Robert West. 2023. Bhosale, et al. 2023. Llama 2: Open founda-
A glitch in the Matrix? Locating and detecting tion and fine-tuned chat models. arXiv preprint
language model grounding with Fakepedia. arXiv arXiv:2307.09288.
preprint arXiv:2312.02073.
Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020.
Noel Ngu, Nathaniel Lee, and Paulo Shakarian. 2023. Asking and answering questions to evaluate the fac-
Diversity measures: Domain-independent proxies for tual consistency of summaries. In Proceedings of the
failure in language model queries. arXiv preprint 58th Annual Meeting of the Association for Compu-
arXiv:2308.11189. tational Linguistics, pages 5008–5020, Online. Asso-
ciation for Computational Linguistics.
Jekaterina Novikova, Ondřej Dušek, and Verena Rieser.
2017. The E2E dataset: New challenges for end- Weijia Xu, Sweta Agrawal, Eleftheria Briakou, Mari-
to-end generation. In Proceedings of the 18th An- anna J. Martindale, and Marine Carpuat. 2023. Un-
nual SIGdial Meeting on Discourse and Dialogue, derstanding and detecting hallucinations in neural
pages 201–206, Saarbrücken, Germany. Association machine translation via model introspection. Trans-
for Computational Linguistics. actions of the Association for Computational Linguis-
tics, 11:546–564.
Alec Radford, Jeff Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners.
Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mo-
hammad Saleh, Balaji Lakshminarayanan, and Pe-
ter J Liu. 2023. Out-of-distribution detection and
selective generation for conditional language mod-
els. In The Eleventh International Conference on
Learning Representations.
Sashank Santhanam, Behnam Hedayatnia, Spandana
Gella, Aishwarya Padmakumar, Seokhwan Kim,
Yang Liu, and Dilek Hakkani-Tur. 2021. Rome was
built in 1776: A case study on factual correctness
in knowledge-grounded response generation. arXiv
preprint arXiv:2110.05456.
John Schulman, Filip Wolski, Prafulla Dhariwal,
Alec Radford, and Oleg Klimov. 2017. Proxi-
mal policy optimization algorithms. arXiv preprint
arXiv:1707.06347.
Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier,
Benjamin Piwowarski, Jacopo Staiano, Alex Wang,
and Patrick Gallinari. 2021. QuestEval: Summariza-
tion asks for fact-based evaluation. In Proceedings of
the 2021 Conference on Empirical Methods in Natu-
ral Language Processing, pages 6594–6604, Online
and Punta Cana, Dominican Republic. Association
for Computational Linguistics.
Sina J. Semnani, Violet Z. Yao, Heidi C. Zhang, and
Monica S. Lam. 2023. WikiChat: A few-shot LLM-
based chatbot grounded with Wikipedia. arXiv
preprint arXiv:2305.14292.
Aviv Slobodkin, Omer Goldman, Avi Caciularu, Ido
Dagan, and Shauli Ravfogel. 2023. The curious case
of hallucinatory (un)answerability: Finding truths in
the hidden states of over-confident large language
Appendix one kilometer track. Origone has been the fastest
speed skier on the globe since April 2006, having
Here, we provide additional information about an-
set a then-new world record of 251.4 km/h at Les
notation guidelines, annotator disagreement exam-
Arcs. Bastien Montes of France was Origone’s
ples, prompts used to generate organic responses
closest challenger on Monday, but even the Fren-
to our tasks, and ChatGPT baseline comparisons.
hcman’s new personal best of 248.105 km/h was
A Annotation Guidelines someway short of the Italian’s 2006 world record,
let alone his latest one. "Simone Origone is the
The purpose of this annotation task is two-fold: (1) greatest champion of all time, he is the only person
identify if a generated response is a hallucination to hold the record for two ski speeds in France –
relative to a given knowledge source, and (2) if so, Les Arcs and Vars Chabrieres. This is a historic
identify and mark the location(s) of the minimal day," technical director of Speed Masters Philippe
spans of text that are ungrounded against the knowl- Billy told Vars.com. The 34-year-old Origone – a
edge source. A response is grounded in a knowl- ski instructor, mountain guide and rescuer by day –
edge source if that source entails the content of the only took up the discipline in 2003, having given
response. In other words, the response should (1) up downhill skiing in 1999. "Now that I have twice
not contradict or (2) draw conclusions that are not won the world record, I can say that I have made
explicitly given in the knowledge source. history," Origone told Vars.com. "It is important
for me and for speed skiing."
A.1 CNN / DailyMail
Provided Information. (1) Original article, (2) Hypothetical Instance 1. The world’s fastest
summary of article. man on skis just got faster.
Task. Identify the presence and location of hal- • Everything is grounded, no need to mark any-
lucinations in the summary of the article, rela- thing here.
tive to the original article text. You are to only
annotate—by copying the text in the summary Annotation. The world’s fastest man on skis
and adding characters < and > in specific loca- just got faster.
tions—around the minimal span(s) of the summary
text relative to the original article; I.e., if changing Hypothetical Instance 2. The world’s fastest
an additional token wouldn’t change whether or not man on skis just got faster. He has now won the
the response was grounded or not, don’t include world record twice. He has excelled at Chabrieres.
it in the span you annotate. Note: You will be He has been the fastest speed skier on the globe
given summaries that contain a variety number of since April 2006. His speed record reached more
sentences. You are only to annotate the first three than 260 kilometers per hour.
sentences in the summary, discarding the rest.
• This summary has more than 3 sentences, so
Example. (CNN) – The world’s fastest man on the first thing to do is to only look at the first
a pair of skis just got faster. On Monday Italian three and discard any sentences beyond that
Simone Origone broke his own speed skiing world in your annotation.
record as he reached 252.4 kilometers per hour
on the Chabrieres slopes in the French Alps – an • Even though the last sentence is ungrounded
achievement confirmed by organizers France Ski relative to the source text, it is not within the
de Vitesse. With a 1,220 m slope that has a max- first three sentences, so we can ignore it here.
imum gradient of 98% and an average of 52.5%,
Chabrieres is not for the faint-hearted. Traveling • Annotate as such, copying the first three sen-
at over 250 km/h is a privilege usually reserved tences into the annotation box. No need to
for Formula One drivers, yet speed skier Origone highlight anything, as everything in the first
was equipped with just an aerodynamic helmet to three sentences are grounded:
increase streamlining and a ski suit made from air-
tight latex to reduce wind resistance. Origone’s Annotation. The world’s fastest man on skis
one nod to safety was wearing a back protector just got faster. He has now won the world record
in case of a crash as he threw himself down the twice. He has excelled at Chabrieres.
Hypothetical Instance 3. Simone Origone, to 13 to 19) then this is not a hallucination, as it
now in his twenties, just got faster. He has now quantifies the original prediction with “initially.”
won the world record more than twice. If the summary states “The National Weather
Service forecast called for 13 to 19 named storms,”
• “In his twenties” contradicts with “The 34- then this is not a hallucination, as it is the final
year-old Origone” in the original article. corrected fact.
• “more than twice” contradicts with how he Identifying the Summary. Sometimes the sum-
has now only won it exactly twice. As such, marization system may produce things that are not
annotate: a summary of the article. For example, “Woman
dies while paragliding in Tenerife. Paragliding in
Annotation. Simone Origone, now in his <twen- Tenerife. Paragliding in Tenerife. Paragliding in
ties>, just got faster. He has now won the world Tenerife. Paragliding Tenerife Canary Islands.”
record <more than> twice. Here, only the first sentence is the summary,
Special Points of Note. Below, we list a few and you can discard the rest. This is because the
points of special note to keep in mind during your summarization system tries to mimic how actual
annotations. website summaries work, so they might also add
things like topic tags, keywords, and things like
Concatenation. When finding yourself annotat- that.
ing something like “The city’s pollution levels have Another example is “InvisiBra, £48, Lavalia,
reached or <exceeded ><“very high”> levels for lavalia.co.uk; Strapless Stick On Adhesive Bra,
the last...,” do concatenate the two labeled spans; if £12, Marks and Spencer; Natural Stick On Bra,
there are no characters between them (for example, £20, Debenhams.” These are all just keyword tags;
<exceeded ><“very high”>), concatenating the an- no summary of the article is present here. Try to
notation bounds into (<exceeded “very high”>) is use your best judgment here, as it might not always
ideal. be clear.
In the case where such things are generated that
“Was”/“Is”. If the article says something like
do not look to be part of a summary of the arti-
“Woods insisted he is ready to win the famous
cle in terms of format (not content), discarding
Claret...” and the summary said something like
them and everything after them is the way to go.
“said he was ready to win the famous Claret...”,
For the former example, you would annotate only
“was” here would not be considered to be a hallu-
“Woman dies while paragliding in Tenerife.” for
cination, as the summary is recounting what was
hallucinations.
said in the article, so the past tense here doesn’t
contradict what was originally said in the article. A.2 Conv-FEVER
When The Article Has “Contradicting” Facts. Provided Information. (1) Conversational con-
Articles may sometimes “correct” themselves; i.e. text, (2) Knowledge for next response, (3) Next
the article may have something where it originally response.
states a claim but then corrects it later in the article.
Task. Identify the presence and location of hallu-
For example, “The NWS – last May – initially fore-
cinations in “Next response”, relative to “Knowl-
cast 13 to 20 “named storms,” including seven to
edge for next response”. You are to only an-
11 hurricanes. Then in August, it dialed down that
notate—by copying the text in “Next Reply”
initial forecast to predict 13 to 19 named storms,
and adding characters < and > in specific lo-
including six to nine hurricanes.”
cations—around the minimal span(s) of an un-
In this case, if the summary only states “The
grounded response relative to the knowledge; I.e.,
National Weather Service forecast called for 13 to
if changing an additional token wouldn’t change
20 named storms” (and doesn’t mention it was cor-
whether or not the response was grounded or not,
rected later to 13 to 19), then this is a hallucination
don’t include it in the span you annotate.
as it is corrected later in the article.
If the summary states “The National Weather Example. Conversational Context: A: lets talk
Service forecast initially called for 13 to 20 named bowling a game where you throw a bowling ball
storms” (and doesn’t mention it was corrected later towards a target. B: ok I love bowling I would
like to know more A: for one it is one of the major Hypothetical Instance 3. Next Reply: A:
forms of throwing sports and I think a lot of people yea it is pretty hard.
enjoy it B: yes me too I use to bowl all the time
with my family what about the strike? A: well a • Here, the response is purely an expression of
strike is when you knock down all pins on the first an opinion and does not use any information
round B: ok how can I get a turkey I think that may in the provided knowledge.
be hard to get though Knowledge for Next Reply:
• As such, you would mark this entire response
Three consecutive strikes are known as a "turkey".
to be a hallucination, annotating as:
Hypothetical Instance 1. Next Reply: A: yea
Annotation. <yea it is pretty hard.>
four consecutive strikes is known as a turkey
Hypothetical Instance 4. Next Reply: A: Three
• Here, a contradiction is present in the under- strikes consecutively is a turkey.
lined portions: four consecutive strikes in the
response goes against what is provided in the • Here, the response is grounded. As such, you
knowledge, which says three. won’t mark anything in the response, annotat-
ing as such:
• To make this reply grounded, one would have
to change “four” to “three”. Annotation. Three strikes consecutively is a
turkey.
• As such, in your annotation, you will highlight Hypothetical Instance 5. Next Reply: A:
the minimum needed span of text to make this four consecutive strikes is known as a turkey,
reply grounded, with chevrons <>. Here, the also known as a strike.
minimum needed span of text to change is
“four”, so you would annotate as such: • Two hallucinations are present in this re-
sponse:
Annotation. yea <four> consecutive strikes is
known as a turkey • The direct contradiction of four with three in
the knowledge source, and
Hypothetical Instance 2. Next Reply: A:
yea it is pretty hard, four consecutive strikes is • The final part of the reply, saying that it’s also
known as a turkey known as a strike, which is information not
provided anywhere in the knowledge source.
• Here, the contradiction is still present in the As such, you would mark this as:
previously underlined portion: four is a con-
Annotation. <four> consecutive strikes is
tradiction with three.
known as a turkey, <also known as a strike>
• However, in addition to this contradiction, you A.3 E2E
see an expression of an opinion in this re-
Provided Information. (1) Restaurant attributes,
sponse: that getting a turkey is pretty hard
(2) Generated restaurant description.
(underlined).
Task. Identify the presence and location of
• You are to ignore the expression of personal hallucinations in the generated restaurant de-
opinions during your annotation of minimal scription—by copying the text in the summary
spans for this task, as we won’t consider con- and adding characters < and > in specific loca-
versational chitchat—if complemented with tions—around the minimal span(s) of the gen-
grounded information elsewhere in the re- erated restaurant description relative to the pro-
ply—to be a hallucination for this task. As vided attributes. i.e., if changing an additional to-
such, you would annotate this as: ken wouldn’t change whether the description was
grounded or not, don’t include it in the span you
Annotation. yea it is pretty hard, <four> con- annotate. Note that numerical ratings should be ren-
secutive strikes is known as a turkey dered as numerical ratings and text ratings should
be rendered as non-numerical expressions. For ex- • As “will never exceed” is the minimal span
ample, “customer rating” can be “low”, “medium” you would need to change to make the de-
and “high”, but also “1”, “3”, and “5”. If the cus- scription grounded relative to the provided
tomer rating is “1”, for instance, and the gener- attributes, annotate:
ated description says “. . . has a low customer rat-
ing”, that would be considered a hallucination. The Annotation. Prices <will never exceed> £20 at
“priceRange” rating also has this attribute, in addi- The Wrestlers and they also don’t really rate well.
tion to “customer rating.” Finally, E2E specifies They are not even family-friendly.
the "familyFriendly" attribute as akin to Child-
friendly in gold annotations: if a restaurant is Hypothetical Instance 4. Restaurant At-
family-friendly, then it is also child-friendly. tributes: name[Midsummer House], food[French],
customer rating[5 out of 5], near[Café Rouge].
Hypothetical Instance 1. Restaurant At- Restaurant Description: A place for fine dining
tributes: name[The Wrestlers], priceRange[less French food, Midsummer House is nearby Cafe
than £20], customer rating[low], fami- Rouge.
lyFriendly[no]. Restaurant Description:
The Wrestlers is an adults only restaurant with a
• Nowhere in the attributes does it say that it’s
customer rating of 1 out of 5 and a price range of
“fine dining.” As such, annotate:
less than £20.
• While the familyFriendly[no] attribute says Annotation. A place for <fine dining> French
that the restaurant is not family friendly, this food, Midsummer House is nearby Cafe Rouge.
does not mean that the restaurant is adults
only. Hypothetical Instance 5. Restaurant
Attributes: name[Browns Cambridge],
• - Having a low customer rating does not imply priceRange[cheap], priceRange[moderate],
a 1 out of 5 customer rating. As such, annotate customer rating[1 out of 5]. Restaurant Descrip-
as: tion: The Brown’s chain of food restaurants is
Annotation. The Wrestlers is an <adults only> an 1 out of 5 customer rated restaurant, and have
restaurant with a customer rating of <1 out of 5> moderate and cheap pricing. They are childfree.
and a price range of less than £20.
• The attribute does not mention that Browns
Hypothetical Instance 2. Restaurant At- Cambridge is a chain of food restaurants.
tributes: name[The Dumpling Tree], eat-
Type[restaurant], food[Italian], priceRange[high]. • Furthermore, it does not say that the restaurant
Restaurant Description: The Dumpling Tree is is childfree. As such, annotate:
an Italian restaurant with high prices.
• Everything is grounded, no need to mark any- Annotation. The Brown’s <chain of food restau-
thing. rants> is an 1 out of 5 customer rated restaurant,
and have moderate and cheap pricing. <They are
Annotation. The Dumpling Tree is an Italian childfree.>
restaurant with high prices.
Hypothetical Instance 6. Restaurant
Hypothetical Instance 3. Restaurant At-
Attributes: name[Browns Cambridge],
tributes: name[The Wrestlers], priceRange[less
priceRange[more than £30]. Restaurant
than £20], customer rating[low], fami-
Description: The Browns Cambridge is a
lyFriendly[no]. Restaurant Description:
restaurant with high prices.
Prices will never exceed £20 at The Wrestlers and
they also don’t really rate well. They are not even
• More than £30 does not necessarily mean
family-friendly.
“high prices.” As such, annotate:
• Having a priceRange attribute of less than 20
does not mean that prices will never exceed Annotation. The Browns Cambridge is a restau-
twenty. rant with <high prices>.
[KNOWLEDGE]
Taxi driver dropped off a soldier on an unsafe road, and
he was killed by a coach. The taxi driver was not at ## RESPONSE THAT USES THE
fault. INFORMATION IN KNOWLEDGE
Taxi driver dropped off a soldier on an unsafe road, and [RESPONSE]
he was killed by a coach. The taxi driver was not at
fault.
whereafter everything generated in the response
preceding a newline character is taken as the final
Accounting is the language of business, and it mea- generation. For E2E, we use a 10-shot in-context
sures the results of an organization’s economic activ- learning prompt in the form of
ities.

Accounting is the language of business, and it mea- Restaurant Attributes:


sures the results of an organization’s economic activ- [RESTAURANT ATTRIBUTES]
ities.
Restaurant Description:
[RESTAURANT DESCRIPTION]
Alimentum is a high rated family-unfriendly restaurant
with less than £20 per person. with the same newline constraints as Conv-
Alimentum is a high rated family-unfriendly restaurant
FEVER.
with less than £20 per person.
C.1 ChatGPT Prompts
ChatGPT Classic. For the most basic zero-shot
Table 9: Annotator span-level annotation differences prompt classifier with no reasoning, we use the
for hallucination, annotated in bold, in organic model
following prompt for CNN / DailyMail, following
responses for CNN / DailyMail (top), Conv-FEVER
(middle), and E2E (bottom). (Luo et al., 2023):

Decide if the following summary


B Annotator Disagreements is consistent with the
corresponding article. Note that
Examples of annotator disagreements in the exact
consistency means all
spans of hallucination are shown in Table 9. Qual-
information in the summary is
itatively, we find that annotators agreed generally
supported by the article.
on the specific items of hallucination; i.e. the sub-
ject that was hallucinated (i.e. the taxi driver), the Article: [Article]
description of what was mentioned (i.e. the de- Summary: [Summary]
scription of accounting), and the attribute that was
Answer (yes or no):
hallucinated (i.e. the specific rating of the restu-
rant). However, where exactly those hallucinated
For Conv-FEVER, our prompt is structured as
spans began and ended was a source of reasonable
follows:
variation among annotators.

C Prompts Decide if the following sentence


(SENTENCE) is consistent with
Organic Model Responses. CNN / DailyMail the corresponding knowledge
generations take the form of a 0-shot prompt statement (KNOWLEDGE). Note that
with [ARTICLE] TL;DR: [SUMMARY], whereafter consistency means all
the first 3 sentences generated in the summary information in the sentence
are taken to be the summary of the article. For (SENTENCE) is supported by the
Conv-FEVER, we use a 2-shot in-context learning knowledge statement (KNOWLEDGE).
prompt in the form
Knowledge Statement (KNOWLEDGE):
[Knowledge]
## CONVERSATION HISTORY
Sentence (SENTENCE): [Response]
[CONTEXT]
Answer (yes or no):
## KNOWLEDGE FOR NEXT RESPONSE
Finally, for E2E, our prompt is:

Decide if the following


restaurant description
(DESCRIPTION) is consistent with
the provided restaurant
attributes (ATTRIBUTES). Note
that consistency means all
information in the description
is supported by the provided
attributes.
Restaurant Attributes
(ATTRIBUTES): [Attributes]
Restaurant Description
(DESCRIPTION): [Response]
Answer (yes or no):

ChatGPT CoT. For zero-shot chain-of-thought


reasoning prompts, we simply replace Answer
(yes or no): in all prompts with Explain your
reasoning step by step then answer (yes
or no) the question:, following (Luo et al.,
2023).

You might also like