Professional Documents
Culture Documents
Hallucinations 2312.17249
Hallucinations 2312.17249
Hallucinations 2312.17249
Abstract
Hallucination)
in-context generation tasks. To facilitate this
Shallow)
Transformer Layer
detection, we create a span-annotated dataset of
organic and synthetic hallucinations over sev-
eral tasks. We find that probes trained on the
force-decoded states of synthetic hallucinations
are generally ecologically invalid in organic
(Grounded
hallucination detection. Furthermore, hidden
(Deep
state information about hallucination appears
to be task and distribution-dependent. Intrin-
sic and extrinsic hallucination saliency varies
across layers, hidden state types, and tasks; no-
tably, extrinsic hallucinations tend to be more
salient in a transformer’s internal representa-
tions. Outperforming multiple contemporary (First Token Last Token)
baselines, we show that probing is a feasible
Token In Generated Response
and efficient alternative to language model hal-
lucination evaluation when model states are Figure 1: Prediction of hallucination in a generated re-
available. sponse by probing the hidden states of a transformer dur-
ing decoding. The true annotated span of hallucination
1 Introduction within the response is boxed in white; rows represent
probes trained to detect hallucination from layers within
Do language models know when they’re hallucinat- a transformer’s hidden states, with columns representing
ing?1 Detecting hallucinations in grounded gener- their prediction for each token in a generated response.
ation is commonly framed as a textual entailment
problem. Prior work has largely focused on cre-
ating secondary detection models trained on and whether the model has just begun to hallucinate or
applied to surface text (Ji et al., 2023). However, will eventually be found to have hallucinated. It
these secondary models are often of comparable is plausible that these encodings are “aware” of
complexity to the underlying language model that whether generation is retrieving information that is
generated the error and ignore the information that grounded in the prompt versus filling in plausible
was already computed during generation. In this information that is stored in the model’s weights.
work, we explore the degree to which probes on a Related work has studied how transformers retrieve
decoder-only transformer language model’s feed- relevant memories as they encode text (Meng et al.,
forward and self-attention states can detect halluci- 2022, 2023) and suggests that feed-forward layers
nations in a variety of grounded generation tasks. act as associative memory stores (Geva et al., 2021).
In particular, as the language model generates a A probe for hallucination detection must consider
response, we probe its encoded tokens to determine whether subsequent generated text is entailed by
∗ information in the prompt, given the retrieved mem-
Work done as an intern at Microsoft Semantic Machines.
1
That is, do they know that their electric sheep are un- ories, or is merely one plausible extrapolation from
grounded? the prompt (see also Monea et al., 2023).
We develop three probes of increasing complex- imal Row, in Los Angeles—with his
ity across three grounded generation tasks: abstrac- commission . . .
tive summarization, knowledge-grounded dialogue
TL;DR:
generation, and data-to-text. For each task, we
collect hallucinations in two ways: (1) from sam- A generated summary sentence of “Deckard can
pled responses, where we generate outputs from a now purchase an animal for his wife” is faithful,
large language model (LLM) conditioned on the “Deckard can purchase a goat in Iran” is an in-
inputs, or (2) by editing reference inputs or outputs trinsic hallucination, and “Sheep are usually cuter
to create discrepancies. We refer to these as organic and fluffier than goats” is an extrinsic hallucination.
and synthetic data, respectively. In both cases, we While models may leverage latent knowledge to
have human annotators mark hallucinations in the inform their responses, such as recognizing that
outputs. Method (2) produces hallucination annota- a goat is indeed an animal, directly stating such
tions at a higher rate, though we find that the utility knowledge in a response—i.e. only generating
of these synthetic examples is lower as they do not “Goats are animals” in the example above—is often
come from the test distribution. considered an extrinsic hallucination as it is neither
We summarize our contributions as follows: entailed nor contradicted by a given knowledge
• We produce a high-quality dataset of more source. Thus, non-hallucinatory generation in this
than 15k utterances with hallucination anno- setting involves balancing the language model’s re-
tations for organic and synthetic output texts trieval and manipulation of knowledge, both from
across three grounded generation tasks. the given context and from its own stored weights.
• We propose three probe architectures for de- Automated Diagnosis. Lexical metrics com-
tecting hallucinations and demonstrate im- monly used in the evaluation of NLG fail to
provements over multiple contemporary base- correlate with human judgments of faithfulness
lines in hallucination detection. (Kryscinski et al., 2019). More useful are NLI
• We analyze how probe accuracy is affected by approaches that either directly measure the de-
annotation type (synthetic/organic), hallucina- gree of entailment between a generated response
tion type (extrinsic/intrinsic), model size, and and a given source text (Hanselowski et al., 2018;
which part of the encoding is probed. Kryscinski et al., 2020; Goyal and Durrett, 2020),
or else break down texts into single-sentence claims
2 Related Work that are easily verifiable (Min et al., 2023; Semnani
Definitions. We center our study of hallucina- et al., 2023); these have been successfully used for
tions in the setting of in-context generation (Lewis hallucination detection (Laban et al., 2022; Bishop
et al., 2020) where grounding knowledge sources et al., 2023). Similarly, approaches that leverage
are provided within the prompt itself. This is in question-answer models to test if claims made by a
contrast to settings where a language model is ex- response can be supported by the source text, and
pected to generate a response entirely from its vice versa (Wang et al., 2020; Durmus et al., 2020;
world knowledge (i.e., grounded in the model’s Scialom et al., 2021), try to more explicitly identify
training data) (Azaria and Mitchell, 2023). the grounding sources of given generations in this
Though the term has often appeared in reference task. Under the intuition that a model is usually
to factuality (Gabriel et al., 2021), hallucinations not confident when it is hallucinating, hallucina-
are not necessarily factually incorrect. In their tory behavior can also be reflected in metrics that
survey, Ji et al. (2023) classify hallucinations as measure generation uncertainty or importance (Ma-
intrinsic, where generated responses directly con- linin and Gales, 2021; Guerreiro et al., 2023; Ren
tradict the knowledge sources, or extrinsic, where et al., 2023; Ngu et al., 2023; Madsen et al., 2023),
generated responses are neither entailed nor con- relative token contributions (Xu et al., 2023), and
tradicted by the sources. We follow this taxonomy. differences between the attention matrices and de-
For example, given the following task prompt for coder stability (Lee et al., 2018) during generation.
abstractive summarization,
Predicting Transformer Behavior. In small
. . . Deckard is now able to buy his wife transformer language models, it is possible to as-
Iran an authentic Nubian goat—from An- cribe behaviors like in-context learning to specific
Task History Knowledge Instruction Response
KGDG ...B: ok how can I get a Three consecutive strikes Response that uses A: Indeed, you have to
turkey I think that may are known as a "turkey." the information in get four consecutive
be hard to get though knowledge: strikes to get one.
Table 1: From top to bottom, examples of organic hallucinations in model responses for abstractive summarization,
knowledge-grounded dialogue generation (KGDG), and data-to-text tasks. Minimum annotated hallucination
spans are highlighted in cyan for intrinsic hallucinations and green for extrinsic hallucinations.
layers and modules (Elhage et al., 2021). This be- (2023) train a probe to detect the (un)answerability
comes harder in larger language models, but Meng of a question given a passage, which is similar in
et al. (2022) provide evidence that the feed-forward some respects to our task in Section 4; however, we
modules function as key-value stores of factual as- detect token-level hallucination during answer gen-
sociations and that it is possible to modify weights eration, which could happen even in theoretically
in these modules to edit facts associated with enti- answerable questions.
ties. Our probes, to be successful, must determine
whether the recalled facts were deployed in accor- 3 Grounded Generation Tasks
dance with the rules of grounding (i.e., to para- We test hallucination probes for autoregressive
phrase facts expressed in the prompt, rather than grounded generation in three distinct tasks: ab-
to replace them or to add details that are merely stractive summarization, knowledge-grounded di-
probable). Transformer internal states have also alogue generation, and data-to-text. We use the
been shown to be predictive of the likelihood that a same high-level prompt structure across all tasks to
model will answer a question correctly in the set- organically generate model responses to tasks and
ting of closed-book QA (Mielke et al., 2022) and to probe for hallucination. Each prompt contains
in the prediction of the truthfulness of a statement an optional interaction history, a knowledge source,
(Azaria and Mitchell, 2023), though these internal and an instruction resulting in the conditionally
representations of truthfulness can disagree system- generated response. Task prompts and examples of
atically with measures of model confidence (Liu organic hallucinations are shown in Table 1.
et al., 2023). These works measure hidden states
in the absence of supporting evidence (i.e. no pas- 3.1 Organic Hallucinations
sages are retrieved for either task), making them We use llama-2-13b (Touvron et al., 2023) as the
effectively probes of knowledge stored in model response model to generate outputs for each task.
weights and not the manipulation of novel or contin- To ensure that a non-negligible number of both hal-
gent information provided in a prompt. Monea et al. lucinations and grounded responses were present
(2023) find that there are separate computational in the generations for each task, each model re-
processes in transformer LLMs devoted to recall sponse is generated using top-k random sampling
of associations from memory vs grounding data, (Fan et al., 2018) with k = 2. The number of few-
and we show that we can detect whether associa- shot in-context learning examples for each task was
tions from grounding data are used correctly across chosen through a pilot exploration of several gen-
several distinct generation tasks. Slobodkin et al. eration strategies, optimizing for the presence of
Response Original Knowledge Changed Knowledge
Yes, Elvis was also one of Elvis Presley is regarded as one of Elvis Presley is regarded as one
the most famous musicians the most significant cultural icons of the least well-known music-
of the 20th century. of the 20th century. -ians of the 20th century.
Close to The Sorrento you name[Clowns], eatType[pub] name[Clowns],
can find Clowns pub. near[The Sorrento] near[The Sorrento]
Table 2: Examples of changed knowledge statements for our datasets of synthetic hallucinations. Changes made
to Conv-FEVER (top) and E2E (bottom) are shown in bold. The changed statements make the top response an
intrinsic predicate error and the bottom response an extrinsic entity error (highlighted as in Table 1).
…
mlp mlp
Single Layer
Layer 0 Classifiers
self self
attn attn
Hallucination Prediction
Response …
Indeed , you have to get four consecutive .
Figure 2: General probe structure. Response tokens that were annotated as hallucinatory are shown in red. For each
internal representation layer, and each type of representation (attention output and feed-forward output), a classifier
is separately trained on the attention head out-projection and feed-forward hidden states for all representation layers
(or just the final token) of a response from a prompted language model. Depending on the probe, these classifiers
are then ensembled to make a final prediction of whether each individual token was, or otherwise if the response
contained, a hallucination.
Ensemble Probe. Our final approach is a single over the entire response x. To fit the probe parame-
feed-forward layer that combines the predictions ters, θ, we again minimize negative log-likelihood:
from our 2L separately trained probes (either the X
linear probes or the attention-pooling probes): L(θ) ∝ − log p(y | u, x≤|x| ; θ)
(u,x,y)∈D
X
4.3 Metrics
p(yi | u, x≤i ) = σ
βl,m · pl,m (yi | u, x≤i )
l∈{1,...,L} Response-level Classification. F1 is the har-
m∈{attention,feed-forward}
monic mean of recall and precision. We denote
where βl,m are learned weights. The individual the F1 score of a response-level classifier by F1-R.
probes are frozen before βl,m parameters are fit.
Token-level Classification. For token-level clas-
4.2 Training Objectives sifiers, it is possible to evaluate the F1 score of
their per-token predictions; however, this would
Token-level. To train a probe to predict token-
give more weight to annotated or predicted spans
level hallucination indicators yi , we minimize nega-
that are longer in length. Here, we are more in-
tive log-likelihood on the annotations in our dataset
terested in span-level F1: specifically, whether we
D:
predicted all of the annotated hallucinations (re-
X |x|
X call), and whether all of the predicted spans were
L(θ) ∝ − log p(yi | u, x≤i ; θ) annotated as actual hallucinations (precision). As
(u,x,y)∈D i=1 requiring an exact match of the spans is too se-
vere—Appendix B shows how our human anno-
Response-level. We also try training a classifier tators often disagreed on the exact boundaries of
to predict whether x contains any hallucinations at a hallucinatory span—we modify span-level pre-
W|x|
all; denote this indicator as y = i=1 yi ∈ {0, 1}. cision and recall metrics to give partial credit, as
For this, we simply use an attention-pooling probe follows.
Define a span to be a set of tokens, where tokens Response-Level Classification, F1-R
from different responses are considered distinct.
Method CDM CF E2E
Let A be the set of all annotated spans in the ref-
erence corpus, and let P be the set of all predicted OC 0.38 0.78 0.72
spans on the same set of examples. We compute the Seq-Logprob 0.44 0.81 0.75
recall r as the average coverage of each reference FactCC 0.43 - -
span, i.e., the fraction of its tokens that are in some DAE 0.37 - -
predicted span. Conversely, we compute the pre- FEQA 0.35 - -
cision p as the average coverage of each predicted QuestEval 0.41 - -
span: SummaCZS 0.46 - -
SummaCConv 0.42 - -
1 X s∩
S ChatGPT 0.34 0.62 0.37
ŝ∈P ŝ
r= ChatGPT CoT 0.36 0.68 0.45
|A| |s|
s∈A
S PoolingSL 0.43 0.86 0.86
1 X ŝ ∩ s∈A s PoolingE 0.45 0.88 0.87
p=
|P | |ŝ| PoolingSL-SYNTH 0.46∗ 0.64 0.66
ŝ∈P
PoolingE-SYNTH 0.48∗ 0.65 0.69
where ∪, ∩, and | · | denote set union, intersec-
tion, and cardinality respectively. We define F1-Sp Human 0.71 0.87 0.87
(standing for "span, partial credit") as the harmonic
mean of p and r. Note that F1-Sp micro-averages Table 4: Response-level probe and baseline organic hal-
over examples, ignoring the boundaries between lucination detection performance across tasks as mea-
sured via F1 scores (F1-R) achieved on the test set. SL
them. Thus, in contrast to F1-R, examples with
denotes the best single-layer classifier (chosen on the
more spans have more influence on the final F1-Sp validation set), E denotes the ensembled classifier, and
metric achieved. SYNTH indicates that the classifier was trained on syn-
thetic data and tested on organic hallucinations. The
5 Experiments Human row measures the performance of a human an-
Unless otherwise indicated, all evaluations are per- notator who is different from the reference annotator;
this was measured on a subset of 50 examples that were
formed on organic data only, even if their probes
triply annotated, averaging over all 6 (annotator, refer-
were trained on synthetic data. ence annotator) pairs.
5.1 Probe Hyperparameters and Training
Single-layer hidden state probes, described in Sec-
(2) off-the-shelf synthetically-trained sentence-
tion 4, are trained under float32 precision with
level claim classification FactCC (Kryscinski et al.,
unregularized log loss. A batch size of 100 and
2020); (3) linguistic feature-based dependency
Adam optimization (Kingma and Ba, 2014) with
arc entailment DAE (Goyal and Durrett, 2020);
a learning rate of 0.001 and β1 , β2 = (0.9, 0.999)
(4) QA-based metrics FEQA (Durmus et al., 2020);
were chosen after a small grid search on hyperpa-
(5) QuestEval (Scialom et al., 2021); (6) the NLI-
rameters. With 70/10/20% train, validation, and
based metric SummaC (Laban et al., 2022); and
test splits, each classifier was trained using early
ChatGPT classifiers (Luo et al., 2023) (7) with-
stopping with 10-epoch patience. With the pre-
out and (8) with chain-of-thought prompting, with
dictions of these single-layer probes as input, the
prompts shown in Appendix C.1.
ensemble classifier was trained with the same hy-
perparameters. Metrics (2)–(6) were not originally intended to
detect hallucinations in individual responses, but
5.2 Response-Level Baselines rather to evaluate a collection of responses for the
We compare the performance of our probes against sake of system comparisons. To adapt to our set-
a set of contemporary methods used in faithfulness ting, however, we threshold each of these contin-
evaluation, measuring test set hallucination classifi- uous metrics to classify the individual responses,
cation F1. Specifically: (1) the model uncertainty- setting the threshold to maximize validation set F1,
based metric of length-normalized sequence log- following Laban et al. (2022). For sentence-level
probability Seq-Logprob (Guerreiro et al., 2023); optimized baselines such as FactCC, response-level
Span-level Classification, F1-Sp Response-level Classification, F1-R
Method CDM CF E2E Base Model Size CDM CF E2E
OC 0.09 0.58 0.15 7B 0.45 0.88 0.87
LinearSL 0.24 0.65 0.46 13B 0.45 0.88 0.87
LinearE 0.27 0.66 0.47 70B 0.45 0.89 0.87
PoolingSL 0.32 0.68 0.46
Span-level Classification, F1-Sp
PoolingE 0.34 0.68 0.47
Base Model Size CDM CF E2E
Human 0.61 0.69 0.58
7B 0.65 0.77 0.46
Table 5: Span-level probe hallucination detection per- 13B 0.64 0.77 0.46
formance, as measured via span-level partial match 70B 0.64 0.76 0.46
scores (F1-Sp ). The Human row is computed as in
Table 4 and is relatively stable across tasks, unlike the Table 6: Response and span-level F1 hallucination de-
systems. tection performance, as measured by F1-R and F1-Sp ,
respectively, of the PoolingE ensemble probe across
llama-2 response model sizes and test sets across tasks.
evaluation is performed by logical OR-ing sentence- For each task and model size, a probe is trained on la-
level baseline predictions (obtained after threshold- beled responses of the same task that were generated
ing the raw scores). organically from the 13B model.
As a further reference baseline, we additionally
report the performance achieved by randomly pre-
dicting a hallucination label with some probability perform on these datasets, given that even humans
p tuned to maximize F1 on the validation set, de- do not fully agree. We observe that our ensemble
noted here as Optimized Coin (OC). attention-pooling probe nearly meets this ceiling
for Conv-FEVER but lags behind for CDM and
5.3 Results E2E. Still, even this level of performance is po-
tentially useful for avoiding hallucinations, as we
Response-level Results. On two datasets, we
discuss in §6.
were actually able to match the interannotator
agreement rate (the “Human” row). On these
Layers. Evaluating the performance of trained
datasets, the probes trained on organic hallucina-
single-layer probes (Figure 3), we see that signals
tions worked best (Table 4). The exception was
of hallucination are strongest in model middle lay-
the CDM dataset, where the organic probes lost
ers during decoding. Probe performance shows
slightly to the synthetic probes, which in turn fell
a consistent and gradual increase from the zeroth
well short of human performance. On all datasets,
layer up until layer 10 to 20, after which there is
we were able to outperform all baseline systems,
a consistent decline in hallucination classification
showing that probing is a feasible (and efficient)
performance for the subsequent layers.
alternative to evaluating LLM outputs when model
states are available. Hidden State Type. Slight variations in ground-
Relative to other baselines that also have ing saliency—the strength of the signal of hallu-
some access to model internals—such as Seq- cination as able to be detected by a probe, mea-
Logprob—we hypothesize that probing-based ap- sured by probe hallucination classification perfor-
proaches perform better here due to being able to mance—are also present among hidden states of
capture information about model uncertainty that is different types during decoding. As shown in Fig-
lost when reduced to the output token probability. ure 3, feed-forward probes reach peak performance
Span-Level Results. These results appear in Ta- at slightly deeper layers than their attention head
ble 5. Again, the “Human” row serves as a ceiling out-projection probe counterparts and exceed them
on how well we can expect our automated system to in single-layer hallucination detection. Training en-
semble classifiers on only the hidden states of the
∗
For CDM, the synthetic training dataset (from BUMP) is same type at every layer, we see how feed-forward
smaller than the organic training dataset, so the SYNTH results
are at a disadvantage. For CF and E2E, however, we generated hidden states possess a slight but statistically signif-
a synthetic training dataset that was matched in size. icant greater saliency of model grounding behavior.
CNN DailyMail ConvFEVER E2E
0.88 1.00
0.4
0.86
0.95
0.3 0.84
Test Set F1
0.82 0.90
0.2
.
0.80
0.85
0.1 0.78 Hidden State Type
0.80 MLP
0.0
0.76 self_attn.o_proj
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
. Transformer Layer .
Figure 3: Grounding behavior saliency—as measured by probe response-level F1 on test set—of single layer probe
classifiers trained on the attention-head out projections and feed-forward states of llama-2-13b, stratified across
tasks and layers.
Model Size. Training probes in the same manner Single Layer MLP Probes
on the force-decoded hidden states of responses on 0.8
0.5
ior in model internal states. Here, with annotator-
0.4
labeled test set responses according to whether they
0.3
contained intrinsic or extrinsic hallucinations, we
0.2 Extrinsic Synthetic
evaluate trained probe performance in correctly de- Intrinsic Synthetic
tecting the presence of hallucinations for instances
0.1 Extrinsic Organic
0.0 Intrinsic Organic
of each hallucination type. We observe in Figure 4 0 5 10 15 20 25 30 35 40
that extrinsic hallucination behavior is more salient Transformer Layer
in the hidden states of shallower layers for both
Figure 4: Intrinsic and extrinsic hallucination saliency
organic and synthetic hallucinations, with saliency
across layers for abstractive summarization as measured
noticeably peaking around layers 15 and 16, de- by response-level probe recall on hallucination types.
creasing in deeper layers. Overall, extrinsic hal-
lucination remained more salient in model hidden
states than intrinsic hallucination.
Synthetic Hallucinations. As Table 7 shows, work has shown that synthetic data generation ap-
probes achieve very high F1 in the detection of proaches specifically designed for factuality eval-
synthetically created hallucinations across all tasks. uation (Ma et al., 2023) do not align with actual
However, probe performance is subject to a sig- errors made by generation models in their lexical
nificant drop when trained and tested on different error distributions (Goyal and Durrett, 2021). Here,
modalities, with probes generally performing better our results show this to extend to their hidden state
when trained on organic hallucinations and tested signals as well, which are used by the probe to
on synthetic ungrounded responses compared to detect unfaithfulness in the case of forced decod-
the reverse, with the notable exception being ab- ing on human-created content, relative to organic
stractive summarization on the CDM dataset. Prior instances of hallucination.
Task Train Test F1-R Test Train F1-R
CDM Organic Organic 0.45 CDM CF 0.44
Synthetic Organic 0.48 E2E 0.44
Organic Synthetic 0.38 CF+E2E 0.46
Synthetic Synthetic 0.93 CDM 0.45
CF Organic Organic 0.88 CF CDM 0.60
Synthetic Organic 0.65 E2E 0.80
Organic Synthetic 0.83 CDM+E2E 0.75
Synthetic Synthetic 0.94 CF 0.88
E2E Organic Organic 0.87 E2E CDM 0.60
Synthetic Organic 0.69 CF 0.75
Organic Synthetic 0.90 CDM+CF 0.76
Synthetic Synthetic 0.98 E2E 0.87
Table 7: Response-level PoolingE probe performance, Table 8: Response-level PoolingE cross-task probe per-
as measured by F1-R, in detecting hallucinations on or- formance, as measured by F1-R. All experiments use
ganic or synthetic ungrounded responses. Performance the same amount of training data (a random subset of
is generally better when the training data resembles the the full training set). The "+" sign indicates an equal
test data. mixture of two training tasks.
Task. To test whether the saliency of ground- ble 8 suggest poor generalization to out-of-domain
ing behavior persists consistently in similar hidden tasks, although perhaps this would be mitigated by
state locations across different tasks, we train our larger and more diverse training sets. Furthermore,
probe on each specific task and evaluate its perfor- it perhaps goes without saying that probing requires
mance when applied to the others. As shown in access to the hidden states themselves. If LLMs
Table 8 and compared to same-task training, we continue to move behind closed-source APIs, such
find a lack of generalization in hallucination detec- hidden state probes will not be possible for those
tion performance for our probes across different without privileged access.
train and test tasks. This may mean that there is
not a consistent signal of hallucination across tasks, Ecological Validity. The conclusions from an
or that our probes are not distinguishing this signal ecologically valid experiment should hold outside
from its task-or-dataset-specific confounds, and are the context of that study (De Vries et al., 2020).
overfitting to the confounds. Synthetic hallucinations are generally ecologically
invalid for organic hallucination detection. Extend-
6 Discussion ing beyond surface-level error distributions (Goyal
and Durrett, 2021), these differences in internal rep-
In this work, we examined how LLM internal rep- resentations further suggest that language model
resentations reveal their hallucinatory behavior, via behavior during hallucination fundamentally mis-
a case study of several in-context grounded genera- aligns with current expectations of ungrounded gen-
tion tasks. Our analyses point to several nuances in erations. Training on the same modality—synthetic
the saliency of this behavior across model layers, or organic—is imperative to maximize hallucina-
hidden state types, model sizes, hallucination types, tion detection performance, and our results on the
and contexts. Below, we highlight a few takeaways synthetic modality, especially on CNN / DailyMail,
as well as limitations that naturally suggest direc- suggest probing to be a feasible method for com-
tions for future work. plex multi-sentence entailment classification that
can leverage the capability of LLMs.
Efficiency and Access. Our work is motivated
by the need for efficient LLM hallucination evalua- Annotator Disagreements. Despite asking an-
tion. Though computationally efficient, one of the notators to label what they deem to be the minimal
key limitations of probing is in the need for labeled spans of hallucination in a response, considerable
in-domain data for probe training. Our results in Ta- variation still exists between annotators on judg-
ments of their exact span boundaries. It remains References
an open questions how to improve annotator agree- Guillaume Alain and Yoshua Bengio. 2017. Under-
ment in this context. standing intermediate layers using linear classifier
probes. In The Fifth International Conference on
Probe Design. Even if span boundaries were Learning Representations.
standardized, they may not align perfectly with
Amos Azaria and Tom Mitchell. 2023. The internal
hallucination signals in LLM internal representa- state of an LLM knows when it’s lying. arXiv
tions. For instance, while a generation may first preprint arXiv:2304.13734.
contradict a given knowledge source at a certain
token, the ultimate signal that it will hallucinate Jennifer A. Bishop, Qianqian Xie, and Sophia Anani-
adou. 2023. LongDocFACTScore: Evaluating the
may already be salient in the tokens preceding it. factuality of long document abstractive summarisa-
Alternatively, the signal might be computed only tion. arXiv preprint arXiv:2309.12455.
later, as the decoder encodes previously generated
Harm De Vries, Dzmitry Bahdanau, and Christopher
tokens and recognizes that it is currently pursu- Manning. 2020. Towards ecologically valid re-
ing an ungrounded train of thought. Thus, there search on language user interfaces. arXiv preprint
may be probe architectures that could improve hal- arXiv:2007.14435.
lucination detection performance and mitigation.
Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
Nonlinear probes such as feed-forward networks question answering evaluation framework for faith-
are also worth considering: the hidden layer(s) of a fulness assessment in abstractive summarization. In
feed-forward network could learn to detect differ- Proceedings of the 58th Annual Meeting of the Asso-
ent types of hallucinations. ciation for Computational Linguistics, pages 5055–
5070, Online. Association for Computational Lin-
Mitigation. Finally, once we can detect un- guistics.
grounded generation behavior, we can try to learn Ondřej Dušek, David M. Howcroft, and Verena Rieser.
to avoid it. This has great practical importance 2019. Semantic noise matters for neural natural lan-
as retrieval-oriented generation becomes popular guage generation. In Proceedings of the 12th Interna-
tional Conference on Natural Language Generation,
across a variety of consumer-facing applications. pages 421–426, Tokyo, Japan. Association for Com-
We have left this subject for future work. A sim- putational Linguistics.
ple method is to reject a response or prefix where
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom
hallucinations have been detected with high proba-
Henighan, Nicholas Joseph, Ben Mann, Amanda
bility, and start sampling the response again (Guer- Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al.
reiro et al., 2023). Another approach would be 2021. A mathematical framework for transformer
to use the predicted hallucination probabilities to circuits. Transformer Circuits Thread, 1.
reweight or filter hypotheses during beam search. Angela Fan, Mike Lewis, and Yann Dauphin. 2018.
Finally, a detector can also be used as a proxy re- Hierarchical neural story generation. In Proceedings
ward method for fine-tuning the language model, of the 56th Annual Meeting of the Association for
using a reward-based method such as PPO (Schul- Computational Linguistics (Volume 1: Long Papers),
pages 889–898, Melbourne, Australia. Association
man et al., 2017). In addition to learning from for Computational Linguistics.
response-level negative rewards incurred at the end
of the utterance, PPO can benefit from early token- Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin Choi,
level negative rewards, which serve as a more in- and Jianfeng Gao. 2021. GO FIGURE: A meta eval-
uation of factuality in summarization. In Findings of
formative signal for reinforcement learning (i.e., the Association for Computational Linguistics: ACL-
reward shaping) and thus can reduce sample com- IJCNLP 2021, pages 478–487, Online. Association
plexity. for Computational Linguistics.
Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Sabrina J. Mielke, Arthur Szlam, Emily Dinan, and Y-
Marti A. Hearst. 2022. SummaC: Re-visiting NLI- Lan Boureau. 2022. Reducing conversational agents’
based models for inconsistency detection in summa- overconfidence through linguistic calibration. Trans-
rization. Transactions of the Association for Compu- actions of the Association for Computational Linguis-
tational Linguistics, 10:163–177. tics, 10:857–872.
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike models. In Proceedings of the 2023 Conference on
Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Empirical Methods in Natural Language Processing,
Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. pages 3607–3625, Singapore. Association for Com-
FActScore: Fine-grained atomic evaluation of factual putational Linguistics.
precision in long-form text generation. In EMNLP.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Giovanni Monea, Maxime Peyrard, Martin Josifoski, bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Vishrav Chaudhary, Jason Eisner, Emre Kıcıman, Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Hamid Palangi, Barun Patra, and Robert West. 2023. Bhosale, et al. 2023. Llama 2: Open founda-
A glitch in the Matrix? Locating and detecting tion and fine-tuned chat models. arXiv preprint
language model grounding with Fakepedia. arXiv arXiv:2307.09288.
preprint arXiv:2312.02073.
Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020.
Noel Ngu, Nathaniel Lee, and Paulo Shakarian. 2023. Asking and answering questions to evaluate the fac-
Diversity measures: Domain-independent proxies for tual consistency of summaries. In Proceedings of the
failure in language model queries. arXiv preprint 58th Annual Meeting of the Association for Compu-
arXiv:2308.11189. tational Linguistics, pages 5008–5020, Online. Asso-
ciation for Computational Linguistics.
Jekaterina Novikova, Ondřej Dušek, and Verena Rieser.
2017. The E2E dataset: New challenges for end- Weijia Xu, Sweta Agrawal, Eleftheria Briakou, Mari-
to-end generation. In Proceedings of the 18th An- anna J. Martindale, and Marine Carpuat. 2023. Un-
nual SIGdial Meeting on Discourse and Dialogue, derstanding and detecting hallucinations in neural
pages 201–206, Saarbrücken, Germany. Association machine translation via model introspection. Trans-
for Computational Linguistics. actions of the Association for Computational Linguis-
tics, 11:546–564.
Alec Radford, Jeff Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners.
Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mo-
hammad Saleh, Balaji Lakshminarayanan, and Pe-
ter J Liu. 2023. Out-of-distribution detection and
selective generation for conditional language mod-
els. In The Eleventh International Conference on
Learning Representations.
Sashank Santhanam, Behnam Hedayatnia, Spandana
Gella, Aishwarya Padmakumar, Seokhwan Kim,
Yang Liu, and Dilek Hakkani-Tur. 2021. Rome was
built in 1776: A case study on factual correctness
in knowledge-grounded response generation. arXiv
preprint arXiv:2110.05456.
John Schulman, Filip Wolski, Prafulla Dhariwal,
Alec Radford, and Oleg Klimov. 2017. Proxi-
mal policy optimization algorithms. arXiv preprint
arXiv:1707.06347.
Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier,
Benjamin Piwowarski, Jacopo Staiano, Alex Wang,
and Patrick Gallinari. 2021. QuestEval: Summariza-
tion asks for fact-based evaluation. In Proceedings of
the 2021 Conference on Empirical Methods in Natu-
ral Language Processing, pages 6594–6604, Online
and Punta Cana, Dominican Republic. Association
for Computational Linguistics.
Sina J. Semnani, Violet Z. Yao, Heidi C. Zhang, and
Monica S. Lam. 2023. WikiChat: A few-shot LLM-
based chatbot grounded with Wikipedia. arXiv
preprint arXiv:2305.14292.
Aviv Slobodkin, Omer Goldman, Avi Caciularu, Ido
Dagan, and Shauli Ravfogel. 2023. The curious case
of hallucinatory (un)answerability: Finding truths in
the hidden states of over-confident large language
Appendix one kilometer track. Origone has been the fastest
speed skier on the globe since April 2006, having
Here, we provide additional information about an-
set a then-new world record of 251.4 km/h at Les
notation guidelines, annotator disagreement exam-
Arcs. Bastien Montes of France was Origone’s
ples, prompts used to generate organic responses
closest challenger on Monday, but even the Fren-
to our tasks, and ChatGPT baseline comparisons.
hcman’s new personal best of 248.105 km/h was
A Annotation Guidelines someway short of the Italian’s 2006 world record,
let alone his latest one. "Simone Origone is the
The purpose of this annotation task is two-fold: (1) greatest champion of all time, he is the only person
identify if a generated response is a hallucination to hold the record for two ski speeds in France –
relative to a given knowledge source, and (2) if so, Les Arcs and Vars Chabrieres. This is a historic
identify and mark the location(s) of the minimal day," technical director of Speed Masters Philippe
spans of text that are ungrounded against the knowl- Billy told Vars.com. The 34-year-old Origone – a
edge source. A response is grounded in a knowl- ski instructor, mountain guide and rescuer by day –
edge source if that source entails the content of the only took up the discipline in 2003, having given
response. In other words, the response should (1) up downhill skiing in 1999. "Now that I have twice
not contradict or (2) draw conclusions that are not won the world record, I can say that I have made
explicitly given in the knowledge source. history," Origone told Vars.com. "It is important
for me and for speed skiing."
A.1 CNN / DailyMail
Provided Information. (1) Original article, (2) Hypothetical Instance 1. The world’s fastest
summary of article. man on skis just got faster.
Task. Identify the presence and location of hal- • Everything is grounded, no need to mark any-
lucinations in the summary of the article, rela- thing here.
tive to the original article text. You are to only
annotate—by copying the text in the summary Annotation. The world’s fastest man on skis
and adding characters < and > in specific loca- just got faster.
tions—around the minimal span(s) of the summary
text relative to the original article; I.e., if changing Hypothetical Instance 2. The world’s fastest
an additional token wouldn’t change whether or not man on skis just got faster. He has now won the
the response was grounded or not, don’t include world record twice. He has excelled at Chabrieres.
it in the span you annotate. Note: You will be He has been the fastest speed skier on the globe
given summaries that contain a variety number of since April 2006. His speed record reached more
sentences. You are only to annotate the first three than 260 kilometers per hour.
sentences in the summary, discarding the rest.
• This summary has more than 3 sentences, so
Example. (CNN) – The world’s fastest man on the first thing to do is to only look at the first
a pair of skis just got faster. On Monday Italian three and discard any sentences beyond that
Simone Origone broke his own speed skiing world in your annotation.
record as he reached 252.4 kilometers per hour
on the Chabrieres slopes in the French Alps – an • Even though the last sentence is ungrounded
achievement confirmed by organizers France Ski relative to the source text, it is not within the
de Vitesse. With a 1,220 m slope that has a max- first three sentences, so we can ignore it here.
imum gradient of 98% and an average of 52.5%,
Chabrieres is not for the faint-hearted. Traveling • Annotate as such, copying the first three sen-
at over 250 km/h is a privilege usually reserved tences into the annotation box. No need to
for Formula One drivers, yet speed skier Origone highlight anything, as everything in the first
was equipped with just an aerodynamic helmet to three sentences are grounded:
increase streamlining and a ski suit made from air-
tight latex to reduce wind resistance. Origone’s Annotation. The world’s fastest man on skis
one nod to safety was wearing a back protector just got faster. He has now won the world record
in case of a crash as he threw himself down the twice. He has excelled at Chabrieres.
Hypothetical Instance 3. Simone Origone, to 13 to 19) then this is not a hallucination, as it
now in his twenties, just got faster. He has now quantifies the original prediction with “initially.”
won the world record more than twice. If the summary states “The National Weather
Service forecast called for 13 to 19 named storms,”
• “In his twenties” contradicts with “The 34- then this is not a hallucination, as it is the final
year-old Origone” in the original article. corrected fact.
• “more than twice” contradicts with how he Identifying the Summary. Sometimes the sum-
has now only won it exactly twice. As such, marization system may produce things that are not
annotate: a summary of the article. For example, “Woman
dies while paragliding in Tenerife. Paragliding in
Annotation. Simone Origone, now in his <twen- Tenerife. Paragliding in Tenerife. Paragliding in
ties>, just got faster. He has now won the world Tenerife. Paragliding Tenerife Canary Islands.”
record <more than> twice. Here, only the first sentence is the summary,
Special Points of Note. Below, we list a few and you can discard the rest. This is because the
points of special note to keep in mind during your summarization system tries to mimic how actual
annotations. website summaries work, so they might also add
things like topic tags, keywords, and things like
Concatenation. When finding yourself annotat- that.
ing something like “The city’s pollution levels have Another example is “InvisiBra, £48, Lavalia,
reached or <exceeded ><“very high”> levels for lavalia.co.uk; Strapless Stick On Adhesive Bra,
the last...,” do concatenate the two labeled spans; if £12, Marks and Spencer; Natural Stick On Bra,
there are no characters between them (for example, £20, Debenhams.” These are all just keyword tags;
<exceeded ><“very high”>), concatenating the an- no summary of the article is present here. Try to
notation bounds into (<exceeded “very high”>) is use your best judgment here, as it might not always
ideal. be clear.
In the case where such things are generated that
“Was”/“Is”. If the article says something like
do not look to be part of a summary of the arti-
“Woods insisted he is ready to win the famous
cle in terms of format (not content), discarding
Claret...” and the summary said something like
them and everything after them is the way to go.
“said he was ready to win the famous Claret...”,
For the former example, you would annotate only
“was” here would not be considered to be a hallu-
“Woman dies while paragliding in Tenerife.” for
cination, as the summary is recounting what was
hallucinations.
said in the article, so the past tense here doesn’t
contradict what was originally said in the article. A.2 Conv-FEVER
When The Article Has “Contradicting” Facts. Provided Information. (1) Conversational con-
Articles may sometimes “correct” themselves; i.e. text, (2) Knowledge for next response, (3) Next
the article may have something where it originally response.
states a claim but then corrects it later in the article.
Task. Identify the presence and location of hallu-
For example, “The NWS – last May – initially fore-
cinations in “Next response”, relative to “Knowl-
cast 13 to 20 “named storms,” including seven to
edge for next response”. You are to only an-
11 hurricanes. Then in August, it dialed down that
notate—by copying the text in “Next Reply”
initial forecast to predict 13 to 19 named storms,
and adding characters < and > in specific lo-
including six to nine hurricanes.”
cations—around the minimal span(s) of an un-
In this case, if the summary only states “The
grounded response relative to the knowledge; I.e.,
National Weather Service forecast called for 13 to
if changing an additional token wouldn’t change
20 named storms” (and doesn’t mention it was cor-
whether or not the response was grounded or not,
rected later to 13 to 19), then this is a hallucination
don’t include it in the span you annotate.
as it is corrected later in the article.
If the summary states “The National Weather Example. Conversational Context: A: lets talk
Service forecast initially called for 13 to 20 named bowling a game where you throw a bowling ball
storms” (and doesn’t mention it was corrected later towards a target. B: ok I love bowling I would
like to know more A: for one it is one of the major Hypothetical Instance 3. Next Reply: A:
forms of throwing sports and I think a lot of people yea it is pretty hard.
enjoy it B: yes me too I use to bowl all the time
with my family what about the strike? A: well a • Here, the response is purely an expression of
strike is when you knock down all pins on the first an opinion and does not use any information
round B: ok how can I get a turkey I think that may in the provided knowledge.
be hard to get though Knowledge for Next Reply:
• As such, you would mark this entire response
Three consecutive strikes are known as a "turkey".
to be a hallucination, annotating as:
Hypothetical Instance 1. Next Reply: A: yea
Annotation. <yea it is pretty hard.>
four consecutive strikes is known as a turkey
Hypothetical Instance 4. Next Reply: A: Three
• Here, a contradiction is present in the under- strikes consecutively is a turkey.
lined portions: four consecutive strikes in the
response goes against what is provided in the • Here, the response is grounded. As such, you
knowledge, which says three. won’t mark anything in the response, annotat-
ing as such:
• To make this reply grounded, one would have
to change “four” to “three”. Annotation. Three strikes consecutively is a
turkey.
• As such, in your annotation, you will highlight Hypothetical Instance 5. Next Reply: A:
the minimum needed span of text to make this four consecutive strikes is known as a turkey,
reply grounded, with chevrons <>. Here, the also known as a strike.
minimum needed span of text to change is
“four”, so you would annotate as such: • Two hallucinations are present in this re-
sponse:
Annotation. yea <four> consecutive strikes is
known as a turkey • The direct contradiction of four with three in
the knowledge source, and
Hypothetical Instance 2. Next Reply: A:
yea it is pretty hard, four consecutive strikes is • The final part of the reply, saying that it’s also
known as a turkey known as a strike, which is information not
provided anywhere in the knowledge source.
• Here, the contradiction is still present in the As such, you would mark this as:
previously underlined portion: four is a con-
Annotation. <four> consecutive strikes is
tradiction with three.
known as a turkey, <also known as a strike>
• However, in addition to this contradiction, you A.3 E2E
see an expression of an opinion in this re-
Provided Information. (1) Restaurant attributes,
sponse: that getting a turkey is pretty hard
(2) Generated restaurant description.
(underlined).
Task. Identify the presence and location of
• You are to ignore the expression of personal hallucinations in the generated restaurant de-
opinions during your annotation of minimal scription—by copying the text in the summary
spans for this task, as we won’t consider con- and adding characters < and > in specific loca-
versational chitchat—if complemented with tions—around the minimal span(s) of the gen-
grounded information elsewhere in the re- erated restaurant description relative to the pro-
ply—to be a hallucination for this task. As vided attributes. i.e., if changing an additional to-
such, you would annotate this as: ken wouldn’t change whether the description was
grounded or not, don’t include it in the span you
Annotation. yea it is pretty hard, <four> con- annotate. Note that numerical ratings should be ren-
secutive strikes is known as a turkey dered as numerical ratings and text ratings should
be rendered as non-numerical expressions. For ex- • As “will never exceed” is the minimal span
ample, “customer rating” can be “low”, “medium” you would need to change to make the de-
and “high”, but also “1”, “3”, and “5”. If the cus- scription grounded relative to the provided
tomer rating is “1”, for instance, and the gener- attributes, annotate:
ated description says “. . . has a low customer rat-
ing”, that would be considered a hallucination. The Annotation. Prices <will never exceed> £20 at
“priceRange” rating also has this attribute, in addi- The Wrestlers and they also don’t really rate well.
tion to “customer rating.” Finally, E2E specifies They are not even family-friendly.
the "familyFriendly" attribute as akin to Child-
friendly in gold annotations: if a restaurant is Hypothetical Instance 4. Restaurant At-
family-friendly, then it is also child-friendly. tributes: name[Midsummer House], food[French],
customer rating[5 out of 5], near[Café Rouge].
Hypothetical Instance 1. Restaurant At- Restaurant Description: A place for fine dining
tributes: name[The Wrestlers], priceRange[less French food, Midsummer House is nearby Cafe
than £20], customer rating[low], fami- Rouge.
lyFriendly[no]. Restaurant Description:
The Wrestlers is an adults only restaurant with a
• Nowhere in the attributes does it say that it’s
customer rating of 1 out of 5 and a price range of
“fine dining.” As such, annotate:
less than £20.
• While the familyFriendly[no] attribute says Annotation. A place for <fine dining> French
that the restaurant is not family friendly, this food, Midsummer House is nearby Cafe Rouge.
does not mean that the restaurant is adults
only. Hypothetical Instance 5. Restaurant
Attributes: name[Browns Cambridge],
• - Having a low customer rating does not imply priceRange[cheap], priceRange[moderate],
a 1 out of 5 customer rating. As such, annotate customer rating[1 out of 5]. Restaurant Descrip-
as: tion: The Brown’s chain of food restaurants is
Annotation. The Wrestlers is an <adults only> an 1 out of 5 customer rated restaurant, and have
restaurant with a customer rating of <1 out of 5> moderate and cheap pricing. They are childfree.
and a price range of less than £20.
• The attribute does not mention that Browns
Hypothetical Instance 2. Restaurant At- Cambridge is a chain of food restaurants.
tributes: name[The Dumpling Tree], eat-
Type[restaurant], food[Italian], priceRange[high]. • Furthermore, it does not say that the restaurant
Restaurant Description: The Dumpling Tree is is childfree. As such, annotate:
an Italian restaurant with high prices.
• Everything is grounded, no need to mark any- Annotation. The Brown’s <chain of food restau-
thing. rants> is an 1 out of 5 customer rated restaurant,
and have moderate and cheap pricing. <They are
Annotation. The Dumpling Tree is an Italian childfree.>
restaurant with high prices.
Hypothetical Instance 6. Restaurant
Hypothetical Instance 3. Restaurant At-
Attributes: name[Browns Cambridge],
tributes: name[The Wrestlers], priceRange[less
priceRange[more than £30]. Restaurant
than £20], customer rating[low], fami-
Description: The Browns Cambridge is a
lyFriendly[no]. Restaurant Description:
restaurant with high prices.
Prices will never exceed £20 at The Wrestlers and
they also don’t really rate well. They are not even
• More than £30 does not necessarily mean
family-friendly.
“high prices.” As such, annotate:
• Having a priceRange attribute of less than 20
does not mean that prices will never exceed Annotation. The Browns Cambridge is a restau-
twenty. rant with <high prices>.
[KNOWLEDGE]
Taxi driver dropped off a soldier on an unsafe road, and
he was killed by a coach. The taxi driver was not at ## RESPONSE THAT USES THE
fault. INFORMATION IN KNOWLEDGE
Taxi driver dropped off a soldier on an unsafe road, and [RESPONSE]
he was killed by a coach. The taxi driver was not at
fault.
whereafter everything generated in the response
preceding a newline character is taken as the final
Accounting is the language of business, and it mea- generation. For E2E, we use a 10-shot in-context
sures the results of an organization’s economic activ- learning prompt in the form of
ities.