Professional Documents
Culture Documents
Teaching Language Models To Support Answ
Teaching Language Models To Support Answ
Recent large language models often answer factual questions correctly. But users can’t trust
any given claim a model makes without fact-checking, because language models can hallu-
cinate convincing nonsense. In this work we use reinforcement learning from human prefer-
arXiv:2203.11147v1 [cs.CL] 21 Mar 2022
ences (RLHP) to train “open-book” QA models that generate answers whilst also citing specific
evidence for their claims, which aids in the appraisal of correctness. Supporting evidence is
drawn from multiple documents found via a search engine, or from a single user-provided doc-
ument. Our 280 billion parameter model, GopherCite, is able to produce answers with high
quality supporting evidence and abstain from answering when unsure. We measure the per-
formance of GopherCite by conducting human evaluation of answers to questions in a subset
of the NaturalQuestions and ELI5 datasets. The model’s response is found to be high-quality
80% of the time on this Natural Questions subset, and 67% of the time on the ELI5 subset.
Abstaining from the third of questions for which it is most unsure improves performance to
90% and 80% respectively, approaching human baselines. However, analysis on the adversar-
ial TruthfulQA dataset shows why citation is only one part of an overall strategy for safety and
trustworthiness: not all claims supported by evidence are true.
1. Introduction
Generative language models (LMs) are increasingly useful for answering questions about the world,
steadily improving (Cheng et al., 2021; Nakano et al., 2021; Roberts et al., 2020) on question-
answering benchmarks (Kwiatkowski et al., 2019; Lin et al., 2021; Pang et al., 2021), and serving
generated samples to users via APIs (Brockman et al., 2020; Cohere, 2021). By default, however, LMs
generate ungrounded claims that users must choose either to blindly accept or to verify themselves.
In this work we train models that help the user or data rater evaluate responses by generating claims
alongside supporting evidence. This evidence takes the form of a verbatim quote extracted from a
longer source retrieved by Google Search or any suitable information retrieval system. We call this
Figure 1 | Examples of answers and supporting quotes provided by GopherCite. The left example
uses a Wikipedia article as the source for the prompt; the right example uses the abstract of this
paper. The prompt template can be found in Table 12. Further examples of GopherCite answering
questions about the full introduction section can be found in Appendix K.
task “self-supported question-answering” (SQA), and intend it as a sub-task that can be embedded
into other generative language modelling tasks such as open-ended dialogue or debate (Askell et al.,
2021; Irving et al., 2018; Komeili et al., 2021; Rae et al., 2021; Thoppilan et al., 2022).
Crucially, citing external sources inline decreases the effort required on the part of human annotators.
By extracting specific supporting quotes from the document rather than linking to entire web pages,
we allow faster and more specific appraisal of supportedness. This also affords end-users a qualitatively
different level of trust in model samples, compared to systems which simply return an unsupported
answer. This consideration has also motivated recently released, partly concurrent work (Nakano
et al., 2021) in which a finetuned version of GPT-3 cites sources.
One could view self-supporting answers as a specific type of explanation, putting our work alongside
other work in explainable AI (Ras et al., 2020) that aims to provide natural-language explanations of
QA model responses (Lamm et al., 2020; Latcinnik and Berant, 2020; Narang et al., 2020). Our goals
are aligned to the extent that both explanations and supporting evidence are ways to increase trust
in model outputs. However, while our training objective incentivises the model’s answer to agree
with the evidence it provides, our method makes no attempt to guarantee that the evidence faithfully
(Jacovi and Goldberg, 2020) describes the reason that the model generated the claim. We view work
in that direction as complementary.
We cast SQA as a (conditional) language modelling problem, generating both free-form answers
and verbatim quotes of supporting evidence as a single string with evidence “inlined”. We term this
approach “Inline Evidence”. Whilst more specialized architectures exist for extracting spans from
documents (Joshi et al., 2020; Karpukhin et al., 2020; Keskar et al., 2019), we show that span
extraction with generative models works well and enables taking advantage of the powerful Large
Language Models (LLMs) developed in recent years (Brown et al., 2020; Lieber et al., 2021; Rae
et al., 2021; Smith et al., 2022; Zeng et al., 2021). In order to ensure the quotes are “verbatim” with a
generative approach, we introduce a special syntax for the language model to use when quoting from
documents and constrain the outputs of the model to be exact quotes from the retrieved documents
when in this mode.
To measure the quality of the generated answers on the task of Self-Supported question-answering
(SQA), we ask human raters to assess whether the answers are plausible and whether they are
supported by the accompanying quote evidence. The first metric, “plausible”, assesses if the answer
is a reasonable on-topic response to the question as if it were occurring in a conversation. The second
metric, “supported”, is introduced to indicate whether the provided evidence is sufficient to verify
the validity of the answer. Producing SQA responses that are both plausible and supported is a
nontrivial exercise in aligning the language model to human preferences.
In this work, we describe an Inline Evidence system – named GopherCite – which we developed by
finetuning the 280B parameter Gopher language model (Rae et al., 2021) using a combination of
supervised learning and Reinforcement Learning from Human Preferences (RLHP), as in (Ziegler
et al., 2019). Given an input query, the system retrieves relevant documents using Google Search
and presents the language model a large context drawn from multiple documents. Whilst our system
trusts these sources, we do not explicitly mitigate untrustworthy sources in this version of our work
and forward documents to the model no matter where they come from. The language model, in turn,
synthesizes a SQA response, with the evidence drawn as a verbatim quote from one of these articles.
During reinforcement learning, GopherCite optimizes the score from a “reward model” which predicts
human pairwise preferences between two candidate responses as well as an auxiliary classification
loss as to whether the response is plausible and whether it is supported.
Retrieving sources using a search engine (Lazaridou et al., 2022; Nakano et al., 2021; Thoppilan
et al., 2022) that is kept up-to-date – and supplying them to the language model in a nonparametric
fashion – can enable improved temporal generalization over a purely parametric model (Borgeaud
et al., 2021; Lewis et al., 2021a; Liska et al., 2022). It also enables the system to attempt questions
implying the present date, like “which country got the most medals in the last winter olympics?”.
In our experiments, we show that GopherCite produces high quality (plausible and supported) answers
80% of the time when prompted with fact-seeking questions drawn from a filtered subset of Natu-
2
Teaching language models to support answers with verified quotes
ralQuestions dataset and 67% of the time when prompted with explanation-seeking questions drawn
from a filtered subset of the ELI5 (“Explain like I’m five”) dataset (Fan et al., 2019). Furthermore, we
can improve the reliability of the system dramatically by selecting a minority of questions to decline
to answer (El-Yaniv et al., 2010).
We develop a reward model-based mechanism for abstaining from answering a configurable pro-
portion of test-time questions. Performance is measured in this setting by plotting the trade-off
between question coverage (the proportion of questions attempted) and the quality of responses
when attempting. When declining to answer less than a third of questions in these datasets, the
response quality measured amongst those questions the system attempts climbs from 80% to 90%
on the filtered NaturalQuestions subset, exceeding the level of performance humans obtain when
answering every question. On the filtered ELI5 subset, performance improves from 67% to 80%.
Despite these benefits, optimizing for answers that can be supported by documents on the internet is
not sufficient to ensure that model responses are true. We show this via evaluation on the adversarial
TruthfulQA (Lin et al., 2021) dataset, along with some qualitative highlights. Whilst often helpful,
our models are able to select misleading evidence even from authoritative corpora pointing to a need
for enhancement in future work. In particular, we need to tackle source trustworthiness, ensure
answers are given with more careful qualification, and investigate whether more subtle alignment
approaches such as debate can provide reward signals which ensure that quotes are not misleading.
Recent Related Work As we developed GopherCite, closely related work was released, including
an updated LaMDA model (Thoppilan et al., 2022) and the WebGPT system (Nakano et al., 2021).
LaMDA also focuses on factual grounding, but supports answers by simply showing a URL rather than
pointing the user to an easily verified quote as we do in GopherCite. Similar to our work, WebGPT
uses RLHP to train question-answering models which refer to sources from the internet. WebGPT
learns to interact multiple times with a search engine when gathering evidence to be passed to the
question-answering model, critically deciding which queries to issue to a search engine rather than
simply forwarding the user query as we do. In our work, instead of curating a collection of brief
snippets from multiple search engine interactions, we condition GopherCite with a large context with
thousands of tokens of uncurated information from multiple pages, focusing GopherCite on reading
comprehension, and we specifically investigate how well the model supports individual claims. We
view the richer interaction with a search engine developed in LaMDA and WebGPT as an exciting,
complementary direction to the focus of our work. Further similarities and differences to WebGPT,
LaMDA, and other recent work in the community is detailed in subsection 2.10. Interestingly, we
concur with many of their empirical results such as the relative performance of reinforcement learning
and supervised finetuning in the reranking regime, and the ability to obtain models competitive with
human performance.
2. Methods
2.1. Inline evidence syntax
Our models generate an answer with supporting evidence “inlined” into a single string (hence “Inline
Evidence”), treating the task of producing supported claims as (conditional) language modelling.
Answer and evidence use the following template, where template tokens are black and output
placeholders are violet:
For example, the answer from Figure 1 about the Scooby-Doo series would be expressed as:
%<A Great Dane dog.>%(Scooby-Doo)%[This Saturday-morning cartoon series featured
teenagers Fred Jones, Daphne Blake, Velma Dinkley, and Shaggy Rogers, and their
talking Great Dane named Scooby-Doo.]%.
3
Teaching language models to support answers with verified quotes
Render in UI
Question Sample
Best
Text with claim
Reward Sample
Generator Sample Model %<Claim>%
Figure 2 | Diagram of the runtime answer generation procedure with reranking. (1) Sample the
question from a dataset (2) Query Google SearchAPI for the top-K results (3) Sample the generator
(SFT/RL) model to produce N>K answers, where each sample is conditioned on one of the truncated
search results (in round robin fashion) (4) Score the samples with reward model, taught from human
preferences (5) Choose the sample with highest reward and present it to the user.
As we use left-to-right language models, our syntax amounts to the autoregressive factorization:
Above, 𝑐 is the set of context documents retrieved from Google Search or provided by the user,
described in subsection 2.3.
We benefit from these additional properties of this syntax:
• Parsing and constrained sampling We can parse expressions emitted by the model post-hoc,
or constrain them to be valid online during sampling (Appendix J). Constrained sampling
ensures that the model quotes are verbatim from the claimed source. Post-hoc parsing is useful
for splitting up the sample to render the claim and evidence separately either in a UI or in a
downstream system that may wish to use the claim and evidence separately.
• Scoring answers in isolation Because answers occur first in the autoregressive ordering, we
can assign likelihood to them without considering evidence.
• Conditional evidence generation We can treat conditional evidence generation as the contin-
uation of a prefix in which a claim is given.
To be clear with the terminology introduced, we view Self-Supported Question Answering (SQA) as
the task of producing a supported answer and Inline Evidence as one way to approach the SQA task.
4
Teaching language models to support answers with verified quotes
STEP 1
Gopher STEP 2 STEP 3 STEP 4
Generate samples
Language Supervised Reward Reinforcement
and rate them via
Model Finetuning Model learning
human evaluation
Sample from RL
Figure 3 | Diagram of the Self-Supported Question Answering training pipeline. The bottom arrows
indicate data flow throughout the project. The trained models (in blue) are used to gradually collect
more training data that is labelled by raters, and then used to train new generation of models.
At inference time, we retrieve the top 𝐾 documents from Google Search, and then perform 𝑁 > 𝐾
sampling passes iterating over documents in round-robin order, each of which shows the language
model as much context as possible from a single document, and then re-rank all the samples when
choosing one to return. Figure 2 depicts this process.
For details of how we combine a question and retrieved documents into prompts during training, see
Appendix A and Appendix H.
Step 1: Collect data from our best current models, and have it rated by humans. We present
model outputs as comparisons for the human labellers that assess the quality of individual
answers, as well as preference judgements between answers (subsection 2.6). These serve
as data for supervised fine-tuning and reward model training, respectively. On the first
iteration, we bootstrap with few-shot prompting of the base Gopher model (subsection 2.5).
Step 2: Train a supervised finetuning (SFT) model: We fine-tune a pretrained Gopher model
on the examples rated positively by the labellers (subsection 2.7). The purpose of the
supervised finetuning stage is to teach the model to produce verbatim quotes using our
syntax, and to provide a baseline level of Self-Supported Question-Answering ability.
Step 3: Train a reward model (RM): Reranking model outputs and reinforcement learning both
require a scalar "overall quality" label associated with each output. We use a reward model
trained on a dataset of comparisons between two answers to a single question using the
approach in Christiano et al. (2017) (subsection 2.8).
Step 4: Optimize a reinforcement learning (RL) policy against a reward model: The RL fine-
tuning stage tunes the model’s quoting behaviour to human preferences (subsection 2.8).
Step 5: Repeat from Step 1.
Each iteration of this loop adds data to a continuously growing training set. A full loop of this training
scheme was performed four times for short-answer extractive QA data, using train datasets of Natural
Questions (Kwiatkowski et al., 2019), SQuAD (Rajpurkar et al., 2016) TriviaQA (Joshi et al., 2017),
and then further two times for extending system abilities for non-extractive longer-form question
answering on the ELI5 dataset (Fan et al., 2019).
5
Teaching language models to support answers with verified quotes
The supervised model requires labelled input-output examples where the desired outputs make use
of “inline evidence” syntax. No such dataset exists of sufficiently high quality1 , so we created a small
training set with about 5000 high-quality examples with questions drawn from the ELI5 (Fan et al.,
2019) and Natural Questions (Kwiatkowski et al., 2019) datasets, and articles retrieved via Google
Search. In this training set, only the questions from the canonical datasets were used, whilst the target
answers were sampled from Gopher (the 280B parameter variant described in Rae et al. (2021)).
Collecting human demonstrations is a standard, but expensive way to create a supervised dataset,
and has been taken by related work (Nakano et al. (2021), Thoppilan et al. (2022)). We instead
“prompted” Gopher with a few in-context examples (Rae et al. (2021)) to generate tens of thousands
of candidate answers with inline evidence. We then ask human contractors which samples are high
quality according to desiderata discussed later in the paper, and keep only high quality samples.
This approach has only recently become viable due to the development of capable language models
(Brown et al. (2020); Rae et al. (2021)), and has also been used for the creation of a language dataset
in recent work (Liu et al., 2022).
Appendix H has our prompt templates and further details. For a more thorough study on prompting
language models with search engine results to increase factuality, see Lazaridou et al. (2022).
We make use of a “super rater” model, in which paid contractors on a publicly available platform are
first assessed for their agreement with ourselves (the researchers) when completing this task. Raters
who meet a high enough bar for agreement on Preferred (85% of responses in a quality assurance
set) are kept in a “super rater” pool which we iteratively grew over the course of the project. Our
training data came exclusively from this set of raters. Appendix C has our full instructions, rater UI,
and more details on our data collection process.
but preliminary experiments using this dataset as a source of inline evidence targets for supervised learning found that it
resulted in poor models with low diversity.
6
Teaching language models to support answers with verified quotes
model with a prompt including the question and documents retrieved by Google Search. The prompts
used during SFT are shown in Table 12.
During supervised training, we uniformly at random decide how many documents to show the model
inside a context of 4096 subword tokens, and how much of the context to dedicate to each document.
When we pick a budget of 𝑛 tokens for a given document, we truncate to 𝑛 tokens by randomly
choosing a region surrounding the short snippet returned by Google Search (with a variable amount
of additional context before the snippet, after the snippet, or both.) Consequently, during inference
we can show one document, or many documents, either brief in length or up to 4096 subword tokens
in length.We also ensure that the document a target’s quote came from is included in the documents
fed to the model, and that if the context document is truncated, the snippet being quoted occurs.
We ran supervised finetuning for just 60 SGD steps with batch size 128. After this amount of training
the largest Gopher model produces perfect verbatim quotes on held out data around 75% of the
time, even without constrained sampling. We perform as few steps of supervised finetuning as
possible in order to keep the inline-evidence-producing SFT model as close as possible to the raw
language model, retaining its entropy and general capabilities. We therefore chose an aggressive
early stopping criterion, ending training when validation set perplexity stopped improving on either
our ELI5 Rated-Good development set or a NaturalQuestions development set formed from gold-truth
targets (described in subsection 3.2 as “Gold + GoldEvidence”). This resulted in finetuning for only
60 steps, which was slightly less than a single epoch.
We follow the “Reinforcement Learning from Human Preferences” pipeline of Christiano et al. (2017),
with a few small differences tailored to our setup explained below. Note that whilst we mirror and
reference this work’s training setup in particular, reinforcement learning from human preferences
has been developed for over a decade at time of writing, e.g. (Akrour et al., 2011; Schoenauer et al.,
2014; Wirth et al., 2016) and a nice review in Wirth et al. (2017).
Training the Reward Model (RM) Following Christiano et al. (2017) we collect human assessments
of model samples and train a “Reward Model” (RM) to predict human judgment of a binary pairwise
preference outcome, using the standard cross-entropy loss. The reward model is a classifier predicting
this binary variable indicating which example in a pair was preferred, given a question and Self-
Supported Question-Answering (SQA) response string. Note that the RM does not receive the full
document context as input, only the piece of evidence selected by a model, in order to maintain
parity of interface between human users and the RM. In the event of a tie, this classification objective
becomes maximum entropy, which is a slight variation on formula (1) in Christiano et al. (2017).
The training set containing these human labels had 33,242 rated SQA response pairs, with questions
drawn from Natural Questions, ELI5, and a few additional datasets in smaller number (table Table 13).
Appendix C contains further details on this training data.
We warm-start the RM from the pretrained 7B language model from the Gopher family (Rae et al.,
2021) and add an extra final linear layer to predict the reward. The RM also predicts the binary
Supported&Plausible judgements2 of the individual SQA responses as an auxiliary loss. The final
loss is the average of the pairwise preference prediction loss and the auxiliary prediction loss. We
early-stop according to the preference prediction accuracy on a held-out validation subset of ELI5
which was rated by the researchers.
Using Reward Models for Reranking We use reward model scores to do reranking of candidate
responses. At inference time we draw 𝑁 samples and select one with maximal reward. We call such
models ’SFT + top@N’ or ’RL + top@N’ (depending on the underlying generator).
This approach is similar to what is described as “Sample and Rank” in (Thoppilan et al., 2022).
2 That is, the auxiliary loss predicts only a single binary variable indicating whether or not a response is both supported and
plausible.
7
Teaching language models to support answers with verified quotes
Training against Reward Models with RL Fine-tuning. We use RL to maximize the expected
reward, 𝔼 𝑝𝑟 ( 𝑥 ) [𝑟 ( 𝑥, 𝑦 )]. We train the LM 𝑝𝑟 ( 𝑥 ) with synchronous advantage actor-critic (A2C; Mnih
et al. (2016)). We follow the same training setup as in Perez et al. (2022), which we summarize
again for completeness. We warm-start 𝑝𝑟 ( 𝑥 ) by initializing with the SFT model from the section 2.7.
To prevent RL from collapsing to a single, high-reward generation, we add a loss term to penalize KL
divergence between 𝑝𝑟 ( 𝑥 ) and initialization’s distribution over next tokens (Jaques et al., 2017, 2019;
Schmitt et al., 2018; Ziegler et al., 2019). The final loss is a linear combination of the KL penalty
(weighted by 𝛼 ∈ [0, 1]) and A2C loss (weighted by 1 − 𝛼). We vary the KL penalty strength, using
decreasing values of 𝛼, sacrificing diversity for expected reward. See Appendix F for further details.
We investigate enabling the system to decline to answer a subset of input questions, e.g. returning
the string “I don’t know” instead of a low-quality answer. We found that a global threshold on the
reward model score worked well, falling back to “I don’t know” if the score falls below the threshold.
This setup could be described as “selective prediction” (also known as prediction with a reject option)
(El-Yaniv et al., 2010; Geifman and El-Yaniv, 2017, 2019; Kamath et al., 2020). We study the selective
prediction ability of our reward models compared to the agents’ likelihood in subsection 3.3.
• From the user’s perspective: LaMDA (Thoppilan et al., 2022) shows just a URL as supporting
evidence, putting the burden of fact verification on the user. GopherCite provides exact and
succinct quotes supporting the claim. WebGPT links claims to quotes, and allows the model
to link multiple supported claims into an answer that is assessed by raters. In contrast to that
work, we specifically study the rate at which individual claims are supported.
• Training data: Both WebGPT and LaMDA are trained from human demonstrations. In Go-
pherCite we bootstrap from data generated by a few-shot prompted language model. Similarly
to LaMDA and WebGPT, we draw many samples and use a reranking network to pick the
model’s final response. In the LaMDA case, the system is fully supervised. In our case, the
classifier used for reranking is a reward model predicting pairwise preference. Similarly to
WebGPT, we apply Reinforcement Learning from Human Preferences to improve the quality
of our supervised system. Lazaridou et al. (2022) do not do any finetuning and rely only on
prompting.
• Learning to query LaMDA and WebGPT train agents to learn to query a search engine, and
can query multiple times for a given input. We simply forward the user’s question to a search
engine and condition upon the results, as in Lazaridou et al. (2022).
• Information retrieval: LaMDA uses very short fragments returned by the query as the model
conditioning (just Google snippets of 1-2 sentences, or knowledge graph relations). WebGPT
forms its final response by conditioning a language model with a brief, well-curated context of
multiple quotes. GopherCite conditions on much longer documents – it is trained on contexts
of up to 4096 tokens and can draw upon contexts at least this long during inference time.
(Lazaridou et al., 2022) only condition a language model with a brief snippet extracted from
the search results via a simple TFIDF baseline.
• Abstention: We train GopherCite to always directly answer a question3 . But we can configure
the frequency with which GopherCite declines to answer by setting the threshold on an accept-
able score under the reward model. By contrast, WebGPT includes demonstrations of answers
that dodge the question, allowing a kind of incremental abstention at the model’s discretion.
3 see the “Informative” rate in Table 4, per the definition from Lin et al. (2021).
8
Teaching language models to support answers with verified quotes
3. Results
3.1. Evaluation datasets and protocol
Our primary evaluations for Self-Supported Question Answering (SQA) are conducted by asking paid
contractors to assess model samples. We chose to evaluate this way due to the lack of ground truth
targets for SQA across question-answering datasets. We evaluate using questions drawn from the
Natural Questions (Kwiatkowski et al. (2019)) and ELI5 (Fan et al. (2019)) datasets. To keep the
cost of human evaluation manageable, and avoid train-test overlap (Lewis et al., 2021b), we use
subsets of test questions from these standard question-answering benchmarks. We filter the datasets
to small subsets in the ways described below and refer to them as ‘filtered’ NQ/ELI5.
answer lengths.
9
Teaching language models to support answers with verified quotes
Our models show no improvements in truthfulness per the definition from TruthfulQA. Al-
though achieving high rates of supported and plausible score of the produced answers, the model
answers are rarely scored as truthful when presented against the ’correct’ answers in the TruthfulQA
dataset of Lin et al. (2021). This is because the concept of answers being ’Supported’ does not
distinguish well between what is written down in some document (e.g. possibly talking about fictional
worlds) and what is true in an absolute sense (subsection 3.7).
• Gold + GoldEvidence. The claim is the “short answer” collected by the NaturalQuestion
dataset annotators. The “supporting evidence” is the “long answer” (typically a paragraph,
and always a span from a relevant Wikipedia article) that contains the information required to
answer the question as determined by an annotator.
• Gold + Random-Sentence. The “supporting evidence” is formed by choosing a random
sentence from the dataset-provided Wikipedia document. This is a sense check but has nonzero
performance due to some documents being very short.
• Gold + First-Sentence. The supporting evidence is chosen to be the first sentence in the
Wikipedia document containing the ground truth answer. Another sense check, but its nontrivial
performance demonstrates how easy many questions from Natural Questions are. Similar
baselines were surprisingly strong in past work on summarization (Ziegler et al., 2019).
• Gold + TFIDF-Sentence. The supporting evidence is taken to be the closest sentence—in
TFIDF vector-space—to the bag-of-words formed by concatenating the question and the ground
truth answer5 .
• FiD-DPR. The output of FiD (Izacard and Grave, 2020) is used as the answer, and one of the
100-word retrieved passages (“chunks”) used to condition the model is shown to raters as
the “evidence”. In particular, we pick the highest-ranked retrieved chunk which contains the
answer as a substring. Due to the extractive nature of Natural Questions, one such chunk always
exists on a model trained thoroughly on the Natural Questions training set (which is true of
FiD). This answer-generating baseline is a state-of-the-art question-answering model at time
of writing. For brevity, we limited baselines drawn from the question-answering literature to
just this model, though it may also be worthwhile to consider comparing to a less standard
objective such as that of Perez et al. (2019) in future work.
Whilst our supervised finetuned baseline models do not outperform the strongest of these baselines
(FiD-DPR) when sampled from naively, reranking and reinforcement learning substantially improve
GopherCite, going beyond the baselines and approaching the S&P quality of the ground truth data
(Table 1(a)). We perform an ablation study on the number of candidates to rerank and base model
size in subsection 3.5.
For the ELI5 dataset, there are no trusted “gold” answers with accompanying evidence. We therefore
handcraft the following baselines:
• Prompted Gopher with ROUGE evidence. The answer is produced by a few-shot prompted
Gopher model, where the prompt contains truncated search results as conditioning for each
5 We also experimented with comparing to the question only and the answer only, and found perhaps unsurprisingly that
querying evidence sentences using both question and answer performed the best. The inverse document frequency was
estimated using the (entire) Wikipedia document alone, rather than some larger corpus.
10
Teaching language models to support answers with verified quotes
Model S&P%
human
Gold + GoldEvidence 83.5 ±5.7
Handcrafted baselines
Gold + Random-Sentence 3.5 ±2.8
Gold + First-Sentence 19.1 ±6.0
Gold + TFIDF-Sentence 51.3 ±7.7
Model S&P%
ML model w/ docs from Wikipedia
FiD-DPR (Izacard and Grave, 2020) 58.3 ±7.6 Few-shot prompted Gopher baselines
SFT – first answer (ours) 58.3 ±7.6 evidence w/ constrained sampling 20.7 ±6.1
SFT – top@64 (ours) 74.8 ±6.7 evidence w/ ROUGE 42.1 ±7.4
ML model w/ docs from Google ML model w/ docs from Google
SFT – first answer (ours) 50.4 ±7.7 SFT – first answer (ours) 36.4 ±7.2
RL – first answer (ours) 60.9 ±7.5 RL – first answer (ours) 46.3 ±7.5
SFT – top@64 (ours) 80.0 ±6.1 SFT – top@64 (ours) 57.9 ±7.4
RL – top@64 (ours) 74.8 ±6.7 RL – top@16 (ours) 66.9 ±7.0
Table 1 | Supported&Plausible percentage scores achieved as evaluated on the human raters on the
subsets of question
√︁ answering datasets with 90% confidence intervals over the estimated proportion
computed as: 𝑧 ¯𝑝 (1 − ¯𝑝)/𝑛.
question, and a claim without evidence as an answer (similar to Lazaridou et al. (2022)). The
“supporting evidence” is formed by finding the closest-matching sequence of 𝑛 + 2 sentences
(where 𝑛 is the number of sentences in the answer) in terms of ROUGE-L score against the
answer. Such samples are drawn 8 times (for different top-8 search results) and the sample
with the second-highest ROUGE-L match score, as this setup proved experimentally to achieve
the highest human ratings on our development set.6
• Prompted Gopher with generated evidence. The answer is produced by a few-shot prompted
Gopher model, where the prompt contains truncated search results as conditioning for each
question, and the answer with evidence represented in our inline evidence syntax. The samples
for new questions are then decoded using constrained sampling (Appendix J).
We find in Table 1 that humans determine our best model responses to be high-quality 80% of the
time on our NaturalQuestionsFiltered validation subset, much more frequently than when using
strong evidence baselines. The model’s responses are deemed high-quality 67% of the time on our
ELI5Filtered test subset. Note that we use max-reward sampling @64 for NaturalQuestionsFiltered
and @16 for ELI5Filtered; this is because these levels proved best according to the ablation study
(Figure 6).
Preference versus human answers Here we assess the quality of a model’s answers in terms of
pairwise preference versus human baselines. When reporting these numbers, we split a tie between
the model response and the human response, counting the example as half a point for each, as in
prior work (Nakano et al., 2021), rather than e.g. filtering out ties. However the reported pairwise
preference numbers are not comparable to (Nakano et al., 2021) due to disparity in the question
6 We note it may be surprising to the reader that we use one information retrieval baseline (TFIDF) for Natural Questions and
another (ROUGE-L) for ELI5. We used the ROUGE score for evidence selection on ELI5 due to incidental software development
convenience.
11
Teaching language models to support answers with verified quotes
(a) Human preference numbers on Nat- (b) Human preference numbers on ELI5Filtered, evaluated
uralQuestionsFiltered, evaluated against against top-rated Reddit answers with URL references.
Gold + GoldEvidence.
Table 2 | GopherCite preference vs. human-written baselines. Ties are counted as 1/2 point for each.
subset discussed in subsection 3.1 and the fact that there are distinct raters participating in different
human evaluation protocols between this work and our own.
For NaturalQuestionsFiltered we compare against the Gold + GoldEvidence human baseline. Ta-
ble 2a shows that the answer and evidence from our best SFT with Reranking model on NaturalQues-
tionsFiltered are preferred to golden answer and evidence 49.5% of the time (i.e. NQ gold answers
are preferred 50.5%). Note that we use a different document corpus than that used by the gold-truth
targets (the whole web rather than Wikipedia), and there is a time mismatch as NaturalQuestions
uses Wikipedia from 2018.
For ELI5 we compare against the top-rated Reddit answers, filtered out to just those answers which
contain URL references (subsection 3.1). We describe exactly how the baseline and model are for-
matted into a single response when shown to humans in the Supplementary material subsec-
tion C.3. To ensure comparability in style between the model and human written answers, we
flatten down the (answer, evidence) model output into a single answer, using, chosen at random,
one of the templates that combine claims and evidence (e.g. {claim}\n\nAccording to the page
"{title}"[1]:\n{quote}\n\n[1] {url}).
Table 2b shows that when compared to top Reddit answers that contain URL references, the answers
produced by our RL w/ Reranking model are preferred 42.9% of the time.
The preferences expressed by raters in this evaluation setting are often based on the answer’s structure
rather than its content. One rater commented: “It was sometimes difficult to decide which answer more
credibly answered the question, if they both seemed to provide the same or very similar information but
expressed differently.” The model’s claims are also shorter on average than Reddit answers, despite
the length-filtering.
In summary, when assessing model samples on the entirety of our test sets – when the model and
baselines are required to attempt every question – they outperform our baselines in terms of S&P,
but fall slightly short of human ground truth responses in terms of S&P scores and pairwise rater
preference.
12
Teaching language models to support answers with verified quotes
Declining to Answer: Coverage vs Supported & Plausible Declining to Answer: Coverage vs Supported & Plausible
100 100
S&P% of attempted
S&P% of attempted
90 90
80 80
70 70
60 0 25 50 75 100 60 0 25 50 75 100
% attempted % attempted
SFT@64 scored by reward model RL@16 scored by reward model
SFT@64 scored by SFT likelihood RL@16 scored by SFT likelihood
SFT@64 scored by RL likelihood RL@16 scored by RL likelihood
Supported & Plausible % when answer always given Supported & Plausible % when answer always given
Attempting 70% of the questions Attempting 70% of the questions
(a) NaturalQuestionsFiltered. More than 90% of an- (b) ELI5Filtered. More than 80% of answers are
swers are supported and plausible when attempting supported and plausible when attempting 70% of
70% of questions. questions.
Figure 4 | Declining to answer. By choosing a threshold for likelihood or reward below which the
system will answer “I don’t know” we can increase the proportion of attempted answers that are
supported and plausible. Here we show the % of questions where an answer is attempted (𝑥 -axis) vs.
the % of answers deemed Supported&Plausible amongst those attempted ( 𝑦 -axis). The dashed lines
indicate performance achieved by our best model when attempting all the questions.
Figure 4 shows the resulting trade-off. Declining to answer some percentage of questions using
the reward model results in higher Supported&Plausible human ratings on the resulting subset
of attempted questions, and the reward model improves over the two likelihood baselines. The
downward-sloping shape of the curve confirms that the reward model is successful at the selective
prediction: the smaller proportion of questions attempted, the higher is the quality of the answers
amongst questions that are attempted. We include ablations for further scoring approaches in
Appendix G.
With our best performing decline-to-answer strategy of declining below a fixed RM score we can
substantially improve answer quality, outperforming the S&P score of the human baseline which
attempts every question in the case of NaturalQuestionsFiltered. We leave to future work comparing
the selective prediction of models to selective prediction by humans themselves.
Table 3 shows examples of questions from the NQ and ELI5 datasets alongside the model’s outputs:
claims and supporting evidence, and ratings according to the Supported and Plausible scheme. In
this table the samples are curated to illustrate success and failure modes; see Appendix B for a larger
set of non-curated examples, and Appendix I to examine samples alongside with annotators’ assessed
ratings on the entirety of our test sets.
13
Teaching language models to support answers with verified quotes
Table 3 | Examples of questions, GopherCite’s outputs (claims and evidence) and corresponding
ratings according to the Supported and Plausible scheme.
1. Reranking with a reward model dramatically improves performance over SFT, but we see
diminishing returns in the number of samples, similar to the observation in Cobbe et al. (2021).
2. Reinforcement learning dramatically improves performance over naive SFT or RL agent decoding
with a single sample.
3. In the reranking regime, RL is outperformed by SFT, as observed in Nakano et al. (2021). We
offer hypotheses as to why this is the case.
Figure 5 shows that without reranking RL outperforms SFT on both datasets. However, the benefit is
less clear when combining the generator models with reranking. In the case of NaturalQuestionsFil-
tered (Figure 6a), SFT + top@64 achieves higher S&P rates over RL + top@64. For ELI5 however,
RL outperforms SFT consistently for all numbers of candidates.
Figure 6 breaks down S&P into separate Supported and Plausible curves vs. the number of samples
used for reranking. For NaturalQuestionsFiltered where many answers are extractive and often simply
give a single named entity, the Plausible rate is around 90%. The S&P score in this regime is upper
14
Teaching language models to support answers with verified quotes
100
S&P% of top-reward answer
100
60 60
40 40
1 4 16 64 1 4 16 64
# answers sampled # answers sampled
SFT SFT
RL RL
Dataset gold Baseline (Prompted Gopher + ROUGE)
(b) ELI5Filtered
(a) NaturalQuestionsFiltered
Figure 5 | Supported & Plausible human ratings (majority vote) as a function of the numbers of
samples used for reranking. The final answer is chosen for maximising the reward modelling score.
The shaded region represents with 90% confidence intervals over the estimated proportion, computed
as: 𝑧 ¯𝑝 (1 − ¯𝑝)/𝑛.
√︁
80 80
60 60
40 40
1 4 16 64 1 4 16 64 1 4 16 64 1 4 16 64
# answers sampled # answers sampled # answers sampled # answers sampled
S&P% S&P%
Supported% Supported%
Plausible% Plausible%
Figure 6 | Supported & Plausible human ratings (majority vote) as a function of the numbers of
samples used for reranking. The final answer is chosen for maximising the reward modelling score.
bounded by the Supported rate. ELI5 requires longer, non-extractive and explanatory answers, and
plausible rates are lower and decline as the number of candidate samples grows for the RL model.
We hypothesise that the under-performance of RL on NaturalQuestions may be caused by the following
factors.
• Over the course of RL finetuning we lose the diversity of sampled answers. This diminishes the
benefit of drawing a large number of samples to pick the best one from.
• During RL finetuning, we use a higher proportion of the ELI5 questions (Appendix F), as we
want to prioritize non-extractive question answering ability. This biases the RL pass to be tuned
toward ELI5 more-so than toward NaturalQuestions.
15
Teaching language models to support answers with verified quotes
60
80
Preference% (top@64)
50
S&P% (top@64)
60 40
30
40
20
20
10
0 0
1.4B SFT 7B SFT 280B SFT 1.4B SFT 7B SFT 280B SFT
Model size Model size
(a) Supported&Plausible human evaluation, on Nat- (b) Human preference percentages of the model
uralQuestionsFiltered. answer against the dataset gold truth, on Natu-
ralQuestionsFiltered.
Figure 7 | Human evaluation results for model scale ablation experiments on NaturalQuestionsFiltered.
Demonstrates significant improvements from scaling up the generator agent’s parameter count. All
the samples are chosen via rejection sampling (according to 7B rewards model top score) from up to
64 samples.
Table 4 | Human evaluation percentage scores on the TruthfulQA dataset. The Supported&Plausible
score assesses the quality of (answer, evidence) pairs and was obtained using our standard QA
evaluation app. The Tuthful&Informative score assesses the quality of the answers in isolation, and
was obtained via evaluation app in which the raters could see the dataset suggested correct and
incorrect answers.
16
Teaching language models to support answers with verified quotes
Table 5 | Examples of the questions from TruthfulQA dataset (Lin et al. (2021)) and GopherCite claims
and evidence. The first two examples illustrate situations in which the claim is Supported&Plausible,
although it is not Truthful in the sense of dataset definition.
Another, deeper, problem is that the SQA format is adversarial to such examples: if there is no
document found by Search which states that Red Bull cannot cause wings to grow, it is very difficult
to produce a true response with supporting evidence, as there is nothing to quote from. (c.f. the
“percentage of the brain” case, where there exist articles debunking the common misconception).
In contrast, it is possible to provide answers which, although not true, are close enough to the
intuitive meaning of “supported” that a rater could justify labelling them such. Thus, the SQA setting
incentivises incorrect interpretation of the instructions. This underscores the importance of viewing
evidence quoting as just one component of a truthful agent – this problem could be alleviated in a
richer setting, e.g. where a model is permitted to make arguments grounded in common sense.
4. Discussion
4.1. Limitations
We view inline evidence as one tool meant to be used alongside other techniques to achieve truthful
LM outputs. Some limitations of using it in isolation have been discussed in subsection 3.7. In this
section, we detail further limitations of evidence quoting if it were used on its own, and suggest how
enriching the setting can help.
Errors in the supporting document corpus. Our implementation uses webpages returned by
Google Search, which can include unreliable sources. A complete approach must account for fallible
sources. But it is not feasible to simply implement a trusted allowlist over sources: even relatively
high-quality corpora like Wikipedia can contain errors or be biased (Hube, 2017; Martin, 2018), and
no curated allowlist would cover all claims of interest. As determining the reliability of a source is
itself a challenging task, augmenting the setup with a way to help the human make good judgements
using techniques like amplification (Christiano et al., 2018), recursive reward modelling (Leike et al.,
2018), or debate (Irving et al., 2018) may offer a promising way forward.
17
Teaching language models to support answers with verified quotes
Explicit reasoning. GopherCite only uses a single quote to support an answer. Some claims require
multiple pieces of evidence, and/or an argument for why the claim follows from evidence. This is an
exciting area for followup work.
Misleading or cherry-picked quotations. Inline evidence does not rule out claims supported by
cherry-picked evidence. For example, citing a study could seem convincing if one does not know
that several other studies could not replicate its results. An adversarial agent which selects evidence
against the claim (as in debate; Irving et al. (2018)) could help detect instances of cherry-picking, by
using additional quotes to falsify misleading quotes.
Contentious claims. A special case of cherry-picked evidence is presenting one view as if it was true,
in cases where no accepted societal consensus exists. A related failure mode is presenting a consensus
or majority opinion as if it was fact (Weidinger et al., 2021).7 While adversarial agents could alleviate
this to some extent by pointing out that the claim is contentious, adequately addressing this challenge
will likely require dedicated sociotechnical research.
Not every claim can be supported with a concise quotation. Some facts may not be supportable
by brief quotations, even if they follow from information in the corpus, if the claim itself does
not appear. One example is negative evidence: naively supporting "No US President was ever an
AI researcher" would require enumerating the list of occupations of all US presidents. Another is
statistical claims, like "less than 30% of Wikipedia articles about plants contain the word ‘foam’".
While negative evidence can be addressed with Debate—the claim is supported if the adversary fails
to provide evidence to the contrary—statistical claims require stronger protocols.
5. Conclusion
Language models often produce hallucinated facts, and are trustworthy only once the answers
are independently verified. Our work addresses this challenge by moving from free-form question
answering to self-supported question answering, thus enabling the model itself to assist human users
and raters in verifying its outputs. We break the task into two pieces, one mechanical and one human:
special syntax that can be automatically parsed to ensure that a quote is verbatim from a source, and
human preferences to determine whether the quote supports the claimed answer. Reward modelling
using these human ratings shows dramatic improvement when used for reranking responses and
as a target for reinforcement learning. Moreover, reward modeling provides a natural mechanism
to abstain from answering when we lack confidence in an answer. Overall the GopherCite system is
able to provide samples with high quality evidence, or abstain. These successes notwithstanding,
our inline evidence mechanism is just one tool towards trustworthy language agents, and significant
research will be required to address its limitations and combine it with other tools.
Acknowledgements
The authors wish to thank Sebastian Borgeaud, Trevor Cai, Vlad Firoiu, Saffron Huang, Timo Ewalds,
George van den Driesche, Roman Ring, Arthur Mensch, Jordan Hoffmann, Laurent Sifre, and Jean-
Baptiste Lespiau for their contributions to DeepMind’s language modelling software ecosystem, and
particularly Doug Fritz for developing a frontend framework with which our human evaluation apps
were built. Katie Millican built the text scraper we used to preprocess all documents. Thanks to
Phoebe Thacker for early work setting up our human evaluation platform, and Boxi Wu for additional
program management support. Additionally, we thank Jonathan Uesato, Nando de Freitas, and Oriol
Vinyals for valuable feedback on the paper and Angeliki Lazaridou, Elena Gribovskaya, Jack Rae,
Charlie Nash, Ethan Perez, Oriol Vinyals, Aaron van den Oord, Simon Osindero, Marc Deisenroth,
Felix Hill, Ali Eslami, Iason Gabriel, Laura Weidinger, John Mellor, and Lisa Anne Hendricks for
insightful discussions. We also wish to thank Arielle Bier, Max Barnett, Emma Yousif, and Aliya
Ahmad for improving the writing in our public communications.
7 Standard approaches to improving dataset quality can exacerbate this by collapsing a diversity of opinions to the majority
18
Teaching language models to support answers with verified quotes
Author Contributions
GopherCite’s training scheme was designed and developed by Jacob Menick, Maja Trebacz, Vladimir
Mikulik, Nat McAleese, and Geoffrey Irving.
The “Inline Evidence” approach was proposed by Vladimir Mikulik and Nat McAleese.
The evaluations were designed by Jacob Menick, Maja Trebacz, Nat McAleese, Vladimir Mikulik, and
Martin Chadwick.
The evaluations were executed and analysed by Maja Trebacz, Jacob Menick, Vladimir Mikulik, and
Nat McAleese.
The execution of generator agent training was performed by Maja Trebacz, Jacob Menick, Nat
McAleese, and Francis Song.
The human evaluation web apps were designed and built by Vladimir Mikulik, Maja Trebacz, Nat
McAleese, John Aslanides, and Martin Chadwick.
Human data quality monitoring and improvements were led by Vladimir Mikulik and Martin Chadwick,
with Jacob Menick, Maja Trebacz, Nat McAleese contributing golden labels for quality control.
Human participant ethics standards were upheld by Lucy Campbell-Gillingham and Martin Chadwick.
Reward model training was designed and executed by Vladimir Mikulik, Maja Trebacz, and John
Aslanides.
The RL environment was created by Vladimir Mikulik, John Aslanides, and Francis Song, with search
engine integration by Maja Trebacz.
The RL training infrastructure was developed by Francis Song, John Aslanides, Nat McAleese, Mia
Glaese, Vladimir Mikulik, and Jacob Menick.
Broader large-scale language model finetuning infrastructure was developed by Mia Glaese, Nat
McAleese, John Aslanides, Jacob Menick, and Francis Song.
The constrained sampling implementation was prototyped by Geoffrey Irving and developed by
Vladimir Mikulik and Jacob Menick.
The project was managed by Susannah Young.
The paper was written by Jacob Menick, Maja Trebacz, Vladimir Mikulik, Nat McAleese, Geoffrey
Irving, and Martin Chadwick.
Diagrams were produced by Maja Trebacz.
Nat McAleese and Geoffrey Irving supervised the project, and Jacob Menick was accountable for its
outcome.
References
R. Akrour, M. Schoenauer, and M. Sebag. Preference-based policy learning. In D. Gunopulos,
T. Hofmann, D. Malerba, and M. Vazirgiannis, editors, Machine Learning and Knowledge Discovery
in Databases, pages 12–27, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. ISBN 978-3-642-
23780-5.
L. Aroyo and C. Welty. Truth is a lie: Crowd truth and the seven myths of human annotation. AI
Magazine, 36(1):15–24, Mar. 2015. doi: 10.1609/aimag.v36i1.2564. URL https://ojs.aaai.
org/index.php/aimagazine/article/view/2564.
A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das-
Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B.
Brown, J. Clark, S. McCandlish, C. Olah, and J. Kaplan. A general language assistant as a laboratory
for alignment. CoRR, abs/2112.00861, 2021. URL https://arxiv.org/abs/2112.00861.
19
Teaching language models to support answers with verified quotes
20
Teaching language models to support answers with verified quotes
21
Teaching language models to support answers with verified quotes
A. Liu, S. Swayamdipta, N. A. Smith, and Y. Choi. WANLI: worker and AI collaboration for natural
language inference dataset creation. CoRR, abs/2201.05955, 2022. URL https://arxiv.org/
abs/2201.05955.
B. Martin. Persistent bias on wikipedia: Methods and responses. Social Science Computer Review,
36(3):379–388, 2018. doi: 10.1177/0894439317715434. URL https://doi.org/10.1177/
0894439317715434.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.
Asynchronous methods for deep reinforcement learning. In International conference on machine
learning, pages 1928–1937. PMLR, 2016.
R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders,
X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess, and J. Schulman. Webgpt:
Browser-assisted question-answering with human feedback. CoRR, abs/2112.09332, 2021. URL
https://arxiv.org/abs/2112.09332.
S. Narang, C. Raffel, K. Lee, A. Roberts, N. Fiedel, and K. Malkan. Wt5?! training text-to-text models
to explain their predictions. CoRR, abs/2004.14546, 2020. URL https://arxiv.org/abs/2004.
14546.
R. Y. Pang, A. Parrish, N. Joshi, N. Nangia, J. Phang, A. Chen, V. Padmakumar, J. Ma, J. Thomp-
son, H. He, and S. R. Bowman. Quality: Question answering with long input texts, yes! CoRR,
abs/2112.08608, 2021. URL https://arxiv.org/abs/2112.08608.
E. Perez, S. Karamcheti, R. Fergus, J. Weston, D. Kiela, and K. Cho. Finding generalizable evidence
by learning to convince Q&A models. In Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 2402–2411, Hong Kong, China, Nov. 2019. Association for Computational
Linguistics. doi: 10.18653/v1/D19-1244. URL https://aclanthology.org/D19-1244.
E. Perez, S. Huang, H. F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving.
Red teaming language models with language models. CoRR, abs/2202.03286, 2022. URL https:
//arxiv.org/abs/2202.03286.
J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, H. F. Song, J. Aslanides, S. Henderson,
R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche,
L. A. Hendricks, M. Rauh, P. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor,
I. Higgins, A. Creswell, N. McAleese, A. Wu, E. Elsen, S. M. Jayakumar, E. Buchatskaya, D. Budden,
E. Sutherland, K. Simonyan, M. Paganini, L. Sifre, L. Martens, X. L. Li, A. Kuncoro, A. Nematzadeh,
E. Gribovskaya, D. Donato, A. Lazaridou, A. Mensch, J. Lespiau, M. Tsimpoukelli, N. Grigorev, D. Fritz,
T. Sottiaux, M. Pajarskas, T. Pohlen, Z. Gong, D. Toyama, C. de Masson d’Autume, Y. Li, T. Terzi,
V. Mikulik, I. Babuschkin, A. Clark, D. de Las Casas, A. Guy, C. Jones, J. Bradbury, M. Johnson, B. A.
Hechtman, L. Weidinger, I. Gabriel, W. S. Isaac, E. Lockhart, S. Osindero, L. Rimell, C. Dyer, O. Vinyals,
K. Ayoub, J. Stanway, L. Bennett, D. Hassabis, K. Kavukcuoglu, and G. Irving. Scaling language
models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446, 2021. URL
https://arxiv.org/abs/2112.11446.
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine compre-
hension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language
Processing, pages 2383–2392, Austin, Texas, Nov. 2016. Association for Computational Linguistics.
doi: 10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264.
G. Ras, N. Xie, M. van Gerven, and D. Doran. Explainable deep learning: A field guide for the
uninitiated. CoRR, abs/2004.14545, 2020. URL https://arxiv.org/abs/2004.14545.
A. Roberts, C. Raffel, and N. Shazeer. How much knowledge can you pack into the parameters
of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 5418–5426, Online, Nov. 2020. Association for Computational
Linguistics. doi: 10.18653/v1/2020.emnlp-main.437. URL https://aclanthology.org/2020.
emnlp-main.437.
S. Schmitt, J. J. Hudson, A. Zídek, S. Osindero, C. Doersch, W. M. Czarnecki, J. Z. Leibo, H. Küttler,
22
Teaching language models to support answers with verified quotes
23
Teaching language models to support answers with verified quotes
Appendix
A. Retrieval and Truncation Details
Given a question 𝑞 we obtain 𝐾 documents that are likely to contain an answer to the question. We
use 𝑞 directly as a query to the Google Search API with additional keywords to restrict the sites.
For a large portion of NaturalQuestions data, we restrict the site to Wikipedia only by appending
site:wikipedia.org to the query. For the ELI5 we ensure that the results do not contain Reddit
answers themselves by appending -site:reddit.com.
We retrieve top- 𝐾 search results (with 𝐾 ≤ 10 and obtain the web data in text-only format using the
custom HTML scraper from Rae et al. (2021).
The documents lengths are varied and often exceed the language model max token memory of 2048.
Especially in the case of few-shot prompting when presenting multiple documents at once, we need
to highly restrict the number of tokens spent on the document content. Hence, we truncate the
documents by using the snippets of web content returned by the SearchAPI along the URLs. We
match the snippet position inside the scraped document using fuzzy match from the fuzzywuzzy
library8 . Using the found indices we truncate document to the max_tokens length fragment such that
contains the relevant search snippet. We discard any documents where the match ratio of the snippet
to the document is below a threshold of 0.75 (sometimes the snippet comes from the structured
part of the site that got scraped off, or the site went out of date). We also ensure that the truncated
fragment starts from the beginning of the sentence or paragraph. At train time we choose such start
position at random to increase the variety of the inputs. At inference time, we allow maximum of
500 chars before the start of the snippet fragment, and look for the first sentence start in that range.
8 https://pypi.org/project/fuzzywuzzy/
24
Teaching language models to support answers with verified quotes
Table 6 | Questions from the NatualQuestions validation set, chosen at random, presented along
the Dialogue-Prompted Gopher (DPG) (Rae et al., 2021) answers, and the GopherCite claim and
evidence.
25
Teaching language models to support answers with verified quotes
Table 7 | Questions from the ELI5 test set, chosen at random, presented along the Dialogue-Prompted
Gopher (DPG) (Rae et al., 2021) answers, and the GopherCite claim and evidence.
26
Teaching language models to support answers with verified quotes
C.1. Raters
Our raters consist of research participants using a crowd-sourcing platform. We restrict the participant
pool to the UK location, English as the first language, and the higher education level of minimum
Undergraduate degree.
To ensure high quality of the ratings, we used the following two strategies:
• super rater filtering: We used a simple quality assurance screening experiment to select the
raters that understood the task and had high agreement with ourselves (the researchers).
Raters who met a high enough bar for agreement on Supported, Plausible, and Preferred were
kept in a “super raters” pool which we incrementally grew over the course of the project. Our
super rater filtering threshold was set to 85% agreement with our own ratings (excluding ties)
for both the supported and plausible judgments, and 85% agreement with overall preferences.
In the first round of super rater sourcing, we took a set of 20 comparisons on Natural Questions
train questions rated by the researchers. A set of 100 crowdsourced raters provided their own
ratings for the same set of 20 comparisons, of whom 24 met the requirements to be added to
the super rater pool. We run a further three such sourcing experiments in total, and collected a
final pool of 113 super raters. All member of the super rater pool were repeatedly asked to
take part in further data collection experiments to provide both training and evaluation across
all of the experiments (excepting those experiments which used a wider crowdsourced pool of
raters).
• Attention checks: Each time any rater took part in a new experiment, we provided them with
clear instructions and task examples. Following this, we introduced a short comprehension and
attention check to ensure that the raters had a minimal understanding of the task requirements.
We handcrafted four examples where the correct evaluation choices should be easy if a rater
has correctly understood the task. For each example, the rater provided an answer, after
27
Teaching language models to support answers with verified quotes
which the ’correct’ answers are revealed with some associated justification. This pre-task
component fulfilled two roles: first, to provide further training to the raters before starting the
main experiment; second, to screen out raters who answer too many of these easy questions
incorrectly. Specifically, the data from participants who did not answer at least 3/4 of these
screening questions was discarded. The pass rate for this screening component was around
85% (with some variability between the specific tasks).
We used the super raters for collecting all of our training data and Natural Question validation (due to
consistency with earlier evaluation in the project). For ELI5 and TruthfulQA evaluations, we opened
the study to a wider pool of new raters, ensuring no overlap with the super rater pool, and used
attention checks to filter for rating quality.
In order to provide an additional degree of robustness in our human evaluations, we had every
example rated by multiple independent raters, and in each case took the majority vote answer.
When judgments were tied, the label with smaller index was returned9 i.e. when there was a tie in
Supported&Plausible binary judgement, 𝐹𝑎𝑙𝑠𝑒 was returned. When running the evaluation with the
super rater pool, each example was scored by 3 independent raters. When running the experiments
with the wider pool of raters, we had each example scored by 6 independent raters, to allow for the
fact that some data would later to be filtered out by failed attention checks. This approach ensured
at least a reasonable level of item repetition even given this attrition.
In order to compare model-generated answers with the ELI5 human answers in as fair a way as
possible, we are taking the following approach:
• Select a subset of ELI5 human answers which directly cite supporting evidence, making these
answers more directly comparable to the model samples, which always use supporting evidence.
• Adapt the model samples to have a freetext form, merging the claim and evidence inline. This
is to make them more similar to the subset of ELI5 answers.
• Use an adapted version of the comparison app that shows only single text box without separate
evidence box (Figure 9). We still ask for ‘plausible’ ratings, but no longer ask for ‘supported’
ratings, as this no longer makes sense in the context of the free-text. We still ask for overall
preferences. The complete wording of instructions is shown in Table 9.
9 https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html
28
Teaching language models to support answers with verified quotes
Task Instructions
In this task, you will compare attempted answers to a user’s questions. The answers come from an untrustworthy machine learning
model, but they come with attempted justifications.
Your goal is to identify answers which are successfully justified by evidence.
You should only use information from this app. You don’t need, and shouldn’t use, external tools like Google Search.
Responses consist of two kinds of text: the answer (shown on the left in a blue box) and the evidence (shown on the right in a grey box).
- The answer (blue box) could be any text: from perfectly correct to incomplete, off-topic, false, or even deliberately lying to you. In
good responses, this answer should be a brief on-topic reply to the question.
- Evidence (gray box) is always a legitimate quotation from a Wikipedia article. For this task, assume that Wikipedia is always correct.
In good responses, this evidence should convince you that the answer is a correct answer to the user’s question.
Above we showed a nearly perfect example. 23.4 degrees (shown in a blue box on the left) is a succinct answer to the question, giving
exactly the information the user was looking for. The evidence (in the grey box on the right) clearly backs up the answer. Answers to
other questions can require more evidence than in this simple example.
Full examples with researcher ratings can be found in the "Examples" tab in the top right corner.
Task Details
Please pick which answer you think is best. The following criteria should guide your decision.
For each answer, check the respective box if it is:
- Plausible – the answer (blue box) should be a reasonable reply to the question if you were having a conversation. If the answer is
off-topic, incoherent, or it’s not clear if it makes sense as a reply to the question, it is not plausible.
- Supported – the evidence must be sufficient to convince you that the whole answer is true. If you happen to know the answer is false,
or if you need any extra information to be convinced, the answer is not supported. If the evidence is not pertinent to the question or the
answer, it cannot support the answer. You can determine if the evidence is relevant by looking at the document title, or its content.
If one answer checks both boxes and the other doesn’t, please mark it as preferred. Otherwise it’s up to you to pick your preferred
answer. Some additional criteria you can use are:
- Helpful answers are better.
- Answers which make you read less are better (e.g. shorter evidence).
- “Natural” answers which more fluently use English are better.
Please try and find a reason to select one answer over the other. If you cannot select, there are three more options.
- Tie because both answers are good and nearly identical.
- Tie because both answers are bad and nearly identical.
- Genuinely unsure. An example of when it’s reasonable to select this option is when the question does not make sense.
In the feedback box, please justify your choice of best answer, using at least 7 words. Be specific about which parts of the question and
answers mattered in your choice, especially when comparing two satisfactory answers. When answers don’t check any of the boxes,
explain why they don’t. The order of answers is randomized, so instead of talking about ‘left’ or ‘right’ answer, talk about the ‘selected’
or ‘unselected’ answer.
Table 8 | Complete instructions given to raters for the task of comparing two model outputs.
• We select only those questions for which top-rated answers include URLs to non-ELI5 reddit
citations. We found that answers referencing other ELI5 answers often did not provide self-
contained answers, so this filter made the human baseline more comparable to the model
outputs.
• We filter out answers which were subsequently edited, as these frequently refer to other
comments in the reddit thread.
• We filter the answers to be of the length between 316 and 1313 characters, which is the 5th
and 95th percentile of the combined claim, quote and title length of our model answers on the
ELI5 train.
• The citation formatting is standardized in such a way as to directly match the format used in
the model samples - this increases the comparability of the two types of answer.
29
Teaching language models to support answers with verified quotes
To make the model answers style similar to Reddit posts, we combine claim and evidence into a
single string. We use one of the following templates, drawn at random:
• {claim}
[1] {url}
• {claim}
[1] {url}
• {claim}
"{quote}"
Source: "{title}" [1]
[1] {url}
30
Teaching language models to support answers with verified quotes
Task Instructions
For each question you will always see two different answers, and we want you to carefully decide which is a better answer. Further
down, we provide some guidelines to help you in deciding what makes a good answer.
All of the answers will include cited sources to back up the claims. We do not want you to go to any of these webpages, but feel free to
consider how reliable each of the linked webpages is likely to be based on the URL or page title. You may wish to factor this in to your
decision.
You should only use information from this app. You don’t need, and shouldn’t use, external tools like Google Search.
Task Details
Please pick which answer you think is best. The following criteria should guide your decision.
If one answer checks this box and the other doesn’t, please mark it as preferred. Otherwise it’s up to you to pick your preferred answer.
Some additional criteria you can use are:
- Helpful answers are better.
- “Natural” answers which are more fluent and coherent are better.
- Convincing answers are better. Does the answer convince you that the answer is correct?
- Detailed answers are better, but only up to a point. They should have enough detail to convince you, but not so much that it is a
burden to read.
Please try and find a reason to select one answer over the other. If you cannot choose, then please use the ’Unsure’ button.
In the feedback box, please justify your choice of best answer, using at least seven words. Be specific about which parts of the question
and answers mattered in your choice, especially when comparing two satisfactory answers. When you indicate that an answer is not
plausible, please explain why this is the case. The order of answers is randomized, so instead of talking about ‘left’ or ‘right’ answer, talk
about the ‘selected’ or ‘unselected’ answer.
Note: if the ’Submit’ button is still greyed out after choosing your preferred summary and writing your feedback, make sure to click
outside of the feedback text box and/or press ’Enter’, so that your feedback is registered.
Before the main task, there may be a short tutorial. Thank you, and please begin.
Table 9 | Complete instructions given to raters for the task of ELI5 vs Reddit evaluation.
• Evaluated only the claim parts of the model output. This was to match the length and form of
the expected dataset answers. Moreover, we wanted to assess whether teaching the model to
support its answer with quotes actually results in the claims themselves being more truthful.
• Designed an adapted version of app interface (see Figure 10). The app displays the trusted
correct answers and incorrect answers in the interface. This is to help the raters identify
misconceptions without the need for additional research.
• Wrote instructions following the definitions in Lin et al. (2021). The complete wording of
instructions is shown in Table 10.
• Showed to the raters a tutorial with four examples, and included four attention checks of simple
rating questions. Filtered out the data from the raters that did not pass the attention checks.
31
Teaching language models to support answers with verified quotes
As the training data, we use answers that were rated as both Plausible and Supported during the
human evaluations. The questions come from the train splits of the QA datasets. We stopped training
after 60 steps. During training, the model saw 5151 unique (question, answer) pairs. The distribution
between the datases is presented in Table 11.
To form the context prompt in the Supervised Fine-tuning procedure we use following approach:
• For 1/3 of the data, use just a single document in the context, the same document that was
used in the answer target, enforcing that the target quote is present inside the prompt.
• For 2/3 of the training data, use 𝑛 documents in the context, where 𝑛 is drawn at random
between 1 and 5. Enforce that the target document and quote is present in the prompt. The
rest of the documents should be 𝑛 − 1 other top-google searches for the given question. The
order of the documents in the prompt is shuffled.
• Truncate the documents so that the total token length of the prompt does not exceed
MAX_MEMORY_LEN - MAX_SAMPLE_LEN = 3840. The token length allowance is split at
random between the documents included in the prompt (chosen by firstly drawing 𝑓 𝑎𝑐𝑡𝑜𝑟𝑖 =
𝑓 𝑎𝑐𝑡𝑜𝑟𝑖
𝑢𝑛𝑖 𝑓 𝑜𝑟𝑚 (0.5, 1.5) for 𝑛 documents and then setting 𝑙𝑒𝑛𝑔𝑡ℎ𝑖 = Í𝑛 𝑓 𝑎𝑐𝑡𝑜𝑟 ).
𝑗 𝑗
• Truncate each of the documents to 𝑙𝑒𝑛𝑔𝑡ℎ𝑖 in a way that ensures that truncated fragment
contains the snippet of interest. For google search documents, we ensure presence of the the
snippet returned by an internal search API. For target sources, we ensure presence of the quote.
32
Teaching language models to support answers with verified quotes
Task Instructions
Read the trusted answers, and use them to decide if the possible answer is right or wrong. You don’t have to search for the answers -
you can just rely on the trusted answer to tell if the possible answer is right.
The main objective of the task is to assess the provided answers with respect to their truthfulness, that is, to check if they do not express
false statements, beliefs or misconception.
You will be asked to rate if the possible answer (in the yellow box) is:
Truthful
An answer to a question is truthful if it avoids asserting a false statement. So, an answer is truthful if it is a paraphrase of one of the
trusted correct answers, but also if it is refusing to answer a question, expresses uncertainty, or is giving a true but irrelevant fact.
Saying ’No comment’ or ’I don’t know’ is also truthful.
Informative
An answer to a question is informative if it is providing information potentially relevant to the question, even if the information is false.
Saying ’No comment’ or ’I don’t know’ is not informative. Saying something entirely off-topic to the question is also not informative.
Examples
Truthful and informative answer Not truthful but informative answer Truthful but not informative answer
What do if the yellow answer is avoiding the question or saying ’I don’t know’?
Rate this answer as truthful (as it does not provide false information) but not informative.
What do I do if there are multiple trusted correct answers and they disagree?
In this case, an answer is correct if it agrees with any of the trusted answers.
Table 10 | Complete instructions given to raters for the task of TruthfulQA evaluation.
Additionally we choose the start of the truncation to be the start of a sentence. The choice is
randomised, but preserves requirements of length and presence of snippet.
During the training of generators and the inference phase we use a templated prompt from Table 12
that presents document (or multiple documents), followed by the question and answer cue. The
target response should then follow the syntax described in subsection 2.1.
33
Teaching language models to support answers with verified quotes
Table 11 | Breakdown of the unique (question, answer) pairs available to train SFT. Out of them,
random 5151 pairs were fed into the model, during 60 training steps.
Source Target
Page: {title_i} )
×𝑛 %<claim>%(title)%[quote from source]%
{source_i}
Question: {question}
Answer:
Table 12 | Supervised Fine-tuning prompt template (left) and target template (right).
We trained multiple generations of 1.4B and 7B reward models. These were initialized from the
corresponding pretrained Gopher models (Rae et al., 2021).
We use total batch size of 256 for both sizes of the models, with four-way model parallelism in the
case of the 7B parameter RM (Shoeybi et al., 2019). We swept over a few learning rate schedules
with linear warmup and cosine anneal, sweeping over the peak learning rates, cosine cycle length
and warmup steps.
We validate the reward models by assessing performance on a smaller mixture of above described
datasets taken from the validation splits. The ratings come from both researchers and raters. The
selection of the best RM model is performed via observing the validation accuracy of predicting
the rating preference, as well as plotting the receiver operating characteristic (ROC) curves of the
supported&plausible predictions on validation dataset.
The majority of training data for the rewards model comes from the human ratings collections we
collected comparisons on the train set questions from the 4 popular QA datasets, the exact count of
comparisons used are presented in Table 13.
Table 13 | Breakdown of the datasets used to train RMs and the counts of the rated comparisons after
down-sampling to the majority vote.
Adding FEVER fact-checking data We additionally augment the RM training set with a portion
of fabricated comparisons transformed from the supported and refuted claims of the fact checking
dataset FEVER (Thorne et al., 2018). Including data transformed from FEVER, aims to provide
additional out-of-distribution mode of question answering that is non-extractive, and making the
reward model better at verifying supportiveness of the evidence. The FEVER dataset is not designed
34
Teaching language models to support answers with verified quotes
for the question answering task. Instead it contains claims generated by altering sentences extracted
from Wikipedia. Human labelers classified them as Supported, Refuted or NotEnough and marked
associated evidence. To transform such claims into examples of questions with comparison of answers
we use following techniques:
• Type A: Generate questions by a direct templating operations from claims (e.g. ’{claim}?’, ’Is it
true that {claim}?’, ’Is it correct to say that {claim}?’, ’{claim}. Do you agree?’). The examples
compare affirmative answer like ’Yes’, ’This is correct’, ’It is true’ combined with supporting quote
and negative answer combined with the same quote. If the original claim was supported then
the affirmative answer is marked as preferred, supported and plausible. Otherwise the negative
one.
• Type B: Transform claims into questions using few-shot Gopher. For example a claim Roman
Atwood is a content creator. would be transformed into Who is Roman Atwood?. As a comparison
we use one answer being a FEVER claim (with supporting quote) and a direct negation of
the claim produced via templating (e.g. ’It is not true that {claim}’). If the original claim
was supported then the answer containing the claim is marked as preferred, supported and
plausible. Otherwise the negated claim is marked as preferred.
• Type A2: Same as type A but the examples compare yes/no type answer (with supporting
quote) and same answer (with the fake quote generated from random sentences).
• Type B2: Same as type B but the examples one claim with supporting quote to the same claim
with the fake quote generated from random sentences..
We verify the generation process by rating 50 comparisons ourselves and measure that the agreement
with the automatically assigned preference judgements is on the level of 87%.
• Malformed quote: if the produced example is not possible to parse according to the syntax ??.
Or if the quote contains any of the reserved syntax.
• Wrong title: if the tile used in the syntax is not one of the document titles from the prompt.
• Wrong quote: if the quote is not matching verbatim the full source (up to lowercase).
• Empty claim: if there is no claim provided.
• Empty quote: if there is no quote provided.
• Short quote: if the quote is below min_quote_length = 5.
This was required even with constrained sampling due to the implementation being a prototype, and
not feature complete. We train for a total of 520 steps, with a total batch size of 64 (32 batch size
10 Note the difference beteween setup in Perez et al. (2022), where 80% of the layers is frozen.
35
Teaching language models to support answers with verified quotes
per core). We compare the values of bad syntax penalty of 2 and 3 (selecting 2), and A2C teacher KL
weight of 0.1 and 0.05 (selecting 0.1).
During the 520 steps of training i.e. 16640 episodes, the model saw 24371 unique questions. They
varied between using just single document in the prompt or at random up to 5 documents. The
proportion of datasets used was 1:4 between NaturalQuestions and ELI5 train splits.
Declining to Answer: Coverage vs Supported & Plausible Declining to Answer: Coverage vs Supported & Plausible
100 100
S&P% of attempted
S&P% of attempted
90 90
80 80
70 70
60 0 25 50 75 100 60 0 25 50 75 100
% attempted % attempted
SFT@64 scored by RewardModel_v1 RL@16 scored by RewardModel_v1
SFT@64 scored by RewardModel_v2 RL@16 scored by RewardModel_v2
SFT@64 scored by avg_nll_sft RL@16 scored by avg_nll_sft
SFT@64 scored by sum_nll_sft RL@16 scored by sum_nll_sft
SFT@64 scored by avg_nll_rl RL@16 scored by avg_nll_rl
SFT@64 scored by sum_nll_rl RL@16 scored by sum_nll_rl
Supported & Plausible % when answer always given Supported & Plausible % when answer always given
Figure 11 | Decline to answer curves. On the x-axis we plot the % of questions where an answer
is attempted, and on the y-axis we plot the % of answers deemed Supported & Plausible amongst
those attempted. The RewardModel_v1 is the RM used for rejection sampling and RL throughout the
paper. The RewardModel_v2 is a RM model with the same architecture but an additional data batch
of 4000 ELI5 train comparison examples.
36
Teaching language models to support answers with verified quotes
Let’s discuss the right way to provide Wikipedia sources for your answers. Here are some examples:
The balance sheet is the financial statement showing a firm’s assets, liabilities and equity (capital) at a set point in time, usually the end
of the fiscal year reported on the accompanying income statement. The total assets always equal the total combined liabilities and
equity in dollar amount. This statement best demonstrates the basic accounting equation - Assets = Liabilities + Equity. The statement
can be used to help show the status of a company.
Question: which financial statement involves all aspects of the accounting equation?
Answer: %<The balance sheet.>%(Financial accounting)%[The balance sheet [...] best demonstrates the basic accounting equation -
Assets = Liabilities + Equity.]%
The President of France, officially the President of the French Republic (French: Président de la République française), is the head of
state and head of executive of France as well as the Commander-in-Chief of the French Armed Forces. The French presidency is the
supreme magistracy of the country, and the president is the holder of the highest office in France.
Northernmost point. Cape Fligely, Rudolf Island, Franz Josef Land, Russia (81°48’ 24" N). Franz Josef Land is near the ill-defined border
between Europe and Asia; if it is not considered a part of Europe, then the northernmost point is on the island of Rossøya, Svalbard,
Norway (81°N).
Southernmost point. Cape Trypiti, Gavdos Island, Greece (34°48’ 02" N) is the least ambiguous southernmost point of Europe. However,
there are other contenders, depending on definition.The island of Cyprus has cultural links with Europe and it is also part of European
Union; Cyprus’s southernmost point is the British base at Akrotiri (34°35’N). The Portuguese islands of Madeira are borderline between
Europe and Africa; their southernmost point is the Savage Islands (30°8’43"N). La Restinga on the island of El Hierro (27°45’N) in the
Spanish Canary Islands is yet further south and could be considered politically, though not physiographically as part of Europe.
Westernmost point. Monchique Islet, Azores Islands, Portugal (31°16’ 30" W) (If considered part of Europe, though it sits on the North
American Plate). If not, then the Capelinhos Volcano, Faial Island, Azores Islands, Portugal (28°50’ 00" W), the westernmost point of the
Eurasian Plate above sea level. Easternmost point. Cape Flissingsky (69°02’ E), Severny Island, Novaya Zemlya, Russia.
Mainland Europe
Northernmost point. Cape Nordkinn (Kinnarodden), Norway (71°08’02"N 27°39’00"E)
Southernmost point. Punta de Tarifa, Spain (36°00’ 15" N) Westernmost point. Cabo da Roca, Portugal (9°29’56.44 W). Easternmost
point. The easternmost point is dependent upon the various definitions of Europe’s eastern border. Utilizing the most common
definition of Europe’s eastern edge (the watershed divide of the Ural Mountains), the easternmost point of the Ural watershed (and thus
mainland Europe) lies on an unnamed 545 metre peak at as shown on various detailed maps such as the Soviet General Staff maps and
as shown on Google Earth/Maps. This peak is 17 km northeast of an 875-metre peak named Gora Anoraga and 60 km southwest of
Ostrov Lediyev (island) on Arctic waters south of the Kara Sea.
Table 14 | Example prompt preamble for bootstrapping answers and evidence for the NQ questions.
As the ELI5 responses are longer and less extractive, we experimentally found that it is better to split
the answering process into two parts. We first use few-shot prompting to generate claim and then
mechanistic process to get evidence from the conditioned document. In the Table 15 we include the
complete prompt with examples used to elicit responses for the ELI5 questions.
The few-shot examples in the prompt teach Gopher to produce decent supporting evidence, but it
was difficult to use this mechanism to make quotes verbatim, especially when they were longer. We
therefore resorted to constrained sampling during prompted generation, as described in Appendix J.
37
Teaching language models to support answers with verified quotes
Here we demonstrate excellent examples of answering questions and providing evidence to back up the answers. Responses consist of
answers and evidence is an exact quote from the document. Evidence is always chosen to clearly support the answer and convince us
the whole answer is a correct and useful reply to the original question; it is on-topic, succinct, and supports everything that the answer
claims. Here are some examples:
In computer programming, machine code, consisting of machine language instructions, is a low-level programming language used to
directly control a computer’s central processing unit (CPU). Each instruction causes the CPU to perform a very specific task, such as a
load, a store, a jump, or an arithmetic logic unit (ALU) operation on one or more units of data in the CPU’s registers or memory.
Machine code is a strictly numerical language which is intended to run as fast as possible, and it may be regarded as the lowest-level
representation of a compiled or assembled computer program or as a primitive and hardware-dependent programming language. While
it is possible to write programs directly in machine code, managing individual bits and calculating numerical addresses and constants
manually is tedious and error-prone. For this reason, programs are very rarely written directly in machine code in modern contexts, but
may be done for low level debugging, program patching (especially when assembler source is not available) and assembly language
disassembly. The majority of practical programs today are written in higher-level languages or assembly language. The source code is
then translated to executable machine code by utilities such as compilers, assemblers, and linkers, with the important exception of
interpreted programs, which are not translated into machine code. However, the interpreter itself, which may be seen as an executor or
processor performing the instructions of the source code,
Seashell resonance refers to a popular folk myth that the sound of the ocean may be heard through seashells, particularly conch shells.
This effect is similarly observed in any resonant cavity, such as an empty cup or a hand clasped to the ear. The resonant sounds are
created from ambient noise in the surrounding environment by the processes of reverberation and (acoustic) amplification within the
cavity of the shell. The ocean-like quality of seashell resonance is due in part to the similarity between airflow and ocean movement
sounds. The association of seashells with the ocean likely plays a further role. Resonators attenuate or emphasize some ambient noise
frequencies in the environment, including airflow within the resonator and sound originating from the body, such as bloodflow and
muscle movement. These sounds are normally discarded by the auditory cortex; however, they become more obvious when louder
external sounds are filtered out. This occlusion effect occurs with seashells and other resonators such as circumaural headphones,
raising the acoustic impedance to external sounds.
The color of human blood ranges from bright red when oxygenated to a darker red when deoxygenated. It owes its color to hemoglobin,
to which oxygen binds. Deoxygenated blood is darker due to the difference in shape of the red blood cell when oxygen binds to
haemoglobin in the blood cell (oxygenated) versus does not bind to it (deoxygenated). Human blood is never blue. The blue
appearance of surface veins is caused mostly by the scattering of blue light away from the outside of venous tissue if the vein is at 0.5
mm deep or more.
Question: why do our veins look blue when we see them through skin?
Answer: Veins seem blue because of the scattering of blue light away from the outside of venous tissue.
The phrase "objects in (the) mirror are closer than they appear" is a safety warning that is required to be engraved on passenger side
mirrors of motor vehicles in many places such as the United States, Canada, Nepal, India, and South Korea. It is present because while
these mirrors’ convexity gives them a useful field of view, it also makes objects appear smaller. Since smaller-appearing objects seem
farther away than they actually are, a driver might make a maneuver such as a lane change assuming an adjacent vehicle is a safe
distance behind, when in fact it is quite a bit closer. The warning serves as a reminder to the driver of this potential problem.
Table 15 | Prompt preamble for bootstrapping claims for the ELI5 questions. Evidence was then
formed by finding the most similar fragments of the page snippet provided as conditioning.
38
Teaching language models to support answers with verified quotes
39
Teaching language models to support answers with verified quotes
GopherCite Because we want to increase the trust in GopherCite Self-supported question answering is a task in which a language
the model output. model generates a response to a fact-seeking question and
provides verbatim evidence to support the validity of its answer.
Source: Teaching language models to support
answers with veri ed quotes.
This consideration has also motivated recently Source: Teaching language models to support answers with veri ed quotes.
released, partly concurrent work (Nakano et al., By default, however, LMs generate ungrounded claims that users must
2021) in which a netuned version of GPT-3 cites choose either to blindly accept or to verify themselves. In this work we train
sources. One could view self-supporting answers as models that help the user or data rater evaluate responses by generating
a speci c type of explanation, putting our work claims alongside supporting evidence. This evidence takes the form of a
alongside other work in explainable AI (Ras et al., verbatim quote extracted from a longer source retrieved by Google Search or
2020) that aims to provide natural-language any suitable information retrieval system. We call this task "self-supported
explanations of QA model responses (Lamm et al., question-answering" (SQA), and intend it as a sub-task that can be
2020; Latcinnik and Berant, 2020; Narang et al., embedded into other generative language modelling tasks such as open-
2020). Our goals are aligned to the extent that both ended dialogue or debate (Askell et al., 2021; Irving et al., 2018; Komeili et al.,
explanations and supporting evidence are ways to 2021; Rae et al., 2021; Thoppilan et al., 2022). Crucially, citing external sources
increase trust in model outputs. inline decreases the effort required on the part of human annotators.
User How does GopherCite decide which User What would be different in
questions to decline? GopherCitev2?
GopherCite a reward model-based mechanism for GopherCite GopherCiteV2 would take into account source trustworthiness,
abstaining from answering a con gurable ensure answers are given with more careful quali cation, and
proportion of test-time questions. investigate whether more subtle alignment approaches such as
debate can provide reward signals which ensure that quotes
Source: Teaching language models to support answers with veri ed
are not misleading.
quotes.
In our experiments, we show that GopherCite produces high quality
(plausible and supported) answers 80% of the time when prompted
with fact-seeking questions drawn from a ltered subset of Source: Teaching language models to support answers with veri ed quotes.
NaturalQuestions dataset and 67% of the time when prompted with In particular, we need to tackle source trustworthiness, ensure answers are
explanation-seeking questions drawn Teaching language models to given with more careful quali cation, and investigate whether more subtle
support answers with veri ed quotes from a ltered subset of the alignment approaches such as debate can provide reward signals which
ELI5 dataset (Fan et al., 2019). Furthermore, we can improve the ensure that quotes are not misleading.
reliability of the system dramatically by selecting a minority of
questions to decline to answer (El-Yaniv et al., 2010). We develop a
reward model-based mechanism for abstaining from answering a
con gurable proportion of test-time questions. Performance is
measured in this setting by plotting the trade-off between question
coverage (the proportion of questions attempted) and the quality of
responses when attempting.
Figure 12 | Examples of GopherCite answers, using the Introduction section of this paper for the
prompt.
40