Professional Documents
Culture Documents
Training Language Models With Language Feedback
Training Language Models With Language Feedback
Jérémy Scheurer1 Jon Ander Campos1 2 Jun Shern Chan1 Angelica Chen1
Kyunghyun Cho1 3 4 Ethan Perez1
1
New York University, University of the Basque Country, 3 Genentech, 4 CIFAR LMB
2
{jeremy.scheurer,perez}@nyu.edu
In essence, ...
outputs. Comparison feedback conveys lim- Once upon a time, ...
ited information about human preferences per In a nutshell, ...
To summarize, ...
human evaluation. Here, we propose to learn
The summary should ...
The gist of ...
from natural language feedback, which con-
veys more information per human evaluation. 2 Choose the refinement with the highest similarity
We learn from language feedback on model with the feedback.
outputs using a three-step learning algorithm. In essence, ...
The summary
should...
First, we condition the language model on the
In a nutshell, ...
initial output and feedback to generate many
refinements. Second, we choose the refine- The gist of ...
ment with the highest similarity to the feed-
3 Finetune a language model on the improved
back. Third, we finetune a language model to
outputs.
}
maximize the likelihood of the chosen refine-
ment given the input. In synthetic experiments,
we first evaluate whether language models ac-
In a nutshell, ...
curately incorporate feedback to produce re-
finements, finding that only large language
models (175B parameters) do so. Using only Figure 1: An overview of our algorithm for learning
100 samples of human-written feedback, our from natural language feedback.
learning algorithm finetunes a GPT-3 model to
roughly human-level summarization ability.
highly according to human preferences, or a predic-
1 Introduction tive model thereof (Ziegler et al., 2019; Stiennon
Language Models (LMs) achieve strong perfor- et al., 2020; Nakano et al., 2021; Ouyang et al.,
mance across diverse NLP tasks, from summariza- 2022). In this line of work, human evaluators in-
tion to question answering and conversational as- dicate their preferences by comparing text outputs.
sistants (Radford and Narasimhan, 2018; Radford However, each comparison provides little informa-
et al., 2019; Brown et al., 2020; Rae et al., 2021, tion per evaluation about human preferences.
interalia). A key problem with LMs is that they We propose to use natural language feedback,
generate text that violates human preferences, such which contains more information per evaluation.
as LM-generated misinformation (Lin et al., 2021), We introduce a three-step learning algorithm, as
offensive language (Gehman et al., 2020), and fac- shown in Figure 1. First, we condition an LM
tually incorrect outputs such as summaries (Stien- on an input, model-generated output, and human-
non et al., 2020). Current methods alleviate such written feedback to sample many possible refine-
issues by training LMs to generate text that scores ments of the output. Second, we choose the re-
finement with the highest embedding-based sim- Models
Ada Babbage Curie Davinci
(∼ 350M) (∼ 1.3B) (∼ 6.7B) (175B)
ilarity with the feedback. Third, we finetune an
GPT-3 1.0 ± 0.3 1.1 ± 0.3 8.7 ± 0.8 38.5 ± 1.3
LM on the chosen refinements. Our algorithm de- InstructGPT 1.6 ± 0.3 2.5 ± 0.4 5.4 ± 0.6 35.6 ± 1.3
parts from prior work, which uses reinforcement
learning methods (Ziegler et al., 2019, inter alia) Table 1: We report the accuracy in % with the standard
or auxiliary losses (Stacey et al., 2021) that cannot error. On the task of removing offensive words from a
be straightforwardly generalized to using natural sentence, only large LMs incorporate feedback.
language feedback.
We validate our algorithm on a carefully- 3 Experiments
controlled synthetic task of removing offensive
words from a sentence with GPT-3-based mod- 3.1 Can Language Models Use Feedback?
els (Brown et al., 2020; Ouyang et al., 2022). For our algorithm to work, LMs must be able to
We find that only the largest GPT-3-based models accurately incorporate feedback to generate refine-
(175B parameters) accurately refine outputs. Using ments. Thus, we first validate our algorithm on
the above insight, we use the largest GPT-3 models a carefully-controlled synthetic task of removing
to test our algorithm on text summarization, follow- specific offensive words from a given sentence. We
ing Stiennon et al. (2020). A model trained with our examine how effective various models are at incor-
algorithm generates summaries that human evalu- porating feedback, to determine what model to use
ators prefer to human reference summaries ∼51% for refining outputs.
of the time. We obtain these results when learn-
ing from only 100 samples of natural language Experimental Setup We instruct an LM to re-
feedback. Our analysis shows that LM-generated fine an automatically-generated sentence with ≤ 10
refinements typically incorporate the feedback, es- offensive words by removing ≤ 3 specific words
pecially when choosing the refinement with the (see Appendix B for a detailed explanation and
highest similarity with the feedback. Our results examples). We evaluate how often the gener-
suggest that natural language feedback is a promis- ated refinement exactly matches the target sen-
ing avenue for learning from human preferences. tence, which we also automatically generate. For
our LMs, we use differently-sized GPT-3 mod-
2 Method els (Brown et al., 2020) and their finetuned, In-
structGPT counterparts (Ouyang et al., 2022).1 We
Here, we define our problem formulation more report all hyperparameters used in Appendix E. We
formally. Given an input x, we seek to generate an report mean and std. error for all results in our
output y that is high quality according to human work.
preference judgments. We assume access to natural
language feedback f on an initial model-generated Results Table 1 shows the results. We observe
output y given the input x. that only the largest GPT-3 and InstructGPT mod-
els (175B parameters) incorporate feedback a non-
To tackle the above problem, we leverage
negligible amount of time. Using this insight, we
the ability of pretrained LMs to follow instruc-
only use the 175B parameter (Davinci) models in
tions (Radford et al., 2019; Sanh et al., 2021; Wei
the rest of our experiments.
et al., 2022; Ouyang et al., 2022). We assume
access to an LM that takes an input (e.g., a task 3.2 Text Summarization
instruction) and produces a distribution over text
3.2.1 Experimental Setup
completions (e.g., a task output). We instruct the
LM to refine the initial output y given the input x Generating Refinements We now evaluate our
and feedback f . We then sample N refinements algorithm on the real-world task of text summariza-
y10 , . . . , yN
0 from the LM. Refinements may vary in tion. We follow prior work on learning from hu-
quality, so we introduce a function S that scores man preferences (Stiennon et al., 2020) and learn to
refinements for how effectively they incorporate summarize Reddit posts from Völske et al. (2017).
feedback. We choose the refinement with the high- We take 100 samples from the Reddit data subset
est score from S and finetune a model on all chosen used in Stiennon et al. (2020). We use InstructGPT
y 0 given x. We use the resulting model to generate 1
Via the OpenAI API. OpenAI does not disclose the size
outputs at test time. of the provided models, so we use estimates from Eleuther.
(175B) to generate initial summaries and refine-
50
then choose the refinement with the highest cosine Figure 2: How often human evaluators prefer sum-
similarity score with the feedback. We opted for maries from our learning algorithm and baselines to
H UMAN S UMMARIES. Our proposed algorithm (left-
high similarity with the feedback because feedback
most bar) generates summaries of a similar quality to
often describes what the ideal or improved text human summaries.
would look like. We refer to refinements generated
with the above algorithm as R EFINEMENT WITH
F EEDBACK + B EST OF N. human summaries, with win rates of 19.0 ± 3.9%
(GPT-3), 35.0 ± 4.8% (InstructGPT), 44.0 ± 5.0%
Finetuning We finetune GPT-3 (175B; Brown
(finetuning on I NITIAL S UMMARIES)). In particu-
et al., 2020)3 on refinements generated by R EFINE -
lar, our approach achieves a win rate of 57.0±5.0%
MENT WITH F EEDBACK + B EST OF N. We com-
over the strongest baseline, finetuning on I NITIAL
pare against finetuning on I NITIAL S UMMARIES
S UMMARIES. Our result suggests that our learn-
generated with InstructGPT. We also compare
ing algorithm produces higher-quality summaries
against summaries generated directly by Instruct-
by finetuning on the higher-quality targets (our re-
GPT and GPT-3 (175B). We use the same instruc-
finements). Overall, we achieve strong results on
tions as for I NITIAL S UMMARIES (in Appendix F)
summarization while learning from only 100 sam-
and provide the post and its title.
ples of human-written feedback.
Evaluation We test on 100 unseen Reddit posts
from the same dataset and conduct human evalu- 3.2.3 Analysis
ations for all experiments.4 Evaluators rank the We now aim to examine the importance of vari-
summaries according to the rubric in Appendix ous aspects of our algorithm for generating high-
C, with ties allowed. We show the win rate of an quality refinements (before finetuning). We eval-
algorithm, counting ties as a half win, similar to uate R EFINEMENT WITH F EEDBACK, which ran-
Kendall rank correlation.5 . We refer to Appendix domly chooses a refinement ∈ y10 , . . . , y200 . This
C for a description of all human evaluation and ablation helps to evaluate the importance of choos-
feedback annotation procedures and Appendix D ing a refinement with our scoring function S. We
for more details about the ranking scheme. also evaluate R EFINEMENT WITHOUT F EEDBACK,
3.2.2 Main Results which instructs the LM to refine the initial summary
but without feedback. This ablation helps to eval-
Fig. 2 reports the win rate of our learning algorithm
uate the importance of using the feedback. Lastly,
over H UMAN SUMMARIES and Appendix Fig. 5
we evaluate H UMAN S UMMARIES, i.e., summaries
reports the win rate over InstructGPT. Finetuning
written by Reddit users on their own posts, and
on R EFINEMENT WITH F EEDBACK + B EST OF
I NITIAL S UMMARIES, i.e., the initial summary y
N generates summaries on par with human sum-
generated by the LM. See Appendix F for concrete
maries, with a win rate of 51.0 ± 5.0% over human
examples of the instructions that we use.
summaries. In contrast, all baselines underperform
Fig. 3 (left) shows the win rates of refine-
2
We use OpenAI’s API to access the embeddings. ments from various methods against I NITIAL S UM -
3
InstructGPT cannot yet be finetuning via OpenAI’s API.
4 MARIES. R EFINEMENT WITH F EEDBACK + B EST
We plan to conduct larger-scale human evaluations in the
future, to confirm our initial findings. OF N improves over the I NITIAL S UMMARIES ,
5
Kendall Rank correlation with our algorithm being preferred 67.0 ± 3.1% of
70 80
Refine w/ Feedback + Best of N
60 Refine w/ Feedback
incorporating feedback
Initial Summaries (%)
60 Refine w/o Feedback
50
% of summaries
Initial Summaries
Win Rate vs.
40
40
30
20
20
10
0 0
Refine w/ Refine w/ Refine w/o Human >=1 >1 All
Feedback + Feedback Feedback Summaries
Best of N No. of feedback points incorporated
Figure 3: Left: How often human evaluators prefer summaries from each refinement method to the I NITIAL S UM -
MARIES (from InstructGPT). R EFINEMENT WITH F EEDBACK improves on the I NITIAL S UMMARIES and outper-
forms human summaries with B EST OF N sampling. Right: Refining with feedback generally does incorporate
specific point(s) mentioned in the feedback.
the time. Our algorithm is preferred 54.0±3.5% of tasks. In contrast, we do not assume access to gold-
the time to human summaries, while I NITIAL S UM - labeled outputs, and we study the more general
MARIES are significantly worse than human sum- text generation setting, which classification tasks
maries, preferred only 39.3±3.4% of the time. Ap- can be formulated as (Radford et al., 2019; Raf-
pendix Fig. 4 shows win rates of refinements gen- fel et al., 2020; Brown et al., 2020). Explanations
erated with various methods against H UMAN S UM - describe why a labeled output is correct, while feed-
MARIES and Appendix Fig. 6 shows that refine- back describes how to improve a candidate output.
ments are more helpful when the initial summary Prior work explores ways of using explanations
is of lower quality. We also refer to Appendix G to train text classification models, with mixed re-
for 10 random examples of I NITIAL S UMMARIES, sults (Camburu et al., 2018; Stacey et al., 2021;
feedback, and refinements from various methods. Pruthi et al., 2021; Wiegreffe et al., 2021; Hase and
Overall, using feedback and scoring refinements Bansal, 2021; Lampinen et al., 2022, inter alia).
are both important steps for generating high-quality A few prior works also learn from language feed-
refinements of the initial output. back, for the purpose of ranking candidate outputs
Here, we examine whether refinements are of rather than generating outputs (Weston, 2016; Li
higher quality because they incorporate the feed- et al., 2016; Hancock et al., 2019; Li et al., 2022).
back, rather than by improving the summary in Matiana et al. (2021) learn text embeddings of lan-
other ways. To do so, we evaluate how often the re- guage feedback, where improvements could benefit
finements incorporate the human-written feedback. the refinement-scoring step of our algorithm.
We evaluate (1) how often ≥ 1 point mentioned in
the feedback is incorporated in the refinement, (2) Outside of text domains, there is abundant work
how often > 1 point is incorporated, and (3) how in reinforcement learning that leverages language
often all of the feedback is incorporated. In Fig. 3 in various ways (see Luketina et al., 2019, for an
(right), we see that our algorithm incorporates ≥ 1 overview). Prior work uses language to specify the
feedback point 72.0 ± 4.5% of the time, showing task (“instruction following” Chaplot et al., 2017;
that LMs are able to incorporate feedback with Mahmoudieh et al., 2022; Ouyang et al., 2022, in-
high accuracy. R EFINEMENTS WITHOUT F EED - ter alia), drive exploration (Tam et al., 2022), infer
BACK only incorporates at least one feedback point reward functions (Lin et al., 2022; Sumers et al.,
15.0 ± 3.6% of the time. Our results suggest that 2021; Fidler et al., 2017, inter alia), and train the
refinements are high-quality because they incorpo- model via strong supervision (Andreas et al., 2017;
rate specific points in the feedback. Kaplan et al., 2017), reward shaping (Goyal et al.,
2019), or purely with language by providing de-
4 Additional Related Work scriptions of trajectories (Nguyen et al., 2021). In
contrast, we use language to correct faulty behav-
Existing work in NLP primarily investigates using ior. Other work uses language feedback at test
explanations for labeled outputs to classification time to correct mistakes in a model’s behavior, for
e.g. image segmentation (Rupprecht et al., 2018) References
or code generation (Elgohary et al., 2020; Austin Jacob Andreas, Dan Klein, and Sergey Levine. 2017.
et al., 2021). In contrast, we use feedback to train Modular multitask reinforcement learning with pol-
models, and our approach does not require human icy sketches. In Proceedings of the 34th Inter-
intervention at test time. national Conference on Machine Learning, vol-
ume 70 of Proceedings of Machine Learning Re-
5 Conclusion search, pages 166–175. PMLR.
In this work, we proposed an algorithm for train- Jacob Austin, Augustus Odena, Maxwell I. Nye,
Maarten Bosma, Henryk Michalewski, David Do-
ing LMs to behave in line with human preferences, han, Ellen Jiang, Carrie J. Cai, Michael Terry,
by learning from natural language feedback. We Quoc V. Le, and Charles Sutton. 2021. Pro-
validated our approach on a carefully-controlled gram synthesis with large language models. CoRR,
word-removal task, showing that only large LMs abs/2108.07732.
(175B parameters) accurately incorporate feedback. Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Using this insight, we then tested our algorithm on Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
the real-world task of text summarization. Our Neelakantan, Pranav Shyam, Girish Sastry, Amanda
finetuning algorithm brought a GPT-3 model to Askell, et al. 2020. Language models are few-shot
learners. Advances in neural information processing
roughly human-level summarization ability, using systems, 33:1877–1901.
only 100 samples of human feedback. Language
feedback is a natural form of communicating with Oana-Maria Camburu, Tim Rocktäschel, Thomas
models which may make it easier for many peo- Lukasiewicz, and Phil Blunsom. 2018. e-snli: Nat-
ural language inference with natural language expla-
ple to provide informative, high-quality feedback. nations. Advances in Neural Information Process-
In the long run, our work suggests many exciting ing Systems, 31.
avenues for future work, e.g., in guiding models
with language feedback in other domains from code Devendra Singh Chaplot, Kanthashree Mysore
Sathyendra, Rama Kumar Pasumarthi, Dheeraj
generation to conversational assistance. Rajagopal, and Ruslan Salakhutdinov. 2017. Gated-
Attention Architectures for Task-Oriented Language
6 Acknowledgements Grounding. arXiv preprint arXiv:1706.07230.
We are grateful to Nat McAleese, Geoffrey Irving, Ahmed Elgohary, Saghar Hosseini, and Ahmed Has-
Jeff Wu, Sam Bowman, Daniel Ziegler, Seraphina san Awadallah. 2020. Speak to your parser: Interac-
Nix, and Lennart Heim for helpful conversations tive text-to-SQL with natural language feedback. In
Proceedings of the 58th Annual Meeting of the Asso-
and feedback. Jérémy Scheurer and Jun Shern ciation for Computational Linguistics, pages 2065–
Chan thank Open Philanthropy for funding that 2077, Online. Association for Computational Lin-
enabled this research. Ethan Perez thanks the Na- guistics.
tional Science Foundation and Open Philanthropy
Sanja Fidler et al. 2017. Teaching machines to describe
for fellowship support. Jon Ander Campos is sup- images with natural language feedback. Advances in
ported by a doctoral grant from the Spanish MECD. Neural Information Processing Systems, 30.
Angelica Chen and Kyunghyun Cho are supported
by the NYU Center for Data Science National Sci- Samuel Gehman, Suchin Gururangan, Maarten Sap,
Yejin Choi, and Noah A Smith. 2020. Realtoxici-
ence Foundation (Award 1922658). Kyunghyun typrompts: Evaluating neural toxic degeneration in
Cho is also supported by Samsung Advanced Insti- language models. arXiv preprint arXiv:2009.11462.
tute of Technology (Next Generation Deep Learn-
ing: from pattern recognition to AI). We also thank Prasoon Goyal, Scott Niekum, and Raymond J.
Mooney. 2019. Using Natural Language for Reward
OpenAI for providing access and credits to their Shaping in Reinforcement Learning.
models via the API Academic Access Program.
Braden Hancock, Antoine Bordes, Pierre-Emmanuel
Mazare, and Jason Weston. 2019. Learning from
dialogue after deployment: Feed yourself, chatbot!
arXiv preprint arXiv:1901.05415.
Refine w/ Feedback
80
E Hyperparameters
E.1 Generating Refinements
For the targeted word removal experiments (§3.1),
we use greedy decoding until 200 tokens or / n
is generated. For all summarization experiments
(§3.2), we sample up to 48 tokens (as in Stien-
non et al., 2020) with nucleus sampling (Holtz-
man et al., 2019) with p = 0.9. We strip non-
alphanumeric characters (e.g., newlines) from the
beginning of sampled summaries. Due to the maxi-
mum token length, sampled summaries sometimes
end with an incomplete sentence. Thus, we remove
ending sentences that do not end in “.”, “!”, or “?”.
E.2 Finetuning
We finetune GPT-3 with 175B parameters on R E -
FINEMENTS WITH F EEDBACK + B EST OF N and
I NITIAL S UMMARIES. We use a batch size of 1
and the default of 4 epochs, as recommended by the
OpenAI API. For other hyperparameters, we con-
duct a hyperparameter search using 5-fold cross-
validation on our train dataset of 100 examples. We
Methods Format
I NITIAL S UMMARY Write an excellent summary of a given text. An
excellent summary is coherent, accurate, concise,
and detailed, and it follows human preferences about
summaries.
TITLE: {title}
Text: {text}
TL;DR:
R EFINEMENT WITH Given a text, an initial summary, and feedback on
F EEDBACK the initial summary, write an improved summary that
incorporates the feedback on the initial summary and
is better than the initial summary. The improved sum-
mary is coherent, accurate, concise, and detailed, and
it follows human preferences about summaries. Most
importantly, the improved summary incorporates the
feedback.
TITLE: {title}
Text: {text}
Summary: {summary}
Feedback: {feedback}
TL;DR:
R EFINEMENT WITHOUT Given a text and an initial summary, write an im-
F EEDBACK proved summary that is better than the initial sum-
mary. The improved summary is coherent, accurate,
concise, and detailed, and it follows human prefer-
ences about summaries.
TITLE: {title}
Text: {text}
Summary: {summary}
TL;DR:
Initial Summary
Should the author ask her committed boyfriend if he lied about visiting an old flame?
Feedback
The summary is good but should mention more details. Concretely, it should remove
the word "committed", since that isn’t emphasized in the text. The summary should also
mention that the author believes her boyfriend visited an old flame very early on in their
relationship, i.e. after 4 months. Further the summary should mention that her boyfriend
has already denied that this happened when the author asked him when they were both
drunk. It should also convey that the boyfriend has lied before. The author is asking if she
should mention this issue or not, since it would bring peace but she already had brougth
it up once.
Human Summary
Asked my bf once if he went to go meet a girl, but i think he lied. Should I ask if he lied?
Example 2
Initial Summary
my boyfriend invited me to spend Mother’s Day with his parents but I feel weird about it
because I don’t want to imply that I will become their daughter one day. Is this too soon
or am I overthinking it?
Feedback
The summary is mostly correct but it should mention that one of the reason why the
original poster feels weird about all this situation is because their relationship is just
starting and they haven’t talked about a future together yet. It is also important that the
original poster can’t spend they Mother’s day with her mom and this is one of the reasons
why her boyfriend has invited her.
Human Summary
my boyfriend and I have been together 8 months and he invited me to spend Mother’s
Day with his mom and dad, but I feel uncomfortable too soon?
Example 3
Initial Summary
Girl (F/19) doesn’t know what to do with guy (M/21) who is good and polite, but not
romantic and doesn’t want anything serious.
Feedback
The summary should make clear that the girl is already together with the guy for 7 months.
The summary should also point out that the guy is not passionate and doesn’t want sex.
Lastly the summary should convey that the author is frustrated by the fact that the guy
doesn’t want anything serious and says he doesn’t want to go fast, but that she also thinks
she’s in love.
Human Summary
He (21) is a good guy, but I’m afraid he doesn’t want anything serious with me (19). How
should I react?