Professional Documents
Culture Documents
REFINER Reasoning Feedback On Intermediate Representations PDF
REFINER Reasoning Feedback On Intermediate Representations PDF
REFINER Reasoning Feedback On Intermediate Representations PDF
Figure 2: An overview of the three tasks tackled in this paper, with examples of both valid and invalid intermediate
reasoning steps, as well as their corresponding fine-grained error types. Notice the Missing Steps error type, in
the second task, actually encompasses two error types: reasoning misalignment, derived from not considering the
or operation, and lack of implicit knowledge, where implicit knowledge is needed to match the existing rules.
4 Experimental Setup
Textual
Exploration feedback (f) Datasets. We evaluate REFINER on three di-
verse tasks (examples in Fig. 2). We briefly de-
Input Generator (𝜃)
x scribe the datasets used for each task below7
Math Word Problem (MWP): We train our mod-
x+z🥇+f els on MAWPs (Koncel-Kedziorski et al., 2016)
dataset and evaluated our models on a challenging
Figure 3: Overview of REFINER interaction loop. dataset SVAMP (Patel et al., 2021). We evaluate
In each iteration, the generator generates multiple hy- our model on both the equation generation (z) and
potheses. The critic randomly selects one hypothesis answer prediction (y) tasks. Similar to Ling et al.
and provides feedback based on reasoning errors.
(2017); Amini et al. (2019) for equation generation,
we replace the numeric values with variable names,
The warm-up phase is critical to ensure that, when for example, number0, number1, etc. For Synthetic
the critic comes in the loop, the generator does Natural Language Reasoning (sNLR), we use the
not produce random answers likely to be very bad, dataset from Liang et al. (2022) with the difficulty
given the size of the output space. Hence, we use level as hard. We evaluate our model on both infer-
a relatively small supervised dataset (10% training ence rule generation (z) and consequent generation
data) to fine-tune the model on the NLR task of (y). For Moral Story (MS), we use dataset from
interest. After the warm-up phase, we use the addi- (Emelin et al., 2021), where we evaluate our model
tional feedback f from the critic model and learn on moral norm z and the moral action y generation.
πθ (z|x, z0 , f ). Training Details. For each task, we train a
Exploration. At each iteration (t), the generator UnifiedQa-T5-base model (UQA-base) (Khashabi
model generates multiple hypotheses (zk ) using nu- et al., 2020) as a critic (§3.1). For exploration
cleus sampling. The critic model randomly selects (§3.2), we use nucleus sampling with p = 0.5. We
one hypothesis and provides feedback on that hy- select the hyper-parameters by the validation loss:
pothesis. The exploration step aims at increasing for both the generator and critic model, we use the
the output variance such that the generator receives Adam optimizer with a learning rate of 1e−4 . Each
a wide range of feedback during training. model is trained for 20 epochs with early stopping
based on validation loss. We trained all models on
Learning. We update the GENERATOR model us- one A100 GPU. We run our models with 3 random
ing the following cross-entropy loss: seeds and report the average results. For the human
study, we selected outputs from the best models
T
X (baselines and our model) according to automatic
L(θ) = − log pθ (zt |x, zt0 , ft (z0 )) (3)
metrics. We train models with T = 3 iterations.
t=1
At inference time, we use greedy decoding for
where T = total number of iterations. Since the the generator and critic model with T = 1 for the
feedback contains the error types and hints, which automatic critic and T = 3 for the oracle critic.
are (latent) fine-grained and logical, it should allow On the MWP and sNLR tasks, we use the exact
the model to learn and update its generation by match (EM) metric for intermediate steps (equation
addressing the reasoning errors mentioned in the generation and inference rules) and accuracy (Acc)
feedback. for the final answers. For MS, we conduct a manual
evaluation study to assess the relevance of norms
Inference. We use the trained critic (see Eq. 2) and moral actions8 . Further evaluation details are
along with the trained generator to generate a tra- provided in Appendix F.
jectory z0 , z1 , ..., zT and stop when either f (zt ) is
7 In
Table 6, we report the data statistics in Appendix.
generated by the generator or “No hint” is gen- 8 Sincethe automatic scores such as BLUE, ROUGE, etc.
erated by the critic. We also experimented with only account for word level similarity between gold norms or
chain of thought prompting, where the generator actions and generate norms or actions.
Baselines. We compare our method with three Model Eq. (z) Ans. (y)
different LMs as generator models: UQA-base, UQA-base 34.1 –
UQA-large (supervised setting), GPT-3.5-code- UQA-base + PPO 31.5 –
REFINER base 47.2 –
DaVinci-002 (few-shot setting) and GPT-3.5? -text- REFINER base + Oracle (T=3) 66.0 –
DaVinci-003. We also compare REFINER to Prox-
UQA-large 46.7 –
imal Policy Optimization (PPO) RL-based method UQA-large + PPO 48.2 –
(Schulman et al., 2017). We use the implementa- REFINER large 53.8 –
tion of PPO from (Ramamurthy et al., 2022). For REFINER large + Oracle (T=3) 68.1 –
GPT-3.5, the model is not finetuned but prompted GPT-3.5 + CoT 59.3 63.5
GPT-3.5 + CoT + REFINERcritic 62.3 66.4
with standard few-shot prompts (Brown et al., GPT-3.5? + CoT 64.1 67.1
2020), in which the LM is given in-context demon- GPT-3.5? + CoT + REFINERcritic 67.3 70.6
strations of input-output pairs. We provide 2 for
demonstrations per class. We also experimented Table 2: Results on MWP. Eq.: Equation, Ans. An-
with chain of thought (COT) prompting (Wei et al., swer. Comparison of REFINER with baselines on the
2022) where the model is prompted first to gen- SVAMP dataset. GPT-3.5: code-DaVinci-002, GPT-
3.5? : text-DaVinci-002 For models other than GPT3.5,
erate the intermediate steps (z) and then the final
the answer can be obtained via symbolic execution of
answer (y). Note that sNLR task is a synthetic task the equation and is thus a function of the validity of the
where the model needs to perform either one-hop equation. For GPT3.5, the model is few-shot prompted
or two-hop reasoning. Clark et al. (2021) showed to either generate the equation with variable names z,
that fine-tuning large language models (354M pa- or generate the answer y.
rameter size) could achieve (99% accuracy) high
performance. Hence, we only compare our RE-
Model IR (z) Con (y)
FINER model with the UQA-base model (220M)
UQA-base 90.6 ± 0.8 94.1
(see Table 3). Since human annotation is expensive, REFINER base 93.5 ± 0.4 97.3
we focus on comparing against the most meaning- REFINER base + Oracle 96.2 ± 0.9 98.9
ful baseline: UQA-large for MS task (see Table GPT-3.5 + CoT 17.4 ± 0.5 45.0
GPT-3.5 + CoT + REFINER 26.8 ± 0.5 46.6
4). It is important to highlight that our proposed GPT-3.5? + CoT 14.3 ± 0.9 40.6
framework is general and one can use any other GPT-3.5? + CoT + REFINER 21.1 ± 1.2 42.1
LMs as GENERATOR or CRITIC.
Table 3: Results on sNLR task. IR: Inference Rules,
5 Results Con: Consequent
🤖
Feedback: The operator in #0 is
🤖 Critic
Feedback: The first number in
#0 is incorrect, the second
number in #0 is incorrect
🤖Critic
Feedback: The second number in
#0 is incorrect. The second
number in #1 is incorrect.
Critic
incorrect, the second number
in #0 is incorrect, the
operator in #1 is incorrect.
Incomplete
Figure 4: Examples. REFINER on MWP task. There different scenarios are highlighted in the figure, where (a)
the CRITIC model provides a correct feedback, GENERATOR model utilizes the feedback and fixes the incorrect
equation, (b) the CRITIC model provides a correct feedback however, GENERATOR model fails to fix the incor-
rect equation, and (c) the CRITIC model provides an incomplete feedback GENERATOR model partially fixes the
incorrect equation.
0.8
UQA-T5-large REFINER Automatic critic during inference
60 Exact Match 0.6 Oracle critic during inference
57
0.4
47
45 0.2
40 42 a) Noise-level of the critic used during training
0.8
Number of Errors
Noise Sensitivity. To further understand the be- with a bit of noise (< 50%) does not seem to harm
haviour of the REFINER framework, we run vari- the model, as performances are not statistically dif-
ations with noisy critics. We replace the oracle ferent than training with the oracle (noise of 0%).
critic used during training with a noisy critic in Then, in Fig. 6 (b), we remark that the quality of
(Fig. 6 (a)) to inspect how training with an imper- the critic used at inference time has a huge impact.
fect critic impacts the generator. Also, we use a Having oracle provide feedback is by far the best
noisy critic at inference but keep the oracle critic scenario. Already with 25% noise, the critic makes
during training (in Fig. 6 (b)). This analysis is per- the generator perform worse than using our trained
formed on the SVAMP dataset for MWP. The noisy critic (REFINER). With more than 50% noise, the
critics are generated by random perturbations of the critic significantly harms the generator. Indeed, the
oracle critic; for a noise-level , the oracle feedback generator, trained with an oracle critic, has learned
is replaced by random feedback with probability . to trust the critic and expects useful feedback.
We observe, in Fig. 6 (a), that when training
with a very noisy critic (> 75% noise), the gen- 6.2 Qualitative Analysis
erator LM learns to ignore the critic, as there is To explain the findings in §6.1, we further manu-
no difference between using the trained critic or ally analyze 100 instances for the MWP task. We
the oracle during inference. Interestingly, training observe two different scenarios when REFINER
failed to fix the outputs generated by GENERA - edges the support of Innosuisse under PFFS-21-29,
TOR model: (a) when the CRITIC model provides the EPFL Science Seed Fund, the EPFL Center for
a correct feedback; however, the GENERATOR Imaging, Sony Group Corporation, and the Allen
model still generates incorrect equation, and (b) Institute for AI.
the CRITIC model provides an incomplete or par-
tially correct feedback. The former case indicates
that either the GENERATOR model makes mistakes
in following the instruction from the CRITIC or the
feedback from the critic can be ambiguous. For
example, in Fig. 4, (b) we observe the case when
the critic is correct, but the feedback could result
in an incorrect equation. The latter case indicates
that our trained critic model generates incorrect
feedback, which can result in incorrect or partially
correct equations. We also observe that our CRITIC
model failed to generate correct feedback when the
GENERATOR model generates incorrect equations
with multiple mistakes (see Fig. 4(c)).
7 Conclusion
In this paper, we propose REFINER, a framework
to improve the reasoning abilities of LMs through
an iterative feedback loop between two models, a
generator and a critic. Our evaluation of this frame-
work on three reasoning tasks showed structured
and fine-grained feedback on intermediate reason-
ing errors results in significant performance gains,
surpassing scalar value feedback. Our trained critic
model alone, even when noisy, can improve inter-
mediate representations of LMs, showing that RE-
FINER can significantly boost LMs’ performance
on reasoning tasks. Our REFINER framework is
very general and in principle, might be applied to
steer language models in performing different rea-
soning tasks. More specifically, critic model can be
seen as a tool for LLMs to refine their generation
quality.
Acknowledgment
We would like to thank Martin Josifoski, Syrielle
Montariol, and Zeming Chen for their helpful feed-
back on a draft version of the paper. We acknowl-
edge the support of the ICT-48 Network of AI Re-
search Excellence Center “TAILOR” (EU Hori-
zon 2020, GA No 952215). West’s lab is partly
supported by grants from the Swiss National Sci-
ence Foundation (200021_185043), Swiss Data
Science Center (P22_08), H2020 (952215), Mi-
crosoft Swiss Joint Research Center, and Google,
and by generous gifts from Facebook, Google, and
Microsoft. Antoine Bosselut gratefully acknowl-
Limitations jishirzi. 2019. MathQA: Towards interpretable
math word problem solving with operation-based
Our REFINER framework could not be comprehen- formalisms. In Proceedings of the 2019 Conference
sively evaluated on all applicable downstream rea- of the North American Chapter of the Association
soning tasks due to their sheer number. While delib- for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
erately distinct, we focused on only three different pages 2357–2367, Minneapolis, Minnesota. Associ-
reasoning tasks in order to study how natural lan- ation for Computational Linguistics.
guage reasoning feedback can impact downstream
Daniel Andor, Luheng He, Kenton Lee, and Emily
tasks. We believe this represents an initial but im- Pitler. 2019. Giving BERT a calculator: Finding
portant step towards exploring automated natural operations and arguments with reading comprehen-
language feedback on intermediate representations. sion. In Proceedings of the 2019 Conference on
In addition, the critic we presented here is specific Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on
for each task, while the ideal critic would be a gen- Natural Language Processing (EMNLP-IJCNLP),
eral one, capable of providing feedback on a wide pages 5947–5952, Hong Kong, China. Association
range of reasoning tasks. Similarly, we consid- for Computational Linguistics.
ered fine-grained reasoning errors specific to each
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten
reasoning task. Recent work has mentioned sev- Bosma, Henryk Michalewski, David Dohan, Ellen
eral other fine-grained reasoning errors (Golovneva Jiang, Carrie Cai, Michael Terry, Quoc Le, and
et al., 2023), which can’t be fully covered by the Charles Sutton. 2021. Program synthesis with large
reasoning tasks we considered. Generalizing both language models.
the critic and fine-grained error types emerge as Yuntao Bai, Saurav Kadavath, Sandipan Kundu,
both the main limitations of this paper and the di- Amanda Askell, Jackson Kernion, Andy Jones,
rections of future work. Anna Chen, Anna Goldie, Azalia Mirhoseini,
Cameron McKinnon, Carol Chen, Catherine Ols-
son, Christopher Olah, Danny Hernandez, Dawn
Ethical Considerations Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson,
In this paper, we experiment with existing datasets Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey
Ladish, Joshua Landau, Kamal Ndousse, Kamile
which are, to the best of our knowledge, adequately Lukosuite, Liane Lovitt, Michael Sellitto, Nelson
cited. Our proposed framework REFINER is de- Elhage, Nicholas Schiefer, Noemi Mercado, Nova
signed to improve the reasoning abilities of LMs. DasSarma, Robert Lasenby, Robin Larson, Sam
These LMs have been shown to encode biases Ringer, Scott Johnston, Shauna Kravec, Sheer El
Showk, Stanislav Fort, Tamera Lanham, Timothy
about race, gender, and many other demographic Telleen-Lawton, Tom Conerly, Tom Henighan, Tris-
attributes (Weidinger et al., 2021), (Sheng et al., tan Hume, Samuel R. Bowman, Zac Hatfield-Dodds,
2020). Since our framework does not offer a way Ben Mann, Dario Amodei, Nicholas Joseph, Sam
to mitigate these biases, models improved using McCandlish, Tom Brown, and Jared Kaplan. 2022.
Constitutional ai: Harmlessness from ai feedback.
this framework could still reflect the same harm-
ful behaviours normally exhibited by these models. Tom Brown, Benjamin Mann, Nick Ryder, Melanie
We recommend anyone deploying our model off- Subbiah, Jared D Kaplan, Prafulla Dhariwal,
the-shelf should first check whether the model is Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-
harmful towards any protected group, and appropri- Voss, Gretchen Krueger, Tom Henighan, Rewon
ate mitigation should be taken. In addition, our MS Child, Aditya Ramesh, Daniel Ziegler, Jeffrey
task is based on a dataset of situations, intentions, Wu, Clemens Winter, Chris Hesse, Mark Chen,
and actions that heavily skew towards western cul- Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam Mc-
ture and social norms (Emelin et al., 2021). Con- Candlish, Alec Radford, Ilya Sutskever, and Dario
sequently, our human evaluation on the MS task Amodei. 2020. Language models are few-shot learn-
was done with AMT workers based in the US who ers. In Advances in Neural Information Processing
were paid adequately for the average time it took Systems, volume 33, pages 1877–1901. Curran As-
sociates, Inc.
to solve the task.
Paul Christiano, Jan Leike, Tom B. Brown, Miljan Mar-
tic, Shane Legg, and Dario Amodei. 2017. Deep re-
References inforcement learning from human preferences.
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Peter Clark, Oyvind Tafjord, and Kyle Richardson.
Koncel-Kedziorski, Yejin Choi, and Hannaneh Ha- 2021. Transformers as soft reasoners over language.
In Proceedings of the Twenty-Ninth International Michael A. Hedderich, Lukas Lange, Heike Adel, Jan-
Joint Conference on Artificial Intelligence, IJ- nik Strötgen, and Dietrich Klakow. 2021. A survey
CAI’20. on recent approaches for natural language process-
ing in low-resource scenarios. In Proceedings of
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, the 2021 Conference of the North American Chapter
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias of the Association for Computational Linguistics:
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Human Language Technologies, pages 2545–2568,
Nakano, Christopher Hesse, and John Schulman. Online. Association for Computational Linguistics.
2021. Training verifiers to solve math word prob-
lems. Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu,
Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022.
Ahmed Elgohary, Saghar Hosseini, and Ahmed Has- Large language models can self-improve.
san Awadallah. 2020. Speak to your parser: Inter-
active text-to-SQL with natural language feedback. Zhanming Jie, Jierui Li, and Wei Lu. 2022. Learn-
In Proceedings of the 58th Annual Meeting of the ing to reason deductively: Math word problem solv-
Association for Computational Linguistics, pages ing as complex relation extraction. In Proceedings
2065–2077, Online. Association for Computational of the 60th Annual Meeting of the Association
Linguistics. for Computational Linguistics (Volume 1: Long
Papers), pages 5944–5955, Dublin, Ireland. Associ-
Ahmed Elgohary, Christopher Meek, Matthew ation for Computational Linguistics.
Richardson, Adam Fourney, Gonzalo Ramos,
and Ahmed Hassan Awadallah. 2021. NL-EDIT: Daniel Khashabi, Sewon Min, Tushar Khot, Ashish
Correcting semantic parse errors through natural Sabharwal, Oyvind Tafjord, Peter Clark, and Han-
language interaction. In Proceedings of the 2021 naneh Hajishirzi. 2020. UNIFIEDQA: Crossing
Conference of the North American Chapter of the format boundaries with a single QA system. In
Association for Computational Linguistics: Human Findings of the Association for Computational
Language Technologies, pages 5599–5610, Online. Linguistics: EMNLP 2020, pages 1896–1907, On-
Association for Computational Linguistics. line. Association for Computational Linguistics.
Denis Emelin, Ronan Le Bras, Jena D. Hwang, Niklas Kiehne, Hermann Kroll, and Wolf-Tilo Balke.
Maxwell Forbes, and Yejin Choi. 2021. Moral sto- 2022. Contextualizing language models for norms
ries: Situated reasoning about norms, intents, ac- diverging from social majority. In Findings of the
tions, and their consequences. In Proceedings of the EMNLP 2022, Abu Dhabi, United Arab Emirates.
2021 Conference on Empirical Methods in Natural Association for Computational Linguistics, Associa-
Language Processing, pages 698–718, Online and tion for Computational Linguistics.
Punta Cana, Dominican Republic. Association for
Computational Linguistics. Bugeun Kim, Kyung Seo Ki, Sangkyu Rhim, and Gah-
gene Gweon. 2022. EPT-X: An expression-pointer
Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chan- transformer model that generates eXplanations for
dar, Soroush Vosoughi, Teruko Mitamura, and Ed- numbers. In Proceedings of the 60th Annual
uard Hovy. 2021. A survey of data augmentation Meeting of the Association for Computational
approaches for NLP. In Findings of the Association Linguistics (Volume 1: Long Papers), pages 4442–
for Computational Linguistics: ACL-IJCNLP 2021, 4458, Dublin, Ireland. Association for Computa-
pages 968–988, Online. Association for Computa- tional Linguistics.
tional Linguistics.
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini,
Mor Geva, Ankit Gupta, and Jonathan Berant. 2020. Nate Kushman, and Hannaneh Hajishirzi. 2016.
Injecting numerical reasoning skills into language MAWPS: A math word problem repository. In
models. In Proceedings of the 58th Annual Meeting Proceedings of the 2016 Conference of the
of the Association for Computational Linguistics, North American Chapter of the Association for
pages 946–958, Online. Association for Computa- Computational Linguistics: Human Language
tional Linguistics. Technologies, pages 1152–1157, San Diego, Califor-
nia. Association for Computational Linguistics.
Olga Golovneva, Moya Chen, Spencer Poff, Mar-
tin Corredor, Luke Zettlemoyer, Maryam Fazel- Klaus Krippendorff. 2018. Content analysis: An
Zarandi, and Asli Celikyilmaz. 2022. Roscoe: A introduction to its methodology. Sage Publications.
suite of metrics for scoring step-by-step reasoning.
Andrew K Lampinen, Nicholas A Roy, Ishita Dasgupta,
Olga Golovneva, Moya Chen, Spencer Poff, Mar- Stephanie CY Chan, Allison C Tam, James L Mc-
tin Corredor, Luke Zettlemoyer, Maryam Fazel- Clelland, Chen Yan, Adam Santoro, Neil C Rabi-
Zarandi, and Asli Celikyilmaz. 2023. ROSCOE: nowitz, Jane X Wang, et al. 2022. Tell me why!
A suite of metrics for scoring step-by-step reason- explanations support learning relational and causal
ing. In The Eleventh International Conference on structure. In International Conference on Machine
Learning Representations. Learning.
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Language Technologies, pages 2080–2094, Online.
Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Association for Computational Linguistics.
Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Ku-
mar, et al. 2022. Holistic evaluation of language Debjit Paul and Anette Frank. 2021. COINS: Dynam-
models. arXiv preprint arXiv:2211.09110. ically generating COntextualized inference rules
for narrative story completion. In Proceedings
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun- of the 59th Annual Meeting of the Association
som. 2017. Program induction by rationale gen- for Computational Linguistics and the 11th
eration: Learning to solve and explain algebraic International Joint Conference on Natural Language
word problems. In Proceedings of the 55th Annual Processing (Volume 1: Long Papers), pages
Meeting of the Association for Computational 5086–5099, Online. Association for Computational
Linguistics (Volume 1: Long Papers), pages 158– Linguistics.
167, Vancouver, Canada. Association for Computa-
tional Linguistics. Piotr Pi˛ekos, Mateusz Malinowski, and Henryk
Michalewski. 2021. Measuring and improving
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler BERT’s mathematical abilities by predicting the or-
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, der of reasoning. In Proceedings of the 59th Annual
Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Meeting of the Association for Computational
Sean Welleck, Bodhisattwa Prasad Majumder, Linguistics and the 11th International Joint
Shashank Gupta, Amir Yazdanbakhsh, and Peter Conference on Natural Language Processing
Clark. 2023. Self-refine: Iterative refinement with (Volume 2: Short Papers), pages 383–394, Online.
self-feedback. Association for Computational Linguistics.
Ana Marasovic, Iz Beltagy, Doug Downey, and Rajkumar Ramamurthy, Prithviraj Ammanabrolu,
Matthew Peters. 2022. Few-shot self-rationalization Kianté Brantley, Jack Hessel, Rafet Sifa, Christian
with natural language prompts. In Findings of the Bauckhage, Hannaneh Hajishirzi, and Yejin Choi.
Association for Computational Linguistics: NAACL 2022. Is reinforcement learning (not) for natural
2022, pages 410–424, Seattle, United States. Associ- language processing?: Benchmarks, baselines,
ation for Computational Linguistics. and building blocks for natural language policy
optimization.
Alice Martin, Guillaume Quispe, Charles Ollion, Syl-
vain Le Corff, Florian Strub, and Olivier Pietquin. Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan
2022. Learning natural language generation with Liu. 2019. NumNet: Machine reading compre-
truncated reinforcement learning. In Proceedings of hension with numerical reasoning. In Proceedings
the 2022 Conference of the North American Chapter of the 2019 Conference on Empirical Methods
of the Association for Computational Linguistics: in Natural Language Processing and the 9th
Human Language Technologies, pages 12–37, Seat- International Joint Conference on Natural Language
tle, United States. Association for Computational Processing (EMNLP-IJCNLP), pages 2474–2484,
Linguistics. Hong Kong, China. Association for Computational
Linguistics.
Nikhil Mehta and Dan Goldwasser. 2019. Improv-
Christian Rupprecht, Iro Laina, Nassir Navab, Gre-
ing natural language interaction with robots using
gory D. Hager, and Federico Tombari. 2018. Guide
advice. In Proceedings of the 2019 Conference
me: Interacting with deep networks. CoRR,
of the North American Chapter of the Association
abs/1803.11544.
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), William Saunders, Catherine Yeh, Jeff Wu, Steven
pages 1962–1967, Minneapolis, Minnesota. Associ- Bills, Long Ouyang, Jonathan Ward, and Jan Leike.
ation for Computational Linguistics. 2022. Self-critiquing models for assisting human
evaluators.
Khanh Nguyen, Dipendra Misra, Robert Schapire,
Miro Dudík, and Patrick Shafto. 2021. Interactive Jérémy Scheurer, Jon Ander Campos, Jun Shern Chan,
learning from activity description. Angelica Chen, Kyunghyun Cho, and Ethan Perez.
2022. Training language models with language feed-
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, back.
Henryk Michalewski, Jacob Austin, David Bieber,
David Dohan, Aitor Lewkowycz, Maarten Bosma, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec
David Luan, Charles Sutton, and Augustus Odena. Radford, and Oleg Klimov. 2017. Proximal policy
2021. Show your work: Scratchpads for intermedi- optimization algorithms. ArXiv, abs/1707.06347.
ate computation with language models.
Emily Sheng, Kai-Wei Chang, Prem Natarajan, and
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Nanyun Peng. 2020. Towards Controllable Bi-
2021. Are NLP models really able to solve simple ases in Language Generation. In Findings of the
math word problems? In Proceedings of the 2021 Association for Computational Linguistics: EMNLP
Conference of the North American Chapter of the 2020, pages 3239–3254, Online. Association for
Association for Computational Linguistics: Human Computational Linguistics.
Vered Shwartz, Peter West, Ronan Le Bras, Chandra Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-
Bhagavatula, and Yejin Choi. 2020. Unsupervised Yan Liu. 2018. A study of reinforcement learning
commonsense question answering with self-talk. In for neural machine translation. In Proceedings of the
Proceedings of the 2020 Conference on Empirical 2018 Conference on Empirical Methods in Natural
Methods in Natural Language Processing (EMNLP), Language Processing, pages 3612–3621, Brussels,
pages 4615–4629, Online. Association for Computa- Belgium. Association for Computational Linguis-
tional Linguistics. tics.
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Zhipeng Xie and Shichao Sun. 2019. A goal-driven
Ziegler, Ryan Lowe, Chelsea Voss, Alec Rad- tree-structured neural model for math word prob-
ford, Dario Amodei, and Paul Christiano. 2020. lems. In International Joint Conference on Artificial
Learning to summarize from human feedback. In Intelligence.
Proceedings of the 34th International Conference on
Neural Information Processing Systems, NIPS’20, Xi Ye and Greg Durrett. 2022. The unreliability of ex-
Red Hook, NY, USA. Curran Associates Inc. planations in few-shot prompting for textual reason-
ing. In Advances in Neural Information Processing
Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Gold- Systems.
berg, and Jonathan Berant. 2020. Leap-of-thought:
Jipeng Zhang, Lei Wang, Roy Ka-Wei Lee, Yi Bin, Yan
Teaching pre-trained models to systematically rea-
Wang, Jie Shao, and Ee-Peng Lim. 2020. Graph-
son over implicit knowledge. In Advances in Neural
to-tree learning for solving math word problems.
Information Processing Systems, volume 33, pages
In Proceedings of the 58th Annual Meeting of the
20227–20237. Curran Associates, Inc.
Association for Computational Linguistics, pages
3928–3937, Online. Association for Computational
Niket Tandon, Aman Madaan, Peter Clark, and Yim-
Linguistics.
ing Yang. 2022. Learning to repair: Repairing
model output errors after deployment using a dy- Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B.
namic memory of feedback. In Findings of the Brown, Alec Radford, Dario Amodei, Paul Chris-
Association for Computational Linguistics: NAACL tiano, and Geoffrey Irving. 2019. Fine-tuning lan-
2022, pages 339–352, Seattle, United States. Associ- guage models from human preferences.
ation for Computational Linguistics.
Generator: Critic:
#0: add( number0, number2 ) | The first number in #0 is
Iteration 1:
#1: subtract ( #0, number2 ) incorrect. The operator in #1 is
incorrect.
Generator:
#0: add ( number1, number2 ) | Critic:
Iteration 2:
#1: multiply( #0, number2 ) The number in #1 is incorrect.
Generator:
#0: add ( number0, number1 ) | Critic:
Iteration 3:
#1: multiply ( #0, number2 ) No Hint
Norm Evaluation
UQA-T5-large UQA-T5-large + PPO REFINER
50
40
30
20
10
0
strongly disagree disagree unsure agree strongly agree
Situation: Jenny has been going out with a guy for a while, but she feels it isn’t
working out.
Intention: Jenny wants to end the relationship with the guy.
Moral Action: Jenny meets the guy and gently says that she wants to end the relationship.
Immoral Action: Jenny sends the guy a text message that informs him that the relation-
ship is over.
Plausible Hypothesis: It’s good to break up with someone face to face.
Contradictory Hypothesis: It’s bad to break up with someone face to face.
Implausible Hypothesis 1: You should not send a text message
Implausible Hypothesis 2: It’s bad to end the relationship with the guy.
Situation: Paula is leaving home to do some shopping and notices that the dog’s
water bowl is empty.
Intention: Paula wants to leave the dog at home.
Moral Action: Paula fills the bowl with water before leaving dog alone in the house.
Immoral Action: Paula decides to fill the bowl when she gets home and leaves the dog with
no water.
Plausible Hypothesis: You must always be kind to animals
Contradictory Hypothesis: It’s bad to be kind to animals.
Implausible Hypothesis 1: It’s rude to leave the dog at home.
Implausible Hypothesis 2: It’s important to do some shopping.
Table 11: Critic training examples from Moral Stories. Norm judgments are highlighted in light blue , original
norm actions are colored in light gray and norm actions extracted from the context are colored in light yellow .
Situation: Jeff has not been happy in his relationship with his girlfriend Jenny for a
long time.
Intention: Jeff wants to break up with Jenny.
Moral Action: Jeff goes to Jenny’s house and tells her that he wants to break up with her.
Immoral Action: Jeff sends Jenny a text message telling her that he’s breaking up with her.
UQA-T5-large Output: It’s wrong to break up with someone.
Jeff calls Jenny and tells her that he’s breaking up with her.
Situation: Jack and his friends are playing in his backyard on a sunny day.
Intention: Jack wants to have fun with his friends.
Moral Action: Jack sets up an inflatable pool and swims in the pool with his friends.
Immoral Action: Jack tosses a ball around with his friends, and he accidentally hit a window
of his neighbor.
UQA-T5-large Output: It’s wrong to hit someone.
Jack throws a ball with his friends and hits a window of his neighbor.
Table 12: Moral Stories generations. Norm outputs (z) are highlighted in light blue , moral action outputs (y)
are colored in light green
.
Moral Action Evaluation
UQA-T5-large UQA-T5-large + PPO REFINER
40
30
20
10
0
strongly disagree disagree unsure agree strongly agree
Figure 11: Excerpt from AMT HIT instructions: Norm Evaluation Task
Figure 12: Excerpt from AMT HIT instructions: Moral Action Evaluation Task
Figure 13: Excerpt from AMT HIT instructions: Norm Evaluation Task instructions
Figure 14: Excerpt from AMT HIT instructions: Norm Evaluation Task Dos and Don’ts
Figure 15: Excerpt from AMT HIT instructions: Norm Evaluation Task examples
Figure 16: Excerpt from AMT HIT instructions: Moral Action Evaluation Task instructions
Figure 17: Excerpt from AMT HIT instructions: Moral Action Evaluation Task Dos and Don’ts
Figure 18: Excerpt from AMT HIT instructions: Moral Action Evaluation Task examples
Figure 19: Excerpt from AMT HIT instructions: Acceptance and Privacy Policy