Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Proposal: Reward ambiguity-aware reinforcement learning

Matthew Farrugia-Roberts & Usman Anwar

Background
A reward function captures a sequential decision-making task by assigning high numerical reward to
sequences of actions that solve the task. A reinforcement learning (RL) algorithm solves such a task
by sampling the reward function as a source of feedback for learning a policy—a function that takes
environment states as input and returns highly-rewarded actions as output.
In practice, learned policies might be subject to distributional shift. During training, the RL algorithm
might encounter only a subset of all possible environment states (training states). Later, the learned
policy might encounter new states, not encountered during training (unseen states). Successful gener-
alisation in RL is when the learned policy acts in unseen states in ways that would have been highly
rewarded by the training reward function.
In this setting, the problem of reward ambiguity arises. Since the RL algorithm never samples the reward
function in the unseen states, the ‘correct’ extrapolation of the task to the unseen states is fundamentally
ambiguous. Compared to the training reward function, multiple proxy reward functions might incentivise
the same actions in the training states, while incentivising different actions in the unseen states.
Sometimes, an RL algorithm learns a policy that generalises to unseen states in a competent and reward-
seeking manner, but towards a proxy reward function (rather than the training reward function). The
policy might completely ignore the intended task in unseen states, or even actively antagonise it. There-
fore, this is a potentially dangerous failure mode of RL (it is an example of goal misgeneralisation, and
it is related to inner alignment failure).

Proposal
We propose a three-part project to better understand reward ambiguity and to mitigate risks from
proxy-based goal misgeneralisation in deep reinforcement learning.
(1) Automatically identifying proxy reward functions: Given the training reward function sampled on
the training states, use reward learning to identify a set of plausible proxy reward functions (reward
functions that incentivise the same behaviour on the training states, but different behaviour on a
particular set of unseen states). Progress in this direction would allow us to detect situations where
there are risks from reward ambiguity, and would serve as a foundation for the following two parts.
(2) Efficiently disambiguating reward functions: Given a set of proxy reward functions (including the
unknown intended reward function), algorithmically identify a small set of unseen states where
knowing the true reward value in these states would remove the reward ambiguity, revealing the
true reward function. Progress in this direction would give us a way to reduce reward ambiguity in
situations where it is possible but expensive to gain additional ground truth reward information.
(3) (Fine-tuning) RL under reward uncertainty: Given a set of proxy reward functions (including
the unknown intended reward function), train a (pre-trained) policy in unseen states towards an
objective that jointly considers all of the reward functions, thereby avoiding catastrophic failure
modes. Progress in this direction would give us a way to reduce the impact of reward ambiguity
when it cannot be eliminated.
Participants in this research project will implement supervised learning algorithms and reinforcement
learning algorithms and conduct experiments investigating one or more of these directions.
References
Background on deep reinforcement learning:
• Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction. MIT Press,
second edition, 2018.
• Emma Brunskill, CS234: Reinforcement Learning. Course materials available via Stanford. Lec-
ture recordings from 2019 available via YouTube.
Some definitions and examples of goal misgeneralization:
• Lauro Langosco Di Langosco, Jack Koch, Lee D Sharkey, Jacob Pfau, David Krueger, “Goal
misgeneralization in deep reinforcement learning,” Proceedings of the 39th International Conference
on Machine Learning, PMLR 162:12004–12019, 2022. Available via PMLR.
• Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato,
Zac Kenton, “Goal misgeneralization: Why correct specifications aren’t enough for correct goals”,
2022. Preprint arXiv:2210.01790.
For some discussion of proxy goals and the related topic of inner misalignment:
• Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant, “Risks from
learned optimization in advanced machine learning systems,” 2019. Preprint arXiv:1906.01820.
These are only a sample. Moreover, identification of further references relevant to the motivation and
the technical parts of the research project will be an ongoing component of the project.

You might also like