Professional Documents
Culture Documents
0.75
0.5
0.50
0.0
0.25
0.5 0.00
0.25
1.0
0.50
1.5
0.75
2.0 1.00
0 1000 2000 3000 4000 0.875 0.9375
Episode Discount
(a) (b)
1.00 1.00
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0.25 0.25
Figure 2: For all gridworlds, the agent receives the observed reward of 1 for reaching the goal, and a (hidden) penalty of −2 for having a
negative side effect. Therefore, a performance score of 1 means the agent reached the goal with minimal impact, 0 means it stayed put, −1
means it reached the goal and had the negative impact, and −2 means it only had the negative impact. Performance averaged over 20 trials.
[Hadfield-Menell and Hadfield, 2018] Dylan Hadfield- side effects using relative reachability. arXiv:1806.01186
Menell and Gillian K. Hadfield. Incomplete contracting [cs, stat], June 2018. arXiv: 1806.01186.
and AI alignment. CoRR, abs/1804.04268, 2018. [Krakovna, 2018] Victoria Krakovna. Specification gaming
[Hadfield-Menell et al., 2017a] Dylan examples in ai, 2018.
Hadfield-Menell,
Anca Dragan, Pieter Abbeel, and Stuart Russell. The [Leech et al., 2018] Gavin Leech, Karol Kubicki, Jessica
off-switch game. In Proceedings of the Twenty-Sixth Cooper, and Tom McGrath. Preventing Side-effects in
International Joint Conference on Artificial Intelligence, Gridworlds, 2018.
IJCAI-17, pages 220–227, 2017. [Lehman et al., 2018] Joel Lehman, Jeff Clune, and Dusan
[Hadfield-Menell et al., 2017b] Dylan Misevic. The surprising creativity of digital evolution:
Hadfield-Menell,
A collection of anecdotes from the evolutionary compu-
Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca
tation and artificial life research communities. CoRR,
Dragan. Inverse reward design. In Advances in Neural
abs/1803.03453, 2018.
Information Processing Systems, pages 6765–6774, 2017.
[Leike et al., 2017] Jan Leike, Miljan Martic, Victoria
[Krakovna et al., 2018] Victoria Krakovna, Laurent Orseau, Krakovna, Pedro A. Ortega, Tom Everitt, Andrew
Miljan Martic, and Shane Legg. Measuring and avoiding Lefrancq, Laurent Orseau, and Shane Legg. AI Safety
Gridworlds. arXiv:1711.09883 [cs], November 2017.
arXiv: 1711.09883.
[Manheim and Garrabrant, 2018] David Manheim and Scott
Garrabrant. Categorizing variants of goodhart’s law.
CoRR, abs/1803.04585, 2018.
[Regan and Boutilier, 2010] Kevin Regan and Craig
Boutilier. Robust policy computation in reward-uncertain
mdps using nondominated policies. In AAAI, 2010.
[Russell and Norvig, 2009] Stuart J Russell and Peter
Norvig. Artificial intelligence: a modern approach.
Pearson Education Limited, 2009.
[Salge et al., 2014] Christoph Salge, Cornelius Glackin, and
Daniel Polani. Empowerment–an introduction. In Guided
Self-Organization: Inception, pages 67–114. Springer,
2014.
[Soares et al., 2015] Nate Soares, Benja Fallenstein, Stuart
Armstrong, and Eliezer Yudkowsky. Corrigibility. AAAI
Workshops, 2015.
[Strathern, 1997] Marilyn Strathern. ‘improving ratings’:
audit in the british university system. European review,
5(3):305–321, 1997.
[Taylor et al., 2016] Jessica Taylor, Eliezer Yudkowsky,
Patrick LaVictoire, and Andrew Critch. Alignment for Ad-
vanced Machine Learning Systems. page 25, 2016.
[Taylor, 2016] Jessica Taylor. Quantilizers: A safer alterna-
tive to maximizers for limited optimization. In AAAI Work-
shop: AI, Ethics, and Society, 2016.