Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

1 Related Works

Balduzzi and Ghifary [3] learns a G-function, which predicts gradients of a Q


function, using a gradient perturbation trick. Fairbank et al. [4] trains a G func-
tion using a DQN-style training for G. However, to compute target gradients,
they assume that their model and reward function is differentiable with respect
to all inputs. Nguyen and Widrow [12] Jordan and Jacobs [10] learn a model
and differentiates through that. Prokhorov and Wunsch [14] gives an overview
of heuristic dynamic programming, dual heuristic programming, and globalized
gual heuristic programming.
Bahdanau et al. [1] use RL to train an Q-function, which is then used to train
a RNN. However, this assume that you always have access to the target label.
Jaques et al. [9] train a DQN to rate how good the output of an RNN is. Heess
et al. [8] trains a RNN using DDPG, but does not explore how to use these tasks
for supervised learning tasks. As such, their policy is super slow. Hausknecht
and Stone [7] train a LSTM Q-funtion in a DQN set up. However, they assume
that either (1) you always roll out from the beginning of an episode or that (2)
you begin a rollout from a random point and start with a zero hidden state.
Wierstra and Alexander [16] shows how to get the policy gradient for recurrent
policies. Bakker [2] uses an LSTM to represent an advantage function and use
eligibility traces.
Gomez and Schmidhuber [5] train RNNs with evolutionary algorithms. Schmid-
huber et al. [15] train RNNs by randomly guessing weights.
Hasinoff [6] provides a survey of old papers. Lin and Mitchell [11] explores
3 different ways to incorporate recurrence in RL: (1) feed history to Q function,
(2) recurrent Q function, and (3) recurrent model.
? ] augments the MDP to include memory states in the state, observation,
and actions. However, this doesnt use BPTT information. Peshkin et al. [13]
first introduces memory states

References
[1] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan
Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An Actor-
Critic Algorithm for Sequence Prediction. arXiv:1607.07086v1 [cs.LG],
2016. URL http://arxiv.org/abs/1607.07086.
[2] Bram Bakker. Reinforcement Learning with Long Short-Term Memory.
Advances in Neural Information Processing Systems 14, pages 14751482,
2002. ISSN 1049-5258.
[3] David Balduzzi and Muhammad Ghifary. Compatible Value Gradients for
Reinforcement Learning of Continuous Deep Policies. pages 127, 2015.
URL http://arxiv.org/abs/1509.03005.
[4] Michael Fairbank, Eduardo Alonso, and Danil Prokhorov. An equiva-
lence between adaptive dynamic programming with a critic and back-

1
propagation through time. IEEE Transactions on Neural Networks
and Learning Systems, 24(12):20882100, 2013. ISSN 2162237X. doi:
10.1109/TNNLS.2013.2271778.
[5] F. J. Gomez and J. Schmidhuber. Co-Evolving Recurrent Neurons Learn
Deep Memory POMDPs. Galleria Rassegna Bimestrale Di Cultura, pages
114, 2004. doi: 10.1145/1068009.1068092.
[6] Samuel W Hasinoff. Reinforcement Learning for Problems with Hidden
State. University of Toronto, Technical Report, pages 118, 2003.
[7] Matthew Hausknecht and Peter Stone. Deep Recurrent Q-Learning for
Partially Observable MDPs. 2015.
[8] Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, and David Silver.
Memory-based control with recurrent neural networks. arXiv, pages 111,
2015. URL http://arxiv.org/abs/1512.04455.
[9] Natasha Jaques, Shixiang Gu, Richard E Turner, and Douglas Eck. Gen-
erating Music by Fine-Tuning Recurrent Neural Networks with Reinforce-
ment Learning. pages 111, 2016.
[10] Michael I Jordan and Robert A Jacobs. Learning to control an unstable
system with forward modeling. Advances in Neural Information Processing
Systems, 2:324331, 1990.

[11] L.J. Lin and T.M. Mitchell. Memory approaches to reinforcement learning
in non-Markovian domains. Artificial Intelligence, 8(7597):28, 1992. URL
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.319.
[12] Derrick H. Nguyen and Bernard Widrow. Neural Networks for Self-Learning
Control Systems, 1990. ISSN 02721708.

[13] Leonid Peshkin, Nicolas Meuleau, and Leslie Kaelbling. Learn-


ing Policies with External Memory. Sixteenth International
Conference on Machine Learning, (March):8, 2001. ISSN
1098-6596. doi: 10.1017/CBO9781107415324.004. URL
http://arxiv.org/abs/cs/0103003.

[14] Danil V. Prokhorov and Donald C. Wunsch. Adaptive critic designs. IEEE
Transactions on Neural Networks, 8(5):9971007, 1997. ISSN 10459227.
doi: 10.1109/72.623201.
[15] Jurgen Schmidhuber, Sepp Hochrieter, and Yoshua Bengio. Evaluating
Long-term Dependendency Denchmark Problems by Random Guessing.
1997.
[16] Daan Wierstra and F Alexander. Recurrent Policy Gradients. (May 2009).

You might also like