Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

000

001
002 Final Project Report - Learning in Sparse Reward Environment
003
004
005
006 Mariamma Antony, Jagriti Singh, Vivek Khandelwal, Ravi Raja
007
008
009 Abstract The aim of this project is to solve tasks in sparse reward
010 environment, where the agent is required to generalise by
Dealing with sparse rewards is one of the biggest learning from limited feedback. In this project, we consider
011
challenges in Reinforcement learning. Lack of problem with single / multiple goals with binary reward
012
intermediate rewards during the entire episode signals, using Hindsight Experience Replay (HER) algo-
013
makes learning difficult for the agent as it might rithm. We have also extended our studies to dynamic goals
014
never reach to the positive-rewarded goal, and environment in which the position of goal changes over
015
therefore have no feedback to improve its perfor- time. We highlight the merits and shortcomings of HER by
016
mance. Manually shaping reward function can demonstrations.
017
result in sub-optimal performance. Our Project
018
aims to solve this hard exploration problem by
019 Literature Review
implementing techniques which can learn from bi-
020
nary rewards. We have also extended our research A significant work that combines RL (Silver, 2020) with arti-
021
to deal with dynamic goals and also enhance per- ficial neural network is deep Q-network (DQN) (Mnih et al.,
022
formance of finding optimal policy using demon- 2015). While DQN solves problems with high dimensional
023
strations. observation spaces, it can only handle low dimensional and
024
025 discrete action spaces. Deep Deterministic Policy Gradient
026 (DDPG) (Lillicrap et al., 2015) is developed as an enhance-
027 Introduction ment to DPG (Silver et al., 2014), inspired by DQN, which
028 allow it to use neural network function to learn large state
Reinforcement learning (RL) provides a powerful frame-
029 and action spaces. Competitive Experience Replay (CER)
work for an agent to perform a task based on observation
030 (Liu et al., 2019) improves HER by efficiently supplement-
and feedback from the environment. An agent interacts
031 ing a sparse reward by placing learning in the context of an
with environment to receives a series of rewards to reach
032 exploration competition between a pair of agents. Learning
one/more goal state. The agent is expected to learn a policy
033 by playing (Riedmiller et al., 2018) can be understood as an
to maximize these expected returns. In case of sparse reward
034 extension to HER. Deep Reinforcement Solutions to sparse
environment, the received rewards consists of just a binary
035 extrinsic rewards have been used in some techniques like
signal indicating success or failure, with most of signals
036 Unsupervised Auxiliary Tasks (Jaderberg et al., 2016) and
being failure. This makes it difficult for an agent to connect
037 Curiosity Driven Exploration (Pathak et al., 2017). (Trott
a long series of actions to a distant future reward. Hence,
038 et al., 2019) presents a effective model-free method to learn
broadening application of RL in sparse reward environment
039 shaped reward to goal state in sparse environment. Con-
becomes a challenging task. Solving for sparse reward
040 cluding from the Mid-Term report we explain our approach
environment in RL are important because :
041 below.
042
043 • Many tasks are natural to specify with a binary reward
signal. For e.g. whether an object is inserted inside Our Approach : Hindsight Experience
044
045 box or not will define the state achieved as success or Replay (HER)
046 failure.
To enable an agent to learn optimal policy in sparse reward
047 • Manually shaping a reward function can result in sub- environment, we are using the technique of experience re-
048 optimal performance. It also requires knowledge of play described in (Andrychowicz et al., 2017) known as
049 domain and reward function, and the reward must be HER. The highlights of this approach are as follows:
050 shaped in such a way that it avoids local optimum.
051 It is difficult to define accurate reward functions in • To generalize over goals, learn a value function,
052 complicated environments. Hence, reward shaping Q(s, a, g) in every episode, that is conditioned on
053 does not generalize well. goals.
054
Final Project Report - Learning in Sparse Reward Environment

055 • If the attempted goal state g is not reached the assign experts and agent attempts to copy it. Demonstrations are
056 the state reached as another goal g 0 . Then replay the collected in multiple ways such as virtual reality. There are
057 episode with g 0 as new goal. This gives positive re- three main concepts in applying demonstrations to RL (Nair
058 wards and learning signals. et al., 2018; Rajeswaran et al., 2018).
059
060 • Use DQN and DDPG (Lillicrap et al., 2015) to han-
dle off-policy data; and discrete and continuous con- • Behavior Cloning (BC) gradient: BC is a form of
061 learning from demonstration. Add BC gradient at every
062 trol respectively. DDPG is based on Actor-Critique
framework in which one neural network act as actor to step in the update of the policy. So, instead of cloning
063 from demonstrations to begin with, at every step we
064 calculates the action given a state, and another act as
critique to determine the value of action. are taking a mixture of policy gradient and BC gradient
065
066 • Q-filter: Improve beyond demonstrations. If our pol-
067 W HAT ARE THE CHALLENGES OF USING HER icy is better than demonstration policy, we don’t clone
068 • Task horizon or action dimensions exceeds beyond it anymore.
069 certain limit: Firstly, this environment consists of
070 tasks in which multiple actions must be performed by • Reset from demonstration states: Take some ran-
071 the agent to carry out the task. Secondly, complex dom timestamp in demonstration and use it to initialize
072 dynamics around contacts are difficult to model using rollouts. This will bring us closer to actual reward
073 HER. This is because these scenarios are sensitive to signals.
074 small errors while manipulating the objects. In these
075 environments, agent rarely sees the reward which pre- Using demonstration at every step we can improve the per-
076 vents agent from learning from random explorations. formance. It can help to speed up learning and learn tasks
077 To solve this one approach is to use demonstration that are difficult to solve using HER alone.
078 along with HER.
079 Dynamic Hindsight Experience Replay (DHER)
080 • Dynamic goals: The dynamic goal moves by follow-
081 ing some law unknown to agent. In such environments DHER (Fang et al., 2019) is an extension to HER and suc-
082 HER does not work well and it even degrades the per- ceeds in learning in dynamic goal environments because of
083 formance of off-policy RL algorithms. The solution the following tasks that it mainly does:
084 here is DHER (Dynamic HER) that assembles success-
085 ful experiences from two relevant failures and makes • It takes the achieved goal(position of a gripper) trajec-
086 an agent learn an efficient policy in such environments. tory of a failed episode(Ei ) and tries to find another
087 failed episode(Ej ) whose desired goal(position of a
088 W HAT ENVIRONMENTS HER DOES NOT SOLVE ? moving object) trajectory can match it.
089 • Block stacking problem: This is the problem of stack-
090 • Then it assembles a new episode(Ei0 ) by combining
ing n-blocks one on top of the another. This problem both the episodes Ei and Ej such that it replaces the
091 consists of long-range horizon and complex contacts.
092 desired goals in Ei by the desired goals of Ej .
It also requires generalizing to each instance of the
093 task. • Now we have a new episode Ei0 stored in replay buffer
094 with a new imagined goal trajectory.
095 • Multi-finger robot-arm manipulation: These has ap-
096 plications of in-hand manipulation of objects, complex
097 grasping and tool use e.g, rotating an object in hand, Experiments
098 opening a door etc. These tasks are challenging to do
1. Fetch environments: Fetch environments are imple-
099 with multi-finger robot arm because it involves com-
mented using OpenAI gym baselines library (Dhariwal et al.,
100 plex interaction with object and high dimensional ob-
2017) and Mujoco Physics engine(Todorov et al., 2012). We
101 servations.
have analysed HER performance on four different Fetch en-
102 vironments:
103 Solution to HER Challenges (i) FetchRech- Move robot arm from initial location to a
104 specific target location.
105 HER with Demonstrations
(ii) FetchPush- Move a box from initial location by pushing
106 The idea of combining demonstrations with RL is not new it using robot arm until it reaches the desired target location.
107 and is implemented in DQN and DDPG from Demonstra- (iii) FetchPickAndPlace- Move an object from initial loca-
108 tions. Demonstrations consists of behavior examples from tion to target location by gripping on object by robot arm.
109
Final Project Report - Learning in Sparse Reward Environment

110 (iv) FetchSlide- Robot arm hits a puck such that it slides (ii) DyReach: Move a robotic arm so that its gripper can
111 and comes to rest on the desired location. reach the target position. The target moves along a straight
112 line segment with fixed velocity.
In FetchReach, after first few epochs, train success-rate is
113 Vanilla HER fails to perform in dynamic environment while
greater than 0.9 and test success-rate is consistently close to
114 DHER shows significant success. In DySnake which is a dis-
1. Similarly for FetchPush, the train success-rate is close to
115 crete control environment DQN + DHER achieves average
0.9 and test success-rate is close to 1 after certain number
116 median success rate of 89.5%
of epochs.
117
118
119
120
121
122
123
124
125
126
127
128
129
130 Figure 1. FetchReach, FetchPush FetchSlide with DDPG+HER
131 Figure 3. Dynamic Goals Environment
However, in FetchSlide the success-rate is low(less than
132
0.2) in most cases. But if we are using certain set of states In DyReach environment DDPG + DHER achieves average
133
(those states encountered after transition being replayed) as median success rate of 72.5%.
134
goal states in replay episodes, success rates may increase 3. Bit Flip: An agent is given an initial binary state and
135
to 0.4 (Andrychowicz et al., 2017).. Using DDPG alone or n
goal vector S = {0, 1} . To get from initial state to the
136
DQN alone gives success rate less than 0.2 in all the Fetch goal state the bit has to be flipped at each state, that makes
137
environments(Andrychowicz et al., 2017). an state space of 2n (here n=50). DQN fails to reach to goal,
138
139 however HER combined with DQN is able to solve it.
140
141
142
143
144
145
146
147
148
149
150 Figure 4. Bit-flip
151
Figure 2. FetchPickAndPlace with and without Demonstration
152
153 In FetchPickAndPlace experiment done in HER paper, Conclusion
154 success-rate of using HER is close to 1. This is because
We have incorporated HER for sparse reward environment.
155 half of the rollouts are initialised with robot arm gripping
Using demonstrations further improve its success rate, and
156 the object. There is a significance increase in train and
modification in algorithm (DHER) enables it to solve for
157 test success rate for using FetchPickAndPlace with HER +
dynamic goal environments. There are several areas of
158 demonstration.
improvements to existing techniques, to enable agent to
159 2. Dynamic Goal environments: perform complex actions. Current techniques for multi-
160 (i) DySnake: A snake and food(goal) are placed on a rect- object stacking has low success rate if number of objects
161 angular map. The task is to control the movement of snake increases beyond certain limit. In the long term, we can
162 to eat the food. The food moves from one position to an- also explore environments in which coordination between
163 other with a fixed velocity. multiple agents are required to perform a task.
164
Final Project Report - Learning in Sparse Reward Environment

165 Acknowledgements Dexterous Manipulation with Deep Reinforcement Learn-


166 ing and Demonstrations. In Proceedings of Robotics:
167 Special Thanks to Shubham for guiding us in this project. Science and Systems (RSS), 2018.
168
169 References Riedmiller, M., Hafner, R., Lampe, T., Neunert, M.,
170 Degrave, J., van de Wiele, T., Mnih, V., Heess, N.,
Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, and Springenberg, J. T. Learning by playing - solv-
171
R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and ing sparse reward tasks from scratch. In Dy, J. and
172
Zaremba, W. Hindsight experience replay, 2017. Krause, A. (eds.), Proceedings of the 35th Interna-
173
174 tional Conference on Machine Learning, volume 80 of
Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, Proceedings of Machine Learning Research, pp. 4344–
175 M., Radford, A., Schulman, J., Sidor, S., Wu, Y., and
176 4353, Stockholmsmässan, Stockholm Sweden, 10–15
Zhokhov, P. Openai baselines. https://github. Jul 2018. PMLR. URL http://proceedings.mlr.
177 com/openai/baselines, 2017.
178 press/v80/riedmiller18a.html.
179 Fang, M., Zhou, C., Shi, B., Gong, B., Xi, W., Wang, T., Xu, Silver, D. Ucl course on reinforcement learning, 2020. URL
180 J., and Zhang, T. DHER: Hindsight experience replay for https://www.davidsilver.uk/teaching/.
181 dynamic goals. In International Conference on Learning
182 Representations, 2019. URL https://openreview. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D.,
183 net/forum?id=Byf5-30qFX. and Riedmiller, M. Deterministic policy gradient algo-
184 rithms. In Xing, E. P. and Jebara, T. (eds.), Proceed-
185 Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., ings of the 31st International Conference on Machine
186 Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforce- Learning, volume 32 of Proceedings of Machine Learn-
187 ment learning with unsupervised auxiliary tasks. CoRR, ing Research, pp. 387–395, Bejing, China, 22–24 Jun
188 abs/1611.05397, 2016. URL http://arxiv.org/ 2014. PMLR. URL http://proceedings.mlr.
189 abs/1611.05397. press/v32/silver14.html.
190
191 Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics
192 Tassa, Y., Silver, D., and Wierstra, D. Continuous control engine for model-based control. In 2012 IEEE/RSJ Inter-
193 with deep reinforcement learning, 2015. national Conference on Intelligent Robots and Systems,
194 pp. 5026–5033, 2012.
Liu, H., Trott, A., Socher, R., and Xiong, C. Competitive
195 experience replay. CoRR, abs/1902.00528, 2019. URL Trott, A., Zheng, S., Xiong, C., and Socher, R. Keeping
196 http://arxiv.org/abs/1902.00528. your distance: Solving sparse reward tasks using self-
197 balancing shaped rewards. In Wallach, H., Larochelle, H.,
198 Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve- Beygelzimer, A., dAlché-Buc, F., Fox, E., and Garnett,
199 ness, J., Bellemare, M. G., Graves, A., Riedmiller, M., R. (eds.), Advances in Neural Information Processing
200 Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, Systems 32, pp. 10376–10386. Curran Associates, Inc.,
201 C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., 2019. URL http://papers.nips.cc/paper/
202 Wierstra, D., Legg, S., and Hassabis, D. Human-level 9225-keeping-your-distance-solving-sp\
203 control through deep reinforcement learning. Nature, 518 arse-reward-tasks-using-self-balancin\
204 (7540):529–533, February 2015. ISSN 00280836. URL g-shaped-rewards.pdf.
205 http://dx.doi.org/10.1038/nature14236.
206
207 Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W.,
208 and Abbeel, P. Overcoming exploration in reinforcement
209 learning with demonstrations. In 2018 IEEE Interna-
210 tional Conference on Robotics and Automation (ICRA),
211 pp. 6292–6299, 2018.
212
213 Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T.
214 Curiosity-driven exploration by self-supervised predic-
215 tion. CoRR, abs/1705.05363, 2017. URL http://
216 arxiv.org/abs/1705.05363.
217 Rajeswaran, A., Kumar, V., Gupta, A., Vezzani, G., Schul-
218 man, J., Todorov, E., and Levine, S. Learning Complex
219

You might also like