Professional Documents
Culture Documents
Potential-Based Shaping in Model-Based Reinforcement Learning
Potential-Based Shaping in Model-Based Reinforcement Learning
Potential-Based Shaping in Model-Based Reinforcement Learning
where α is a learning rate and “x ←α y” means that x is counted framework, the return for this sequence is taken to
assigned the value (1 − α)x + αy. The Q function esti- P∞
be U (s̄) = i=0 γ i ri .
mate is used to make an action decision in state s ∈ S, usu- Note that the Q function for any policy can be decom-
ally by choosing the a ∈ A such that Q̂(s, a) is maximized. posed into an average of returns like this:
Q-learning is actually a family algorithms. A complete Q- X
learning specification includes rules for initializing Q̂, for Q(s, a) = Pr(s̄|s, a)U (s̄).
changing α over time, and for selecting actions. The pri- s̄
mary theoretical guarantee provided by Q-learning is that Q̂
Here, with the averaging weights Pr(s̄|s, a) depend on the
converges to Q in the limit of infinite experience if α is de-
likelihood of the sequence as determined by the dynamics of
creased at the right rate and all s, a pairs begin an infinite
the MDP and the agent’s policy.
number of experience tuples.
A prototypical model-based algorithm is Rmax, which
Shaping Return
creates “optimistic” estimates T̂ and R̂ of T and R.
Specifically, Rmax hypothesizes a state smax such that In potential-based shaping (Ng, Harada, & Russell 1999),
T̂ (smax , a, smax ) = 1, R̂(smax , a) = 0 for all a. It also ini- the system designer provides the agent with a shaping func-
tion Φ(s), which maps each state to a real value. For shaping
tializes T̂ (s, a, smax ) = 1 and R̂(s, a) = vmax for all s and to be successful, it is noted that Φ should be higher for states
a. (Unmentioned transitions have probability 0.) It keeps with higher return.
a transition count c(s, a, s0 ) and a reward total t(s, a), both
The shaping function is used to modify the reward for a
initialized to zero. In its practical form, it also has an integer
transition from s to s0 by adding F (s, s0 ) = γΦ(s0 ) − Φ(s)
parameter m ≥ 0. Each experience tuple hs, a, s0 , ri re-
to it. The Φ-shaped return for s̄ becomes
sults in an update c(s, 0
Pa, s ) ← c(s, a, s0 ) + 1 and t(s, a) ←
t(s, a)+r. Then, if s0 c(s, a, s ) = m, the algorithm mod-
0
X
∞
ifies T̂ (s, a, s0 ) = c(s, a, s0 )/m and R̂(s, a) = t(s, a)/m. UΦ (s̄) = γ i (ri + F (si , si+1 ))
At this point, the algorithm computes i=0
X X∞
Q̂(s, a) = R̂(s, a) + γ T̂ (s, a, s0 ) max
0
Q̂(s0 , a0 ) = γ i (ri + γΦ(si+1 ) − Φ(si ))
a
s0 i=0
for all state–actions pairs. When in state s, the algorithm X
∞ X
∞ X
∞
chooses the action that maximizes Q̂(s, a). Rmax guaran- = γ i ri + γ i+1 Φ(si+1 ) − γ i Φ(si )
tees, with high probability, that it will take near-optimal ac- i=0 i=0 i=0
tions on all but a polynomial number of steps if m is set X
∞ X∞
sufficiently high (Kakade 2003). = U (s̄) + γ i Φ(si ) − Φ(s0 ) − γ i Φ(si )
PWe refer 0 to any state–action pair s, a such that i=1 i=1
s0 c(s, a, s ) < m as unknown and otherwise known. = U (s̄) − Φ(s0 ).
Notice that, each time a state–action pair becomes known,
Rmax performs a great deal of computation, solving an ap- That is, the shaped return is the (unshaped) return minus
proximate MDP model. In contrast, Q-learning consistently the shaping function’s value for the starting state. Since
has low per-experience computational complexity. The prin- the shaped return UΦ does not depend on the actions or
ciple advantage of Rmax, then, is that it makes more efficient any states past the initial state, the shaped Q function is
use of the experience it gathers at the cost of increased com- similarly just a shift downward for each state: QΦ (s, a) =
putational complexity. Q(s, a)−Φ(s). Therefore, the optimal policy for the shaped
Shaping and model-based learning have both been shown MDP (the MDP with shaping rewards added in) is precisely
to decrease experience complexity compared to a straight- the same as that of the original MDP (Ng, Harada, & Russell
forward implementation of Q-learning. In this paper, we 1999).
show that the benefits of these two ideas are orthogonal—
their combination is more effective than either approach Rmax Return
alone. The next section formalizes the effect of potential- In the context of the Rmax algorithm, at an intermediate
based shaping in both model-free and model-based settings. stage of learning, the algorithm has built a partial model of
the environment. In the known states, accurate estimates of
Comparing Definitions of Return the transitions and rewards are available. When an unknown
We next discuss several notions of how a trajectory is sum- state is reached, the agent assumes its value is vmax , that is,
marized by a single value, the return. We address the origi- an upper bound on the largest possible value.
For simplicity of analysis, we consider how the notion of Note that we can achieve equivalent behavior to Rmax
return is adapted in the Rmax framework if all rewards take with shaping by only modifying the reward for unknown
on their true values until an unknown state–action pair is state–action pairs. That is, if we define the reward for
reached. In this setting, the estimated return for a trajectory the first transition from a known state s to an unknown
in Rmax is state s0 as being incremented by Φ(s0 ), precisely the same
exploration policy results. This equivalence between two
X
T −1
views of model-based shaping approaches is reminiscent of
URmax (s̄) = γ i ri + γ T vmax ,
the equivalence between potential-based shaping and value-
function initialization in the model-free setting (Wiewiora
i=0
Q̂(s, a)
X
= Pr(s̄|s, a)UΦRmax (s̄)
s̄
X
= Pr(s̄|s, a)(URmax (s̄) − γ T vmax − Φ(s0 ) + γ T Φ(sT ))
s̄
X
≥ Pr(s̄|s, a)(U (s̄) − Φ(s0 )) Figure 1: A 15x15 grid with a fixed start (S) and goal (G)
s̄ location. Gray squares are episode-ending pits.
= Q(s, a) − Φ(s0 )
= QΦ (s, a).
Thus, running Rmax with any admissible shaping function
leads to a PAC-MDP learning algorithm.
Figure 2(a) demonstrates why not using an admissible
shaping function can cause problems for Rmax with shap-
ing. Consider the case where s0 and s1 are known and s2 is
unknown. Rewards are zero and vmax = 0. So, Q̂(s0 , a1 ) =
V (s1 ) and Q̂(s0 , a2 ) = Φ(s2 ). If V (s1 ) > Φ(s2 ), the agent
will never feel the need to take action a1 when in state s0 ,
since it believes that such a policy would be suboptimal. (a) (b)
However, if V (s2 ) > V (s1 ), which can happen if Φ is not
admissible, the agent will behave suboptimally indefinitely. Figure 2: (a) Simple MDP to illustrate why inadmissible
shaping functions can lead to learning suboptimal policies.
Experiments (b) A 4x6 grid.
We evaluated six algorithms to observe the effects shaping
has on a model-free algorithm, Q-learning, a model-based
ARTDP with exploration parameters Tmax = 100, Tmin =
algorithm, Rmax, and ARTDP (Barto, Bradtke, & Singh
.01, and β = .996, which seemed to strike an acceptable
1995), which has elements of both. We ran the algorithms
balance between exploration and exploitation.
in two novel test environments.1
RTDP (Real-Time Dyanmic Programming) is an algo- We ran Q-learning with 0.10 greedy exploration (it chose
rithm for solving MDPs with known connections to the the- the action with the highest estimated Q value with proba-
ory of admissible heuristic. The RL version, Adaptive RTDP bility 0.90 and otherwise a random action) and α = 0.02.
(ARTDP), combines learning models and heuristics and is These previously published settings (Ng, Harada, & Russell
therefore a natural point of comparison for Rmax with shap- 1999) were not optimized for the new environments. Rmax
ing. Lacking adequate space for a detailed comparison, we used a known-state parameter of m = 5 and all reported re-
note that ARTDP is like Rmax with shaping except with sults were averaged over forty replications to reduce noise.
(1) m = 1, (2) Boltzmann exploration, and (3) incremental For experiments, we used maze environments with dy-
planning. Difference (1) means ARTDP is not PAC-MDP. namics similar to those published elsewhere (Russell &
Difference (3) means its per-experience computational com- Norvig 1994; Leffler, Littman, & Edmunds 2007). Our first
plexity is closer to that of Q-learning. In our tests, we ran environment consisted of a 15x15 grid (Figure 1). It has start
(S) and goal (G) locations, gray squares representing pits,
1
The benchmark environments used by Ng, Harada, & Rus- and dark lines representing impassable walls. The agent re-
sell (1999) were not sufficiently challenging—Q-learning with ceives a reward of +1 for reaching the goal, −1 for falling
shaping learned almost instantly in these problems. into a pit, and a −0.001 step cost. Reaching the goal or a pit
ends an episode and γ = 1. potential-based shaping, a method for introducing “hints”
Actions move the agent in each of the 4 compass direc- to the learning agent. We argued that, for shaping functions
tions, but actions can fail with probability 0.2, resulting in a that are “admissible”, this algorithm retains the formal guar-
slip to one side or the other with equal probability. For ex- antees of Rmax, a PAC-MDP algorithm. Finally, we showed
ample, if the agent chooses action “east” from the location that the resulting combination of model learning and reward
right above the start state, it will go east with probability 0.8, shaping can learn near optimal policies more quickly than
south with probability 0.1, and north with probability 0.1. either innovation in isolation.
Movement that would take the agent through a wall results In this work, we did not examine how shaping can best
in no change of state. be combined with methods that generalize experience across
In the experiments presented, the shaping function was states. However, our approach meshes naturally with exist-
ing Rmax enhancements along these lines such as factored
Φ(s) = C × D(s) ρ + G, where C ≤ 0 is the per-step cost,
D(s) is the distance of the shortest path from s to the goal, ρ Rmax (Guestrin, Patrascu, & Schuurmans 2002) and RAM-
is the probability that the agent goes in the direction intended Rmax (Leffler, Littman, & Edmunds 2007). Of particular
interest in future work will be integrating these ideas with
and G is the reward for reaching the goal. D(s) ρ is a lower more powerful function approximation in the Q functions or
bound on the expected number of steps to reach the goal, the model to help the methods scale to larger state spaces.
since every step the agent makes (using the optimal policy)
has chance ρ of succeeding. The ignored chance of failure Acknowledgments
makes Φ admissible. The authors thank DARPA TL and NSF CISE for their gen-
Figure 3 shows the cumulative reward obtained by the six erous support.
algorithms in 1500 learning episodes. In this environment,
without shaping, the agents are reduced to exploring ran- References
domly from the start state, very often ending up in a pit be-
Barto, A. G.; Bradtke, S. J.; and Singh, S. P. 1995. Learning to
fore stumbling across the goal. ARTDP and Q-learning have act using real-time dynamic programming. Artificial Intelligence
a very difficult time locating the goal until around 1500 and 72(1):81–138.
8000 episodes, respectively (not shown). Rmax, in contrast, Brafman, R. I., and Tennenholtz, M. 2002. R-MAX—a general
explores more systematically and begins performing well af- polynomial time algorithm for near-optimal reinforcement learn-
ter approximately 900 episodes. ing. Journal of Machine Learning Research 3:213–231.
The admissible shaping function leads the agents toward Guestrin, C.; Patrascu, R.; and Schuurmans, D. 2002. Algorithm-
the goal more effectively. Shaping helps Q-learning master directed exploration for model-based reinforcement learning in
the problem in about half the time. Similarly, Rmax with factored MDPs. In Proceedings of the International Conference
shaping succeeds in about half the time of Rmax. Using on Machine Learning, 235–242.
the shaping function as a heuristic (initial Q values), helps Hansen, E. A., and Zilberstein, S. 1999. Solving Markov deci-
ARTDP most of all; its aggressive learning pays off after sion problems using heuristic search. In Proceedings of the AAAI
about 300 steps. Spring Symposium on Search Techniques for Problem Solving Un-
In this example, ARTDP outperformed Rmax using the der Uncertainty and Incomplete Information, 42–47.
same heuristic. To achieve its PAC-MDP property, Rmax is Kakade, S. M. 2003. On the Sample Complexity of Reinforce-
conservative. We constructed a modified maze, Figure 2(b), ment Learning. Ph.D. Dissertation, Gatsby Computational Neu-
to demonstrate the benefit of this assumption. We modified roscience Unit, University College London.
the environment by changing the step cost to −0.1 and the Leffler, B. R.; Littman, M. L.; and Edmunds, T. 2007. Effi-
goal reward to +5 and adding a “reset” action that teleported cient reinforcement learning with relocatable action models. In
the agent back to the start state incurring the step cost. Oth- Proceedings of the Twenty-Second Conference on Artificial Intel-
erwise, the structure is the same as the previous maze. The ligence (AAAI-07).
optimal policy is to walk the “bridge” lined by risky pits Ng, A. Y.; Harada, D.; and Russell, S. 1999. Policy invariance
(akin to a combination lock). under reward transformations: Theory and application to reward
Figure 4 presents the results of the six algorithms. The shaping. In Proceedings of the Sixteenth International Conference
on Machine Learning, 278–287.
main difference from the 15x15 maze is that ARTDP is not
able to find the optimal path reliably, even using the heuris- Puterman, M. L. 1994. Markov Decision Processes—Discrete
Stochastic Dynamic Programming. New York, NY: John Wiley
tic. The reason is that there is a high probability that it will
& Sons, Inc.
fall into a pit the first time it walks on the bridge. It models
this outcome as a dead end, resisting future attempts to sam- Russell, S. J., and Norvig, P. 1994. Artificial Intelligence: A
Modern Approach. Englewood Cliffs, NJ: Prentice-Hall.
ple that action. Increasing the exploration rate could help
prevent this modeling error but at the cost of the algorithm Strehl, A. L.; Li, L.; and Littman, M. L. 2006. PAC reinforce-
ment learning bounds for RTDP and rand-RTDP. In AAAI 2006
randomly resetting much more often, hurting its opportunity Workshop on Learning For Search.
to explore.
Watkins, C. J. C. H., and Dayan, P. 1992. Q-learning. Machine
Learning 8(3):279–292.
Conclusions Wiewiora, E. 2003. Potential-based shaping and Q-value initial-
We have defined a novel reinforcement-learning algorithm ization are equivalent. Journal of Artificial Intelligence Research
that extends both Rmax, a model-based approach, and 19:205–208.
&('*),+.-0/1 2 343657681 9
2 1 :
'*;<=/1 2 39
36<?>61 @6A
'*;<=
&('*),+.-
BDC E 5<8@61 @6A/F1 2 39
36<?>G1 @6A
BDC E 5<8@61 @6A
%
#
$
#
"
! "
Figure 3: A cumulative plot of the reward received over 1500 learning episodes in the 15x15 grid example.
J
IHH
_*`abcd e fg
f6a?h6d i6j
_*`ab
k(_*l,m.n0cd e f4f6op6qd g
e d r
J
HHH k(_*l,m.n
sDt u oaqi6d i6jcFd e fg
f6a?hGd i6j
sDt u oaqi6d i6j
^
IHH
\W
[]
[\
H
YX Z
VW
UT
ST R IHH
R J HHH
Figure 4: A cumulative plot of the reward received over the first 2000 learning episodes in the 4x6 grid example.