Professional Documents
Culture Documents
Convergence of Q-Learning PDF
Convergence of Q-Learning PDF
Francisco S. Melo
Institute for Systems and Robotics,
Instituto Superior Técnico,
Lisboa, PORTUGAL
fmelo@isr.ist.utl.pt
1 Preliminaries
We denote a Markov decision process as a tuple (X , A, P, r), where
• X is the (finite) state-space;
• A is the (finite) action-space;
• P represents the transition probabilities;
• r represents the reward function.
We denote elements of X as x and y and elements of A as a and b. We admit
the general situation where the reward is defined over triplets (x, a, y), i.e., r is
a function
r : X × A × X −→ R
assigning a reward r(x, a, y) everytime a transition from x to y occurs due to
action a. We admit r to be a bounded, deterministic function.
The value of a state x is defined, for a sequence of controls {At }, as
"∞ #
X
J(x, {At }) = E γ t R(Xt , At ) | X0 = x .
t=0
and verifies X
V ∗ (x) = max Pa (x, y) r(x, a, y) + γV ∗ (y) .
a∈A
y∈X
y∈X
1
The optimal Q-function is a fixed point of a contraction operator H, defined
for a generic function q : X × A −→ R as
X
(Hq)(x, a) = Pa (x, y) r(x, a, y) + γ max q(y, b) .
b∈A
y∈X
kHq1 − Hq2 k∞ =
X
= max Pa (x, y) r(x, a, y) + γ max q1 (y, b) − r(x, a, y) + γ max q2 (y, b) =
x,a b∈A b∈A
y∈X
X
= max γ Pa (x, y) max q1 (y, b) − max q2 (y, b) ≤
x,a y∈X b∈A b∈A
X
= max γ Pa (x, y) max q1 (y, b) − max q2 (y, b) ≤
x,a b∈A b∈A
y∈X
X
= max γ Pa (x, y) max |q1 (z, b) − q2 (z, b)| =
x,a z,b
y∈X
X
= max γ Pa (x, y) kq1 − q2 k∞ =
x,a
y∈X
= γ kq1 − q2 k∞ .
Pπ [At = a | Xt = x] > 0
for all state-action pairs (x, a). Let {xt } be a sequence of states obtained follow-
ing policy π, {at } the sequence of corresponding actions and {rt } the sequence
of obtained rewards. Then, given any initial estimate Q0 , Q-learning uses the
following update rule:
Qt+1 (xt , at ) = Qt (xt , at ) + αt (xt , at ) rt + γ max Qt (xt+1 , b) − Qt (xt , at ) ,
b∈A
where the step-sizes αt (x, a) verify 0 ≤ αt (x, a) ≤ 1. This means that, at the
(t + 1)th update, only the component (xt , at ) is updated.1
This leads to the following result.
1 There are variations of Q-learning that use a single transition tuple (x, a, y, r) to perform
2
Theorem 1. Given a finite MDP (X , A, P, r), the Q-learning algorithm, given
by the update rule
Qt+1 (xt , at ) = Qt (xt , at ) + αt (xt , at ) rt + γ max Qt (xt+1 , b) − Qt (xt , at ) , (2)
b∈A
yields
If we write
3
where X(x, a) is a random sample state obtained from the Markov chain (X , Pa ),
we have
X
Pa (x, y) r(x, a, y) + γ max Qt (y, b) − Q∗ (x, a) =
E [Ft (x, a) | Ft ] =
b∈A
y∈X
Finally,
var [Ft (x) | Ft ] =
»“ ”2 –
= E r(x, a, X(x, a)) + γ max Qt (y, b) − Q∗ (x, a) − (HQt )(x, a) + Q∗ (x, a) =
b∈A
»“ ”2 –
= E r(x, a, X(x, a)) + γ max Qt (y, b) − (HQt )(x, a) =
b∈A
» –
= var r(x, a, X(x, a)) + γ max Qt (y, b) | Ft
b∈A
References
[1] Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the con-
vergence of stochastic iterative dynamic programming algorithms. Neural
Computation, 6(6):1185–1201, 1994.
[2] Carlos Ribeiro and Csaba Szepesvári. Q-learning combined with spreading:
Convergence and results. In Proceedings of the ISRF-IEE International Con-
ference: Intelligent and Cognitive Systems (Neural Networks Symposium),
pages 32–36, 1996.