Professional Documents
Culture Documents
Valuetools Preprint
Valuetools Preprint
Valuetools Preprint
Stephane
Gaubert
Department of Mathematics
Indian Institute of Technology Delhi
New Delhi, India
INRIA
Domaine de Voluceau
78153 Le Chesnay Cedex,
France
vishesh.dhingra@gmail.com
Stephane.Gaubert@inria.fr
ABSTRACT
1.1
3
2
G.2.2 [Discrete Mathematics]: Graph Theory; G.4 [Mathematical software]: Algorithm design and analysis
General Terms
1
1
Algorithms, Performance
Keywords
Policy iteration, repeated games, graph algorithms, maxplus algebra, nonlinear harmonic functions
1.
INTRODUCTION
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Valuetools 06, October 11-13, 2006, Pisa, Italy
Copyright 2006 ACM 1-59593-504-5 ...$5.00.
11
9
7
3
5
1.2
y Rp
(1)
and
g 0 : Rn Rp , gi0 (x) = max
rij 0 + xj ,
0
(i,j )E
x Rn . (2)
We set:
f = g g 0 .
Maps of this form are called min-max functions after the
work of Olsder [23] and Gunawardena [14].
For instance, the maps corresponding to the game of Fig. 1
are given by:
1
0
(2 + y1 ) y3
B (1 + y1 ) (6 + y2 ) C
C
g(y) = B
@(9 + y1 ) (5 + y2 )A ,
(3 + y1 ) (5 + y3 )
1
(1 + x1 ) (7 + x2 )
A.
x3
g 0 (x) = @
(2 + x4 ) (11 + x3 )
0
0 , .
v0 = 0 .
1.3
Invariant half-lines
(3)
1.4
The proof of Kohlberg [21], or the proof of the more general results of Bewley and Kohlberg [2], rely in essence on
quantifier elimination arguments. Different methods must
be developed to obtain practicable algorithms.
Gaubert and Gunawardena [13], extending an earlier work
of Cochet and the same authors [8] in a special case, proposed an algorithm to compute (f ). This algorithm uses
the idea of policy iteration, i.e., of iteration in the space of
strategies of one player.
Assume first that Min fixes a feedback stationary strategy
, meaning that her decision depends only on the current
state and is independent of the time. So, the strategy can
be identified as a map from {10 , . . . , n0 } to {1, . . . , p}, that
we still denote by , assigning to each of the nodes 10 , . . . , n0
one node (i0 ) such that (i0 , (i0 )) E. Once Mins strategy
is known, Max must solve a one player game, the dynamic
programming operator of which, denoted by f () , is given
by:
()
fi
max
((i0 ),j 0 )E
r(i0 ),j 0 + xj .
(4)
Maps of this form are called max-plus linear maps. Computing an invariant half-line of a max-plus linear map is a much
simpler problem [6], which essentially reduces to computing
the maximal circuit mean in a digraph.
In a nutshell, the policy iteration algorithm of [13] constructs a sequence of strategies and associated invariant halflines with the property that at every step, the cycle time
(f () ) and (f () ) of the old and new strategies and
satisfy:
(f () ) (f () ) .
(5)
The algorithm stops when the invariant half-line of the current strategy is an invariant half-line of f . If the inequalities (5) are always strict, the same strategy cannot be selected twice, and since the total number of (feedback, stationary) strategies is finite, the algorithm must terminate.
The difficulty stems from degenerate iterations, in which
the equality holds in (5). Naive implementations of policy
iteration, in which degenerate iterations are not properly
handled, can lead the algorithm to cycle.
This difficulty is solved in [8, 13], where it is shown that
there exists a canonical choice of an invariant half-line of the
map f , which guarantees that the algorithm terminates.
This choice relies on max-plus spectral theory: at each degenerate iteration, the invariant half-line of f () is chosen to
be the image of the invariant half-line of f () by the spectral
projector of f () . The convergence proof relies on a discrete
max-plus analogue of the classical result of potential theory,
showing that a function which is harmonic in the usual sense
on a regular bounded domain of Rn is defined uniquely by
its trace on the boundary. The max-plus analogue of the
boundary is defined in terms of the nodes belonging to
circuits of maximal weight-to-length ratio (critical nodes).
One iteration of the algorithm of [8] takes a time O(nm),
where m is the number of arcs of G, plus the time needed to
apply a spectral projector to a given super-harmonic vector
if the iteration is degenerate. An explicit formula for the
spectral projector, due to Cohen, Dubois, Quadrat, and Viot
(see [9, 8]), is known. It would lead to an execution time of
O(n3 ) for every iteration of the algorithms of [8, 13], and it
1.5
Related works
To conclude this introduction, let us make additional comments on the history of the problem, and mention other references.
The problem of the existence of the cycle time is intimately related with the problem of the existence of the value
for the associated infinite deterministic game with mean
payoff, in which the payment associated to an infinite trajectory is the limsup of (ri0 i1 + + ri2k1 i2k )/k as k tends to
infinity. The value was shown to exist by Ehrenfeucht and
Mycielski [12]. In [15], Gurvich, Karzanov, and Khachiyan
gave an algorithm to compute the value of such games. They
mention the existence of a class of graphs for which the
number of iterations of their algorithm can be exponential,
although they report that the number of iterations grows
linearly with the dimension for random examples.
Another line of research concerns policy iteration. The
technique of policy iteration was invented by Howard [17],
to solve stochastic control problems. Hoffman and Karp [16]
generalised the idea to the case of stochastic games. Their
generalisation requires transition probabilities to be positive. Hence it does not apply to the case of deterministic games. To our knowledge, the first application of policy iteration techniques to the latter case appeared in [8,
13]. Another policy iteration algorithm, inspired by the
ones of [8, 13] has been proposed by Cheng and Zheng [5]:
their algorithm differs from the one of [8] in the way that it
computes inductively the different coordinates of the cycle
time, starting from the largest one. Policy iteration techniques for parity games (which reduce to a subclass of
mean payoff games) have been developed with motivations
from Model checking, see V
oge and Jurdzi
nski [24]. See
also Costan et al. [11] for an application of policy iteration
techniques to the static analysis of programs. Finally, we
note that the question of the complexity of policy iteration
for games has recently received a considerable attention, see
Bj
orklund, Sandberg, and Vorobyov [3] and Jurdzi
nski, Pa-
2.
2.1
z (k) = AN N z (k1) b,
1
A=
,
5 0
with 2 > > 0. and take u = (4, 2)T . Since Au = (( +
4) 1, 2)T u, u is super-harmonic. The critical graph of
A consists of the circuit (2, 2), so C = {2} and N = {1}.
We have b = AN C uC = 1 + 2 = 1. Step 3 of the algorithm
defines the sequence consisting only of the vector z 0 = 1,
since |N | 1 = 0. Hence, the output of the algorithm is the
vector v = (1, 2)T which is harmonic. This example shows
that Algorithm 1 can be arbitrarily faster than the more
natural fixed point iteration method, which would compute
the nonincreasing sequence u(0) = u, u(1) = Au(0) , . . .,
until u(k+1) = u(k) . It is easy to see that the smallest k with
this property is the smallest integer not less than 3/, and
so, this alternative algorithm can be arbitrarily slow.
2.2
max j .
Aij 6=
(6)
3.1
Example 2. Let
0
0
B 1
B
A=@
7
2
4
1
0 C
C
4
A
3
The algorithm
gj (x) = rj 0 (j 0 ) + x(j 0 ) ,
for j = 1, . . . , n, so that
f () = g () g 0 .
Algorithm 2 (Min-max policy iteration algorithm, see [13]).
Input: A weighted bipartite digraph, such that every node
is the tail of at least one arc. We denote by f the associated
dynamic programming operator, which is defined by as in
Section 1.2. Output: An invariant half-line of f , that is,
u Gn such that f u = u.
Au = (3 + 4, 3 + 4, 1 + 4, 5 + 3)T
u = (6 + 4, 5 + 4, 1 + 4, 5 + 3)T .
For instance, the first entry of Au is obtained from (Au)1 =
(0 + 2 + 4) (2 + 1 + 4) (4 8 + 3) = 3 + 4.
The graph S consists of the arcs (i, j) in G(A), with i, j
U1 or i = j = 4 U2 . So,
0
1
0
2
B 1
1 C
C
AS = B
@ 7
4
A
3
and
1
4
2
B 3 5 C
C
A = B
@11
0
A
0
The strategy k+1 is chosen so that it keeps the values of k , whenever possible, meaning that k+1 (j 0 ) =
( )
k (j 0 ) if fj u(k) = fj k u(k) .
4 2
AN N =
,
AN C =
3
5
The vector zN can be computed by applying steps 2 and 3
of Algorithm 1. We have
z
(0)
= b = AN C vC = (, 10)T ,
3.
3.2
Illustration
3
2
4
3
2
1
9
1
1 0
1
(2 + 15 + 8) (27 + 8)
17 + 8
B (1 + 15 + 8) (6 + 16 + 8) C B16 + 8 C
C B
C
f u(1) =B
@(9 + 15 + 8) (5 + 16 + 8)A =@11 + 8 A .
12 + 8
(3 + 8 + 8) (5 + 27 + 8)
0
4 + 8 3
11
1
8 + 8
3 + 3 3
5
6
Iteration 2
11
1
1 2 + 3
5 5
9
0
6 5
3
5
6
Iteration 3
3
5
Iteration 4
6 5
5
6
Iteration 5
EXPERIMENTAL RESULTS
We now present experimental results. We considered random graphs, then a simple pursuit-evasion game, and finally,
a concrete problem arising in the control of discrete event
systems, which is taken from a work of Katz [20].
Let us now describe the details of implementation of Algorithm 2. Invariant half-lines of max-plus linear maps were
computed using the policy iteration algorithm of [6] (which
is experimentally more efficient than the method based on
Karps algorithm [19]). We used floating point arithmetics,
so the equality tests in Algorithm 2 and in Algorithm 4.4
of [6] have to be understood up to a fixed parameter, which
was chosen to be := 0 (max(M 0 , M ) + 1) where
M 0 = max
rij 0 min
rij 0 , M = max
rj 0 k min
rj 0 k
0
0
0
0
(i,j )E
(i,j )E
(j ,k)E
(j k)E
4.1
0
5 5
9
1 9 + 8
3
2
1
1 1 3
5 5
v = (1 3, 6 5, 5 5, 3)T .
4.
11
(k)
u(1) = (9 + 8, 8 + 8, 16 + 8, 4 + 8)T .
5 5
7
3
2
3 3
1 1 +
6 5
(1 )
11
1
3
5
9
7
4 + 3
11
4.1.1
35
From
2
5100
50500
201000
30
25
20
To
5000
50000
200000
400000
Step Size
1
100
500
1000
15
10
200
400
600
800
1000
1200
1400
1600
120
100
80
60
40
20
0
0
200
400
600
800
1000
1200
1400
1600
Figure 4: CPU time (in seconds) taken by the algorithm as a function of n. Same conventions as in
Fig. 2.
1400
4.1.2
1200
1000
800
600
400
200
0
0
200
400
600
800
1000
1200
1400
1600
4.2
shown in Fig. 4. We denote by Nspec the number of degenerate iterations, in which the routine computing the spectral
projector is called. Table 1 gives some values of the minimal, average, and maximal value of Nmin , Nglob , and Nspec ,
computed from the same set of experimental data as the one
used to produce the figures (in particular, the Min, Average,
In this section, we present experiments concerning a pursuit evasion game. Suppose there is a cat and a mouse in a
room. The following assumptions are made: 1) The mouse
(Min) wishes to minimise the negative of its distance to the
cat, and the cat (Max) tries to maximise the negative of its
distance to the mouse. This is modelled by assuming that
the payments occur only after the cats actions. Then, the
60
50
40
30
20
10
0
0
50000
100000
150000
200000
250000
300000
350000
400000
9000
2
ob
ac
10
3
st
le
11
4
6
8
12
8000
7000
6000
5000
4000
3000
2000
1000
0
0
50000
100000
150000
200000
250000
300000
350000
400000
C
1
1
1
1
1
1
1
1
1
1
1
1
M
1
2
3
4
5
6
7
8
9
10
11
12
FD
0
1
2
2
1
3
2
3
2
3
3
3
C
2
2
2
2
2
2
2
2
2
2
2
2
M
1
2
3
4
5
6
7
8
9
10
11
12
FD
0
0
1
1
1
2
2
3
2
3
3
3
C
3
3
3
3
3
3
3
3
3
3
3
3
M
1
2
3
4
5
6
7
8
9
10
11
12
FD
1
1
0
0
2
1
3
2
3
3
3
2
C
4
4
4
4
4
4
4
4
4
4
4
4
M
1
2
3
4
5
6
7
8
9
10
11
12
FD
2
2
1
0
3
1
3
2
3
3
3
2
1800
1600
1400
1200
1000
800
600
4.3
400
200
0
0
50000
100000
150000
200000
250000
300000
350000
400000
The Ax = Bx problem
One motivation for the cycle time problem for min-max
functions is to solve systems of max-plus linear equations.
Indeed, if A and B are n p matrices with entries in R
{}, the max-plus linear problem Ax = Bx can be rewritten equivalently as Ax Bx and Bx Ax or
xi min(
min
j: Aji >
min
j: Bji >
`
4
6
8
10
12
14
16
18
20
24
30
36
Acknowledgments
The first author was supported by the French embassy in
New Delhi through the INRIA international internship programme. He thanks Ricardo Katz for his help. The second
author thanks Thierry Bousch for having pointed out to him
the work of Zwick and Paterson [25] and the works on parity
games (see V
oge and Jurdzi
nski [24]).
5.
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
REFERENCES
[19]
[20]
[21]
[22]
[23]
[24]
[25]