Professional Documents
Culture Documents
Refinement of Soccer Agents' Positions Using Reinforcement Learning
Refinement of Soccer Agents' Positions Using Reinforcement Learning
Refinement of Soccer Agents' Positions Using Reinforcement Learning
U s i n g Reinforcement Learning
Tomohito Andou
1 Introduction
Conventional reinforcement learning research assumes that the environment is
based on Markov decision processes (MDPs) and that the problem is small
enough for the entire state space to be enumerated and stored in memory. From
the practical viewpoint, however, reinforcement learning should be able to ap-
proximate to the optimM policy in complex problems or large problems with high
frequency, even if it is not guaranteed for convergence to optimMity. Actually,
recent researches ohen tackle complex problems a n d / o r large problems [1].
RoboCup (The World Cup Robotic Soccer) [3], of which the first tournament
was opened in August 1997, is a suitable problem to test reinforcement learning
methods. In the R o b o C u p simulator league, software agents play a game in a vir-
tual field provided by Soccer-server [5]. An agent receives perceptuM information
and executes actions through Soccer-server. Generally, this problem includes the
following difficulties:
2 Andhill Agents
An agent normally receives a visual message per 300 milliseconds and can execute
an action per 100 milliseconds. An agent of other teams in RoboCup97 seems
to execute a series of actions (turn - dash - dash, kick kick - kick, and so on)
after it receives a message. Andhill uses timer interruption to ensure executing
an a~tion in every turn. Figure 1 shows a rough sketch of agents' execution.
375
update internalinformation)
eceiveperceptualinformation~
- C u r r e n t time
- L o c a t i o n of t h e agent
- T i m e - s t a m p of the a g e n t ' s l o c a t i o n
- Velocity of t h e a g e n t
- T i m e - s t a m p of the a g e n t ' s v e l o c i t y
- D i r e c t i o n of the a g e n t
- S t a m i n a of the a g e n t
- L o c a t i o n of t h e ball
- T i m e - s t a m p of the b a l l ' s l o c a t i o n
- Velocity of the ball
- T i m e - s t a m p of the b a l l ' s velocity
- E x p e c t e d f u t u r e l o c a t i o n of the ball
- L o c a t i o n s of agents in sight
- T h e field i n f o r m a t i o n
- T i m e - s t a m p of the field i n f o r m a t i o n
~ l , l , , , , , , 0 m J ~
Agents share information about the ball's location and the field information
using "say" command among team-mates.
This section describes the detail of the "execute an action" module in Figure 1.
The most important thing for soccer agents when it can kick the ball is to send
the ball to a safer place. So the definition of the safety of the place is needed.
The safety of the place seems to depend on the distribution of agents and the
distances to the both of goals.
In Andhill agents, the safety of the place is determined by the safety function
whose input values are the following elements:
377
The safety function is a neural network shown in Figure 3. When the input values
are the distances from the ball's location, the output value indicates the safety of
the ball's location. The network is only hand-adjusted but the network represents
that the closer place to the goal of the agent's side is the more dangerous place,
the closer place to a team-mate is the safer place, and so on.
@ -0.5 ~
2~.IJ ~-io
@ _
@ -2 4
@ --2
Fig. 3. The safety function: Curves in hidden units denote the sigmoid functions.
Figure 4 shows an example of output values of the safety function. The circle
symbols denote team-mates and the cross symbols denote opponents. The circle
t e a m defends the left side goal. The larger number means the safer place. The
safer place is the more desirable place to kick to.
An agent who can kick the ball estimates the safeties of all reachable places,
and kicks the ball to the safest place. Dribbling is not prepared explicitly, but
the agent sometimes dribbles because the safest place is just its front. The agent
shoots if it is within the shoot range.
2.5 Positions
W h a t the agent should do when it is far away to the ball is one of the most diffi-
cult tactics to plan. An agent using the fixed position strategy doesn't move from
its home position. In real world soccer games, however, a player moves around
in the field incessantly because its optimal position is changed dynamically. It
must be applicable to RoboCup agents.
378
-4i-4~-3i-2i-1~-21-31-2 o i o io i-li 0 i2 i4 i4
Fig. 4. The output values of the safety function: This figure omits real numbers into
integer.
There are some strategies according to the player's role in a real world soccer
team.
- G o a l - k e e p e r " The goal-keeper stays in front of the goal of its side. There are
no privileged players in RoboCup97, but this strategy is good for capturing
the ball because opponents kick to the place frequently.
- D e f e n d e r : The defender's role is mainly to capture the ball which an op-
ponent possesses. So a defender should be in a place where the ball tends
to come when the ball is possessed by an opponent. Therefore, a defender
usually keeps marking an opponent.
- F o r w a r d : The forward's role is mainly to get a goal. This task is realized
only when the player's team keeps the ball from the opponents'. So a forward
should be far away from opponents.
These strategies seem also useful in RoboCup agents. T h a t is to say, the following
elements are necessary to define the strategy about the position.
Note that which team possesses the ball is represented by the safety of the
current location of the ball.
Andhill uses the position function with which the agent decides the position
to go. The position function is a neural network that inputs the above-mentioned
five elements and outputs the position value of the place. The network structure
is shown in Figure 5. This neural network can represent the class of the XOR
problem, for example, the strategy in the right side and the strategy in the left
side can be opposite.
379
Fig. 5. The structure of the position neural network: Curves in hidden units and an
output unit denote the sigmoid functions.
In this research, which agent has worked well is defined by the number of its
kicking the bail. An Andhill agent kicks the ball to a safer place (Section 2.4).
So, kicking the ball can be said to be a profitable action for the team.
3.3 A l g o r i t h m for U p d a t i n g t h e N e t w o r k
where v(O <_ 7 < 1) denotes the discount factor, and wi does the i th compo-
nent of W.
5. Calculate Awl(t) as
Awi(t) = (rt - b)Di(t),
where b denotes the reinforcement baseline.
381
w13
v5 = g(~J3,sv3 + w4,sv4)
VlO ----g(•)3,5V8 q~ W4,5V9)
_ ~ v5
382
1 0 1 0
- .5 --(v5 + vl0)
V5 ~W3,5 V5 T Vl0 0W3,5
= vs g'(in5) -
v 5 -~- VlO
vs v s T v g'(inlo)
0 ln( v5 )
el,a = Owl,a v~ T vlo
1 0 1 0
- v~ 0wl,3 v5 --(v5 + vlo)
V5 -~- Vl0 0Wl,3
= w3'svlg'(in3) ~ v5 + vlo
where in is the sum of weighted input values of the unit, and gl is the differential
sigmoid function g / = g(1 - g).
The above-mentioned update rules are too sensitive to the output values.
For example, if the output value v5 is quite small, eligibilities are very big. This
easily causes oscillations. Besides, if all the output values are equal (v5 = v]0),
eligibilities are very small. This research uses modified eligibilities as follows:
4 Experiments
As described in Section 3.2, moving the position is costly action. Even if the
output values of the position function indicate to go to somewhere distant, the
action is not realistic. So an agent goes to a feasible place to which it can go
with remaining stamina.
An agent can move the position instantly before kick-off. The best position
of an agent is decided by other agents' positions. Therefore, agents move their
position instantly by turns.
The learning team learns fi'om a long game playing with a team of the fixed
position strategy.
quently. On the other hand, agents have to also exploit policies because too
much exploration disturbs other agents' learning. In conventional reinforcement
learning, there is not so good solution for this trade-off. Fortunately, the tra~e-
off can be eased in the soccer-position problem because an agent can estimate
the position in which it can kick the ball frequently. An agent can reinforce a
hopeful position from observation without practical experience. When the ball
goes to somewhere far from team-mates, agents reinforce the place. This section
describes the results of this observational reinforcement learning.
From preliminary experiments, it becomes clear that absence of the goal-
keeper is a fatal defect in this learning. So, this section adds special strategy for
the goal-keeper. The goal-keeper has a fixed position and a fixed territory.
Figure 8 shows one situation in the middle of the learning. The formation of
the learning team seems very appropriate to the opponents'. Each agent marks
an opponent effectively, and the formation is extended moderately. The learning
team was stronger than the fixed position's in this period of the learning.
Unexpectedly, the learning team did not converge at this strategy. Figure 11
shows the transition of the behavior of an agent. This agent learned the strategy
in which it prefers the closer place from its goal and the farther place from other
team-mates. The transition of the score rate is shown in Figure 12. The learning
team led the fixed position team in around 40000, but after that, the learning
team did not get stronger but rather get weaker. On the other hand, the kick
rate was increased favorably. The kick rate is the rate of rewards. So, the fact is
that the learning team learned the admirable position certainly.
5 Conclusions
This paper explained a RoboCup team, Andhill, and applied reinforcement learn-
ing to a complex and large problem, the soccer-position problem. This paper es-
pecially used observational reinforcement as a solution to the exploration prob-
lem.
Andhill participated in the RoboCup97 tournament using the fixed position
strategy. Fortunately, Andhill won the second prize. A chance of winning to the
champion team seemed to exist in the position fitting: The approach in this
research must be promising.
385
0.4 ~ i i i i i ~ ~ i
. . . . . ! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
-0.1
t
I
-0.2 ¢%
-0.3
-0.4 I I l l I t t ! I
0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
Fig. 9. The transition of the behavior of an agent in Section 4.1: This figure omits the
fifth input.
0.35
t h e r a t e in w h i c h the 1 ~ * ~ i n ~ t e am g e t , a goal --
the rate in w h i c h the o ~ o n e n t s ' te~ gets a ~o&l . .
the r a t e in which the agent kicks the hall - -'
0.3
0,25
0.2
0.15i
O.'f I t .*.,,, ,t" "''°'" =W N d ~ r ~r _
0.05
I l I t I i i I I
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
0.4 I I
0.3 . °,~
0.2 \ ** " , j ,
, %'"'t*°/,,,.. "
I' i
0.1
I %,,>/~;
-0.1
-0.2
Fig. 11. The transition of the behavior of an agent in Section 4.2: This figure omits
the fifth input.
0.35
~h, ~t. i, w h i c h the l ~ a r ~ i n g t.~ =.~, ~ ~oal
th. r~e in which t h . ~ p p ~ e n t , " t,~m g e t , a ~o~i
the rate in w h i c h the agent kicks ~he ball -- --.
0.3
j i t7
0.15 ~ t , ~ ' ,, ,, s
0.05
0
0 20000 40000 60000 80000 100000 120000 140000 160000 180000
References