Refinement of Soccer Agents' Positions Using Reinforcement Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

R e f i n e m e n t of Soccer Agents' Positions

U s i n g Reinforcement Learning

Tomohito Andou

Department of Mathematical and Computing Sciences,


Tokyo Institute of Technology
andoa~i s. t it ech. ac. j p

Abstract. This paper describes the structure of the RoboCup team,


Andhill, which won the second prize in the RoboCup97 tournament,
and the results of the reinforcement learning in which an agent receives
a reward when it kicks the ball. In multi-agent reinforcement learning,
the trade-off betwee~ exploration and exploitation is a serious problem.
This research uses observational reinforcement to ease the exploration
problem.

1 Introduction
Conventional reinforcement learning research assumes that the environment is
based on Markov decision processes (MDPs) and that the problem is small
enough for the entire state space to be enumerated and stored in memory. From
the practical viewpoint, however, reinforcement learning should be able to ap-
proximate to the optimM policy in complex problems or large problems with high
frequency, even if it is not guaranteed for convergence to optimMity. Actually,
recent researches ohen tackle complex problems a n d / o r large problems [1].
RoboCup (The World Cup Robotic Soccer) [3], of which the first tournament
was opened in August 1997, is a suitable problem to test reinforcement learning
methods. In the R o b o C u p simulator league, software agents play a game in a vir-
tual field provided by Soccer-server [5]. An agent receives perceptuM information
and executes actions through Soccer-server. Generally, this problem includes the
following difficulties:

- M u l t i - a g e n t : Each team consists of eleven agents. T e a m play is required,


but sharing memory is not permitted by the RoboCup simulator league rule
except "saying" in the field.
- Incomplete perception: An agent can not look over its shoulder. Soccer-
server gives sensor information including some errors. Agents should be ro-
bust.
- Real-time: A match is uninterrupted. Agents must choose actions in a short
period.

Furthermore, the following difficulties should be overcome when reinforcement


learning is applied to this problem:
374

- Partially o b s e r v a b l e M a r k o v d e c i s i o n p r o c e s s e s : Incomplete percep-


tion causes POMDPs.
- D y n a m i c e n v i r o n m e n t : The environment is not stationary, and will be
changed gradually because team-mates learn simultaneously.
- L a r g e s t a t e s p a c e : The state space is too large to be stored in memory. A
large state space requires generalization by function approximators.
- D i s t r i b u t i o n o f r e i n f o r c e m e n t signals a m o n g team-mates: The aim
of a team is to win the game, but reinforcement signals should be defined in
more detail. Various kinds of reinforcement signals are possible.
- Trade-off between exploration and exploitation: How an agent be-
haves during the learning is an important factor. An agent follows its best
policy in exploitation oriented learning, and tries wide policies in exploration
oriented learning. Multi-agent learning makes this problem more serious.
In this research, we have built a RoboCup team, Andhill, which improves
agents with on-line reinforcement learning and refines the position which seems
the essence of team play.

2 Andhill Agents

In Pre-RoboCup tournament in 1996, there were two major types of strategies.


One strategy is the lump strategy, in which most agents keep chasing the ball
in the game. There was not the "stamina" element in Pre-RoboCup rule. The
other strategy is the fixed position strategy. An agent reacts only when the ball
is within its territory. Once the ball leaves the territory, the agent returns to its
home position.
The champion team in Pre-RoboCup, Ogalets, adopted the fixed position
strategy. An Ogalets agent has its inherent home position and knows team-mates'
home positions. An agent who possesses the ball kicks it to the home position of
a team-mate without confirmation of the team-mate's existence. Other strong
teams in Pre-RoboCup were also based on the fixed position strategy.
The fixed position strategy is successful because agents behave more effec-
tively than the lump strategy's. In RoboCup97, the "stamina" element is intro-
duced, therefore the fixed position strategy becomes more advantageous because
agents don't need much stamina.
This research, however, dares to aim at a better strategy, the dynamic po-
sition strategy in which an agent decides its position by itself. The dynamic
position team can fit its formation to the opponent team's formation. The struc-
ture of Andhill agents is described from Section 2.1 to Section 2.5.

2.1 Fundamental Structure

An agent normally receives a visual message per 300 milliseconds and can execute
an action per 100 milliseconds. An agent of other teams in RoboCup97 seems
to execute a series of actions (turn - dash - dash, kick kick - kick, and so on)
after it receives a message. Andhill uses timer interruption to ensure executing
an a~tion in every turn. Figure 1 shows a rough sketch of agents' execution.
375

update internalinformation)

eceiveperceptualinformation~

Fig. 1. The flow of an agent's execution

2.2 Internal Information

A n a g e n t receives visual i n f o r m a t i o n a b o u t relative distance, d i r e c t i o n , d i s t a n c e -


c h a n g e a n d d i r e c t i o n - c h a n g e of each o b j e c t from Soccer-server. T h e a g e n t can
e s t i m a t e its l o c a t i o n from i n f o r m a t i o n a b o u t fixed o b j e c t s a n d the ball's f u t u r e
l o c a t i o n from the ball's velocity.
A n d h i l l a g e n t s i n t e r p r e t sensor i n f o r m a t i o n into i n t e r n a l i n f o r m a t i o n such as:

- C u r r e n t time
- L o c a t i o n of t h e agent
- T i m e - s t a m p of the a g e n t ' s l o c a t i o n
- Velocity of t h e a g e n t
- T i m e - s t a m p of the a g e n t ' s v e l o c i t y
- D i r e c t i o n of the a g e n t
- S t a m i n a of the a g e n t
- L o c a t i o n of t h e ball
- T i m e - s t a m p of the b a l l ' s l o c a t i o n
- Velocity of the ball
- T i m e - s t a m p of the b a l l ' s velocity
- E x p e c t e d f u t u r e l o c a t i o n of the ball
- L o c a t i o n s of agents in sight
- T h e field i n f o r m a t i o n
- T i m e - s t a m p of the field i n f o r m a t i o n

F i g u r e 2 shows an e x a m p l e of t h e field i n f o r m a t i o n . T h e field is r e p r e s e n t e d


b y a 16 × 11 grid. E a c h unit i n d i c a t e s existence of t e a m - m a t e s a n d o p p o n e n t s .
376

~ l , l , , , , , , 0 m J ~

.... [ " [ [ " [ V V [ ................................

Fig. 2. An example ~ the fieM in~rm~ion

Agents share information about the ball's location and the field information
using "say" command among team-mates.

2.3 Action Flow

This section describes the detail of the "execute an action" module in Figure 1.

1. G e t s e n s o r i n f o r m a t i o n if i n t e r n a l i n f o r m a t i o n is old: If the time-


stamp of the agent's location or the ball's location is old, the agent turns
around according to its visual range.
2. K i c k t h e ball if possible: If the bail is within the agent's kick range, the
agent passes, dribbles or shoots. See Section 2.4
3. C h a s e t h e ball if t h e a g e n t is t h e closest t o t h e ball: If the agent is
the closest agent of team-mates to the expected future location of the ball,
it runs to the place. To avoid all agents' holding back, the agent runs to the
place if the distance is roughly equal, too.
4. O t h e r w i s e r e t u r n t o t h e a g e n t ' s h o m e p o s i t i o n : If the above conditions
are not satisfied, the agent returns to its home position. See Section 2.5

Additionally, an Andhill agent "says" its internal information to team-mates


periodically.

2.4 Kick Direction

The most important thing for soccer agents when it can kick the ball is to send
the ball to a safer place. So the definition of the safety of the place is needed.
The safety of the place seems to depend on the distribution of agents and the
distances to the both of goals.
In Andhill agents, the safety of the place is determined by the safety function
whose input values are the following elements:
377

1. Distance to the goal of the agent's side


2. Distance to the goal of the opponents' side
3. Distance to the closest team-mate
4. Distance to the closest opponent

The safety function is a neural network shown in Figure 3. When the input values
are the distances from the ball's location, the output value indicates the safety of
the ball's location. The network is only hand-adjusted but the network represents
that the closer place to the goal of the agent's side is the more dangerous place,
the closer place to a team-mate is the safer place, and so on.

@ -0.5 ~
2~.IJ ~-io
@ _

@ -2 4

@ --2

Fig. 3. The safety function: Curves in hidden units denote the sigmoid functions.

Figure 4 shows an example of output values of the safety function. The circle
symbols denote team-mates and the cross symbols denote opponents. The circle
t e a m defends the left side goal. The larger number means the safer place. The
safer place is the more desirable place to kick to.
An agent who can kick the ball estimates the safeties of all reachable places,
and kicks the ball to the safest place. Dribbling is not prepared explicitly, but
the agent sometimes dribbles because the safest place is just its front. The agent
shoots if it is within the shoot range.

2.5 Positions

W h a t the agent should do when it is far away to the ball is one of the most diffi-
cult tactics to plan. An agent using the fixed position strategy doesn't move from
its home position. In real world soccer games, however, a player moves around
in the field incessantly because its optimal position is changed dynamically. It
must be applicable to RoboCup agents.
378

-4i-4~-3i-2i-1~-21-31-2 o i o io i-li 0 i2 i4 i4

-8;1 - !-31-4i-5i-1 2 !2!51 s l s i7

Fig. 4. The output values of the safety function: This figure omits real numbers into
integer.

There are some strategies according to the player's role in a real world soccer
team.

- G o a l - k e e p e r " The goal-keeper stays in front of the goal of its side. There are
no privileged players in RoboCup97, but this strategy is good for capturing
the ball because opponents kick to the place frequently.
- D e f e n d e r : The defender's role is mainly to capture the ball which an op-
ponent possesses. So a defender should be in a place where the ball tends
to come when the ball is possessed by an opponent. Therefore, a defender
usually keeps marking an opponent.
- F o r w a r d : The forward's role is mainly to get a goal. This task is realized
only when the player's team keeps the ball from the opponents'. So a forward
should be far away from opponents.

These strategies seem also useful in RoboCup agents. T h a t is to say, the following
elements are necessary to define the strategy about the position.

1. Distance to the goal of the agent's side


2. Distance to the goal of the opponents' side
3. Distance to the closest team-mate
4. Distance to the closest opponent
5. Safety of the current location of the ball

Note that which team possesses the ball is represented by the safety of the
current location of the ball.
Andhill uses the position function with which the agent decides the position
to go. The position function is a neural network that inputs the above-mentioned
five elements and outputs the position value of the place. The network structure
is shown in Figure 5. This neural network can represent the class of the XOR
problem, for example, the strategy in the right side and the strategy in the left
side can be opposite.
379

Fig. 5. The structure of the position neural network: Curves in hidden units and an
output unit denote the sigmoid functions.

3 U s i n g Reinforcement Learning to Refine the P o s i t i o n


Function

Reinforcement learning is the machine learning in which an agent seeks to adapt


to an environment by reinforcing the action which gives a greater reward (re-
inforcement signal). Reinforcement learning differs from supervised learning es-
pecially in its unnecessariness of ideal input/output pairs. It requires on-line
performance of the problem, but the RoboCup problem satisfies the request.
This research applies reinforcement learning to refine the position function
(Section 2.5).

3.1 Local Reinforcement Signals


Reinforcement signals need to be predefined. This task is difficult but very im-
portant.
The most natural definition of reinforcement signals in the RoboCup prob-
lem may be that a team receives a reward when it gets a goal and receives
a punishment when it loses a goal. The definition, however, remains the credit
assignment problem. T h a t is, how the reward should be distributed among team-
mates. For example, if the reward is defined to be distributed fairly among all
the team-mates, an agent who doesn't work well reinforces its useless actions.
If the agent who gets a goal can receive much reward, the goal-keeper receives
only little reward regardless of its important role.
380

In this research, which agent has worked well is defined by the number of its
kicking the bail. An Andhill agent kicks the ball to a safer place (Section 2.4).
So, kicking the ball can be said to be a profitable action for the team.

3.2 Exploration and Exploitation


Effective sample data can be collected only when an agent behaves consciously
for the team play because other agents also learn simultaneously (exploitation).
On the other hand, an agent must try various kinds of policies fl'equently during
the learning process because the environment is not stationary and other agents'
policies are changed gradually (exploration). There is a difficult trade-off problem
between exploitation and exploration.
Moreover, moving the position is a high level action. Even if the position
function indicates to go to somewhere, the agent can not always go to the place
in time. Furthermore, moving the position costs stamina.
An Andhill agent chooses the best possible policy and the exploration is only
incidentally performed on account of the costly trial.

3.3 A l g o r i t h m for U p d a t i n g t h e N e t w o r k

Q-learning [6] is one of the most promising reinforcement learning methods.


Unfortunately, Q-learning assumes that the environment is stationary and com-
pletely observable. This assumption is not realized in the RoboCup problem.
Littman [4] pointed out that Q-learning can be applied in not only MDPs
but the framework of Markov games. In the framework of Markov games, the
environment is supposed to include opponents. He used a simple soccer game
as an example, but the state space was quite small and the environment was
completely observable.
Kimura [2] proposed the stochastic gradient ascent (SGA) method in POMDPs.
A large state space is permitted in the SGA method. The general form of the
SGA algorithm is the following:
1. Observe Xt in the environment.
2. Execute action at with probability ~r(a~,~ , A~).
3. Receive the immediate reward r~.
4. Calculate ei(t) and D~(t) as

e~(t) = ~ In (~r(at, W,X,)),

Di(t) = ei(t) + vDi(t - 1),

where v(O <_ 7 < 1) denotes the discount factor, and wi does the i th compo-
nent of W.
5. Calculate Awl(t) as
Awi(t) = (rt - b)Di(t),
where b denotes the reinforcement baseline.
381

6. Policy Improvement: update W as

AW(t) = (awl(t), A'~2(t),.--, awi(t),-..),


w w + -

where a is a nonnegative learning rate factor.


7. Mow~ to the time step t + 1, and go to step 1.

This research adopts the SGA method.


In this research, W is the set of weights of the neural network in Figure
5. Action probability is defined as 7r(at, t,V, Xt) = or~ EsES os, where ot is the
output value when the input values are Xt, and S is the set of squares of the
grid in Figure 2.

w13

F i g . 6. Tile neural network is to be reused ret)eatcdty.

The network is applied to all squares s C S. In other words, the network


is to be reused repeatedly while the action probability 7c(at, W, Xt) is being
calculated. Figure 6 shows a simple neural network for explanation of calculating
the eligibility el. The networks in Figure 6 are identical except their activation
values, therefore the activation values v are:

v5 = g(~J3,sv3 + w4,sv4)
VlO ----g(•)3,5V8 q~ W4,5V9)

where g is the sigmoid function. When action a 5 is selected, eligibility e i is


calculated as follows:

_ ~ v5
382

1 0 1 0
- .5 --(v5 + vl0)
V5 ~W3,5 V5 T Vl0 0W3,5

= vs g'(in5) -
v 5 -~- VlO

vs v s T v g'(inlo)
0 ln( v5 )
el,a = Owl,a v~ T vlo
1 0 1 0
- v~ 0wl,3 v5 --(v5 + vlo)
V5 -~- Vl0 0Wl,3

= w3'svlg'(in3) ~ v5 + vlo

w3'sv6g'(ins) v5 +1 vlo g'(inle)

where in is the sum of weighted input values of the unit, and gl is the differential
sigmoid function g / = g(1 - g).
The above-mentioned update rules are too sensitive to the output values.
For example, if the output value v5 is quite small, eligibilities are very big. This
easily causes oscillations. Besides, if all the output values are equal (v5 = v]0),
eligibilities are very small. This research uses modified eligibilities as follows:

e3,5 = v3(0.1 - vs)g'(ins) +


v8(0.1 - v , o ) g ' ( i n l 0 ) +
v 3 ( 0 . 9 - vs)g'(ins) x 2
el,3 = w3,svlg' (in3 )(O.1 - vs )g' (ins ) +
w3,sv6g'(ins)(0.1 - Vlo)g'(inlo) +
w3,svlg'(in3)(0.9 - vs)g'(ins) x 2

These equations are equivalent to the complementary back propagation's.

4 Experiments

The update module needs about 10 milliseconds computation. Eleven agents'


update modules amount to about 110 milliseconds. So, the module can not be
executed in every turn (100 milliseconds). The module is executed in every 1100
milliseconds so as to operate safely.
An agent often kicks the ball twice or three times in one catch. An agent
receives a reward when it kicks the ball, but it should be given one reward in a
catch. So, an agent is limited to receive only one reward in a second.
If agents crowd around the ball, agents in t h e lump o f t e n receive lots of
rewards because the ball doesn't get out. An agent can receive a reward only
when the distance to other agents is over 5 meters.
383

As described in Section 3.2, moving the position is costly action. Even if the
output values of the position function indicate to go to somewhere distant, the
action is not realistic. So an agent goes to a feasible place to which it can go
with remaining stamina.
An agent can move the position instantly before kick-off. The best position
of an agent is decided by other agents' positions. Therefore, agents move their
position instantly by turns.
The learning team learns fi'om a long game playing with a team of the fixed
position strategy.

4.1 Incidental Exploration


This section describes the results of learning with the incidental exploration of
Section 3.2.
The learning team converged at the two-lumps strategy in which agents tend
to crowd to other team-mates and form two lumps shown in Figure 7. One lump
is formed in front of the team's goal and the other is formed at the center of the
field.
Figure 9 shows the transition of the behavior of an agent of the learning team.
The horizontal axis means the learning time by 100 milliseconds. The vertical
axis means the regularized feature value of behaviors. Each element corresponds
to the input value of the position flmction in Section 2.5. The lower value means
the closer distance. Therefore, the agent prefers the closer place from its goal,
the farther place from the opponents' goal, the closer place from a team-mate,
and the farther place from opponents. This figure is smoothed to be clear.
Figure 10 shows the transition of the learning team's score rate, the opponent
team's score rate, and the rate of the number of rewarded kicks of an agent. These
rates are counted in every 10 seconds and smoothed. All of the three rates are
neither increasing nor decreasing in this figure. In other words, the learning team
did not learn well after all.
There are some reasons why this learning converged at the two-lumps strut-
egy:
-- If an agent prefers the closer place to another team-mate, the agent often
leads other agents to the lump strategy. The lump strategy is easily rein-
forced.
- Usually, some agents rush to the bali's location. So, the hlmp strategy is
rarely reduced.
- If the learning team is not so strong, agents tend to reinforce the closer place
to the team's goal.
- The center place is also apt to be reinforced because the ball frequently goes
through there.

4.2 Observational Reinforcement


The incidental exploration reinforcement learning explores too rarely. In order
to get out of the local optimal policy, agents must explore policies more fre-
384

quently. On the other hand, agents have to also exploit policies because too
much exploration disturbs other agents' learning. In conventional reinforcement
learning, there is not so good solution for this trade-off. Fortunately, the tra~e-
off can be eased in the soccer-position problem because an agent can estimate
the position in which it can kick the ball frequently. An agent can reinforce a
hopeful position from observation without practical experience. When the ball
goes to somewhere far from team-mates, agents reinforce the place. This section
describes the results of this observational reinforcement learning.
From preliminary experiments, it becomes clear that absence of the goal-
keeper is a fatal defect in this learning. So, this section adds special strategy for
the goal-keeper. The goal-keeper has a fixed position and a fixed territory.
Figure 8 shows one situation in the middle of the learning. The formation of
the learning team seems very appropriate to the opponents'. Each agent marks
an opponent effectively, and the formation is extended moderately. The learning
team was stronger than the fixed position's in this period of the learning.
Unexpectedly, the learning team did not converge at this strategy. Figure 11
shows the transition of the behavior of an agent. This agent learned the strategy
in which it prefers the closer place from its goal and the farther place from other
team-mates. The transition of the score rate is shown in Figure 12. The learning
team led the fixed position team in around 40000, but after that, the learning
team did not get stronger but rather get weaker. On the other hand, the kick
rate was increased favorably. The kick rate is the rate of rewards. So, the fact is
that the learning team learned the admirable position certainly.

5 Conclusions

This paper explained a RoboCup team, Andhill, and applied reinforcement learn-
ing to a complex and large problem, the soccer-position problem. This paper es-
pecially used observational reinforcement as a solution to the exploration prob-
lem.
Andhill participated in the RoboCup97 tournament using the fixed position
strategy. Fortunately, Andhill won the second prize. A chance of winning to the
champion team seemed to exist in the position fitting: The approach in this
research must be promising.
385

F i g . 7. The two-lumps strategy in Section 4.1

Fig. 8. One situation in the middle of the lcmning in Section 4.2


386

0.4 ~ i i i i i ~ ~ i

input 3 " "'


03 ",. input 4 ----' ,

0.2 °"'%°""" "'"''"

. . . . . ! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

-0.1

t
I
-0.2 ¢%

-0.3

-0.4 I I l l I t t ! I
0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Fig. 9. The transition of the behavior of an agent in Section 4.1: This figure omits the
fifth input.

0.35
t h e r a t e in w h i c h the 1 ~ * ~ i n ~ t e am g e t , a goal --
the rate in w h i c h the o ~ o n e n t s ' te~ gets a ~o&l . .
the r a t e in which the agent kicks the hall - -'

0.3
0,25

0.2
0.15i
O.'f I t .*.,,, ,t" "''°'" =W N d ~ r ~r _

0.05

I l I t I i i I I
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Fig. 10. The transition of the score rate in Section 4.1


387

0.4 I I

0.3 . °,~

0.2 \ ** " , j ,

, %'"'t*°/,,,.. "
I' i

0.1

I %,,>/~;

-0.1

-0.2

0 20000 40000 60000 80000 100000 120000 140000 160000 180000

Fig. 11. The transition of the behavior of an agent in Section 4.2: This figure omits
the fifth input.

0.35
~h, ~t. i, w h i c h the l ~ a r ~ i n g t.~ =.~, ~ ~oal
th. r~e in which t h . ~ p p ~ e n t , " t,~m g e t , a ~o~i
the rate in w h i c h the agent kicks ~he ball -- --.

0.3

0.25 ~,, .,.~.,~! ;~,~'~;.%? ' ~ ~,

j i t7
0.15 ~ t , ~ ' ,, ,, s

%'.:[ :: " .~ I~. ::' ~.~. , '

0.1 ' "~ ; °

0.05

0
0 20000 40000 60000 80000 100000 120000 140000 160000 180000

Fig. 12. The traasition of the score rate in Section 4.2


388

References

1. Kaelbling,L.P., Littman,M.L. and Moore,A.W.: "Reinforcement Learning: A Sur-


vey", Journal of Artificial Intelligence Research 4, p p . 2 3 7 - 285 (1996).
2. Kimura, H., Yamamura, M. and Kobayashi,S.: "Reinforcement Learning in Partially
Observable Markov Decision Processes", JournM of.Japanese Society for Arti~cial
Intelligence, Voi.11, No.5, pp.761 - 768 (1996 in Japanese).
3. Kitano,H., Asada,M., Kuniyoshi,Y., Noda,I. and Osawa,E.: "RoboCup: The Robot
World Cup Initiative", Proceedings of IJCAI-95 Workshop on Entertainment and
AI/Alife (1995).
4. Littman,M.L.: "Markov games as a framework for multi-agent reinforcement learn-
ing", In Proceedings of the Eleventh Internationed Conference on Machine Learn-
ing, p p . 1 5 7 - 163 (1994).
5. Noda,I.: "Soccer Server: a simulator of Robo Cup", JSAI AI-Symposium 95: Spe-
cial Session on RoboCup (1995)
6. Watkins,C.J.C.H. and Dayan,P.: "Technical Note Q-Learning', Machine Learning
8, pp.279 - 292 (1992).

You might also like