Professional Documents
Culture Documents
Use of Neural Networks As Decision Makers in Strategic Situations
Use of Neural Networks As Decision Makers in Strategic Situations
prefer to go outside or stay inside by good or bad weather, the objective of the HLANN is to find and choose the
and you know what you are aspiring to. Thus, as an strategy (output) that maximizes the Will-signal. Fig. 1
intelligent agent, you are able to make a decision, and represents this whole Human-Like Neural Network.
choose the optimal strategy, i.e. the strategy that
corresponds to your will. Thereby, as the objective is to
design a decision maker that can make as good decisions as
humans’, the HLANN will follow the same reasoning as a
man: it will receive two sorts of inputs, the conditions and
the possible choices, and it will be made up of two main
parts: one part that takes the decision, and another part that
expresses the feeling and the will of the intelligent agent.
1281
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009
reinforcement signal, as represented in Fig. 1. We recall that payoff because the intelligent agent does not know this
r belongs to [-1,1] and is wanted to be equal to 1 (value for function, so we express it as follows:
which the chosen strategy is the most wanted strategy). The
training of the network will update the weights of each ∂g g(achosen + v environment + v noise ) − g(achosen ) (6)
(a )≈
layer k of the network (wkij) in order to minimize F(r). Thus, ∂a chosen noise
by applying the steepest descent algorithm, one finds:
where the vector venvironment corresponds to the direction of
∂F ∂r (2) the evolution of the other variables of which the payoff
∆w k
ij = −α k = 2α (1− r) k
∂w ij ∂w ij depends. The vector vnoise is a random vector that represents
a variation of the choice of the intelligent agent. Its norm is
where α is the learning rate. Now, let us propagate this |noise|, which is a random number selected according to a
expression through the HLANN until we reach the mean-zero Gaussian distribution with very small standard
output-layer of the Decision Maker network: deviation (but noise≠0). So g(a) is the real payoff, given by
the combination between the chosen strategy a and the
∂r ∂r ∂g ∂a M ∂n M Environment (other players’ strategies), while
= (3) g(a+venvironment + vnoise) is the expected payoff of a strategy
∂w k ij ∂g ∂a M ∂n M ∂w k ij
that would correspond to the output a+vnoise. As the
intelligent agent’s strategy is a and not a+vnoise, the value of
where r, g, and a are defined in Fig. 2, M is the number of
g(a+venvironment + vnoise) is not provided by the environment
layers of the Decision Maker, and nM is the net input before
itself but by an other neural network that approximates the
its last layer’s transfer function fM. As we will need it below,
Environment, and that is trained by a supervised-learning
we already define this function fM as the logistic function.
process, as the back-propagation algorithm, during an
Now, let us calculate every member of the right part of (3):
exploration phase. The training set of this Environment
Approximation network is constituted by the outputs of the
∂n M Decision Maker network during the exploration phase and
= a M −1 (4)
∂w k ij the targets are given by the corresponding responses of the
environment.
where aM-1 is the output of the M-1th layer of the Decision Then, the last member of (3) is expressed by the same
Maker network. Then, we have: method:
∂r r(g + noise) − r(g) (7)
≈
∂a M ∂f M ∂g noise
= = a M (1− a M ) (5)
∂n M ∂n M
where r is the Will signal, and only depends on the payoff.
n th
Where a is the output of the n layer. For the other The noise is also selected according to a mean-zero
members of (3), we will use a similar method as the one Gaussian distribution with very small standard deviation
used in the Stochastic Real-Valued Neuron (SRV) of (but noise≠0).
Gullapalli [9]: we cannot calculate the derivative of the Thus, by replacing these expressions in (2), we obtain:
1282
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009
∆w M = −αS M a M −1 (8) sum of their immediate feeling and the Expected future
feeling from this situation (from this payoff). The
where SM is the sensitivity of the Mth layer, and is given by: immediate feeling is directly provided by the Feeling
network, but the future reinforcement is given by the
S M = −2(1− r(g))
r(g + noise) − r(g) g(a + v env + v noise' ) − g(a) M
a (1− a M )
following discounted sum:
noise noise'
(9) ∞
E Payoff (t) = E {∑ γ k feeling (t + k)}
k= 0
Then, as in the Back-propagation algorithm, the training of
the Decision Maker network is completed by (11)
back-propagating this sensitivity through its layers.
Of course, the intelligent agent cannot predict the values of
future Feelings (feeling(t+k)), but it can approximate it by
3.2. Evolution of the Feeling-Will network
summing the weighted precedent feelings obtained from a
similar situation in the past. Then, the output target of the
The Feeling-Will network represents the objectives of
Will-network for each situation (payoff) is updated after
the intelligent agent: its output is a Will-signal r(payoff)
each moves by the formula:
that is maximal (equal to 1) for the most desired payoff and
minimal for the least desired payoff. In the Prisoner’s
dilemma, the goal of every player is obvious: earn as much
TPayoff = α . feeling Payoff (t) + (1 − α ).E Payoff
money as possible. Thus, the mapping between payoff and (12)
feeling will be as follows: the payoff ‘5$’ will be awarded
by a feeling maximum (equal to 1), while the payoff ’0$’ where α is a coefficient that indicates if the intelligent agent
will receive a feeling of -1. The willing signal will initially prefers large immediate feeling (α=1) or good long-term
just copy the feeling signal, in order to allow the intelligent feeling (α=0). This Target vector is updated after each
agent to get the best Feeling (feeling of 1). For example, for plays in order to detect if a situation indeed leads to a good
the Prisoner dilemma, the training set given by the or a bad long-term feeling. Retain that the Feeling network
environment would be: is uniquely used to train the Will-network. Then, it allows
the HLAN to update its Will-network in order to produce a
P = [0 1 3 5] (10) Will-signal (a reinforcement signal) that will lead it to the
T = [−1 −0.5 0.5 1] strategy that optimizes its long-term or immediate feeling.
This tendency to long-term decision is undoubtedly an
where 0, 1, 3 and 5 are the possible payoffs, and T gives evidence of intelligent behavior.
their associated feeling.
However, in the Iterated Prisoner Dilemma, the goal of 4. Experiments and Results
every player is not to earn as much money as possible, but
to earn as much money as possible in the long-term, which In the previous sections, we introduced a new HLANN
is different. To follow this objective, the Will-network has and its training algorithm. In this part, we implement this
to evolve, in order to allow the intelligent agent to choose ANN to a well-known problem of Game Theory: the
the strategy that maximizes its long-term feeling. As said Iterated Prisoner Dilemma, introduced in the first part of
before, short-term objectives and long-term objectives can this paper. This dilemma describes well the problematic of
appeal to different strategies, because some payoffs give human decisions in economics and will allow us to see how
large immediate feelings, but then always lead to bad the HLANN can actually make decisions like humans. A
situations, so the Will-signal has to take this into mathematical description of this dilemma was first
consideration. To do so, the Will-Network has to play the introduced in [4] and is displayed in Table 1.
role of an adaptive critic unit, as introduced in [10]. It Table 1. The payoff function of the Prisoner Dilemma: amount
produces a reinforcement signal that represents a weighted of money that every player will receive, according to the
sum of the actual feeling and the Expected future feelings strategies of each player.
of each situation (each payoff). To produce this adaptive Player B
critic signal the Will-Network is trained after each play C D
(move) by a supervised training algorithm (as C (3,3) (0,5)
back-propagation) where the training set is constituted by Player A
(5,0) (1,1)
the payoffs, and their relative updated targets: the weighted D
1283
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009
To experiment the HLANN, we make it participate to sum of the immediate feeling and the Expectation of the
an IPD tournament against different well-known fixed future feeling of the payoff of 3 has become larger than the
strategies that have been studied and used in [3] and [4]. In one of the payoff of 5. Thus, when the payoff is 5 for
this paper, we present the results of the HLANN against the example, the HLANN will change its weights in order to
following strategies: Always Cooperate, Always Defect, Tit increase the Will-signal, as expressed by (1), which
for Tat (Cooperate on the first move, and then, for every corresponds to a reduction of the payoff (it follows the
other moves, mimic the strategy of the other player at the curve of Fig. 3 from the abscissa 5 to the abscissa 3).
precedent move, which is the demonstrated Winning According to Fig. 3, a reduction of the Payoff corresponds
strategy according to [3] and [4]), and Soft Majority to an increase in the output signal, i.e. a change of strategy
(cooperate as long as the opponent cooperated more than he from strategy -1 (Defection) to strategy 1 (Cooperation). So
defected, else, defect). the HLANN changes its weights in order to increase its
For each game, The HLANN receives two inputs: the output until the will-signal reaches the equilibrium, i.e. for
two possible strategies, that are Cooperation, a payoff of 3. Then, the output will not change anymore. As
mathematically represented by 1, and Defection, another example, if the opponent does not collaborate, then
represented by -1. Its output is, as in [6], a continuous the HLANN will never reach the payoffs of 3 or 5. Its
signal ranged between -1 (complete defection) and 1 abscissa will then be between 0 and 3. Thus, according to
(complete cooperation). Then, we give to the HLANN its the derivative of the curve of Fig. 3 in this area, the
feeling and Will networks training sets (10), which are the HLANN has to increase the payoff (in order to increase its
same for both at the beginning, because the goal is to critic signal during the next move), which means, according
maximize the immediate feeling. Then, the HLANN begins to Fig.3, a reduction of the output of the HLANN. So it will
by an exploration phase, in order to discover the payoff change its weights in order to adopt the strategy -1
function, during which it maps the strategies of the two (Defection).
opponents to the payoff of the HLANN, as shown in Fig. 3.
1284
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009
References
1285