Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009

USE OF NEURAL NETWORKS AS DECISION MAKERS IN STRATEGIC


SITUATIONS
BENOIT COURAUD, PEILIN LIU

Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China


E-MAIL: benoit.couraud@sjtu.edu.cn, liupeilin@sjtu.edu.cn

Abstract: [2], or in social dilemmas [3]. In these situations, two or


Intelligence consists of the ability to make right decisions more intelligent agents (players) have the choice between
in a given situation in order to achieve a certain goal. Game different strategies that will give them a payoff that also
Theory provides mathematical models of real-world situations depends on the other players’ strategies.
for studying intelligent behavior. Most of time, effective One of the most famous and useful game in economics
decision-making in strategic situations (such as competitive
situations) requires nonlinear mapping between stimulus and
is the prisoner Dilemma: consider the case where you,
response. This sort of mapping can be provided by Artificial player A, play a game against an opponent, B, with whom
Neural Networks. This paper describes the use of a you cannot communicate. You each have two plays: C
human-like Artificial Neural Network to find the optimal (cooperate) or D (defect). If you both play C, you will both
strategy in strategic situations without injecting expert receive 3$. If you play C and they play D, you receive
knowledge. In order to train such a Neural Network, an nothing. Alternatively, if you play D and they play C, you
unsupervised reinforcement-learning rule using receive 5$. But if you both play D, you two will receive 1$.
Back-Propagation is introduced. Unlike most of reinforcement The one shot’s prisoner dilemma (each player just has
learning systems, this learning rule can operate with one move) has a single dominant strategy: Defect. But in
continuous outputs, what makes it worth for a lot of different
applications. Finally, this decision maker is used to find the
the case of the Iterated Prisoner Dilemma (IPD), the
optimal strategy in the well-known Iterated Prisoner’s optimal strategy seems to be “Tit-for-Tat” (Cooperate on
Dilemma, in order to demonstrate that this human-like the first move, and then, for every other moves, mimic the
Artificial Neural Networks can be used to design machines strategy of the other player at the precedent move), as show
that are also capable of intelligent behavior. the analysis of Axelrod in [3] and [4].
This paper first introduces a Human-like Artificial
Keywords: Neural Networks (HLANN) and its own
Neural Networks; Artificial Intelligence; Reinforcement reinforcement-learning rule using Back-Propagation. Then,
training; Back-Propagation; Game Theory; Iterated it exhibits one example of its utilization as an intelligent
Prisoner’s Dilemma agent that is able to make the right decision in a strategic
situation such as the one of the Iterated Prisoner Dilemma.
1. Introduction
2. Human-like Artificial Neural Network
The main goal of Artificial Intelligence is to design
machines that are capable of intelligent behavior. This 2.1. Human decisions
intelligent behavior is usually described in terms of
stimulus-response: the intelligent agent chooses one In order to design an ANN that is capable of intelligent
strategy (one action) in response to a particular situation. behaviors such as those of humans, we first have to study
This situation can be one of the numerous real-world the process that leads to human decision-making. In a given
situations described by Game Theory. situation, humans know the conditions of the situation, the
The fundamentals of game theory were first laid out in possible choices they have, how much they like each choice,
[1]. Game theory attempts to mathematically study optimal and what they want. Let us study a simple situation:
behavior in strategic situations in which one’s success everyday, you know the weather outside (it is a condition),
depends on the choices of the other players. These strategic and you have two choices: stay inside or go outside (these
situations describe real-world situations such as those that are the possibilities). Furthermore, you know if you usually
occur in political science and economics [1], in evolution

978-1-4244-3703-0/09/$25.00 ©2009 IEEE


1280
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009

prefer to go outside or stay inside by good or bad weather, the objective of the HLANN is to find and choose the
and you know what you are aspiring to. Thus, as an strategy (output) that maximizes the Will-signal. Fig. 1
intelligent agent, you are able to make a decision, and represents this whole Human-Like Neural Network.
choose the optimal strategy, i.e. the strategy that
corresponds to your will. Thereby, as the objective is to
design a decision maker that can make as good decisions as
humans’, the HLANN will follow the same reasoning as a
man: it will receive two sorts of inputs, the conditions and
the possible choices, and it will be made up of two main
parts: one part that takes the decision, and another part that
expresses the feeling and the will of the intelligent agent.

2.2. Artificial Neural Network


Figure 1. Architecture of the Human-Like Artificial Neural
As described above, the HLANN receives two kinds of
Network.
inputs: the conditions, as in [5], and the possible strategies
that the intelligent agent can choose.
3. Learning Rules of the human-like ANN
The main part of this HLANN is the Decision Maker
network: an ANN - a multi or single layer perceptron, that
maps the inputs of the HLANN with its output: the chosen The HLANN is constituted by two different neural
strategy, modulo the addition of a small random variable in networks that have to be trained. The training algorithms of
order to explore all the possibilities. Once the choice is these two parts are not the same. The Decision Maker
made, the output of this network will be combined with the network is trained by reinforcement training, where the
environment variables to give the payoff of the intelligent reinforcement signal is the output of the Feeling-Will
agent (the payoff is the result of the chosen strategy, network. If the intelligent agent has just two possible
combined with other players’ strategies in Game Theory). If strategies (i.e. the Decision Maker network has the choice
we stopped the design of our HLANN here, we would have between two outputs–strategies), the learning rule of the
the same ANN as the one that Harrald and Fogel used in [6] Decision Maker network could be one of the Selective
to solve the iterated prisoner dilemma: a multi-layer feed Bootstrap [7] or the Associative Reward-Penalty Neurons
forward perceptron with 6 condition-inputs that correspond [8] rules. Though, in most of strategic situations, the
to the previous moves of the two intelligent agents. intelligent agent has to choose between more than two
But, as described in the previous part, the HLANN is strategies. Furthermore, as we already said, the mapping
also made up of another part that expresses the Feeling and between the situations and the strategy to choose is often
the Will of the intelligent agent. This part maps the payoff nonlinear, so the Decision Maker network has to be a
with a Feeling and with a Will signal. If a payoff is very multilayer network. In these cases, we use a reinforcement
lucrative for the intelligent agent, then the Feeling signal learning using Back-Propagation, as described below, in
will be high, and inversely if the payoff is bad. To do these which the reinforcement signal produced by the
mappings, we use two ANN, multi or single layer Feeling-Will network is back-propagated to correct the
perceptrons, which map each possible payoff with a weights of the Decision Maker network. Then, to complete
“feeling-signal” first, ranged between -1 (for the worst the training of the HLANN, the Feeling-Will network is
payoff), and 1 for the best payoff, and then with a also updated, by using the back-propagation algorithm, in
Will-signal. This Will-signal is also ranged between -1 and order to make evolve the Will of the intelligent agent.
1, and corresponds to a critic of the strategies: it expresses
if a strategy is wanted by the intelligent agent (Will of 1) or 3.1. Reinforcement Learning using Back-Propagation
not (will of -1). This Will-signal can be exactly the same as
the Feeling-signal, but it is not necessarily the case: if the In this part, we introduce an adaptation of the
payoff that produces the best immediate Feeling always Back-propagation algorithm. In order to train the HLANN,
leads to a very bad Feeling situation, it may not be we want to minimize the function
preferred to a situation that produces a good average
long-term Feeling. F(r) = (1− r) 2 (1)
Thereby, given a situation (conditions and possibilities)
where r is the output of the Feeling-Will network, so the

1281
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009

reinforcement signal, as represented in Fig. 1. We recall that payoff because the intelligent agent does not know this
r belongs to [-1,1] and is wanted to be equal to 1 (value for function, so we express it as follows:
which the chosen strategy is the most wanted strategy). The
training of the network will update the weights of each ∂g g(achosen + v environment + v noise ) − g(achosen ) (6)
(a )≈
layer k of the network (wkij) in order to minimize F(r). Thus, ∂a chosen noise
by applying the steepest descent algorithm, one finds:
where the vector venvironment corresponds to the direction of
∂F ∂r (2) the evolution of the other variables of which the payoff
∆w k
ij = −α k = 2α (1− r) k
∂w ij ∂w ij depends. The vector vnoise is a random vector that represents
a variation of the choice of the intelligent agent. Its norm is
where α is the learning rate. Now, let us propagate this |noise|, which is a random number selected according to a
expression through the HLANN until we reach the mean-zero Gaussian distribution with very small standard
output-layer of the Decision Maker network: deviation (but noise≠0). So g(a) is the real payoff, given by
the combination between the chosen strategy a and the
∂r ∂r ∂g ∂a M ∂n M Environment (other players’ strategies), while
= (3) g(a+venvironment + vnoise) is the expected payoff of a strategy
∂w k ij ∂g ∂a M ∂n M ∂w k ij
that would correspond to the output a+vnoise. As the
intelligent agent’s strategy is a and not a+vnoise, the value of
where r, g, and a are defined in Fig. 2, M is the number of
g(a+venvironment + vnoise) is not provided by the environment
layers of the Decision Maker, and nM is the net input before
itself but by an other neural network that approximates the
its last layer’s transfer function fM. As we will need it below,
Environment, and that is trained by a supervised-learning
we already define this function fM as the logistic function.
process, as the back-propagation algorithm, during an
Now, let us calculate every member of the right part of (3):
exploration phase. The training set of this Environment
Approximation network is constituted by the outputs of the
∂n M Decision Maker network during the exploration phase and
= a M −1 (4)
∂w k ij the targets are given by the corresponding responses of the
environment.
where aM-1 is the output of the M-1th layer of the Decision Then, the last member of (3) is expressed by the same
Maker network. Then, we have: method:
∂r r(g + noise) − r(g) (7)

∂a M ∂f M ∂g noise
= = a M (1− a M ) (5)
∂n M ∂n M
where r is the Will signal, and only depends on the payoff.
n th
Where a is the output of the n layer. For the other The noise is also selected according to a mean-zero
members of (3), we will use a similar method as the one Gaussian distribution with very small standard deviation
used in the Stochastic Real-Valued Neuron (SRV) of (but noise≠0).
Gullapalli [9]: we cannot calculate the derivative of the Thus, by replacing these expressions in (2), we obtain:

Figure 2. Architecture of the HLANN learning Algorithm.

1282
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009

∆w M = −αS M a M −1 (8) sum of their immediate feeling and the Expected future
feeling from this situation (from this payoff). The
where SM is the sensitivity of the Mth layer, and is given by: immediate feeling is directly provided by the Feeling
network, but the future reinforcement is given by the
S M = −2(1− r(g))
r(g + noise) − r(g) g(a + v env + v noise' ) − g(a) M
a (1− a M )
following discounted sum:
noise noise'
(9) ∞
E Payoff (t) = E {∑ γ k feeling (t + k)}
k= 0
Then, as in the Back-propagation algorithm, the training of
the Decision Maker network is completed by (11)
back-propagating this sensitivity through its layers.
Of course, the intelligent agent cannot predict the values of
future Feelings (feeling(t+k)), but it can approximate it by
3.2. Evolution of the Feeling-Will network
summing the weighted precedent feelings obtained from a
similar situation in the past. Then, the output target of the
The Feeling-Will network represents the objectives of
Will-network for each situation (payoff) is updated after
the intelligent agent: its output is a Will-signal r(payoff)
each moves by the formula:
that is maximal (equal to 1) for the most desired payoff and
minimal for the least desired payoff. In the Prisoner’s
dilemma, the goal of every player is obvious: earn as much
TPayoff = α . feeling Payoff (t) + (1 − α ).E Payoff
money as possible. Thus, the mapping between payoff and (12)
feeling will be as follows: the payoff ‘5$’ will be awarded
by a feeling maximum (equal to 1), while the payoff ’0$’ where α is a coefficient that indicates if the intelligent agent
will receive a feeling of -1. The willing signal will initially prefers large immediate feeling (α=1) or good long-term
just copy the feeling signal, in order to allow the intelligent feeling (α=0). This Target vector is updated after each
agent to get the best Feeling (feeling of 1). For example, for plays in order to detect if a situation indeed leads to a good
the Prisoner dilemma, the training set given by the or a bad long-term feeling. Retain that the Feeling network
environment would be: is uniquely used to train the Will-network. Then, it allows
the HLAN to update its Will-network in order to produce a
P = [0 1 3 5] (10) Will-signal (a reinforcement signal) that will lead it to the
T = [−1 −0.5 0.5 1] strategy that optimizes its long-term or immediate feeling.
This tendency to long-term decision is undoubtedly an
where 0, 1, 3 and 5 are the possible payoffs, and T gives evidence of intelligent behavior.
their associated feeling.
However, in the Iterated Prisoner Dilemma, the goal of 4. Experiments and Results
every player is not to earn as much money as possible, but
to earn as much money as possible in the long-term, which In the previous sections, we introduced a new HLANN
is different. To follow this objective, the Will-network has and its training algorithm. In this part, we implement this
to evolve, in order to allow the intelligent agent to choose ANN to a well-known problem of Game Theory: the
the strategy that maximizes its long-term feeling. As said Iterated Prisoner Dilemma, introduced in the first part of
before, short-term objectives and long-term objectives can this paper. This dilemma describes well the problematic of
appeal to different strategies, because some payoffs give human decisions in economics and will allow us to see how
large immediate feelings, but then always lead to bad the HLANN can actually make decisions like humans. A
situations, so the Will-signal has to take this into mathematical description of this dilemma was first
consideration. To do so, the Will-Network has to play the introduced in [4] and is displayed in Table 1.
role of an adaptive critic unit, as introduced in [10]. It Table 1. The payoff function of the Prisoner Dilemma: amount
produces a reinforcement signal that represents a weighted of money that every player will receive, according to the
sum of the actual feeling and the Expected future feelings strategies of each player.
of each situation (each payoff). To produce this adaptive Player B
critic signal the Will-Network is trained after each play C D
(move) by a supervised training algorithm (as C (3,3) (0,5)
back-propagation) where the training set is constituted by Player A
(5,0) (1,1)
the payoffs, and their relative updated targets: the weighted D

1283
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009

To experiment the HLANN, we make it participate to sum of the immediate feeling and the Expectation of the
an IPD tournament against different well-known fixed future feeling of the payoff of 3 has become larger than the
strategies that have been studied and used in [3] and [4]. In one of the payoff of 5. Thus, when the payoff is 5 for
this paper, we present the results of the HLANN against the example, the HLANN will change its weights in order to
following strategies: Always Cooperate, Always Defect, Tit increase the Will-signal, as expressed by (1), which
for Tat (Cooperate on the first move, and then, for every corresponds to a reduction of the payoff (it follows the
other moves, mimic the strategy of the other player at the curve of Fig. 3 from the abscissa 5 to the abscissa 3).
precedent move, which is the demonstrated Winning According to Fig. 3, a reduction of the Payoff corresponds
strategy according to [3] and [4]), and Soft Majority to an increase in the output signal, i.e. a change of strategy
(cooperate as long as the opponent cooperated more than he from strategy -1 (Defection) to strategy 1 (Cooperation). So
defected, else, defect). the HLANN changes its weights in order to increase its
For each game, The HLANN receives two inputs: the output until the will-signal reaches the equilibrium, i.e. for
two possible strategies, that are Cooperation, a payoff of 3. Then, the output will not change anymore. As
mathematically represented by 1, and Defection, another example, if the opponent does not collaborate, then
represented by -1. Its output is, as in [6], a continuous the HLANN will never reach the payoffs of 3 or 5. Its
signal ranged between -1 (complete defection) and 1 abscissa will then be between 0 and 3. Thus, according to
(complete cooperation). Then, we give to the HLANN its the derivative of the curve of Fig. 3 in this area, the
feeling and Will networks training sets (10), which are the HLANN has to increase the payoff (in order to increase its
same for both at the beginning, because the goal is to critic signal during the next move), which means, according
maximize the immediate feeling. Then, the HLANN begins to Fig.3, a reduction of the output of the HLANN. So it will
by an exploration phase, in order to discover the payoff change its weights in order to adopt the strategy -1
function, during which it maps the strategies of the two (Defection).
opponents to the payoff of the HLANN, as shown in Fig. 3.

Figure 4. Will-network output after the game between two


HLANN.
This strategy is the one found by the HLANN itself,
Figure 3. Approximation of the environment for the IPD. without any expert knowledge. This strategy led the
Then, after this exploration phase, the game can begin: HLANN to a large Win of the tournament, as shown in
the HLANN produces an output (a strategy). The Table 2 and Fig. 5. As one can see in Tab. 2, the HLANN is
environment combines it with the strategy of the opponent the intelligent agent that, after having played against all the
and gives to the HLANN its payoff. This payoff goes other intelligent agents, had the maximum average score
through the Will-network to produce the reinforcement (after 200 iterations in each game).
signal. It also produces a feeling signal, which will then be This win is not surprising, because the HLANN
used to update the Will-network as explained in the doesn’t have a fixed strategy: it analyses by itself the
previous section. For example, Fig. 4 shows the Will- possibilities in each situation, chooses the one that
network output after the game between two HLANN. We maximizes its feeling, and update its will-signal after every
can see that the most wanted situation (payoff) is no more move, in order to maximize the long-term feeling, as would
the largest payoff of 5 (as it was at the beginning), but the do a rational player.
payoff of 3. It is explained by the fact that the weighted

1284
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009

of the learning algorithms developed for this Human-Like


Artificial Neural Network. It also underlines the fact that
Table 2. Averaged scores of most known intelligent agents. this human-like Artificial Neural Network is capable of
Scores corresponds to the average amount of money the intelligent behaviors in decision making, without injecting
intelligent agent would receive after playing 100 moves. any expert-knowledge. It is indeed able to adapt itself to the
Intelligent Agent Average Score situation, and to find a strategy that optimizes its feeling.
(strategy) Finally, because neural networks are universal function
HLANN 320 approximators, this Human-Like Artificial Neural Network
Tit for Tat 288 can be a very useful tool in more complex situations, for
Soft Majority 281 example continuous situations with more complex
Always Cooperate 248 environment, where one has to find the optimal strategy
Always Defect 237 without the intervention of an expert-knowledge.

What is very interesting is that the HLANN is able to Acknowledgements


adapt itself to each situation, and to change its strategy,
which is a characteristic of intelligence. The authors wish to acknowledge the support of
Elodie Durand and Pierre Nicol through many useful
discussions and Chen Xianmin, Zhang Yunfei and Zhu Jiayi
for their many advices.

References

[1] J. Von Neumann and O. Morgensten, Theory of


Games and Economic Behavior. Princeton, NJ:
Priceton Univ. Press, 1944.
[2] J. Maynard Smith, Evolution and the Theory of Games.
Cambridge University Press, 1982.
[3] R. Axelrod, The Evolution of Cooperation. New York:
Basic Books, 1984.
Figure 5. Average of the score of the 5 strategies and average [4] R. Axelrod, More effective choice in the iterated
of their opponents’ score.
prisoner’s dilemma, J. Conflict Resolution, vol. 24, pp.
379-403, 1980.
Figure. 5 shows that the HLANN is the intelligent
[5] C. L. Tan, T. S. Quah, H. H. Teh, An Artificial Neural
agent that produces the best average score for him, and a
Network that models Human Decision Making.
low score for his opponent. This is a typical ambitious
Computer, vol. 29, pp. 64-70, 1996.
(human) strategy. Furthermore, as we just said, the HLANN
[6] P.G Harrald and D. B. Fogel, Evolving continuous
is able to adapt its strategy to the opponent. It is also able to
behaviors in the iterated prisoner’s dilemma, Biosyst.,
adapt its Will to the situation, and thus to find the only one
vol. 37, n0. 1-2, pp. 135-145, 1996.
strategy that optimizes its long-term feeling, which is the
[7] B.Widrow, N. K. Gupta, and S. Maitra. Punish/reward:
characteristic of an intelligent behavior.
Learning with a critic in adaptive threshold systems.
IEEE Transactions on Systems, Man, and Cybernetics,
5. Conclusions
vol. 5, pp. 455-465, 1973.
[8] A. G. Barto. Learning by statistical cooperation of
In this paper, we have presented a human-inspired self-interested neuron-like computing elements,
approach that led to the design of an Artificial Neuron Human Neurobiology, vol. 4, pp. 229-256, 1985.
Network that is capable of intelligent behaviors. This
[9] V. Gullapalli. A stochastic reinforcement algorithm
human-like Artificial Neural Network requires a specific for learning real-valued functions. Neural Networks,
reinforcement learning rule using Back-propagation that is vol. 3, pp. 671-692, 1990.
inspired from Gullapalli’s SRV. We have also presented a
[10] R. S. Sutton and A. G. Barto, Reinforcement learning:
simple example of the use of this human-like Artificial An introduction. MIT Press, Cambridge, MA.
Neural Network: the Iterated Prisoner Dilemma. This
example demonstrates the efficiency of the architecture and

1285

You might also like