Project Report Jasper Busschers Corrected PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

DECISION ENGINEERING :

PROJECT REPORT
Multi objective analysis of the
collective risk dilemma

Jasper, Busschers
May 24, 2020

Faculty of Sciences and Bio-engineering


Contents

0.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.2.1 multi-agent multi-objective Markov decision process . . . . . 4
0.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.3.1 Collective risk dilemma . . . . . . . . . . . . . . . . . . . . . 5
0.4 results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.4.1 Like minded agents . . . . . . . . . . . . . . . . . . . . . . . 6
0.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
0.6 appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.6.1 Set of different alternatives . . . . . . . . . . . . . . . . . . . 16
0.6.2 additional plots . . . . . . . . . . . . . . . . . . . . . . . . . 17

1
Abstract

The field of multi objective optimisation has a long history with many
approaches to evaluate the quality of different solutions. Often these
methods try to model the preferences of the decision maker by map-
ping multiple objectives into a singular scalarised reward. In this work
we propose a model that takes into account the preferences of multiple
decision makers. This work is meant to bring multi objective analysis
approaches to the newer field of reinforcement learning (RL). In rein-
forcement learning a specialised learning algorithm is used to choose a
sequence of actions that optimises some objective. Multi-objective RL
is a sub field in RL which studies on objectives that can be expressed
in multiple smaller objectives. It is focused on discovering all optimal
solutions and providing decision support or recommendations.
In this work we want to focus on an extended version of the problem
which is called multi-agent multi-objective games[1]. In these problems
a group of decision makers have to make decisions simultaneously. Each
of these decision makers have their own objectives and preferences.
Since each decision maker only cares about itself, there can often occur
so called greedy behaviour. This is similar to the well-known prisoners
dilemma, where Nash Equilibrium shows that if every decision maker
takes the best action for himself, a globally sub optimal solution is
found. We will use the tool D-sight to evaluate with what weights
and preferences of different objectives, we can avoid greedy behaviour.
The problem we will focus on is the collective risk dilemma. We will
study the impact of different preference functions for each agent and
implement a voting system where each agent can vote for the solution
it likes the most.
0.1 Summary
The field of reinforcement learning (RL) has made rapid progress over the past
years. From beating the best human in GO in 2015 to the new industries of
self-driving cars, factory robots and much more[2]. Traditionally the optimisation
problem for RL is given as a Markov decision process (MDP), these problems
define a set of actions the decision maker can take. We are then interested in a
so-called policy, a function that learns the best action to take in each situation in
order to maximise its objective. In the case of GO for example such objective is
simply winning the game, however many problems cannot be easily expressed as
a single objective problem. A self-driving car must get to its destination as fast as
possible, but also avoid risky situations and respect traffic rules. Bringing all these
different objectives to a single encapsulating reward function to be optimised, can
be extremely challenging.
In reality, it is also not often only about the preferences of a single decision maker.
Like for example with self-driving cars, there will be many more cars on the road,
each optimising their own objective. While each car may find a reasonable solution,
it is very unlikely that this solution will give the best overall traffic conditions which
would be the global objective.
This is the kind of decision-making problem we will study in this work. We will
use the collective risk dilemma to illustrate this way of analysing these multi-agent
multi-objective problems. The collective risk dilemma is meant to model different
decision-makers each investing money for a couple of rounds in order to reach
their goal. Each player in this game wants to minimise the money he has to invest
while maximising the investment of the rest and also wants to reach the common
goal. This is often compared to climate change, where different countries have to
invest in order to reach a goal. Some countries may have more incentive to invest
than other countries, based on the risks that climate change pose to them and the
amount of capital they own.
In this work we will use the traditional Promethee methods to analyse the different
possible optimal outcomes for this problem, and study which outcome is preferred
by most agents for different models of their preferences. We will use a voting
system where the ranking of different solutions for each decision maker is used as
new objective. And finally we optimise the ranking of each agent to find solutions
that are preferred by most of the decision makers.
0.2 Methodology
0.2.1 multi-agent multi-objective Markov decision process
In this paper we will be working with a much-extended version of these Markov
Decision Processes in which there are multiple agents or decision makers. All
operating simultaneously while optimising their own reward that is defined in
terms of multiple objectives.

• a set of n agents or decision makers , AG;

• a set of environment and agent states, S;

• a set of possible actions, At;

• Pa (s, s0 ) = P r(st+1 = s0 |st = s, at = a) is the probability of transition from


state s to states0 under action a.

• {g1 (s), g2 (s), ..., gk (s)|s ∈ S} is a set of k objectives to minimise or maximise.

• {Wa = {w1a , w2a , ..., wka }|a ∈ AG} is a set of k weights for each agent.

• {Ra (s, s0 ) = w1a ∗ g1 (s0 ) + w2a ∗ g2 (s0 ) + ... + wka ∗ gk (s0 )|a ∈ AG, s0 ∈ S} is the
set of rewards for each different decision maker after taking action a in state
s that resulted in state s0 .;

In our approach we extended the definition to also use promethee methods of


preference[3]. More specifically we allow every agent to have a preference function
for each objective. This function is defined as follows:
Pj (a, b) = {Fj [dj (a, b)]|∀a, b ∈ S}
Where dj (a, b) computes the difference between the value of an objective in two
solutions.
dj (a, b) = {gj (a) − gj (b)|∀a, b ∈ S}
dj (a, b) >= 0
The preference function is then used to indicate in more detail how much the agent
preference change as the value of an objective increases.
In our work the reward for each agent is thus defined as:
{Ra (s, s0 ) = w1a ∗ P1a [g1 (s0 )] + w2a ∗ P2a [g2 (s0 )] + ... + wka ∗ Pka [gk (s0 )]|∀a ∈ AG, s0 ∈ S}
Using this combined reward, it is possible to create a ranking over all the alter-
natives for each agent. This ranking is used as a new objective, where a value
of 1 indicates the most preferred alternative and a value of 0 indicates the least
preferred alternative. All other alternatives are distributed along this interval ac-
cording to their rank. Let AL be the set of alternatives of length l, then there

4
exists a ranking such that when the reward for that alternative is higher, its posi-
tion in the ranking will be lower. We call the ranking of a solution Ra (s, s0 ), the
preference for each agent is then as follows.
{Rap (s, s0 ) = (l − Ra [s, s0 ])/l|s, s0 ∈ S, a ∈ AG }
R = maxa∈A (R1p (s, s0 ) + R2p (s, s0 ) + ... + Rnp (s, s0 )
In our work we assume a democratic voting system where the importance for each
ranking is set to 1, this can be seen as a hyper parameter that models the voting
rights for each decision maker.

0.3 Experiment
0.3.1 Collective risk dilemma
The problem we will study in this paper is the collective risk dilemma, the problem
is defined by the following components:

• a set of 3 agents or decision makers , AG;

• a duration in number of rounds and a target value, ROU N DS = 5 , T ARGET =


75;

• a set of possible investments for each round, A = [0, 5, 10];

• a start capital, C = 50;

• a risk for each decision maker, this risk decides how much of the start capital
the agent loses whenever the goal is not reached. RISK = [0.25, 0.25, 0.25];

We then compute all possible outcomes by using combinatorics and then defined
a set of objectives over these outcomes. We call the set of outcomes O. The ob-
jective for each agent consists of the following objectives:

• The percentage of money left, mla (o) = (C − I)/C where I is the amount
invested after 5 rounds and o ∈ O and a ∈ AG.

• A binary value whether the goal was reached. gr(o) = sum(o) >= T ARGET

• The fairness of the solution defined as f a(o) = −std(o) + 1, Where std is


used to compute the standard deviation over the rewards of the different
agents. This objective is to make sure that there is not 1 agent performing
much better than all the other ones.

5
In our case with 3 agents, the objective for each agent become the following.
R1 (o) = maxa∈A (w11 ∗ ml1 (o) − w21 ∗ ml2 (o) − w31 ∗ ml3 (o) + w41 ∗ gr(o) + w51 ∗ f a(o))
R2 (o) = maxa∈A (−w11 ∗ ml1 (o) + w22 ∗ ml2 (o) − w32 ∗ ml3 (o) + w42 ∗ gr(o) + w51 ∗ f a(o))
R3 (o) = maxa∈A (−w11 ∗ ml1 (o) − w23 ∗ ml2 (o) + w33 ∗ ml3 (o) + w43 ∗ gr(o) + w53 ∗ f a(o))
Each decision maker wants to maximise the money they have left in the end while
minimising the money the other players have left. Independent of the weights of
each agent, this problem has a set of non-dominated solutions. We computed this
set by removing all outcomes from O that are dominated.
The final set of outcomes contained 48 non-dominated solutions, it was also no-
table how only 1 of the solutions did not reach the goal. This was the case where
every decision maker invests nothing, this is obvious because whenever the goal
is not reached their investment were for nothing and it would have always been
better to invest less.
We then use D-sight to create a ranking of the solutions based on the preferences
of each decision maker. As discussed in the previous section, we will convert this
ranking to a value between 0 and 1. This results in the final optimisation problem
that reflects a democratic voting system.
R(o) = maxa∈A (R1p (o) + R2p (o) + R3p (o))
For each agent there is also a preference function defined for each of the objectives
related to remaining money. This is defined by setting a preference threshold and
an indifference threshold. The indifference threshold is set both for the money the
agent has left and for the other agents their money left. It indicates at what point
the agent starts to worry about its own remaining money. The preference thresh-
old indicates from which point the agent cares maximally about its own remaining
money.

0.4 results
0.4.1 Like minded agents
In the first experiment we will assume every agent to follows the same preferences
where the weights are defined as follows.
{W a = [0.2, 0.2, 0.2, 0.2, 0.2]|∀a ∈ AG}
We also set the preference function for each agent for each objective related to
the remaining money. This is modelled as a linear function with an indifference
threshold of 0,2 and a preference threshold of 0,3. After using D-sight to compute
a ranking for each agent, we end with the following rankings.

6
Figure 1: Visualisation of scores of alternatives for agent 1.

Figure 2: Visualisation of scores of alternatives for agent 2.

7
Figure 3: Visualisation of scores of alternatives for agent 3.

We can see that every agent has significantly different preferred solutions. This
we investigate by looking at some representative solutions in more detail. To keep
it brief, we will only perform a comparison of the different solutions for the first
agent. We chose 3 different solutions to take a closer look at. Solution 30 was the
solution that was ranked as best solution. Number 35 is the solution where each
agent invests exactly the same amount, and solution 36 is the worst rated solution
where agent 1 pays all his money and agent 2 also pays a bit. Full details of each
solution can be found in the appendix at the end of this report.

8
Figure 4: Visualisation of scores of alternatives for agent 1.

We see that solution 30 that contains greedy behaviour is ranked as highest


scoring, this will be the case for each agent as long as they follow a similar weight
profile.

9
Figure 5: Visualisation of scores of alternatives for agent 1.

We can see from the AB analysis that the fairest solution is not the highest
rated while there are many greedy and fair solutions that have a similar overall
score. A trade off must be made between the agents remaining money and the
fairness of the group. The next step is to see if we can use a voting system to reach
a solution where none of the other agents are being exploited to pay most of the
sum. We now create a new dataset by using this ranking of each agent as a new
objective and start implementing a system that takes into account the preference
of each decision maker.

10
Democracy optimisation problem
As discussed in section 0.3.1 we use the ranking of each agent to compute a new
objective and create a global optimisation problem out of these new objectives.
This will give us 3 objectives for each of the 48 possible solutions.

Figure 6: The overall ranking of solutions using equal weight for each agents
ranking.

Notably we see that solution 35 scores significantly high along with solution 20
which scored the highest. Note that solution 35 is the most fair solution according
to our definition of fairness, where every country invests the same amount of 25.
Solution 20 is a solution that is quite good for agent 1 and 3 but very bad for
agent 2, meaning the global optimum in this case is still a solution where one of
the agents is being exploited. Although this is closely followed by the most fair
solution which could become the optimum with slight weight adjustment.

Different preferences
As second experiment we will test the model using different preferences for each
agent. We inspire the weight profiles on key figures who decide about climate
change. Note that the chosen weights are just arbitrarily chosen based on the
countries position against climate change.
Decision maker Own money Spending others Climate change Fairness
Donald Trump (USA) 50% 2x (20%) 0% 10%
Xi Yingping (CHINA) 20% 2x (10%) 50% 10%
Vladimir Poetin (RUSSIA) 40% 2x(20%) 5% 15%

11
We also define a preference and indifference threshold for each country.
Decision maker Indifference Preference
Donald Trump (USA) 0.1 0.3
Xi Yinging (CHINA) 0.3 0.5
Vladimir Putin (RUSSIA) 0.2 0.4
We follow the same steps as in previous example and proceed to the democracy
optimisation problem.

Figure 7: The overall ranking of solutions using equal weight for each country.

We see that solution 35 has been rated as the best solution. In this case each
country invests the same amount. Solution 43 and 30 are the second and third
best option, these we will analyse more in a spider diagram.

12
Figure 8: Spider web chart of 3 best solutions (20,43,35) and best solution when
USA had more weight (42).

We see that the best solutions found do favour China and Russia more than
they do for the USA. Depending on the weight for each country this best decision
may shift. This we investigate using the following stability interval chart.

13
Figure 9: Stability of the weight for each country.

In our model we view the weights of each country as their voting right. We
can see that USA has little to say in the final outcome as the decision would not
change even when the vote of USA is ignored. Whenever the voting right of the
USA increases above 39,79% the final outcome may change. In this case solution
42 becomes the best solution, which highly favours the USA. This can be seen in
figure 8.

0.5 Conclusion
In this work we illustrated how multi objective analysis can be used to investigate
multi agent decision-making problems. We showed that this way we can efficiently
find weights which will target a specific solution when training a reinforcement
learning agent. We believe that such voting system can be used to efficiently com-
municate and make decisions within a multi agent system. Such as self-driving
cars, having to coordinate to approximate the best global outcome. The model we
propose is not limited to reinforcement learning settings and can be used in any
decision problem where there are multiple decision makers with different prefer-
ences. This could apply to businesses where shareholders could have a different
voting right, or translate to politics. If the preferences of each decision maker can
be modelled accurately, it would provide significant opportunities as a democratic
system that mathematically enforces monotonicity by taking the full ranking of
each voter into account.

14
References

[1] R. Rădulescu, P. Mannion, D. M. Roijers, and A. Nowé, “Multi-objective multi-


agent decision making: A utility-based analysis and survey,” 2019.

[2] D. Silver and e. a. Huang, “Mastering the game of Go with deep neural networks
and tree search,” Nature, vol. 529, pp. 484–489, 2016.

[3] B. M. Jean-Pierre Brans, Multiple Criteria Decision Analysis: State of the Art
Surveys, ch. Promethee methods. springer, 2005.

15
0.6 appendix
0.6.1 Set of different alternatives
id goal reached money left 1 money left 2 money left 3 fairness
0 0 1 1 1 1
1 1 0.4 0.3 0.8 0.78
2 1 0.6 0.7 0.2 0.78
3 1 0.6 0.9 0 0.63
4 1 1 0 0.5 0.59
5 1 0.9 0.1 0.5 0.67
6 1 0.9 0.2 0.4 0.71
7 1 0.9 0.4 0.2 0.71
8 1 0.9 0.5 0.1 0.67
9 1 0.9 0.6 0 0.63
10 1 1 0.1 0.4 0.63
11 1 1 0.2 0.3 0.64
12 1 1 0.3 0.2 0.64
13 1 1 0.4 0.1 0.63
14 1 1 0.5 0 0.59
15 1 0.9 0 0.6 0.63
16 1 0.7 0 0.8 0.64
17 1 0.7 0.1 0.7 0.72
18 1 0.7 0.2 0.6 0.78
19 1 0.7 0.3 0.5 0.84
20 1 0.7 0.4 0.4 0.86
21 1 0.7 0.5 0.3 0.84
22 1 0.7 0.6 0.2 0.78
23 1 0.7 0.7 0.1 0.72
24 1 0.7 0.8 0 0.64
25 1 0.1 0.4 1 0.63
26 1 0.8 0 0.7 0.64
27 1 0.8 0.1 0.6 0.71
28 1 0.8 0.2 0.5 0.76
29 1 0.8 0.3 0.4 0.78
30 1 0.8 0.4 0.3 0.78
31 1 0.8 0.5 0.2 0.76
32 1 0.8 0.6 0.1 0.71
33 1 0.8 0.7 0 0.64

16
id goal reached money left 1 money left 2 money left 3 fairness
34 1 0.1 0.5 0.9 0.67
35 1 0.5 0.5 0.5 1.00
36 1 0 0.5 1 0.59
37 1 0 0.6 0.9 0.63
38 1 0 0.9 0.6 0.63
39 1 0 1 0.5 0.59
40 1 0 0.8 0.7 0.64
41 1 0 0.7 0.8 0.64
42 1 0.9 0.3 0.3 0.72
43 1 0.4 0.7 0.4 0.86
44 1 0.3 1 0.1 0.61
45 1 0.5 0 0.9 0.63
46 1 1 0 0.4 0.59
47 1 0 0.8 0.6 0.66

0.6.2 additional plots

Figure 10: Visualisation of scores of alternatives for USA.

17
Figure 11: Visualisation of scores of alternatives for China.

Figure 12: Visualisation of scores of alternatives for China.

18
Figure 13: Evaluation of alternatives for Russia.

Figure 14: Evaluation of alternatives for USA.

19
Figure 15: Evaluation of alternatives for China.

20

You might also like