Download as pdf or txt
Download as pdf or txt
You are on page 1of 86

Game Theory,

Evolutionary Dynamics, and


Multi-Agent Learning
Prof. Nicola Gatt
(nicola.gatti@polimi.it
i

Game theory
Game theory: basics

Normal for
•Player
•Action
•Outcome
•Utilitie
•Strategie
•Solutions
s

Game theory: basics


Player 2

Normal for
•Player
•Action
Player 1

•Outcome
•Utilitie
•Strategie
•Solutions
s

Game theory: basics


Player 2

Rock Paper Scissors

Normal for
•Player Rock
•Action
Player 1

•Outcome
Paper
•Utilitie
•Strategie
•Solutions Scissors
s

Game theory: basics


Player 2

Rock Paper Scissors

Normal for Player 2 Player 1


•Player Rock Tie
wins wins
•Action
Player 1

•Outcome Player 1
Tie
Player 2
Paper
•Utilitie wins wins
•Strategie
•Solutions Player 2 Player 1
Scissors Tie
wins wins
s

Game theory: basics


Player 2

Rock Paper Scissors

Normal for
•Player Rock 0,0 -1,1 1,-1
•Action
Player 1

•Outcome 1,-1 0,0 -1,1


Paper
•Utilitie
•Strategie
•Solutions Scissors -1,1 1,-1 0,0
s

Game theory: basics


Player 2

Rock Paper Scissors

Normal for
•Player Rock 0,0 -1,1 1,-1 1 (Rock)

•Action
Player 1

•Outcome 1,-1 0,0 -1,1


Paper 1 (Paper)
•Utilitie
•Strategie
•Solutions Scissors -1,1 1,-1 0,0 1 (Scissors)

2 (Rock) 2 (Paper) 2 (Scissors)


s

Game theory: basics


Player 2

Rock Paper Scissors

Normal for
•Player Rock 0,0 -1,1 1,-1 1/3

•Action
Player 1

•Outcome 1,-1 0,0 -1,1 1/3


Paper
•Utilitie
•Strategie
•Solutions Scissors -1,1 1,-1 0,0 1/3

1/3 1/3 1/3


s

Nash Equilibrium
A strategy pro le ( ⇤
1,

2) is a Nash equilibrium if and if
n o

⇤ ⇤
1 2 arg max 1 U1 2
1
n o
• ⇤
2 2 arg max
2

1 U2 2

fi
:

Evolutionary dynamics
An evolutionary interpretation

Player 1

33% 33%

33%

Rock Paper Scissors


An evolutionary interpretation

Player 1

In nite population 33% 33%


of individuals

33%

Rock Paper Scissors


fi

Bacterials
An evolutionary interpretation
Player 1 Player 2

33% 33% 30%

50%

20%
33%

Rock Paper Scissors Rock Paper Scissors


An evolutionary interpretation
Player 1 Player 2

Rock Paper Scissors

33% 33% 30%


Rock 0,0 -1,1 1,-1
50%
Paper 1,-1 0,0 -1,1

Scissors -1,1 1,-1 0,0


20%
33%

Rock Paper Scissors Rock Paper Scissors


An evolutionary interpretation
Player 1 Player 2

Rock Paper Scissors

33% 33% 30%


Rock 0,0 -1,1 1,-1
50%
Paper 1,-1 0,0 -1,1

Scissors -1,1 1,-1 0,0


20%
33%

Rock Paper Scissors Rock Paper Scissors

At each round, each individual of a population plays against each


individual of the opponent’s population weighted by the
corresponding probability
An evolutionary interpretation
Player 1 Utility = 0*1/2 - 1*1/5 + 1*3/10 = 0.1 Player 2

Rock Paper Scissors

33% 33% 30%


Rock 0,0 -1,1 1,-1
50%
Paper 1,-1 0,0 -1,1

Scissors -1,1 1,-1 0,0


20%
33%
1/2 1/5 3/10
Rock Paper Scissors Rock Paper Scissors
An evolutionary interpretation
Player 1 Utility = 0.1 Player 2

Rock Paper Scissors

33% 33% 30%


Rock 0,0 -1,1 1,-1
50%
Paper 1,-1 0,0 -1,1

Scissors -1,1 1,-1 0,0


20%
33%
1/2 1/5 3/10
Rock Paper Scissors Rock Paper Scissors

Utility = 1*1/2 - 0*1/5 - 1*3/10 = 0.2


An evolutionary interpretation
Utility = -1*1/2 - 1*1/5 - 0*3/10 = -0.3
Player 1 Utility = 0.1 Player 2

Rock Paper Scissors

33% 33% 30%


Rock 0,0 -1,1 1,-1
50%
Paper 1,-1 0,0 -1,1

Scissors -1,1 1,-1 0,0


20%
33%
1/2 1/5 3/10
Rock Paper Scissors Rock Paper Scissors

Utility = 0.2
An evolutionary interpretation
Utility = - 0.3 Player 1

33% 33%
Utility = 0.1

33%

Rock Paper Scissors

Utility = 0.2
An evolutionary interpretation
Utility = - 0.3 Player 1

33% 33%
Utility = 0.1

33%

Rock Paper Scissors

Utility = 0.2

The average utility is 1/3 * 0.1 + 1/3 * 0.2 - 0.3 * 1/3 = 0


An evolutionary interpretation
Utility = - 0.3 Player 1

33% 33%
Utility = 0.1

33%

Rock Paper Scissors

Utility = 0.2
The average utility is 0
A positive (utility - average utility) leads to an increase of the population
A negative (utility - average utility) leads to a decrease of the population
An evolutionary interpretation
Player 1

17%
37%

46%

Rock Paper Scissors

New population after the replication


Revision protocol
Question: how the populations change?

Replicator dynamics
⇣ ⌘
˙ 1 (a, t) = 1 (a, t) ea U 1 2 (t) 1 (t)U1 2 (t)

Utility given by playing a Average population utility


with a probability of 1

Example (1)
<latexit sha1_base64="9q1jqctE4KHqKuKzQKtfoqlT7mU=">AAACkXicbVFdi9QwFE3r11q/Rtc3UYKDssIytAuiPgjD+iL4Mn7M7sK0DLfpnW7YJC3JrTqUvvsX/RM++AvMdAu6O14IHM65Nyc5N6+VdBTHP4PwytVr12/s3Ixu3b5z997o/oMjVzVW4FxUqrInOThU0uCcJCk8qS2CzhUe52fvNvrxV7ROVuYLrWvMNJRGrqQA8tRy9CNdRKmTpYZlskcv+FsepYey9GSOpTQtWAvrrlVKdX/7UsLv5ET7qdv3I8+3+Nk5z7eEz70QpWiK4ebeLYvSbDkax5O4L74NkgGM2VCz5eh3WlSi0WhIKHBukcQ17SswhRPgP+oNSAqF3qJxWIM4gxIXHhrQ6LK2z67jzzxT8FVl/THEe/bfiRa0c2ud+04NdOouaxvyf9qiodXrrJWmbgiNODdaNYpTxTeL4IW0KEitPQBhpX8rF6dgQZBf1wUX3SiStvrWRT6k5HIk2+DoYJK8nMQfD8bTwyGuHfaIPWV7LGGv2JS9ZzM2Z4L9Ch4Gj4Mn4W74JpyGQ28YDDO77EKFH/4AMtXG1Q==</latexit>

h i
1 (t) = 1 (R, t) 1 (P, t) 1 (S, t)
<latexit sha1_base64="sj5l2fNS9NN/Y+o9w37YW0pxkJk=">AAACkXicbVFda9RAFJ3ErzZ+bW3fRBlclAplSRZK9UFY6ovgy/qxbWETlpvJ3XToZBJmbqpLyLt/0T/hg7/A2TSg7Xph4HDOvXNmzk0rJS2F4U/Pv3X7zt17W9vB/QcPHz0e7Dw5sWVtBM5EqUpzloJFJTXOSJLCs8ogFKnC0/Ti/Vo/vURjZam/0qrCpIBcy6UUQI5aDH7E8yC2Mi9gMd6n1/wdD+JjmTsyxVzqBoyBVdsopdq/fTHhd7Ki+dweuJFXG/z0iucbwpdOCGLUWX9z55YEcbIYDMNR2BXfBFEPhqyv6WLwO85KUReoSSiwdh6FFR0o0JkV4D7qDEgKhc6itliBuIAc5w5qKNAmTZddy186JuPL0rijiXfsvxMNFNauitR1FkDn9qa2Jv+nzWtavkkaqauaUIsro2WtOJV8vQieSYOC1MoBEEa6t3JxDgYEuXVdcylqRdKU39rAhRTdjGQTnIxH0eEo/DQeTo77uLbYU/aC7bOIHbEJ+8CmbMYE++Xtec+85/6u/9af+H2v7/Uzu+xa+R//ADqYxtk=</latexit>

h i
2 (t) = 2 (R, t) 2 (P, t) 2 (S, t)

<latexit sha1_base64="3m6f2fTy+/1bEHG+Usr96RVnu8s=">AAACgnicbVFNb9QwEHVCW4pL2y0cuVisqCrUrpIiVA4gVXDhWCS2rRRHK8eZ7FrrOJE9AUXR/lBO/Ad+Ad5sJPrBSPY8vTfjsZ+zWiuHUfQrCJ9sbe883X1G957vHxyOjl5cu6qxEqay0pW9zYQDrQxMUaGG29qCKDMNN9nyy1q/+QHWqcp8x7aGtBRzowolBXpqNmp5QqezmH1ilGsoMKE8g7kynbBWtKvOWruijEWMHTN2FvfJ75zTPh/flXryX81G8yQHkw/nUW7VfIEp5elsNI4mUR/sMYgHMCZDXM1Gf3heyaYEg1IL55I4qvFUC5M7KfzL/AhUUoMf0jiohVyKOSQeGlGCS7verBV745mcFZX1yyDr2bsdnSida8vMV5YCF+6htib/pyUNFh/STpm6QTByM6hoNMOKrZ1nubIgUbceCGmVvyuTC2GFRP8/96aUjUZlq58r6k2KH1ryGFyfT+L3k+jb+fjy82DXLnlFXpMTEpMLckm+kisyJZL8DraDg+Aw3ArfhnH4blMaBkPPS3Ivwo9/AY6+tZ4=</latexit>

2 3 <latexit sha1_base64="lfHFLfUQ5LGiWDAINGX28Ss12d0=">AAACgnicbVFNb9QwFHTSFoop7QLHXixWVKiCVbKoggNIFb1wLBLbVoqjleO87Fp1nMh+KYqi/aE98R/4BXizkegHT7I9mnn22OOs1sphFN0G4db2zpOnu8/o870X+wejl68uXNVYCTNZ6cpeZcKBVgZmqFDDVW1BlJmGy+z6bK1f3oB1qjI/sa0hLcXCqEJJgZ6aj1qe0Nl8yr4yyjUUmFCewUKZTlgr2lVnrV1RxiLGjhiL+/mDXzinG3B0R+vJfz0byZMcTD6cR7lViyWmlKfz0TiaRH2xxyAewJgMdT4f/eF5JZsSDEotnEviqMb3WpjcSeFf5i1QSQ3epHFQC3ktFpB4aEQJLu36sFbsrWdyVlTWD4OsZ+/u6ETpXFtmvrMUuHQPtTX5Py1psPicdsrUDYKRG6Oi0Qwrtk6e5cqCRN16IKRV/q5MLoUVEv3/3HMpG43KVr9W1IcUP4zkMbiYTuKTSfRjOj79NsS1Sw7JG/KOxOQTOSXfyTmZEUl+BzvBfnAQbofHYRx+3LSGwbDnNblX4Ze/kNa1nw==</latexit>

2 3
0 1 1 0 1 1
U1 = 4 1 0 1 5 U2 = 4 1 0 1 5
1 1 0 1 1 0

<latexit sha1_base64="hmnzCI0AIInL2FZ9KdItS8YA/l4=">AAACanicbVFNb9QwEPWGrxK+lnJCvVgsoCK1q6QSggtSBReOBbFtpThaTZzJrlXHjuwJsIry+/gN/Ae4cIAr3m0OtGUk209v3ujZz0Wjlack+T6Krl2/cfPW1u34zt179x+MH24fe9s6iTNptXWnBXjUyuCMFGk8bRxCXWg8Kc7erfsnn9F5Zc0nWjWY17AwqlISKFDzMYgsns3TXUH4lbzsPvZ7QpaWXvA3PBYaK8piUeBCmQ6cg1XfOef6mPOE8+ecp5t9PxxCxAJNOahi4dRiSXks8vl4kkyTTfGrIB3AhA11NB//EqWVbY2GpAbvszRpaE+DKb2E8IJgQUpqDCatxwbkGSwwC9BAjT7vNqH0/FlgSl5ZF5YhvmH/neig9n5VF0FZAy395d6a/F8va6l6nXfKNC2hkedGVas5Wb5OmJfKoSS9CgCkU+GuXC7BgaTwDxdc6laTcvZLH4eQ0suRXAXHB9P05TT5cDA5fDvEtcV22BO2y1L2ih2y9+yIzZhk39hP9pv9Gf2ItqPH0c65NBoNM4/YhYqe/gWJ1LjU</latexit>

⇥ ⇤
U1 (R, ·) = 0 1 1
<latexit sha1_base64="4lz9ISSYcxbYr7zFaWeKIzlWuxo=">AAACanicbVFNb9QwEPWGrxK+lnJCvVgsoCKVVVIJwQWpggvHRWLbSnG0mjiTXauOHdkTYBXl9/Eb+A9w4QBXvNscaMuTbD29mdEbPxeNVp6S5Psounb9xs1bO7fjO3fv3X8wfrh77G3rJM6l1dadFuBRK4NzUqTxtHEIdaHxpDh7v6mffEbnlTWfaN1gXsPSqEpJoCAtxiCyeL5I9wXhV/Kym/UHQpaWXvC3PBYaK8piUeBSmQ6cg3XfOef6mPOXKefPOU+2d+BCxAJNOXTFwqnlivJY5IvxJJkmW/CrJB3IhA2YLca/RGllW6MhqcH7LE0aOtBgSi8hvCBYkJIag0nrsQF5BkvMAjVQo8+7bSg9fxaUklfWhWOIb9V/JzqovV/XReisgVb+cm0j/q+WtVS9yTtlmpbQyHOjqtWcLN8kzEvlUJJeBwLSqbArlytwICn8wwWXutWknP3SxyGk9HIkV8nx4TR9NU0+Hk6O3g1x7bA99oTts5S9ZkfsA5uxOZPsG/vJfrM/ox/RbvQ42jtvjUbDzCN2AdHTv4ZfuNI=</latexit>

⇥ ⇤
U1 (P, ·) = 1 0 1
<latexit sha1_base64="9Np1xyX1zI7Sj8R6AijjyRFEVgs=">AAACanicbVFNb9QwEPWGr2K+lnJCvVgsoCK1q6QSgkulCi49toJtK8XRynEmu1YdO7InLasov6+/gf8AFw5wxbubA215kuWnNzN64+e81spjHH8fRHfu3rv/YOMhffT4ydNnw+ebJ942TsJEWm3dWS48aGVgggo1nNUORJVrOM3PPy/rpxfgvLLmKy5qyCoxM6pUUmCQpkPBUzqZJtsc4Rt62X7pdrgsLL5j+4xyDSWmlOcwU6YVzolF1zrnOspYwthbxnbXV8wY55SDKfouyp2azTGjPJsOR/E4XoHdJklPRqTH0XT4ixdWNhUYlFp4nyZxjTtamMJLEV4QLFBJDcGk8VALeS5mkAZqRAU+a1ehdOxNUApWWheOQbZS/51oReX9ospDZyVw7m/WluL/ammD5cesVaZuEIxcG5WNZmjZMmFWKAcS9SIQIZ0KuzI5F05IDP9wzaVqNCpnLzsaQkpuRnKbnOyNk/fj+HhvdPCpj2uDbJFXZJsk5AM5IIfkiEyIJFfkJ/lN/gx+RJvRy2hr3RoN+pkX5Bqi138BjA241Q==</latexit>

⇥ ⇤
U1 (S, ·) = 1 1 0
Example (2)

<latexit sha1_base64="IsId3nRVEE6Z4N9z1dgWF3Lcj0M=">AAAD6HictVPNjtMwEHYTfpbw14UjF4uKVVcKVVIJwQVptQiJY6B0d6W6qhzHzZo6TmRPgCrKO3BDXHkrLjwET4DTRmK7pXtBjOLo83wz48+TSVxIYSAIfnQc99r1Gzf3bnm379y9d7+7/+DE5KVmfMxymeuzmBouheJjECD5WaE5zWLJT+PFq4Y//ci1Ebl6D8uCTzOaKjEXjIJ1zfY7P8nEIzFPhaqo1nRZV9Ja7ZEkh4oYkWa0noV9AvwzGFa9q304xBgf2PUSr/kt+gCTY5H2x5sEYbbkIfHbpGG/iXz6p4bdEt/mEB9fDGlK2Texzy5R0dWiol2iov8panS1qNEuUaN/E+URrpL2U3pkOuv2gkGwMrwNwhb0UGvRrPvL3oeVGVfAJDVmEgYF+JKqxDBq58eWBsEkt8VLwwvKFjTlEwsVzbiZVquRrPET60nwPNd2KcAr78WMimbGLLPYRmYUzs1lrnH+jZuUMH8xrYQqSuCKrQ+alxJDjpv5xonQnIFcWkCZFlYrZudUUwb2L9g4JSslCJ1/qj3bpPByS7bByXAQPhsEb4e9o+O2XXvoEXqM+ihEz9EReoMiNEbMee0sHHBK94P7xf3qfluHOp025yHaMPf7b7rJNnk=</latexit>

⇣ ⌘
˙ 1 (R, t) = 1 (R, t) U1 (R, ·) 2 (t) 1 (t) U1 2 (t)

⇣ ⌘
˙ 1 (P, t) = 1 (P, t) U1 (P, ·) 2 (t) 1 (t) U1 2 (t)

⇣ ⌘
˙ 1 (S, t) = 1 (S, t) U1 (S, ·) 2 (t) 1 (t) U1 2 (t)
Evolutionary Stable Strategies
A strategy is an ESS if it is immune to invasion by mutant
strategies, given that the mutants initially occupy a small
fraction of populatio

Every ESS is an asymptotically stable xed point of


the replicator dynamics
n

fi
Evolutionary Stable Strategies (1)
A strategy is an ESS if it is immune to invasion by mutant
strategies, given that the mutants initially occupy a small
fraction of populatio

Every ESS is an asymptotically stable xed point of


the replicator dynamics

Nash equilibri While a NE always exists,


an ESS may not exist
ESS
a

fi
Evolutionary Stable Strategies (2)
Given a Nash equilibrium, a small perturbation could
make the equilibrium unstable

Player 2

Action 1 Action 2

Action 1 1, 1 0, 0
Player 1

Action 2 0, 0 0, 0

Nash equilibrium
Evolutionary Stable Strategies (2)
Given a Nash equilibrium, a small perturbation could
make the equilibrium unstable

Player 2

Action 1 Action 2

Action 1 1, 1 0, 0 ✏
Player 1

<latexit sha1_base64="fKFPxQ2uOzd9AsOzCkr5rrYcu60=">AAACEXicbVDLTgIxFO3gC/GFunTTCCYuDJkhMbokunGJiYARCOl07kBDp520HQ2Z8BW61f9wZ9z6Bf6GX2CBWQh4kpucnHNvzs3xY860cd1vJ7eyura+kd8sbG3v7O4V9w+aWiaKQoNKLtW9TzRwJqBhmOFwHysgkc+h5Q+vJ37rEZRmUtyZUQzdiPQFCxklxkoP5Q7EmnEpyr1iya24U+Bl4mWkhDLUe8WfTiBpEoEwlBOt254bmzNORKApsUEpUYZRDuNCJ9EQEzokfWhbKkgEuptOfx/jE6sEOJTKjjB4qv69SEmk9Sjy7WZEzEAvehPxP6+dmPCymzIRJwYEnQWFCcdG4kkROGAKqOEjSwhVzP6K6YAoQo2tay4lSrhhSj6NC7Ykb7GSZdKsVrzzintbLdWusrry6Agdo1PkoQtUQzeojhqIIoFe0Ct6c56dd+fD+Zyt5pzs5hDNwfn6BVHNng8=</latexit>

Action 2 0, 0 0, 0 1 ✏
<latexit sha1_base64="DXlf8mI+2ebZtsecwtC+PNdDUi4=">AAACE3icbVBLSgNBFOzxG+Mv6tJNYyK40DATEF0G3biMYD6QCaGnpydp0j+6e5QQcgzd6j3ciVsP4DU8gZ1kFiax4EFR9R71qEgxaqzvf3srq2vrG5u5rfz2zu7efuHgsGFkqjGpY8mkbkXIEEYFqVtqGWkpTRCPGGlGg9uJ33wk2lApHuxQkQ5HPUETipF1UlgKLkKiDGVSlLqFol/2p4DLJMhIEWSodQs/YSxxyomwmCFj2oGv7DlDIjYYuagR0pZiRsb5MDVEITxAPdJ2VCBOTGc0/X4MT50Sw0RqN8LCqfr3YoS4MUMeuU2ObN8sehPxP6+d2uS6M6JCpZYIPAtKUgathJMqYEw1wZYNHUFYU/crxH2kEbausLkUnjJLtXwa511JwWIly6RRKQeXZf++UqzeZHXlwDE4AWcgAFegCu5ADdQBBgq8gFfw5j17796H9zlbXfGymyMwB+/rFzoJnoE=</latexit>

✏ 1 ✏
<latexit sha1_base64="DXlf8mI+2ebZtsecwtC+PNdDUi4=">AAACE3icbVBLSgNBFOzxG+Mv6tJNYyK40DATEF0G3biMYD6QCaGnpydp0j+6e5QQcgzd6j3ciVsP4DU8gZ1kFiax4EFR9R71qEgxaqzvf3srq2vrG5u5rfz2zu7efuHgsGFkqjGpY8mkbkXIEEYFqVtqGWkpTRCPGGlGg9uJ33wk2lApHuxQkQ5HPUETipF1UlgKLkKiDGVSlLqFol/2p4DLJMhIEWSodQs/YSxxyomwmCFj2oGv7DlDIjYYuagR0pZiRsb5MDVEITxAPdJ2VCBOTGc0/X4MT50Sw0RqN8LCqfr3YoS4MUMeuU2ObN8sehPxP6+d2uS6M6JCpZYIPAtKUgathJMqYEw1wZYNHUFYU/crxH2kEbausLkUnjJLtXwa511JwWIly6RRKQeXZf++UqzeZHXlwDE4AWcgAFegCu5ADdQBBgq8gFfw5j17796H9zlbXfGymyMwB+/rFzoJnoE=</latexit>

<latexit sha1_base64="fKFPxQ2uOzd9AsOzCkr5rrYcu60=">AAACEXicbVDLTgIxFO3gC/GFunTTCCYuDJkhMbokunGJiYARCOl07kBDp520HQ2Z8BW61f9wZ9z6Bf6GX2CBWQh4kpucnHNvzs3xY860cd1vJ7eyura+kd8sbG3v7O4V9w+aWiaKQoNKLtW9TzRwJqBhmOFwHysgkc+h5Q+vJ37rEZRmUtyZUQzdiPQFCxklxkoP5Q7EmnEpyr1iya24U+Bl4mWkhDLUe8WfTiBpEoEwlBOt254bmzNORKApsUEpUYZRDuNCJ9EQEzokfWhbKkgEuptOfx/jE6sEOJTKjjB4qv69SEmk9Sjy7WZEzEAvehPxP6+dmPCymzIRJwYEnQWFCcdG4kkROGAKqOEjSwhVzP6K6YAoQo2tay4lSrhhSj6NC7Ykb7GSZdKsVrzzintbLdWusrry6Agdo1PkoQtUQzeojhqIIoFe0Ct6c56dd+fD+Zyt5pzs5hDNwfn6BVHNng8=</latexit>
Evolutionary Stable Strategies (2)
Given a Nash equilibrium, a small perturbation could
make the equilibrium unstable

Player 2

Action 1 Action 2

The only best


response is Action 1 1, 1 0, 0 ✏
Player 1

<latexit sha1_base64="fKFPxQ2uOzd9AsOzCkr5rrYcu60=">AAACEXicbVDLTgIxFO3gC/GFunTTCCYuDJkhMbokunGJiYARCOl07kBDp520HQ2Z8BW61f9wZ9z6Bf6GX2CBWQh4kpucnHNvzs3xY860cd1vJ7eyura+kd8sbG3v7O4V9w+aWiaKQoNKLtW9TzRwJqBhmOFwHysgkc+h5Q+vJ37rEZRmUtyZUQzdiPQFCxklxkoP5Q7EmnEpyr1iya24U+Bl4mWkhDLUe8WfTiBpEoEwlBOt254bmzNORKApsUEpUYZRDuNCJ9EQEzokfWhbKkgEuptOfx/jE6sEOJTKjjB4qv69SEmk9Sjy7WZEzEAvehPxP6+dmPCymzIRJwYEnQWFCcdG4kkROGAKqOEjSwhVzP6K6YAoQo2tay4lSrhhSj6NC7Ykb7GSZdKsVrzzintbLdWusrry6Agdo1PkoQtUQzeojhqIIoFe0Ct6c56dd+fD+Zyt5pzs5hDNwfn6BVHNng8=</latexit>

Action 1
Action 2 0, 0 0, 0 1 ✏
<latexit sha1_base64="DXlf8mI+2ebZtsecwtC+PNdDUi4=">AAACE3icbVBLSgNBFOzxG+Mv6tJNYyK40DATEF0G3biMYD6QCaGnpydp0j+6e5QQcgzd6j3ciVsP4DU8gZ1kFiax4EFR9R71qEgxaqzvf3srq2vrG5u5rfz2zu7efuHgsGFkqjGpY8mkbkXIEEYFqVtqGWkpTRCPGGlGg9uJ33wk2lApHuxQkQ5HPUETipF1UlgKLkKiDGVSlLqFol/2p4DLJMhIEWSodQs/YSxxyomwmCFj2oGv7DlDIjYYuagR0pZiRsb5MDVEITxAPdJ2VCBOTGc0/X4MT50Sw0RqN8LCqfr3YoS4MUMeuU2ObN8sehPxP6+d2uS6M6JCpZYIPAtKUgathJMqYEw1wZYNHUFYU/crxH2kEbausLkUnjJLtXwa511JwWIly6RRKQeXZf++UqzeZHXlwDE4AWcgAFegCu5ADdQBBgq8gFfw5j17796H9zlbXfGymyMwB+/rFzoJnoE=</latexit>

✏ 1 ✏
<latexit sha1_base64="DXlf8mI+2ebZtsecwtC+PNdDUi4=">AAACE3icbVBLSgNBFOzxG+Mv6tJNYyK40DATEF0G3biMYD6QCaGnpydp0j+6e5QQcgzd6j3ciVsP4DU8gZ1kFiax4EFR9R71qEgxaqzvf3srq2vrG5u5rfz2zu7efuHgsGFkqjGpY8mkbkXIEEYFqVtqGWkpTRCPGGlGg9uJ33wk2lApHuxQkQ5HPUETipF1UlgKLkKiDGVSlLqFol/2p4DLJMhIEWSodQs/YSxxyomwmCFj2oGv7DlDIjYYuagR0pZiRsb5MDVEITxAPdJ2VCBOTGc0/X4MT50Sw0RqN8LCqfr3YoS4MUMeuU2ObN8sehPxP6+d2uS6M6JCpZYIPAtKUgathJMqYEw1wZYNHUFYU/crxH2kEbausLkUnjJLtXwa511JwWIly6RRKQeXZf++UqzeZHXlwDE4AWcgAFegCu5ADdQBBgq8gFfw5j17796H9zlbXfGymyMwB+/rFzoJnoE=</latexit>

<latexit sha1_base64="fKFPxQ2uOzd9AsOzCkr5rrYcu60=">AAACEXicbVDLTgIxFO3gC/GFunTTCCYuDJkhMbokunGJiYARCOl07kBDp520HQ2Z8BW61f9wZ9z6Bf6GX2CBWQh4kpucnHNvzs3xY860cd1vJ7eyura+kd8sbG3v7O4V9w+aWiaKQoNKLtW9TzRwJqBhmOFwHysgkc+h5Q+vJ37rEZRmUtyZUQzdiPQFCxklxkoP5Q7EmnEpyr1iya24U+Bl4mWkhDLUe8WfTiBpEoEwlBOt254bmzNORKApsUEpUYZRDuNCJ9EQEzokfWhbKkgEuptOfx/jE6sEOJTKjjB4qv69SEmk9Sjy7WZEzEAvehPxP6+dmPCymzIRJwYEnQWFCcdG4kkROGAKqOEjSwhVzP6K6YAoQo2tay4lSrhhSj6NC7Ykb7GSZdKsVrzzintbLdWusrry6Agdo1PkoQtUQzeojhqIIoFe0Ct6c56dd+fD+Zyt5pzs5hDNwfn6BVHNng8=</latexit>
Evolutionary Stable Strategies (2)
Given a Nash equilibrium, a small perturbation could
make the equilibrium unstable

Player 2

Action 1 Action 2
ESS

Action 1 1, 1 0, 0 1 ✏
<latexit sha1_base64="DXlf8mI+2ebZtsecwtC+PNdDUi4=">AAACE3icbVBLSgNBFOzxG+Mv6tJNYyK40DATEF0G3biMYD6QCaGnpydp0j+6e5QQcgzd6j3ciVsP4DU8gZ1kFiax4EFR9R71qEgxaqzvf3srq2vrG5u5rfz2zu7efuHgsGFkqjGpY8mkbkXIEEYFqVtqGWkpTRCPGGlGg9uJ33wk2lApHuxQkQ5HPUETipF1UlgKLkKiDGVSlLqFol/2p4DLJMhIEWSodQs/YSxxyomwmCFj2oGv7DlDIjYYuagR0pZiRsb5MDVEITxAPdJ2VCBOTGc0/X4MT50Sw0RqN8LCqfr3YoS4MUMeuU2ObN8sehPxP6+d2uS6M6JCpZYIPAtKUgathJMqYEw1wZYNHUFYU/crxH2kEbausLkUnjJLtXwa511JwWIly6RRKQeXZf++UqzeZHXlwDE4AWcgAFegCu5ADdQBBgq8gFfw5j17796H9zlbXfGymyMwB+/rFzoJnoE=</latexit>
Player 1

Action 2 0, 0 0, 0 ✏
<latexit sha1_base64="fKFPxQ2uOzd9AsOzCkr5rrYcu60=">AAACEXicbVDLTgIxFO3gC/GFunTTCCYuDJkhMbokunGJiYARCOl07kBDp520HQ2Z8BW61f9wZ9z6Bf6GX2CBWQh4kpucnHNvzs3xY860cd1vJ7eyura+kd8sbG3v7O4V9w+aWiaKQoNKLtW9TzRwJqBhmOFwHysgkc+h5Q+vJ37rEZRmUtyZUQzdiPQFCxklxkoP5Q7EmnEpyr1iya24U+Bl4mWkhDLUe8WfTiBpEoEwlBOt254bmzNORKApsUEpUYZRDuNCJ9EQEzokfWhbKkgEuptOfx/jE6sEOJTKjjB4qv69SEmk9Sjy7WZEzEAvehPxP6+dmPCymzIRJwYEnQWFCcdG4kkROGAKqOEjSwhVzP6K6YAoQo2tay4lSrhhSj6NC7Ykb7GSZdKsVrzzintbLdWusrry6Agdo1PkoQtUQzeojhqIIoFe0Ct6c56dd+fD+Zyt5pzs5hDNwfn6BVHNng8=</latexit>

1 ✏ ✏
<latexit sha1_base64="DXlf8mI+2ebZtsecwtC+PNdDUi4=">AAACE3icbVBLSgNBFOzxG+Mv6tJNYyK40DATEF0G3biMYD6QCaGnpydp0j+6e5QQcgzd6j3ciVsP4DU8gZ1kFiax4EFR9R71qEgxaqzvf3srq2vrG5u5rfz2zu7efuHgsGFkqjGpY8mkbkXIEEYFqVtqGWkpTRCPGGlGg9uJ33wk2lApHuxQkQ5HPUETipF1UlgKLkKiDGVSlLqFol/2p4DLJMhIEWSodQs/YSxxyomwmCFj2oGv7DlDIjYYuagR0pZiRsb5MDVEITxAPdJ2VCBOTGc0/X4MT50Sw0RqN8LCqfr3YoS4MUMeuU2ObN8sehPxP6+d2uS6M6JCpZYIPAtKUgathJMqYEw1wZYNHUFYU/crxH2kEbausLkUnjJLtXwa511JwWIly6RRKQeXZf++UqzeZHXlwDE4AWcgAFegCu5ADdQBBgq8gFfw5j17796H9zlbXfGymyMwB+/rFzoJnoE=</latexit>

<latexit sha1_base64="fKFPxQ2uOzd9AsOzCkr5rrYcu60=">AAACEXicbVDLTgIxFO3gC/GFunTTCCYuDJkhMbokunGJiYARCOl07kBDp520HQ2Z8BW61f9wZ9z6Bf6GX2CBWQh4kpucnHNvzs3xY860cd1vJ7eyura+kd8sbG3v7O4V9w+aWiaKQoNKLtW9TzRwJqBhmOFwHysgkc+h5Q+vJ37rEZRmUtyZUQzdiPQFCxklxkoP5Q7EmnEpyr1iya24U+Bl4mWkhDLUe8WfTiBpEoEwlBOt254bmzNORKApsUEpUYZRDuNCJ9EQEzokfWhbKkgEuptOfx/jE6sEOJTKjjB4qv69SEmk9Sjy7WZEzEAvehPxP6+dmPCymzIRJwYEnQWFCcdG4kkROGAKqOEjSwhVzP6K6YAoQo2tay4lSrhhSj6NC7Ykb7GSZdKsVrzzintbLdWusrry6Agdo1PkoQtUQzeojhqIIoFe0Ct6c56dd+fD+Zyt5pzs5hDNwfn6BVHNng8=</latexit>
Prisoner’s dilemma

Player 2

Cooperate Defeat

Cooperate 3,3 0,5


Player 1

Defeat 5,0 1,1


Prisoner’s dilemma

Player 2

Cooperate Defeat

Cooperate 3,3 0,5


Player 1

Defeat 5,0 1,1


Prisoner’s dilemma
Evolutionary Dy

Player 2
1 1

Cooperate Defeat

Player 2’s cooperate


0.75 0.75

probability
y1 y1
0.5 0.5
Cooperate 3,3 0,5
Player 1

0.25 0.25

Defeat 5,0 1,1


0 0
0 0.25 0.5 0.75 1 0
x1
Player 1’s cooperate
probability
Figure 4: The replicator dynamics,
(left), the stag hunt (center), and ma

Prisoner’s dilemma
Evolutionary Dy

Player 2
1 1

Cooperate Defeat

Player 2’s cooperate


0.75 0.75

probability
y1 y1
0.5 0.5
Cooperate 3,3 0,5
Player 1

0.25 0.25

Defeat 5,0 1,1


0 0
0 0.25 0.5 0.75 1 0
x1
asymptotically Player 1’s cooperate
stable probability
Figure 4: The replicator dynamics,
(left), the stag hunt (center), and ma

Stag hunt

Player 2

Stag Hare

Stag 4,4 1,3


Player 1

Hare 3,1 3,3


Stag hunt

Player 2

Stag Hare

Stag 4,4 1,3


Player 1

Hare 3,1 3,3


Stag hunt
Evolutionary Dynamics of Multi-Agent Le

Player 2
1 1 1

Stag Hare
0.75 0.75 0.75

Player 2’s stag


probability
y1 y1 y1
0.5 0.5 0.5
Stag 4,4 1,3
Player 1

0.25 0.25 0.25

Hare 3,1 3,3


0 0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0
x1 x1
Player 1’s stag
probability
Figure 4: The replicator dynamics, plotted in the unit simplex,
(left), the stag hunt (center), and matching pennies (right).

Stag hunt
Evolutionary Dynamics of Multi-Agent Le

Player 2 asymptotically
stable
1 1 1

Stag Hare
0.75 0.75 0.75

Player 2’s stag


probability
y1 y1 y1
0.5 0.5 0.5
Stag 4,4 1,3
Player 1

0.25 0.25 0.25

Hare 3,1 3,3


0 0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0
x1 x1
asymptotically Player 1’s stag
stable probability
Figure 4: The replicator dynamics, plotted in the unit simplex,
(left), the stag hunt (center), and matching pennies (right).

Matching pennies

Player 2

Head Tail

Head 0,1 1,0


Player 1

Tail 1,0 0,1


Matching pennies

Player 2

Head Tail

Head 0,1 1,0


Player 1

Tail 1,0 0,1


Matching pennies
Evolutionary Dynamics of Multi-Agent Learning

Player 2
1 1

Head Tail
0.75 0.75

Player 2’s hea


probability
y1 y1
0.5 0.5
Head 0,1 1,0
Player 1

0.25 0.25

Tail 1,0 0,1


0 0
0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
x1 x1 x
Player11’s head
probability
4: The replicator dynamics, plotted in the unit simplex, for the prisoner’s dilemma
e stag hunt (center), and matching pennies (right).
d

Matching pennies
Evolutionary Dynamics of Multi-Agent Learning

Player 2
1 1

Head Tail
0.75 0.75

Player 2’s hea


probability
y1 y1
0.5 0.5
Head 0,1 1,0
Player 1

0.25 0.25

Tail 1,0 0,1


0 0
0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
x1 x1 x
Player11’s head
probability
4: The replicator dynamics, plotted in the unit simplex, for the prisoner’s dilemma
e stag hunt (center), and matching pennies (right).
d

Starting point
⇣ ⌘
˙ 1 (a, t) = 1 (a, t) ea U 1 2 (t) 1 (t)U1 2 (t)

If the starting point is 0, then it will be always zero

Having a starting point with zero strategies does not make sense
Starting point
Evolutionary Dynamics of Multi-Agent Learning

Player 2
1 1

Action 1 Action 2
0.75 0.75

Player 2’s hea


probability
y1 y1
0.5 0.5
Action 1 …, … …, …
Player 1

0.25 0.25

Action 2 …, … …, …
0 0
0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
x1 x1 x
Player11’s head
probability
4: The replicator dynamics, plotted in the unit simplex, for the prisoner’s dilemma
e stag hunt (center), and matching pennies (right).
d

Starting point
Evolutionary Dynamics of Multi-Agent Learning

Player 2
1 1

Action 1 Action 2
0.75 0.75

Player 2’s hea


probability
y1 y1
0.5 0.5
Action 1 …, … …, …
Player 1

0.25 0.25

Action 2 …, … …, …
0 0
0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
x1 x1 x
Player11’s head
probability
4: The replicator dynamics, plotted in the unit simplex, for the prisoner’s dilemma
e stag hunt (center), and matching pennies (right).
d

Starting point
Evolutionary Dynamics of Multi-Agent Learning

Player 2
1 1

Action 1 Action 2
0.75 0.75

Player 2’s hea


probability
y1 y1
0.5 0.5
Action 1 …, … …, …
Player 1

0.25 0.25

Action 2 …, … …, …
0 0
0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
x1 x1 x
Player11’s head
probability
4: The replicator dynamics, plotted in the unit simplex, for the prisoner’s dilemma
e stag hunt (center), and matching pennies (right).
d

Starting point
Evolutionary Dynamics of Multi-Agent Learning

Player 2
1 1

Action 1 Action 2
0.75 0.75

Player 2’s hea


probability
y1 y1
0.5 0.5
Action 1 …, … …, …
Player 1

0.25 0.25

Action 2 …, … …, …
0 0
0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
x1 x1 x
Player11’s head
probability
4: The replicator dynamics, plotted in the unit simplex, for the prisoner’s dilemma
e stag hunt (center), and matching pennies (right).
d

Single-population games
• There is a single population of individuals that is
playing against itsel
• An example: bacterials
f

Example
Single population

<latexit sha1_base64="dQwcmzBg8aG40FxjXubuxah7rZ4=">AAACInicbVDLSgMxFM3Ud31VXYmbYCsoSJkpiC5FNy4r9CG0pdxJ0zaYZIbkjlqG4s/oVv/DnbgS/Aq/wLR2YdUDgcO553JuThhLYdH3373MzOzc/MLiUnZ5ZXVtPbexWbNRYhivskhG5ioEy6XQvIoCJb+KDQcVSl4Pr89H8/oNN1ZEuoKDmLcU9LToCgbopHZuu9C0oqdgv4n8Di1LK85Eg+FBoZ3L+0V/DPqXBBOSJxOU27nPZidiieIamQRrG4Ef46EE3bEMXHIKBgWTfJhtJpbHwK6hxxuOalDcttLxZ4Z0zykd2o2MexrpWP25kYKydqBC51SAfft7NhL/mzUS7J60UqHjBLlm30HdRFKM6KgZ2hGGM5QDR4AZ4W6lrA8GGLr+plJUIlGY6HaYdSUFvyv5S2qlYnBU9C9L+dOzSV2LZIfskn0SkGNySi5ImVQJI/fkkTyRZ+/Be/Fevbdva8ab7GyRKXgfXwkQpCA=</latexit> <latexit sha1_base64="fqNxtUaYet8LtVV1klaGZOu2Ur4=">AAACInicbVDLSgMxFM34tr6qrsRNsAoVpMwURJeiG5cV+hDaUu6kaQ1NMkNyRy1D8Wd0q//hTlwJfoVfYFq70NYDgcO553JuThhLYdH3P7yZ2bn5hcWl5czK6tr6RnZzq2qjxDBeYZGMzHUIlkuheQUFSn4dGw4qlLwW9i6G89otN1ZEuoz9mDcVdLXoCAbopFZ2Z79hRVdBvoH8Hi1Ly85Ei4PD/VY25xf8Eeg0CcYkR8YotbJfjXbEEsU1MgnW1gM/xiMJum0ZuOQUDAom+SDTSCyPgfWgy+uOalDcNtPRZwb0wClt2omMexrpSP29kYKytq9C51SAN3ZyNhT/m9UT7Jw2U6HjBLlmP0GdRFKM6LAZ2haGM5R9R4AZ4W6l7AYMMHT9/UlRiURhortBxpUUTFYyTarFQnBc8K+KubPzcV1LZJfskTwJyAk5I5ekRCqEkQfyRJ7Ji/fovXpv3vuPdcYb72yTP/A+vwEKt6Qh</latexit> <latexit sha1_base64="utC7CMJOqp+hqnXnXrR5eCJ1QIs=">AAACInicbVDLSgMxFM34rPVVdSVugq2gIGWmIrosunFZoVWhLeVOmtZgkhmSO2oZij+jW/0Pd+JK8Cv8AtPHQqsHAodzz+XcnDCWwqLvf3hT0zOzc/OZhezi0vLKam5t/cJGiWG8xiIZmasQLJdC8xoKlPwqNhxUKPlleHM6mF/ecmNFpKvYi3lTQVeLjmCATmrlNgsNK7oKdhvI79GytOpM9KC/V2jl8n7RH4L+JcGY5MkYlVbuq9GOWKK4RibB2nrgx7gvQbctA5ecgkHBJO9nG4nlMbAb6PK6oxoUt810+Jk+3XFKm3Yi455GOlR/bqSgrO2p0DkV4LWdnA3E/2b1BDvHzVToOEGu2Siok0iKER00Q9vCcIay5wgwI9ytlF2DAYauv18pKpEoTHTXz7qSgslK/pKLUjE4LPrnpXz5ZFxXhmyRbbJLAnJEyuSMVEiNMPJAnsgzefEevVfvzXsfWae88c4G+QXv8xsMXqQi</latexit>

(Type 1) (Type 2) (Type 3)

Type 1 Type 2 Type 3


(Type 1)

Type 1 1 2 0
Single population

<latexit sha1_base64="dQwcmzBg8aG40FxjXubuxah7rZ4=">AAACInicbVDLSgMxFM3Ud31VXYmbYCsoSJkpiC5FNy4r9CG0pdxJ0zaYZIbkjlqG4s/oVv/DnbgS/Aq/wLR2YdUDgcO553JuThhLYdH3373MzOzc/MLiUnZ5ZXVtPbexWbNRYhivskhG5ioEy6XQvIoCJb+KDQcVSl4Pr89H8/oNN1ZEuoKDmLcU9LToCgbopHZuu9C0oqdgv4n8Di1LK85Eg+FBoZ3L+0V/DPqXBBOSJxOU27nPZidiieIamQRrG4Ef46EE3bEMXHIKBgWTfJhtJpbHwK6hxxuOalDcttLxZ4Z0zykd2o2MexrpWP25kYKydqBC51SAfft7NhL/mzUS7J60UqHjBLlm30HdRFKM6KgZ2hGGM5QDR4AZ4W6lrA8GGLr+plJUIlGY6HaYdSUFvyv5S2qlYnBU9C9L+dOzSV2LZIfskn0SkGNySi5ImVQJI/fkkTyRZ+/Be/Fevbdva8ab7GyRKXgfXwkQpCA=</latexit>

(Type 2)

Type 2 0 1 2
<latexit sha1_base64="fqNxtUaYet8LtVV1klaGZOu2Ur4=">AAACInicbVDLSgMxFM34tr6qrsRNsAoVpMwURJeiG5cV+hDaUu6kaQ1NMkNyRy1D8Wd0q//hTlwJfoVfYFq70NYDgcO553JuThhLYdH3P7yZ2bn5hcWl5czK6tr6RnZzq2qjxDBeYZGMzHUIlkuheQUFSn4dGw4qlLwW9i6G89otN1ZEuoz9mDcVdLXoCAbopFZ2Z79hRVdBvoH8Hi1Ly85Ei4PD/VY25xf8Eeg0CcYkR8YotbJfjXbEEsU1MgnW1gM/xiMJum0ZuOQUDAom+SDTSCyPgfWgy+uOalDcNtPRZwb0wClt2omMexrpSP29kYKytq9C51SAN3ZyNhT/m9UT7Jw2U6HjBLlmP0GdRFKM6LAZ2haGM5R9R4AZ4W6l7AYMMHT9/UlRiURhortBxpUUTFYyTarFQnBc8K+KubPzcV1LZJfskTwJyAk5I5ekRCqEkQfyRJ7Ji/fovXpv3vuPdcYb72yTP/A+vwEKt6Qh</latexit>

(Type 3)

Type 3 2 0 1
<latexit sha1_base64="utC7CMJOqp+hqnXnXrR5eCJ1QIs=">AAACInicbVDLSgMxFM34rPVVdSVugq2gIGWmIrosunFZoVWhLeVOmtZgkhmSO2oZij+jW/0Pd+JK8Cv8AtPHQqsHAodzz+XcnDCWwqLvf3hT0zOzc/OZhezi0vLKam5t/cJGiWG8xiIZmasQLJdC8xoKlPwqNhxUKPlleHM6mF/ecmNFpKvYi3lTQVeLjmCATmrlNgsNK7oKdhvI79GytOpM9KC/V2jl8n7RH4L+JcGY5MkYlVbuq9GOWKK4RibB2nrgx7gvQbctA5ecgkHBJO9nG4nlMbAb6PK6oxoUt810+Jk+3XFKm3Yi455GOlR/bqSgrO2p0DkV4LWdnA3E/2b1BDvHzVToOEGu2Siok0iKER00Q9vCcIay5wgwI9ytlF2DAYauv18pKpEoTHTXz7qSgslK/pKLUjE4LPrnpXz5ZFxXhmyRbbJLAnJEyuSMVEiNMPJAnsgzefEevVfvzXsfWae88c4G+QXv8xsMXqQi</latexit>
Convergence
Theorem. The replicator equation for a matrix game
(single-population) satis es
1. A stable rest point is a NE

2. A convergent trajectory in the interior of the


strategy space evolves to a NE

3. A strict NE is locally asymptotically stable


fi
:



Multi-population games
• Every player is associated with a different population
of individual
• Theorem. (p*, q*) is a two-species ESS if and only if
(p*, q*) is a strict Nash in a neighbourhood an
• is a locally asymptotically stable rest point of the
replicator equatio
• if it is in the interior, then it is a globally
asymptotically stable rest point of the replicator
equation
s

Known dynamics

Replicator dynamics Logit

BNN Smith
Properties
Multi-agent learning
Markov decision problem
Reinforcement learning
Q-learning (1)

For every pair state/action:

h i
Q(s, a) Q(s, a) + ↵ r + maxa0 Q(s, a0 ) Q(s, a)

learning rate reward discount factor


Example: normal-form games

Player
•1 stat
•2 actions (Cooperate, Defeat)
Player 2

Cooperate Defeat

Cooperate 3,3 0,5


Player 1

Defeat 5,0 1,1


e

Example: normal-form games


(
1.0 a = Cooperate
1 (a) =
0.0 a = Defeat
⇣ ⌘ (
Q(a) Q(a) + ↵ r Q(a) ↵ = 0.2 0.2 a = Cooperate
1 (a) =
0.8 a = Defeat

Player 2

Cooperate Defeat

Cooperate 3,3 0,5


Player 1

Defeat 5,0 1,1


Example: normal-form games
(
1.0 a = Cooperate
1 (a) =
0.0 a = Defeat
⇣ ⌘ (
Q(a) Q(a) + ↵ r Q(a) ↵ = 0.2 0.2 a = Cooperate
1 (a) =
0.8 a = Defeat
round Player 2’s action Player 1’s Q function
t=0 — Q(Cooperate) = 0
t=1 a = Cooperate Q(Cooperate) = 0.6 Player 2
t=2 a = Defeat Q(Cooperate) = 0.48
t=3 a = Defeat Q(Cooperate) = 0.384
t=4 a = Defeat Q(Cooperate) = 0.3072
t=5 a = Defeat Q(Cooperate) = 0.24576
Cooperate Defeat
t=6 a = Cooperate Q(Cooperate) = 0.496608

Cooperate 3,3 0,5


Player 1

Defeat 5,0 1,1


Example: normal-form games
(
1.0 a = Cooperate
1 (a) =
0.0 a = Defeat
⇣ ⌘ (
Q(a) Q(a) + ↵ r Q(a) ↵ = 0.2 0.2 a = Cooperate
1 (a) =
0.8 a = Defeat
round Player 2’s action Player 1’s Q function
t=0 — Q(Cooperate) = 0
t=1 a = Cooperate Q(Cooperate) = 0.6 Player 2
t=2 a = Defeat Q(Cooperate) = 0.48
t=3 a = Defeat Q(Cooperate) = 0.384
t=4 a = Defeat Q(Cooperate) = 0.3072
t=5 a = Defeat Q(Cooperate) = 0.24576
Cooperate Defeat
t=6 a = Cooperate Q(Cooperate) = 0.496608

Cooperate 3,3 0,5


Player 1

Defeat 5,0 1,1


Example: normal-form games
(
1.0 a = Cooperate
1 (a) =
0.0 a = Defeat
⇣ ⌘ (
Q(a) Q(a) + ↵ r Q(a) ↵ = 0.2 0.2 a = Cooperate
1 (a) =
0.8 a = Defeat
round Player 2’s action Player 1’s Q function
t=0 — Q(Cooperate) = 0
t=1 a = Cooperate Q(Cooperate) = 0.6 Player 2
t=2 a = Defeat Q(Cooperate) = 0.48
t=3 a = Defeat Q(Cooperate) = 0.384
t=4 a = Defeat Q(Cooperate) = 0.3072
t=5 a = Defeat Q(Cooperate) = 0.24576
Cooperate Defeat
t=6 a = Cooperate Q(Cooperate) = 0.496608

Cooperate 3,3 0,5


Player 1

Defeat 5,0 1,1


Example: normal-form games
(
1.0 a = Cooperate
1 (a) =
0.0 a = Defeat
⇣ ⌘ (
Q(a) Q(a) + ↵ r Q(a) ↵ = 0.2 0.2 a = Cooperate
1 (a) =
0.8 a = Defeat
round Player 2’s action Player 1’s Q function
t=0 — Q(Cooperate) = 0
t=1 a = Cooperate Q(Cooperate) = 0.6 Player 2
t=2 a = Defeat Q(Cooperate) = 0.48
t=3 a = Defeat Q(Cooperate) = 0.384
t=4 a = Defeat Q(Cooperate) = 0.3072
t=5 a = Defeat Q(Cooperate) = 0.24576
Cooperate Defeat
t=6 a = Cooperate Q(Cooperate) = 0.496608

Cooperate 3,3 0,5


Player 1

Defeat 5,0 1,1


Example: normal-form games
(
1.0 a = Cooperate
1 (a) =
0.0 a = Defeat
⇣ ⌘ (
Q(a) Q(a) + ↵ r Q(a) ↵ = 0.2 0.2 a = Cooperate
1 (a) =
0.8 a = Defeat
round Player 2’s action Player 1’s Q function
t=0 — Q(Cooperate) = 0
t=1 a = Cooperate Q(Cooperate) = 0.6 Player 2
t=2 a = Defeat Q(Cooperate) = 0.48
t=3 a = Defeat Q(Cooperate) = 0.384
t=4 a = Defeat Q(Cooperate) = 0.3072
t=5 a = Defeat Q(Cooperate) = 0.24576
Cooperate Defeat
t=6 a = Cooperate Q(Cooperate) = 0.496608

Cooperate 3,3 0,5


Player 1

Defeat 5,0 1,1


Example: normal-form games
(
1.0 a = Cooperate
1 (a) =
0.0 a = Defeat
⇣ ⌘ (
Q(a) Q(a) + ↵ r Q(a) ↵ = 0.2 0.2 a = Cooperate
1 (a) =
0.8 a = Defeat
round Player 2’s action Player 1’s Q function
t=0 — Q(Cooperate) = 0
t=1 a = Cooperate Q(Cooperate) = 0.6 Player 2
t=2 a = Defeat Q(Cooperate) = 0.48
t=3 a = Defeat Q(Cooperate) = 0.384
t=4 a = Defeat Q(Cooperate) = 0.3072
t=5 a = Defeat Q(Cooperate) = 0.24576
Cooperate Defeat
t=6 a = Cooperate Q(Cooperate) = 0.496608

Cooperate 3,3 0,5


Player 1

Defeat 5,0 1,1


Example: normal-form games
(
1.0 a = Cooperate
1 (a) =
0.0 a = Defeat
⇣ ⌘ (
Q(a) Q(a) + ↵ r Q(a) ↵ = 0.2 0.2 a = Cooperate
1 (a) =
0.8 a = Defeat
round Player 2’s action Player 1’s Q function
t=0 — Q(Cooperate) = 0
t=1 a = Cooperate Q(Cooperate) = 0.6 Player 2
t=2 a = Defeat Q(Cooperate) = 0.48
t=3 a = Defeat Q(Cooperate) = 0.384
t=4 a = Defeat Q(Cooperate) = 0.3072
t=5 a = Defeat Q(Cooperate) = 0.24576
Cooperate Defeat
t=6 a = Cooperate Q(Cooperate) = 0.496608

Cooperate 3,3 0,5


Player 1

Defeat 5,0 1,1


Q-learning (2)
Softmax (a.k.a. Boltzam exploration)
exp(Q(s, a)/⌧ )
i (a) = P 0 )/⌧ )
a 0 exp(Q(s, a
Q-learning (2)
Softmax (a.k.a. Boltzam exploration)
exp(Q(s, a)/⌧ )
i (a) = P 0 )/⌧ )
a 0 exp(Q(s, a

temperature
Q-learning (2)
Softmax (a.k.a. Boltzam exploration)
exp(Q(s, a)/⌧ )
i (a) = P 0 )/⌧ )
a 0 exp(Q(s, a

temperature

Every action is played with strictly positive probabilit

The larger the temperature, the smoother the functio

If the temperature is 0, we would have a best response


y

Example: normal-form games

Q(Cooperate) Q(Defeat) 1 (Cooperate) 1 (Cooperate)


0 0 0.5 0.5
1 0 0.731 0.269
5 0 0.99331 0.00669
10 0 0.999955 0.000045
Example: normal-form games

Q(Cooperate) Q(Defeat) 1 (Cooperate) 1 (Cooperate)


0 0 0.5 0.5
1 0 0.731 0.269
5 0 0.99331 0.00669
10 0 0.999955 0.000045
Example: normal-form games

Q(Cooperate) Q(Defeat) 1 (Cooperate) 1 (Cooperate)


0 0 0.5 0.5
1 0 0.731 0.269
5 0 0.99331 0.00669
10 0 0.999955 0.000045
Example: normal-form games

Q(Cooperate) Q(Defeat) 1 (Cooperate) 1 (Cooperate)


0 0 0.5 0.5
1 0 0.731 0.269
5 0 0.99331 0.00669
10 0 0.999955 0.000045
Self-play Q-learning dynamics
Self-play learning
Q-learning algorithm
Player 2

Cooperate Defeat
Q-learning algorithm

Cooperate 3,3 0,5


Player 1

Defeat 5,0 1,1


Learning dynamics
Assumptions
•Time is continuou
•All the actions can be selected simultaneously
:

Learning dynamics
Assumptions
•Time is continuou
•All the actions can be selected simultaneously

↵ ⇣ ⌘ ⇣ X ⌘
1 (a, t) 0 0
˙ 1 (a, t) = ea U1 2 (t) 1 (t)U1 2 (t) ↵ 1 (a, t) log( 1 (a)) 1 (a ) log( 1 (a ))

a0
:

Learning dynamics
Assumptions
•Time is continuou
•All the actions can be selected simultaneously

↵ ⇣ ⌘ ⇣ X ⌘
1 (a, t) 0 0
˙ 1 (a, t) = ea U1 2 (t) 1 (t)U1 2 (t) ↵ 1 (a, t) log( 1 (a)) 1 (a ) log( 1 (a ))

a0

exploitation term exploration term


:

Learning dynamics
Assumptions
•Time is continuou
•All the actions can be selected simultaneously

↵ ⇣ ⌘ ⇣ X ⌘
1 (a, t) 0 0
˙ 1 (a, t) = ea U1 2 (t) 1 (t)U1 2 (t) ↵ 1 (a, t) log( 1 (a)) 1 (a ) log( 1 (a ))

a0

exploitation term exploration term

When the temperature is 0, the Q-learning behaves


as the replicator dynamics
:

Prisoner’s dilemma

Player 2 FAQ-learning
1

Cooperate Defeat
0.75 0

Player 2’s cooperate


probability
y1 y1
0.5
Cooperate 3,3 0,5
Player 1

0.25 0

Defeat 5,0 1,1 0


0 0.25 0.5 0.75 1
asymptotically Player 1’sx1cooperate
stable probability
Lenient FAQ-learning
1

0.75 0

Stag hunt

Player 2
FAQ-learning
asymptotically
stable
1 1

Stag Hare
0.75 0.75 0.

Player 2’s stag


probability
y1 y1 y1
0.5 0.5 0
Stag 4,4 1,3
Player 1

0.25 0.25 0.

Hare 3,1 3,3


0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
x1 asymptotically x11’s stag
Player
stable probability
Lenient FAQ-learning
1 1

0.75 0.75 0.

Matching pennies

ning Player 2
1 1

Head Tail
0.75 0.75

Player 2’s hea


probability
y1 y1
0.5 0.5
Head 0,1 1,0
Player 1

0.25 0.25

Tail 1,0 0,1


0 0
5 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
x
1
x
1 Playerx11’s head
probability
FAQ-learning
1 1

0.75 0.75
d

Other MAL algorithms dynamic


based on replicator equation

In nitesimal Gradient Ascent

IGA-Win or Learn Fast

Weighted Policy Learning

Cross Learning
Frequency-Adjusted Q-learning

Regret Minimization (Polynomial Weights)


fi
s

You might also like