GameMindsDT - Final Report

REINFORCEMENT LEARNING
GameMindsDT
OUR DECISION TRANSFORMER
Pol Fernández
Omar Aguilera
Shuang Long
Edgar Planell
Alex Barrachina
Documentation on GitHub!
Motivation
● We want to success at getting an agent that
is able to learn and make its own decision in
a complex environment.
● RL is a interesting field to dive into it.
● Transformers are really popular nowadays.
● The fusion of both tech are a innovative idea,

since its another approach to RL.
2
Framework & Tech Requirements
3
Decision Transformer
Definition:
● Based on a transformer architecture
● Reduces de RL problem to a sequential
problem
● Uses a GPT-like model to make predictions
4
Implementation
Input embedding
Position embedding
5
Implementation
Continuous Discrete
Architecture Values inside range [-1, 1]. _

Hyperbolic Tangent is used
Training Mean Square Error is used Cross Entropy is used
6
Atari
environment
Why: Implementation:
○ Complexity ○ States are images ○ Tanh > LayerNorm
○ Benchmarking ○ Discrete ○ Global position encoder
DT-Atari
Atari environment
Results:
○ Learned to play well !
○ Better scores than other offline RL techniques
○ Better with expert data
○ Initial R-t-g => 5* maximum in dataset
DT-Atari 8
Atari environment
Future explorations:
○ Better data > lives
○ Split images in patches
○ Position encoding strategies > ALiBi, RoPE
○ Multi-Game-DT
DT-Atari 9
applied to
Pybullet Environments-V0
Hopper Walker2D Halfcheetah Ant

10
Link to GitHub Readme
Pybullet - Datasets Ex
pe
rt
Po
e Random: Datasets sampled with a randomly initialized policy. li
a c ●
cy?
S p ● Medium: Datasets sampled with a medium-level policy.
us
u o Mixed: Datasets collected during policy training.
i n ●
n t
Co
11
Pybullet - Datasets Sa
mp
le
s
Re
du
ce
d
● Normalization of episodes length:

Excluded episodes shorter than the mean duration to avoid introducing noise into training.
● Normalization of Observation Space:

With standard score normalization, to ensure normalization of the observation space beyond
mean and standard deviation from data.
● Generation of Additional Model Inputs:

Manually computed the return-to-go and timesteps arrays.
12
Pybullet - Hopper Environment
�� ♂
Root cause?
Normalization of Observation
Space during inference was Results:
missing!
��
Predic
ted Ac
tion
x1000
�
��
(~ 20 Train/Test runs ) 13
Pybullet - Walker 2D Environment
Training
Results:
Test
859,26 : Test Run [Attempt
1]

2]
�
��
3]
Pybullet - Halfcheetah Environment
Changes Implemented?
● Training with ↑ Computational demanding
hyperparameters (↑Powerful Hardw) Results(?):
● Global Position Encoding
● Constant Return-to-go (∞ / 0)
“Exploration-exploitation trade-off”
Not Successful
results
��
��
Pybullet - Ant Environment
● Environment with the highest dimensionality (↑ Difficulty)
● Similar changes implemented with Halfcheetah env

(↑Powerful Hardw)
Not Successful
Training Highlights: results
��
��
Pybullet - Conclusions & Future Work
● The decision transformer succeed in 2/4 Pybullet Environments-V0
● Train Decision Transformer with a larger Dataset (Expert Policy)
● Implement other positional encoding strategies
✓ Global Position Encoding

✓ Rotary Position Embedding (RoPE)
✓ Attention with Linear Biases (ALiBi)
● Try other physics Environments (Mujoco/Gymnasium):
✓ More updated documentation (easier Troubleshooting)

✓ Less dependencies depreciated & more compatibility
● Benchmark Decision Transformer performance vs SoA RL Algorithms:
17
MineRL “Find Cave” - From theory to practice
18
MineRL - Our DT Agent Find Caves
Episode
VPT
Episodes
1024 x Frame actions
embeddings
19
Results
● 50% is not only luck
● Good Explorer
● The other tasks fails
20
Key insights
● Training is quite simple ● Not enough data for complex tasks

● VPT is Powerful ● Enough data for simple tasks
DT-MineRL
21
Project Milestones: Our Journey
22
Conclusions
📈 Deeper understanding of DT’s, harnessing its potential

💡 Improvements to the original Decision Transformer
🐋 The versatility and importance of Docker
23
THANK YOU FOR YOUR
ATTENTION
Q&A TIME!
24

GameMindsDT - Final Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GameMindsDT - Final Report

Uploaded by

Copyright:

Available Formats

REINFORCEMENT LEARNING

● RL is a interesting field to dive into it.

● Transformers are really popular nowadays.

● The fusion of both tech are a innovative idea,

Architecture Values inside range [-1, 1]. _

Training Mean Square Error is used Cross Entropy is used

Hopper Walker2D Halfcheetah Ant

● Normalization of episodes length:

● Normalization of Observation Space:

● Generation of Additional Model Inputs:

733,893 : Test Run [Attempt

● Similar changes implemented with Halfcheetah env

● Train Decision Transformer with a larger Dataset (Expert Policy)

● Implement other positional encoding strategies

✓ Global Position Encoding

● Try other physics Environments (Mujoco/Gymnasium):

✓ More updated documentation (easier Troubleshooting)

● Benchmark Decision Transformer performance vs SoA RL Algorithms:

● Training is quite simple ● Not enough data for complex tasks

📈 Deeper understanding of DT’s, harnessing its potential

You might also like