Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 1

ESE680-005 Reinforcement Learning
Santiago Paternain and Miguel Calvo-Fullana

Electrical and Systems Engineering, University of Pennsylvania
{spater,cfullana}@seas.upenn.edu
Thanks to: Juan Andrés Bazerque and José Lezama

Electrical Enginnering, Universidad de la República
August 27, 2019
Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 1

Organization
I Lectures
⇒ Tuesdays and Thursdays 9:00-10:30 at Moore 212
⇒ Miguel Calvo-Fullana and Santiago Paternain
I Office hours:
⇒ Clark Zhang: Monday 5pm-7pm GRASP conference room
clarkz@seas.upenn.edu
⇒ Kate Tolstaya: Wednesday 9am-11am 452C Walnut 3401
eig@seas.upenn.edu
⇒ Arbaaz Khan: On demand
arbaazk@seas.upenn.edu
I Canvas: http://canvas.upenn.edu/courses/1475618
⇒ Piazza

Evaluation
I Homework (50%): 5 Homework (10% each), groups of two students

⇒ Homework 1 : MDPs (Individual)
⇒ Homework 2 : Policy Gradient ⇒ Reinforce
⇒ Homework 3 : Policy Gradient with baselines
⇒ Homework 4 : Q-learning
⇒ Homework 5 : Actor-critic
I In-class Midterm (20%): October 17
⇒ MDPs and Policy Gradient
I Take-home Final (30%): Due on December 5
⇒ Theoretical questions
⇒ Implementation

Textbook
I “Reinforcement Learning: An introduction ”, Second edition

⇒ Richard S. Sutton and Andrew G. Barto
⇒ Available Online at
http://webdocs.cs.ualberta.ca/sutton/book/the-book.html
I “Algorithms for Reinforcement Learning”
⇒ Csaba Szepesvári
⇒ Available online at
https://sites.ualberta.ca/~szepesva/RLBook.html

Reinforcement Learning
I Model-free framework to formalize sequential decision making
I At each time step t = 0, 1 . . . the agent is in state St ∈ S

I And it selects an action At ∈ A(s) possibly state dependent
I As a consequence of the action the environment produces
⇒ A numerical reward Rt ∈ R
⇒ The transition to the system to a new state st+1

Examples: Cart-Pole
I Goal is to keep the pole in vertical position
I State space S ⊂ R4
⇒ Pos, vel, angle, ang. velocity
I Action space A ⊂ R
⇒ Horizontal acceleration
I Dynamics of the system
ẍ cos θ + θ̈` = −g sin θ
ẍ(m+mp )+θ̈mp ` cos θ = F +mp `θ̇2 sin θ
I Can we set up this problem in the RL framework?

I How? Should we?

Mountain Car Problem
I State space S: position and velocity, Action space A is the force applied
I Goal is to reach the top of the mountain

Mountain Car Problem

What Problems Can RL Solve?
I It considers Markov Decision Processes

⇒ Memory-less process ⇒ More in the next class
⇒ Given a state and an action the system transitions into a state
⇒ It receives a reward
I Previous examples are in this form ⇒ But they have infinite states

What is it Learned?
I In general we want to find a policy π : S → A that maximizes total reward

" T #
X
argmax E Rt
π
t=0
S A
I The learning problem is to find such policy based on the rewards collected

What is it Learned?
I When the state is continuous it might be diffucult to describe such map
S A
I What can we do?

What is it Learned?
I One possibility is to discretize the state-space
S A
I Not always possible and we need to resort to other techniques

Random Policies
I Gridworld example ⇒ Given a state we chose which direction we move

I Actions are A = {up, right, down, left}
I Rewards are −1 if the action pushes out of the board but no movement
I Transitions are deterministic or with probability one

I Policies might be random ⇒ for exploration mainly

Success of Reinforcement Learning
I “Playing Atari with Deep Reinforcement Learning” V.Mnih etal (2013)

I “Human-level control through deep reinforcement learning” V. Mnih etal
(2015)1
I Train a Neural Network

⇒ Inputs: Pixels
⇒ Output: Joystick command
1
https://www.youtube.com/watch?v=V1eYniJ0Rnk
I AlphaZero Defeated in Chess, Shogi, and Go the world champion

programs and humans

I How does AlphaZero play?

⇒ It learns the “value” of states
⇒ Uses the values to guide a
search of possible moves
⇒ Selects the action with larger
value

I AlphaStar: Mastering the Real-Time Strategy Game StarCraft II
I Challenges
⇒ Imperfect information: Only part of the map is observed
⇒ Long term planning: Actions won’t pay off until end of game
⇒ Large action-space: 1026 legal actions per step
I One week of training ⇒ Equivalent to 200 years of gameplay
I OpainAI Five: Multi-agent Reinforcement Learning to play DOTA2

I Is a team of five artificial NN
I No explicit communication between the members of the team
I April 2019 ⇒ defeated the world champion team
I Multi-agent posses challenges
⇒ State-space grows exponentially
⇒ Reward assignment
⇒ 45000 years of Dota self-play over 10 realtime months
2
https://www.youtube.com/watch?v=yEOEqaEgu94
I Helicopter Maneuvers ⇒ Good model is available

I Design of controller for those maneuvers was hard3
3
https://www.youtube.com/watch?v=VCdxqn0fcnE
I Models are not so good ⇒ Experiments in controlled environment

I Lots of data is needed for training ⇒ Large training times
I But complex tasks are possible with these techniques4
4
https://www.youtube.com/watch?v=ZhsEKTo7V04
Not everything that shines is... RL
I Some of the most impressive robotic behaviors are achieved without RL5
5
https://www.youtube.com/watch?v=hSjKoEva5bg
Characteristics of Reinforcement Learning
I What is different about Reinforcement Learning

⇒ There is no supervisor, only a reward signal
⇒ Feedback is delayed, not instantaneous
⇒ Time matters (sequential, data is not i.i.d)
⇒ Agent’s action impact on the subsequent data that it receives
I Frontier of Reinforcement Learning
⇒ Trying to apply Reinforcement Learning in robotics
⇒ Autonomous driving for instance
⇒ Fleets deployed: Not only to test but to collect data
⇒ Research in Safe Learning

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 1

Uploaded by

Copyright:

Available Formats

ESE680-005 Reinforcement Learning

Santiago Paternain and Miguel Calvo-Fullana

Thanks to: Juan Andrés Bazerque and José Lezama

August 27, 2019

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 1

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 2

I Homework (50%): 5 Homework (10% each), groups of two students

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 3

I “Reinforcement Learning: An introduction ”, Second edition

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 4

I Model-free framework to formalize sequential decision making

I At each time step t = 0, 1 . . . the agent is in state St ∈ S

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 5

I Goal is to keep the pole in vertical position

I Can we set up this problem in the RL framework?

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 6

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 7

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 8

I It considers Markov Decision Processes

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 9

I In general we want to find a policy π : S → A that maximizes total reward

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 10

I When the state is continuous it might be diffucult to describe such map

I What can we do?

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 11

I One possibility is to discretize the state-space

I Not always possible and we need to resort to other techniques

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 12

I Gridworld example ⇒ Given a state we chose which direction we move

I Transitions are deterministic or with probability one

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 13

I “Playing Atari with Deep Reinforcement Learning” V.Mnih etal (2013)

I Train a Neural Network

I AlphaZero Defeated in Chess, Shogi, and Go the world champion

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 15

I How does AlphaZero play?

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 16

I AlphaStar: Mastering the Real-Time Strategy Game StarCraft II

I OpainAI Five: Multi-agent Reinforcement Learning to play DOTA2

I Helicopter Maneuvers ⇒ Good model is available

I Models are not so good ⇒ Experiments in controlled environment

I What is different about Reinforcement Learning

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 22

You might also like