Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

ESE680-005 Reinforcement Learning

Santiago Paternain and Miguel Calvo-Fullana


Electrical and Systems Engineering, University of Pennsylvania
{spater,cfullana}@seas.upenn.edu

Thanks to: Juan Andrés Bazerque and José Lezama


Electrical Enginnering, Universidad de la República

August 27, 2019

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 1


Organization

I Lectures
⇒ Tuesdays and Thursdays 9:00-10:30 at Moore 212
⇒ Miguel Calvo-Fullana and Santiago Paternain
I Office hours:
⇒ Clark Zhang: Monday 5pm-7pm GRASP conference room
clarkz@seas.upenn.edu
⇒ Kate Tolstaya: Wednesday 9am-11am 452C Walnut 3401
eig@seas.upenn.edu
⇒ Arbaaz Khan: On demand
arbaazk@seas.upenn.edu
I Canvas: http://canvas.upenn.edu/courses/1475618
⇒ Piazza

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 2


Evaluation

I Homework (50%): 5 Homework (10% each), groups of two students


⇒ Homework 1 : MDPs (Individual)
⇒ Homework 2 : Policy Gradient ⇒ Reinforce
⇒ Homework 3 : Policy Gradient with baselines
⇒ Homework 4 : Q-learning
⇒ Homework 5 : Actor-critic
I In-class Midterm (20%): October 17
⇒ MDPs and Policy Gradient
I Take-home Final (30%): Due on December 5
⇒ Theoretical questions
⇒ Implementation

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 3


Textbook

I “Reinforcement Learning: An introduction ”, Second edition


⇒ Richard S. Sutton and Andrew G. Barto
⇒ Available Online at
http://webdocs.cs.ualberta.ca/sutton/book/the-book.html
I “Algorithms for Reinforcement Learning”
⇒ Csaba Szepesvári
⇒ Available online at
https://sites.ualberta.ca/~szepesva/RLBook.html

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 4


Reinforcement Learning

I Model-free framework to formalize sequential decision making

I At each time step t = 0, 1 . . . the agent is in state St ∈ S


I And it selects an action At ∈ A(s) possibly state dependent
I As a consequence of the action the environment produces
⇒ A numerical reward Rt ∈ R
⇒ The transition to the system to a new state st+1

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 5


Examples: Cart-Pole

I Goal is to keep the pole in vertical position

I State space S ⊂ R4
⇒ Pos, vel, angle, ang. velocity
I Action space A ⊂ R
⇒ Horizontal acceleration
I Dynamics of the system
ẍ cos θ + θ̈` = −g sin θ
ẍ(m+mp )+θ̈mp ` cos θ = F +mp `θ̇2 sin θ

I Can we set up this problem in the RL framework?


I How? Should we?

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 6


Mountain Car Problem

I State space S: position and velocity, Action space A is the force applied
I Goal is to reach the top of the mountain

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 7


Mountain Car Problem

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 8


What Problems Can RL Solve?

I It considers Markov Decision Processes


⇒ Memory-less process ⇒ More in the next class
⇒ Given a state and an action the system transitions into a state
⇒ It receives a reward

I Previous examples are in this form ⇒ But they have infinite states

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 9


What is it Learned?

I In general we want to find a policy π : S → A that maximizes total reward


" T #
X
argmax E Rt
π
t=0

S A
I The learning problem is to find such policy based on the rewards collected

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 10


What is it Learned?

I When the state is continuous it might be diffucult to describe such map

S A

I What can we do?

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 11


What is it Learned?

I One possibility is to discretize the state-space

S A

I Not always possible and we need to resort to other techniques

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 12


Random Policies

I Gridworld example ⇒ Given a state we chose which direction we move


I Actions are A = {up, right, down, left}
I Rewards are −1 if the action pushes out of the board but no movement

I Transitions are deterministic or with probability one


I Policies might be random ⇒ for exploration mainly

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 13


Success of Reinforcement Learning

I “Playing Atari with Deep Reinforcement Learning” V.Mnih etal (2013)


I “Human-level control through deep reinforcement learning” V. Mnih etal
(2015)1

I Train a Neural Network


⇒ Inputs: Pixels
⇒ Output: Joystick command

1
https://www.youtube.com/watch?v=V1eYniJ0Rnk
Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 14
Success of Reinforcement Learning

I AlphaZero Defeated in Chess, Shogi, and Go the world champion


programs and humans

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 15


Success of Reinforcement Learning

I How does AlphaZero play?


⇒ It learns the “value” of states
⇒ Uses the values to guide a
search of possible moves
⇒ Selects the action with larger
value

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 16


Success of Reinforcement Learning

I AlphaStar: Mastering the Real-Time Strategy Game StarCraft II

I Challenges
⇒ Imperfect information: Only part of the map is observed
⇒ Long term planning: Actions won’t pay off until end of game
⇒ Large action-space: 1026 legal actions per step
I One week of training ⇒ Equivalent to 200 years of gameplay
Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 17
Success of Reinforcement Learning

I OpainAI Five: Multi-agent Reinforcement Learning to play DOTA2


I Is a team of five artificial NN
I No explicit communication between the members of the team
I April 2019 ⇒ defeated the world champion team
I Multi-agent posses challenges
⇒ State-space grows exponentially
⇒ Reward assignment
⇒ 45000 years of Dota self-play over 10 realtime months

2
https://www.youtube.com/watch?v=yEOEqaEgu94
Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 18
Success of Reinforcement Learning

I Helicopter Maneuvers ⇒ Good model is available


I Design of controller for those maneuvers was hard3

3
https://www.youtube.com/watch?v=VCdxqn0fcnE
Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 19
Success of Reinforcement Learning

I Models are not so good ⇒ Experiments in controlled environment


I Lots of data is needed for training ⇒ Large training times
I But complex tasks are possible with these techniques4

4
https://www.youtube.com/watch?v=ZhsEKTo7V04
Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 20
Not everything that shines is... RL

I Some of the most impressive robotic behaviors are achieved without RL5

5
https://www.youtube.com/watch?v=hSjKoEva5bg
Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 21
Characteristics of Reinforcement Learning

I What is different about Reinforcement Learning


⇒ There is no supervisor, only a reward signal
⇒ Feedback is delayed, not instantaneous
⇒ Time matters (sequential, data is not i.i.d)
⇒ Agent’s action impact on the subsequent data that it receives
I Frontier of Reinforcement Learning
⇒ Trying to apply Reinforcement Learning in robotics
⇒ Autonomous driving for instance
⇒ Fleets deployed: Not only to test but to collect data
⇒ Research in Safe Learning

Santiago Paternain, Miguel Calvo-Fullana ESE680-005 Reinforcement Learning 22

You might also like