Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

CSD311: Artificial Intelligence

Finite MDP, Q-learning algorithm

I Q-learning is an algorithm to learn good/optimal actions in


states with finite S.
I Q : S × A → R is a table that is updated in an iterative
manner. The algorithm is model free - that is does not have
to learn a state change model. But it will not work for infinite
spaces.
I The update is done using Bellman’s equation (see
qlearning.ipynb for an example):
Monte Carlo sampling I

I In many RL problems Q-learning cannot be used directly due


to the size of the table.
I One way to address this is to sample sequences of actions
using MC sampling as per a policy p that uses the current
values of the expected reward Ep [Rt |st , at ] for (st , at ) pairs
similar to Q-learning.
I A policy that is often used is the -greedy policy that was
discussed for MAB but generalized to handle states.
I A possible implementation for an adversarial game is given
below. The same approach can be used when there is a single
agent (e.g. a video game).
I The initial (s, a) values can be initialized in some way - one
possibility is randomly in some interval.
Monte Carlo sampling II

I The current (s, a) value is updated after a rollout of r moves


by γ r −1 if there is a win, −γ r −1 if there is a loss, and left
unchanged if there is a draw (no loss or win). The Ep [R|s, a]
value is the unormalized (s, a) value divided by the number of
times it has been updated. The method is similar to how
values were updated in MCTS.
I Assuming -greedy is being used the algorithm chooses a
random move with probability  and the move with highest
expected value with probability (1 − ).
I Often the  value is changed using a decay function and a
threshold. The decay function is an exponential in the number
of rollouts. So, as rollouts increase and the Ep (R|s, a] values
stabilize the  value reaches a low threshold level so that the
program is exploiting most of the time.
I In play mode the  value is set to 0 so that the agent plays the
move with the highest reward for each pair (s, a).
Monte Carlo sampling III

I When the number of (s, a) pairs is very large the above is not
feasible and the only option is to use a function approximator
that learns to predict Ep [st , at ] for (s, a) pairs.
I In a particular state s the outcome for each action ai after
rollout using the -greedy policy is used to estimate the value
Rp [s, ai ] for each action and the (s, ai ) pairs are fed to a
function approximator, typically a neural network (often a
convolutional neural n/w). The neural n/w is trained with the
(s, a) and Rp [s, a] data.
I In most adversarial games two artificial agents play each other
a large number of times and the generated data is
continuously used to train the neural n/w as play progresses.

You might also like