Professional Documents
Culture Documents
Unit 5
Unit 5
S[t] denotes the current state of the agent and s[t+1] denotes
the next state.
Markov Property
It says that "If the agent is present in the current
state S1, performs an action a1 and move to the
state s2, then the state transition from s1 to s2
only depends on the current state and future
action and states do not depend on past actions,
rewards, or states."
Or, in other words, as per Markov Property, the current
state transition does not depend on any past action or
state
Markov Decision Process
Q-Learning Algorithm
Q-learning is a popular model-free reinforcement learning
algorithm based on the Bellman equation.
Model-free means that the agent uses predictions of the
environment’s expected response to move forward.
It does not use the reward system to learn, but rather, trial and
error.
The main objective of Q-learning is to learn the policy which can
inform the agent that what actions should be taken for
maximizing the reward under what circumstances.
The goal of the agent in Q-learning is to maximize the value of Q.
Q stands for quality in Q-learning, which means it specifies the
quality of an action taken by the agent.
Q-Learning algorithm
Q-Learning algorithm works like this:
Initialize all Q-values, e.g., with zeros
Choose an action a in the current state s based on the
current best Q-value
Perform this action a and observe the outcome (new
state s’).
Measure the reward R after this action
Update Q with an update formula that is called
the Bellman Equation.
Repeat steps 2 to 5 until the learning no longer improves
Q-Learning Algorithm
An example of Q-learning is an Advertisement
recommendation system.
Deep Q-Learning
Q-Learning approach is practical for very small
environments and quickly loses it’s feasibility when the
number of states and actions in the environment increases.
The solution for the above leads us to Deep Q-
Learning which uses a deep neural network to approximate
the values.
Deep Q Learning uses the Q-learning idea and takes it one
step further.
Instead of using a Q-table, we use a Neural Network that
takes a state and approximates the Q-values for each action
based on that state.
The basic working step for Deep Q-Learning is that the
initial state is fed into the neural network and it returns the
Q-value of all possible actions as an output.
The difference between Q-Learning and Deep Q-Learning
can be illustrated as follows:
Deep Q-Learning
We do this because using a classic Q-table is not very scalable.
It might work for simple games, but in a more complex game
with dozens of possible actions and game states the Q-table
will soon get complex and cannot be solved efficiently
anymore.
So now we use a Deep Neural Network that gets the state
as input, and produces different Q values for each action.
Then again we choose the action with the highest Q-value.
The learning process is still the same with the iterative update
approach, but instead of updating the Q-Table, here we
update the weights in the neural network so that the outputs
are improved.