3.5 Intro2DeepQLearning

21/11/23, 12:43 Deep Q-Learning | An Introduction To Deep Reinforcement Learning
Home
A Hands-On Introduction to Deep Q-

Learning using OpenAI Gym in Python
Ankit Choudhary — Updated On August 21st, 2023
Advanced Algorithm Libraries Programming Python Reinforcement Learning Reinforcement Learning Structured Data
Introduction
I have always been fascinated with games. The seemingly in nite options available to
perform an action under a tight timeline – it’s a thrilling experience. There’s nothing quite like
it.
So when I read about the incredible algorithms DeepMind was coming up with (like AlphaGo
and AlphaStar), I was hooked. I wanted to learn how to make these systems on my own
machine. And that led me into the world of deep reinforcement learning (Deep RL).
Deep RL is relevant even if you’re not into gaming. Just check out the sheer variety of
functions currently using Deep RL for research:
What about industry-ready applications? Well, here are two of the most commonly cited
Deep RL use cases:
Google’s Cloud AutoML

Facebook’s Horizon Platform
The scope of Deep RL is IMMENSE. This is a great time to enter into this eld and make a
career out of it.
https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/ 1/14
In this article, I aim to help you take your rst steps into the world of deep reinforcement
learning. We’ll use one of the most popular algorithms in RL, deep Q-learning, to understand
how deep RL works. And the icing on the cake? We will implement all our learning in an
awesome case study using Python.
Ready to start your data science journey?

Master 23+ tools & learn 50+ real-world projects to
transform your career in Data Science.
Enroll Now
Table of contents
Introduction
The Road to Q-Learning
RL Agent-Environment
Markov Decision Process (MDP)
Q Learning
What is Deep Q-Learning?
Why ‘Deep’ Q-Learning?
Deep Q-Networks
Challenges in Deep RL as Compared to Deep Learning
1. Target Network
2. Experience Replay
Putting it all Together
Implementing Deep Q-Learning in Python using Keras & OpenAI Gym
Frequently Asked Questions
End Notes
The Road to Q-Learning

There are certain concepts you should be aware of before wading into the depths of deep
reinforcement learning. Don’t worry, I’ve got you covered.
I have previously written various articles on the nuts and bolts of reinforcement learning to
introduce concepts like multi-armed bandit, dynamic programming, Monte Carlo learning
and temporal differencing. I recommend going through these guides in the below sequence:
Nuts & Bolts of Reinforcement Learning: Model-Based Planning using Dynamic

Programming
Reinforcement Learning Guide: Solving the Multi-Armed Bandit Problem from
Scratch in Python
Reinforcement Learning: Introduction to Monte Carlo Learning using the OpenAI
Gym Toolkit
Introduction to Monte Carlo Tree Search: The Game-Changing Algorithm behind

DeepMind’s AlphaGo
Nuts and Bolts of Reinforcement Learning: Introduction to Temporal Difference (TD)
Learning
These articles are good enough for getting a detailed overview of basic RL from the
beginning.
However, note that the articles linked above are in no way prerequisites for the reader to
understand Deep Q-Learning. We will do a quick recap of the basic RL concepts before
exploring what is deep Q-Learning and its implementation details.
RL Agent-Environment
A reinforcement learning task is about training an agent which interacts with its
environment. The agent arrives at different scenarios known as states by performing
actions. Actions lead to rewards which could be positive and negative.
The agent has only one purpose here – to maximize its total reward across an episode. This
episode is anything and everything that happens between the rst state and the last or
terminal state within the environment. We reinforce the agent to learn to perform the best
actions by experience. This is the strategy or policy.
Let’s take an example of the ultra-popular PubG game:
The soldier is the agent here interacting with the environment

The states are exactly what we see on the screen
An episode is a complete game
The actions are moving forward, backward, left, right, jump, duck, shoot, etc.
Rewards are de ned on the basis of the outcome of these actions. If the soldier is able
to kill an enemy, that calls for a positive reward while getting shot by an enemy is a
negative reward
Now, in order to kill that enemy or get a positive reward, there is a sequence of actions
required. This is where the concept of delayed or postponed reward comes into play. The
crux of RL is learning to perform these sequences and maximizing the reward.
Markov Decision Process (MDP)

An important point to note – each state within an environment is a consequence of its
previous state which in turn is a result of its previous state. However, storing all this
information, even for environments with short episodes, will become readily infeasible.
To resolve this, we assume that each state follows a Markov property, i.e., each state depends
solely on the previous state and the transition from that state to the current state. Check out
the below maze to better understand the intuition behind how this works:
Now, there are 2 scenarios with 2 different starting points and the agent traverses different
paths to reach the same penultimate state. Now it doesn’t matter what path the agent takes
to reach the red state. The next step to exit the maze and reach the last state is by going right.
Clearly, we only needed the information on the red/penultimate state to nd out the next
best action which is exactly what the Markov property implies.
Q Learning
Let’s say we know the expected reward of each action at every step. This would essentially be
like a cheat sheet for the agent! Our agent will know exactly which action to perform.
It will perform the sequence of actions that will eventually generate the maximum total
reward. This total reward is also called the Q-value and we will formalise our strategy as:
The above equation states that the Q-value yielded from being at state s and performing
action a is the immediate reward r(s,a) plus the highest Q-value possible from the next state
s’. Gamma here is the discount factor which controls the contribution of rewards further in
the future.
Q(s’,a) again depends on Q(s”,a) which will then have a coef cient of gamma squared. So, the
Q-value depends on Q-values of future states as shown here:
Adjusting the value of gamma will diminish or increase the contribution of future rewards.
Since this is a recursive equation, we can start with making arbitrary assumptions for all q-
values. With experience, it will converge to the optimal policy. In practical situations, this is
implemented as an update:
where alpha is the learning rate or step size. This simply determines to what extent newly
acquired information overrides old information.
What is Deep Q-Learning?

Deep Q-Learning is a reinforcement learning technique that combines Q-Learning, an
algorithm for learning optimal actions in an environment, with deep neural networks. It aims
to enable agents to learn optimal actions in complex, high-dimensional environments. By
using a neural network to approximate the Q-function, which estimates the expected
cumulative reward for each action in a given state, Deep Q-Learning can handle
environments with large state spaces. The network is updated iteratively through episodes,
using a combination of exploration and exploitation strategies. However, care must be taken
to mitigate instability caused by non-stationarity and divergence issues, typically addressed
by experience replay and target networks. Deep Q-Learning has proven effective in training
agents for various tasks, including video games and robotic control.
Why ‘Deep’ Q-Learning?

Q-learning is a simple yet quite powerful algorithm to create a cheat sheet for our agent. This
helps the agent gure out exactly which action to perform.
But what if this cheatsheet is too long? Imagine an environment with 10,000 states and 1,000
actions per state. This would create a table of 10 million cells. Things will quickly get out of
control!
It is pretty clear that we can’t infer the Q-value of new states from already explored states.
This presents two problems:
First, the amount of memory required to save and update that table would increase as
the number of states increases
Second, the amount of time required to explore each state to create the required Q-
table would be unrealistic
Here’s a thought – what if we approximate these Q-values with machine learning models
such as a neural network? Well, this was the idea behind DeepMind’s algorithm that led to its
acquisition by Google for 500 million dollars!
Deep Q-Networks
In deep Q-learning, we use a neural network to approximate the Q-value function. The state
is given as the input and the Q-value of all possible actions is generated as the output. The
comparison between Q-learning & deep Q-learning is wonderfully illustrated below:
So, what are the steps involved in reinforcement learning using deep Q-learning networks
(DQNs)?
1. All the past experience is stored by the user in memory

2. The next action is determined by the maximum output of the Q-network
3. The loss function here is mean squared error of the predicted Q-value and the target
Q-value – Q*. This is basically a regression problem. However, we do not know the
target or actual value here as we are dealing with a reinforcement learning problem.
Going back to the Q-value update equation derived fromthe Bellman equation. we
have:
The section in green represents the target. We can argue that it is predicting its own value,
but since R is the unbiased true reward, the network is going to update its gradient using
backpropagation to nally converge.
Challenges in Deep RL as Compared to Deep

Learning
So far, this all looks great. We understood how neural networks can help the agent learn the
best actions. However, there is a challenge when we compare deep RL to deep learning (DL):
Non-stationary or unstable target: Let us go back to the pseudocode for deep Q-

learning:
As you can see in the above code, the target is continuously changing with each iteration. In
deep learning, the target variable does not change and hence the training is stable, which is
just not true for RL.
To summarise, we often depend on the policy or value functions in reinforcement learning to

sample actions. However, this is frequently changing as we continuously learn what to
explore. As we play out the game, we get to know more about the ground truth values of
states and actions and hence, the output is also changing.
So, we try to learn to map for a constantly changing input and output. But then what is the
solution?
1. Target Network
Since the same network is calculating the predicted value and the target value, there could
be a lot of divergence between these two. So, instead of using 1one neural network for
learning, we can use two.
We could use a separate network to estimate the target. This target network has the same
architecture as the function approximator but with frozen parameters. For every C iterations
(a hyperparameter), the parameters from the prediction network are copied to the target
network. This leads to more stable training because it keeps the target function xed (for a
while):
2. Experience Replay
“ To perform experience replay, we store the agent’s experiences – =( , , , +1)
What does the above statement mean? Instead of running Q-learning on state/action pairs as
they occur during simulation or the actual experience, the system stores the data discovered
for [state, action, reward, next_state] – in a large table.
Let’s understand this using an example.
Suppose we are trying to build a video game bot where each frame of the game represents a
different state. During training, we could sample a random batch of 64 frames from the last
100,000 frames to train our network. This would get us a subset within which the correlation
amongst the samples is low and will also provide better sampling ef ciency.
Putting it all Together

The concepts we have learned so far? They all combine to make the deep Q-learning
algorithm that was used to achive human-level level performance in Atari games (using just
the video frames of the game).
I have listed the steps involved in a deep Q-network (DQN) below:
1. Preprocess and feed the game screen (state s) to our DQN, which will return the Q-
values of all possible actions in the state
2. Select an action using the epsilon-greedy policy. With the probability epsilon, we select
a random action a and with probability 1-epsilon, we select an action that has a
maximum Q-value, such as a = argmax(Q(s,a,w))
3. Perform this action in a state s and move to a new state s’ to receive a reward. This
state s’ is the preprocessed image of the next game screen. We store this transition in
our replay buffer as <s,a,r,s’>
4. Next, sample some random batches of transitions from the replay buffer and calculate
the loss
5. It is known that:
which is just the squared difference between target Q and predicted Q

6. Perform gradient descent with respect to our actual network parameters in order to
minimize this loss
7. After every C iterations, copy our actual network weights to the target network
weights
8. Repeat these steps for M number of episodes
Implementing Deep Q-Learning in Python using

Keras & OpenAI Gym
Alright, so we have a solid grasp on the theoretical aspects of deep Q-learning. How about
seeing it in action now? That’s right – let’s re up our Python notebooks!
We will make an agent that can play a game called CartPole. We can also use an Atari game
but training an agent to play that takes a while (from a few hours to a day). The idea behind
our approach will remain the same so you can try this on an Atari game on your machine.
CartPole is one of the simplest environments in the OpenAI gym (a game simulator). As you
can see in the above animation, the goal of CartPole is to balance a pole that’s connected
with one joint on top of a moving cart.
Instead of pixel information, there are four kinds of information given by the state (such as
the angle of the pole and position of the cart). An agent can move the cart by performing a
series of actions of 0 or 1, pushing the cart left or right.
We will use the keras-rl library here which lets us implement deep Q-learning out of the box.
Step 1: Install keras-rl library

From the terminal, run the following code block:
git clone https://github.com/matthiasplappert/keras-rl.git

cd keras-rl
python setup.py install
Step 2: Install dependencies for the CartPole environment

Assuming you have pip installed, you need to install the following libraries:
pip install h5py

pip install gym
Step 3: Let’s get started!

First, we have to import the necessary modules:
import numpy as np
import gym
from keras.models import Sequential

from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam
from rl.agents.dqn import DQNAgent

from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory
Then, set the relevant variables:
ENV_NAME = 'CartPole-v0'
# Get the environment and extract the number of actions available in the
Cartpole problem
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n
Next, we will build a very simple single hidden layer neural network model:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())
Now, con gure and compile our agent. We will set our policy as Epsilon Greedy and our
memory as Sequential Memory because we want to store the result of actions we performed
and the rewards we get for each action.
policy = EpsGreedyQPolicy()
memory = SequentialMemory(limit=50000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory,
nb_steps_warmup=10,
target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])
# Okay, now it's time to learn something! We visualize the training here for
show, but this slows down training quite a lot.
dqn.fit(env, nb_steps=5000, visualize=True, verbose=2)
Test our reinforcement learning model:
dqn.test(env, nb_episodes=5, visualize=True)
This will be the output of our model:
Not bad! Congratulations on building your very rst deep Q-learning model.
Frequently Asked Questions

Q1. What is different between deep Q and regular Q-learning?
A. The key difference between deep Q-learning and regular Q-learning lies in their
approaches to function approximation. Regular Q-learning uses a table to store Q-values for
each state-action pair, making it suitable for discrete state and action spaces. In contrast,
deep Q-learning employs a deep neural network to approximate Q-values, enabling it to

handle continuous and high-dimensional state spaces. While regular Q-learning guarantees
convergence, deep Q-learning’s convergence is less assured due to non-stationarity issues
caused by updates to the neural network during learning. Techniques like experience replay
and target networks are used to stabilize deep Q-learning training.
Q2. What are the limitations of deep Q network?
A. Deep Q-Networks (DQN) come with several limitations. They can suffer from instability
during training due to the non-stationarity problem caused by frequent updates of the neural
network. Additionally, DQNs might overestimate Q-values, impacting the learning process.
They struggle with handling continuous action spaces and can be computationally expensive,
requiring signi cant training time and resources. Exploration in high-dimensional state
spaces can be challenging, leading to suboptimal policies. Lastly, tuning hyperparameters for
DQNs can be intricate and sensitive, affecting convergence and overall performance. Despite
these limitations, techniques like Double Q-learning and prioritized experience replay aim to
address some of these challenges.
End Notes
OpenAI gym provides several environments fusing DQN on Atari games. Those who have
worked with computer vision problems might intuitively understand this since the input for
these are direct frames of the game at each time step, the model comprises of convolutional
neural network based architecture.
There are some more advanced Deep RL techniques, such as Double DQN Networks,
Dueling DQN and Prioritized Experience replay which can further improve the learning
process. These techniques give us better scores using an even lesser number of episodes. I
will be covering these concepts in future articles.
I encourage you to try the DQN algorithm on at least 1 environment other than CartPole to
practice and understand how you can tune the model to get the best results.
deep reinforcement learning python q-learning Reinforcement Learning
About the Author

Ankit Choudhary
IIT Bombay Graduate with a Masters and Bachelors in Electrical
Engineering. I have previously worked as a lead decision scientist for
Indian National Congress deploying statistical models (Segmentation,
K-Nearest Neighbours) to help party leadership/Team make data-
driven decisions. My interest lies in putting data in heart of business
for data-driven decision making.
Our Top Authors

3.5 Intro2DeepQLearning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3.5 Intro2DeepQLearning

Uploaded by

Copyright:

Available Formats

21/11/23, 12:43 Deep Q-Learning | An Introduction To Deep Reinforcement Learning

A Hands-On Introduction to Deep Q-

Google’s Cloud AutoML

Ready to start your data science journey?

The Road to Q-Learning

Nuts & Bolts of Reinforcement Learning: Model-Based Planning using Dynamic

Introduction to Monte Carlo Tree Search: The Game-Changing Algorithm behind

Let’s take an example of the ultra-popular PubG game:

The soldier is the agent here interacting with the environment

Markov Decision Process (MDP)

What is Deep Q-Learning?

Why ‘Deep’ Q-Learning?

1. All the past experience is stored by the user in memory

Challenges in Deep RL as Compared to Deep

Non-stationary or unstable target: Let us go back to the pseudocode for deep Q-

To summarise, we often depend on the policy or value functions in reinforcement learning to

“ To perform experience replay, we store the agent’s experiences – =( , , , +1)

Let’s understand this using an example.

Putting it all Together

I have listed the steps involved in a deep Q-network (DQN) below:

which is just the squared difference between target Q and predicted Q

Implementing Deep Q-Learning in Python using

Step 1: Install keras-rl library

git clone https://github.com/matthiasplappert/keras-rl.git

Step 2: Install dependencies for the CartPole environment

pip install h5py

Step 3: Let’s get started!

from keras.models import Sequential

from rl.agents.dqn import DQNAgent

Then, set the relevant variables:

Test our reinforcement learning model:

dqn.test(env, nb_episodes=5, visualize=True)

This will be the output of our model:

Frequently Asked Questions

deep Q-learning employs a deep neural network to approximate Q-values, enabling it to

deep reinforcement learning python q-learning Reinforcement Learning

About the Author

Our Top Authors

You might also like