Deep Reinforcement Learning For Algorithmic Trading

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Sign in Join now

Deep Reinforcement Learning for


Algorithmic Trading
Gaurav S. Follow
Software Engineer at MediaMath 141 30 19

In my previous post, I trained a simple Neural Network to approximate a Bond Price-Yield


function. As we saw, given a fairly large data set, a Neural Network can find the underlying
statistical relationship between the inputs and the outputs by adjusting the weights and biases
in its neurons. In this post, I will go a step further by training an Agent to make automated
trading decisions in a simulated stochastic market environment using Reinforcement Learning
or Deep Q-Learning which is a form of semi-supervised learning. This was invented by a UK firm
Deep Mind and was able to master a diverse range of Atari 2600 games to a superhuman level.
You can watch it play by itself here .
Why use AI for algorithmic trading? A vast majority of Algorithmic trading comprises of
Statistical arbitrage / Relative Value strategies which are mostly based on convergence to mean,
where the mean is derived from a randomly chosen sample of historical data. Algorithmic
trading primarily has two components: Policy and Mechanism. The policy is chosen by the
traders and the mechanism is implemented by the machines. It has always been a huge
Sign in Join now
challenge to pick the right data sample for that universal spread measure through regression.
The issue here is that, "Statistical Arbitrage is not as much of a Regression problem, as it is a
behavioral design problem" and it has been understood well but quite poorly implemented. With
all the advancement in Artificial Intelligence and Machine Learning, the next wave of algorithmic
trading will have the machines choose both the policy as well as the mechanism. Using
advanced concepts such as Deep Reinforcement Learning and Neural Networks, it is possible to
build a trading/portfolio management system which has cognitive properties that can discover a
long term strategy through training in various stochastic environments.
IMPLEMENTATION :
Based on the investment thesis of the mean reversion of the spreads, I will simulate 500
episodes of two mean reverting stochastic processes and train the agent to do a long/short
strategy. Think of it as two instruments (stocks or bonds) belonging to the same industry sector
which more or less move together and the agent i.e. Neural Net is the trader who will exploit the
aberrations in their behavior due to news, earnings report, weather or other macro-economic
events by going long on the cheaper instrument and short on the expensive one and vice versa
until it reverts back to its mean. In fact, the Neural Net wouldn't even know about the mean
reversion behavior or whether to do a statistical arbitrage strategy or not, instead it will discover
this pattern by itself in its pursuit to maximize the rewards/gains in every episode, i.e. it will learn
this strategy by itself through trial and error. Once trained in this environment, this agent should
be able to trade any two instruments which have a certain co-integration behavior and respective
volatility range. We can safely assume that the trading volume is small enough so as to have no
impact whatsoever on the market. I would like to re-emphasize the importance of generating
unbiased data as opposed to using historical market data as I have defined the concept as
'Smart Data' in my previous post.
ENVIRONMENT
The first and the most important part is to design the environment. The environment class
should implement the following attributes / methods based on the OpenAI / gym convention:
Init : For initialization of the environment at the beginning of the episode.
State: Holds the price of A and B at any given time = t.
Step: The change in environment after one time step. With each call to this method, the
environment returns 4 values described below:
a) next_state: The state as a result of the action performed by the agent. In our case, it will
always be the Price of A and B at t = t + 1
b) reward: Gives the reward associated with the action performed by the Agent.
c) done: whether we have reached the end of the episode.
d) info: Contains diagnostic information.
Reset: To reset the environment after every episode of training. In this case, it restores the prices
of both A and B to their respective means and simulates new price path.
Its a good practice to keep the environment code separate from that of the agent. Doing so, will
Sign in Join now
make it easier to modify the environment's behavior and training the agent on the fly. I wrote a
Python class called market_env to implement its behavior.
A sample path of 500 time steps for the two assets generated by the environment with A(blue):
mean = 100.0, vol = 10% and B(green): mean = 100.0, vol = 20% using the Ornstein–Uhlenbeck
process (plotted using python/matplotlib) is shown below. As you can see that the two
processes cross each other many times exhibiting a co-integration property, an ideal ground to
train the agent for a long-short strategy.

AGENT
The agent is a MLP (Multi Layer Perceptron) multi-class classifier neural network taking in two
inputs from the environment: Price of A and B resulting in actions : (0) Long A, Short B (1) Short
A, Long B (2) Do nothing, subject to maximizing the overall reward in every step. After every
action, it receives the next observation (state) and the reward associated with its previous
action. Since the environment is stochastic in nature, the agent operates through a MDP (Markov
Decision Process) i.e. the next action is entirely based on the current state and not on the history
of prices/states/actions and it discounts the future reward(s) with a certain measure (gamma).
The score is calculated with every step and saved in the Agent's memory along with the action,
current state and the next state. The cumulative reward per episode is the sum of all the
individual scores in the lifetime of an episode and will eventually judge the performance of the
agent over its training. The complete workflow diagram is shown below:
Why should this approach even work ? Since the spread of the two co-
integrated processes exhibits a stationary property i.e. it has a constant mean and variance over
time and can be thought of as having a normal distribution. The agent can identify this statistical
behavior by buying and selling A and B simultaneously based on their price spread (= Price_A -
Price_B) . For example, if the spread is negative it implies that A is cheap and B is expensive, the
agent will figure the action would be to go long A and short B to attain the higher reward. The
agent will try to approximate this through the Q(s, a) function where 's' is the state and 'a' is the
optimal action associated with that state to maximize its returns over the lifetime of the
episode. The policy for next action will be determined using Bellman Ford Algorithm as
described by the equation below:
Through this mechanism, it will also appreciate the long term prospects
than just immediate rewards by assigning different Q values to each action. This is the crux of
Reinforcement Learning. Since the input space can be massively large, we will use a Deep Neural
Network to approximate the Q(s, a) function through backward propagation. Over multiple
iterations, the Q(s, a) function will converge to find the optimal action in every possible state it
has explored.
Speaking of the internal details, it has two major components:
Memory: Its a list of events. The Agent will store the information through iterations of exploration
and exploitation. It contains a list of the format: (state, action, reward, next_state, message)
Brain: This is the Fully Connected, Feed-Forward Neural Net which will train from the memory i.e.
Sign in Join now
past experiences. Given the current state as input, it will predict the next optimal action.
To train the agent, we need to build our Neural Network which will learn to classify actions based
on the inputs it receives. (A simplified Image below. Of course the real neural net will be more
complicated than this.).

In the above image,


Inputs(2): Price of A and B in green.
Hidden(2 layers): Denoted by 'H' nodes in blue.
Output(3): classes of actions in red.
For implementation, I am using Keras and Tensorflow both of which are free and open source
python libraries.
The neural net is trained with an arbitrarily chosen sample size from its
memory at the end of every episode in real-time hence after every episode the network collects
more data and trains further from it. As a result of that, the Q(s, a) function would converge with
more iterations and we will see the agent's performance increasing over time until it reaches a
saturation point. The returns/rewards are scaled in the image below.

In the above graph, you can see 3 different plots representing entire training scenarios of 500
episodes, each having 500 steps. With every step, the agent performs an action and gets its
reward. As you can see, in the beginning since the agent has no preconception of the
consequences of its actions, it takes randomized actions to observe the rewards associated
with it. Hence the cumulative reward per episode fluctuates a lot in the beginning from 0-300th
episode, however beyond 300 episodes, the agent starts learning from its training and and by
400th episode, it almost converges in each of the training scenarios as it discovers the long-
short pattern and starts to fully exploit it.
There are still many challenges to it and it is still a part of an ongoing research of engineering
both the agent as well as the environment. My aim here was not to show a 'backtested profitable
trading strategy' but to describe how to apply advanced Machine Learning concepts such as
Deep Q-Learning/Neural Networks to the field of Algorithmic Trading. It is an extremely
complicated process and pretty hard to explain in a single blog post however I have tried my
best to simplify things. I am going to put a much detailed analysis and code on github, so please
watch this space if you are interested.
Furthermore, this approach can be extended into a large portfolio of stocks and bonds and the
agent can be trained under diverse range of stochastic environments. Additionally, the agent's
behavior can be constrained to various risk parameters such as sizing, hedging etc. One can
also have multiple agents training under different suitability criteria given the desired risk/return
profiles. These types of approximation can be made more accurately using large data sets and
distributed computing power.
Eventually, the question is, can AI do everything ? Probably no. Can we effectively train it to do
Sign in Join now
anything ? Possibly yes , i.e. with real intelligence, the artificial intelligence can surely thrive.
Thanks for reading. Please feel free to share your ideas.
Hope you enjoyed the post !
DISCLAIMER:
1. Opinions expressed are solely my own and do not express the views or opinions of any of my
employers.
2. The information from the Site is based on financial models, and trading signals are generated
mathematically. All of the calculations, signals, timing systems, and forecasts are the result of
back testing, and are therefore merely hypothetical. Trading signals or forecasts used to produce
our results were derived from equations which were developed through hypothetical reasoning
based on a variety of factors. Theoretical buy and sell methods were tested against the past to
prove the profitability of those methods in the past. Performance generated through back testing
has many and possibly serious limitations. We do not claim that the historical performance,
signals or forecasts will be indicative of future results. There will be substantial and possibly
extreme differences between historical performance and future performance. Past performance
is no guarantee of future performance. There is no guarantee that out-of-sample performance
will match that of prior in-sample performance. The website does not claim or warrant that its
timing systems, signals, forecasts, opinions or analyses are consistent, logical or free from
hindsight or other bias or that the data used to generate signals in the back tests was available
to investors on the dates for which theoretical signals were generated.

Gaurav S.
Gaurav Software Engineer at MediaMath Follow
S.

30 comments

Sign in to leave your comment

Igor Igor Halperin


3mo
Halperin Research Professor of Financial Machine Learning at NYU Tandon School of Engineering

A nice article. I would add a few comments: 1. Q-learning (or RL in general) is NOT a
form of semi-supervised learning. 2. I would be interested to see any empirical support
to keeping the word 'Deep' in Deep Reinforcement Learning that you use. Did you
benchmark it against a simple linear (or bi-linear, in your case) set of basis functions,
and a simple linear architecture? 3. I am not sure I agree with this statement:
"Furthermore, this approach can be extended into a large portfolio of stocks and bonds
and the agent can be trained under diverse range of stochastic environments." I think
this particular approach of Q-learning cannot be extended to a multi-asset portfolio, but
Sign in Join now
its generalization called G-learning can do that. You can find some details on this
approach in my paper on RL in a multi-asset setting
here: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3174498.
Like Reply 8 Likes 4 Replies

View Previous Replies (2)

Gaurav S. 3mo
Gaurav Software Engineer at MediaMath
S.
Igor, there have been known examples of linear architecture (linear regression
models) to find a constant 'fitted' spread. But as I have mentioned that "It has
always been a huge challenge to pick the right data sample for that universal
spread measure through regression since the mean is derived from a
randomly chosen sample of historical data". The real reason to choose NN or
Reinforcement learning for that matter is to identify the spread as not a just a
measure but rather a function which we are trying to approximate and its
anything but linear.
Like Reply

Igor Igor Halperin


3mo
Halperin Research Professor of Financial Machine Learning at NYU Tandon School of Engineering

Gerardo Lemus - I believe it is always better to start with a simplest model first
(a linear architecture), and THEN see what changes when you move to neural
networks. Every time I violated this rule, I ended up regretting it.
Like Reply 1 Like

Gerardo Lemus 5mo


Gerardo Quantitative Finance Practitioner: Applying all available data to get an edge over the market.
Lemus
Gaurav S. - I replicated Ritter's paper in https://medium.com/@gjlr2000/teaching-a-
robot-to-buy-low-sell-high-c8d4f061b93d -- the code is in Google colaboratory
Like Reply 1 Like 1 Reply

Gaurav S. 5mo
Gaurav Software Engineer at MediaMath
S.
Thats great. Thanks for sharing. I will review it.
Like Reply
Sign in Join now

Andreas Holm Nielsen 6mo


Andreas Trading Analyst - Algorithmic Trading at Danske Commodities
Holm
Nielsen Great article! I was a bit confused regarding the unbiased data simulation you
described. Is it possible for you to explain this point a bit more? I guess you would need
historical data to simulate data as well (distribution, Monte Carlo) unless you have
some static pricing formula as your bond example?
Like Reply 1 Like 2 Replies

Gerardo Lemus
Gerardo Quantitative Finance Practitioner: Applying all available data to get an edge over the market.6mo
Lemus
One possibility is to estimate the mean reverting parameters from history (see
https://medium.com/@gjlr2000/mean-reversion-in-finance-definitions-
45242f19526f for a python example) and then use them as parameters for the
Monte Carlo Simulation for the reinforced learning training. The solution
should be similar to the dynamic programming optimal solution
http://folk.uio.no/kennethk/articles/art53_journal.pdf (but the advantage of
reinforced learning is that you can add complexity to the simulation, like
adding seasonality as mentioned above)
Like Reply 1 Like

Gaurav S. 6mo
Gaurav Software Engineer at MediaMath
S.
By unbiased I mean that the data should have no relationship with the actual
historical data. The important part to note here is that you are training the
agent based on the market behavior not on historical data. We would only
need the long running mean and volatility of the index (say SP500 or FTSE100)
and then you can generate/simulate 5000 possible paths to train/test your
agent/strategy fairly.
Like Reply

Gerardo Lemus 6mo


Gerardo Quantitative Finance Practitioner: Applying all available data to get an edge over the market.
Lemus
I am curious to see how much better the deep learning optimal policy improves upon
the 'rule-of-thumb' policies (like https://medium.com/@gjlr2000/trading-mean-
reversion-1493ba10460f ). Also would be interested in seeing how different rewards
could change the policies (e.g. log utility of Wealth, others). Thomas Choi - I think that
Sign in Join now
because the price action is a Monte Carlo _simulation_ you could change the
simulation parameters to anything you want (change mean reversion, add jumps, add
seasonality) and create a new optimal policy (just like changing the rules of Go, and
retraining alphaGo under the new rules).
Like Reply 1 Like 1 Reply

Gaurav S. 6mo
Gaurav Software Engineer at MediaMath
S.
Gerardo , great questions. The policy depends on the rewards and the
discount factor can be adjusted to any environment based on whether one
wants a short term view (low discount factor) or a very long term view (very
high discount factor). The reward function can include all the parameters you
mentioned plus more for example the current state of the portfolio, market
volatility etc. You are completely right in the sense as to if one changes the
simulation (i.e. the behavior of the environment), the agent will adjust its
behavior based on the (new) reward policy.
Like Reply

Andrew Peters 9mo


Andrew General Manager at The Philippines Recruitment Company
Peters
You've mentioned a few interesting points on machine learning here, thank you.
Like Reply 1 Like 1 Reply

Gaurav S. 9mo
Gaurav Software Engineer at MediaMath
S.
Thanks Andrew. I am glad you liked it.
Like Reply

Thomas Choi 11mo


Thomas System Architect/Electronic Trading/Solution Provider/Machine Learning/BlockChain
Choi
Great idea, it is nice that a trading model can be built this way. A few questions
regarding your model: 1) modify the actions to a) Long A, b) Short A, c) Long B, d) Short
B, e) Nothing, therefore we can see whether the agent can learn the optimal policy that
it thinks the Long/Short of A/B can have the maximum reward (or P&L) 2) I assume the
model works best for a step of daily, and the reward is calculated on daily closing price.
3) Have you done the training of the agent and put it in simulation on real-time market
Sign in Join now
data? 4) The is kind of spread trading on two commodities. But there is also a seasonal
factor in the relationship. Do you think this "seasonal" feature can be added to model?
Like Reply 1 Like 1 Reply

Gaurav S. 11mo
Gaurav Software Engineer at MediaMath
S.
Thanks for the time to review and great questions. Here are my answers: 1. If
the actions are based on individual stocks, then we can do the experiment on
just a single stock and not a pair. The input the model is in fact just the spread
(even though I show the price of A and B). It is necessary that if we go long on
A , then we simultaneously go short on B and vice versa to keep a market
neutral situation. 2. I haven't really specified the size of a time step. It can
range from a microsecond to a month based on the instrument type. Some
assets are very liquid like FX derivatives, while others don't trade for months
like Municipal bonds. 3. No. This is a fairly new concept and an evolving idea. I
think there is more to this before I can put it to use to real trading. 4. I believe
so. It can be added as a third input (in addition to the prices) to account for the
classification of actions.
Like Reply

Show more comments.

More from Gaurav S. 3 articles

Deep Learning Neural Networks for Bond Fixed Income: Price-Yield Fallacy
Pricing
Fixed Income: Price-Yield Fallacy
Deep Learning Neural Networks for July 31, 2016
Bond…
October 30, 2017

You might also like