DOS - Presentation

Deep Optimal Stopping
Vanessa Pizante
University of Toronto
March 26, 2019
Vanessa Pizante (University of Toronto) Deep Optimal Stopping March 26, 2019 1 / 44
Optimal Stopping Problem
X = (Xn )N d N
n=0 is an Markov process in R on (Ω, F, (F)n=0 , P).
T is the set of all X -stopping times.

Recall: τ is an X -stopping time if {τ = n} ∈ Fn , ∀n ∈ {0, 1, ..., N}
g : {0, 1, ..., N} × Rd → R is a measurable, integrable function.
Objective
Find V = supτ ∈T Eg (τ, Xτ ).
Introduction
Primary Ressource
Becker, S., Cheridito, P., and Jentzen, A.: Deep Optimal Stopping. (2018).
Optimal stopping problems with finitely many stopping times can be solved
exactly.
Optimal V given by Snell Envelope.
Difficult to approximate numerically in higher dimensions.
[Becker et al., 2018] proposes decomposing τ into a sequence of N + 1 0-1
stopping decisions.
Each decision is learned via a deep feedforward neural network.
What is a Neural Network?
A neural network is a computational system modelled loosely after the human

brain: a neuron receives input from other neurons or from an external source
which it uses to generate output.
Each input has a weight quantifying its relative significance.
You can think of a neuron’s output as its relative firing rate.
Source: [Hijazi et al., 2015]
Feedforward Neural Network
Input Layer: x = (x1 , x2 , x3 ).
Hidden Layer: Takes x, and outputs y = (y1 , y2 , y3 ) where yi = fy (wi x + bi )

and wi = (wi1 , wi2 , wi3 ) for i = 1, 2, 3.
Output Layer: Take y and outputs z1 = fz (wz y + bz ) where

wz = (wz1 , wz2 , wz3 )
y1
x1
y2
x2 z1
y3
x3
fy and fz are called activation functions.
Training a Neural Network
The training data can consist of input and target pairs (x m , z1m )M
m=1 or just
inputs (x m )M
m=1 .
Training the network refers to setting the weight wi and bias bi terms using
the training data.
θ = {w1 , w2 , w3 , wz , b1 , b2 , b3 , bz } is called the model parameters.
Training is done by minimizing a loss function, C , w.r.t θ.

e.g. mean squared-error loss function
M
1 X θ m
C (θ) = kz1 (x ) − z m k2
2M m=1
Training a Neural Network
The optimal parameters are found by minimizing the loss function via a
gradient descent algorithm.
e.g. Vanilla gradient descent with learning rate η update step:
θt+1 ← θt − η · ∇θ C (θt )
∇θ C (θt ) is computed using backpropagation.

Method for efficiently moving backwards through the neural network graph to
∂C
compute each ∂θ i
via the chain rule.
Special case of reverse-mode autodifferentiation.
After training, we run the algorithm using a new dataset, this is the testing
data.
Note: Minimizing a loss function via gradient descent is equivalent to

maximizing a reward function via gradient ascent.
Reinforcement Learning (Optional)
Problem set-up:
Agent receives reward rt = r (st , at ) depending on its actions at and the state
st .
Environment is a Markov Decison Process with unknown transition
probabilites p(st+1 |st , at ).
Reinforce:
Agents aim to learn a policy πθ (st , at ) so they interact with their

environment to maximize some cummulative reward function.
Samples different rollouts ρ = (s1 , a1 , ..., sT , aT ) to obtain a policy where
rollouts corresponding to a high reward are more likely to be chosen.
Model-free approach to policy iteration.
Q-Learning (Optional)
Q-function (a.k.a. action-value function): Expected rewards (discounted by

factor γ) if you take action a then follow your policy:
"∞ #
X
π

Q (s, a) = E γi rt+i st = s, at = a
i=0
Q-learning is an iterative algorithm based on the recursive formula for Q

derived via Bellman’s equation:
Q(st , at ) ← Q(st , at ) + α(r (st , at ) + γ max Q(st+1 , a) − Q(st , at ))

a
Note r (st , at ) + γ maxa Q(st , a) − Q(st , at ) is the Bellman error

=⇒ Q-Learning is equivalent to minimizing the Bellman error.
Optimal Stopping and Q-Learning (Optional)
Yu, H. and Bertsekas, D.P.: Q-learning Algorithms for Optimal Stopping Based on Least
Squares. (2007).
Given current state xt we have two options:

Stop and incur cost f (xt ).
Continue and incur cost h(xt , xt+1 ).
Let Q ∗ (xt ) = mina Q(xt , a) where a is either to stop or continue.
Optimal policy is to stop as soon as (Xt ) enters the set: D = {xt |f (xt ) ≤ Q ∗ (xt )}
Optimal Stopping and Q-Learning (Optional)
Consider approximations for Q that use a linear projection of transformed xt

[Yu and Bertsekas, 2007]:
Q ∗ (xt ) ≡ φ(Xt )0 βt
The corresponding projected Bellman error is then minimized w.r.t. βt via a

least-squares approximation based on the Q-learning algorithm as follows:
βt+1 = βt + α(β̂t+1 − βt )
where
t
2
X
β̂t = arg min (φ(xk )0 β − h(xk , xk+1 ) − γ min{f (xk+1 ), φ(xk )0 βt })
β
k=0
Least-Squares Monte Carlo (LSMC)
Longstaff, F. and Schwartz E.: Valuing American Options by Simulation: A Simple

Least-Squares Approach, (2001).
Simulate M stock price paths (xnm )N

n=0 .
Set VNm = g (N, xNm ), ∀m and then for n = N − 1, ..., 0:

Approximate continuation value qn via q̂n (x) = Kk=0 βk x k where β minimizes
P
the least-squares error
M
X
(e −rtn Vn+1
m
− qn (xnm ))2
m=1
(
g (n, xnm ) if qn (xnm ) < g (n, xnm )
Set Vnm = −rn T m
e N Vn+1 otherwise
f
Least-Squares Neural Network Approach
Kohler, Krzyzak and Todorovic: Pricing of High-Dimensional American Options by

Neural Networks, (2006).
Neural network applied to Longstaff-Schwartz algorithm to price higher dimensional

American options.
Replaces least-squares regression with neural network of the form:
q θn = a2θn ◦ σ ◦ a1θn
where
PK
a1 : Rd → RK , a2 : RK → R : ai (x) = Wi x + bi and k=0 |w2,k | ≤ Cn .
σ : Rk → [0, 1]K : σk (yk ) = 1/(1 + e −yk ).
Algorithm backwards recursively creates training data (xnm , e −rtn · Vn+1
m
) and has a
mean squared loss function.
Extension: More Complex Neural Networks
Hu, R.: Deep Learning for Ranking Response Surfaces with Applications to Optimal
Stopping Problems, (2019).
Figure 1: Left: Feed-forward NN, similar to NN applied in [Becker et al., 2018], Right:
UNET, applied in [Hu, 2019].
Source: [Hu, 2019]
Ranking Response Surfaces
Optimal stopping problem can be reframed as an image classification problem

via surface ranking.
Surface ranking problem: Extract index of minimal surface for each input x
and treat it as a class label.
For optimal stopping labels are just stop or continue (at each time step).
Can now use convolution NNs (e.g. UNET), which can be more
computationally efficient [Hu, 2019].
Disadvantage: Convergence theory has yet to have been completely

developed for fully convolutional NN.
Expressing Stopping Times as a Series of 0-1 Decisions
Goal
Show optimal decision to stop (Xn )N N
n=0 can be made according to {fn (Xn )}n=0
d
where fn : R → {0, 1}.
Write time n stopping time, τn , as a function of {fk (Xk )}N

k=n .
Let Tn be the set of all X -stopping times τ s.t. n ≤ τ ≤ N.
Clearly TN = {N}, so let τN = N · fN (XN ) where fN ≡ 1.
Now, for n = N − 1, ..., 0, we can define:

N
X k−1
Y
τn = kfk (Xk ) (1 − fj (Xj )) ∈ Tn (1)
k=n j=n
Theorem 1
For n ∈ {0, ..., N − 1}, let τn+1 ∈ Tn+1 be of the form:
N
X k−1
Y
τn+1 = kfk (Xk ) (1 − fj (Xj )) (2)
k=n+1 j=n+1
Then there exists a measurable function fn : Rd → {0, 1} such that τn ∈ Tn

satisfies
Eg (τn , XτN ) ≥ Vn − (Vn+1 − Eg (τn+1 , Xτn+1 )) (3)
where Vn and Vn+1 satisfy
Vn = sup Eg (τ, Xτ ) (4)

τ ∈Tn
Sketch of Theorem 1 Proof
Fix stopping time τ ∈ Tn and let = Vn+1 − Eg (τn+1 , Xτn+1 ).

Show g (τn+1 , Xτn+1 ) is meas. using (2).
=⇒ ∃hn meas. and Markovian s.t. hn (Xn ) = E[g (τn+1 , Xτn+1 )|Xn ].
Define D = {g (n, Xn ) ≥ hn (Xn )}, E = {τ = n} ∈ Fn and:

τn = nID + τn+1 ID c ∈ Tn
τ̃ = τn+1 IE + τ IE c ∈ Tn+1
So Eg (τ, Xτn+1 ) ≥ Eg (τ̃ , Xτ̃ ) − =⇒ E[g (τ, Xτn+1 )IE c ] ≥ E[g (τ, Xτ )IE c ] − .
Exercise: Show Eg (τn , Xτn ) ≥ Eg (τ, Xτ ) − .
Since τ is arbitrary, we have Eg (τn , Xτn ) ≥ Vn − , satisfying (3).
Proof is completed by showing τn here satisfies (1).
What have we proven?
Theorem 1 shows that τn from (1) can adequately be used to compute Vn .

Meaning τn is the time n optimal stopping time.
Follows from (3) due to the backwards recursive way τn is computed.
Optimal stopping time corresponding to V = supτ ∈T Eg (τ, Xτ ) is:

N
X n−1
Y
τ= nfn (Xn ) (1 − fk (Xk ))
n=1 k=0
...But how do we find the sequence {fn }N

n=0 ?
Introducing a Neural Network
Using fN ≡ 1 we construct a sequence of neural networks, f θn : Rd → {0, 1},

to approximate fn for n = N − 1, ..., 0.
We can then approximate τn+1 via
N
X k−1
Y
k · f θk (Xk ) (1 − f θj (Xj )) (5)
k=n j=n
Problem:
f θn produces binary output =⇒ it is not continuous w.r.t. θ.

Need continuous output to train θn via a gradient-based optimization
algorithm.
Neural Network with Continuous Output
Introduce F θ : Rd → (0, 1), a two-layer, feed-forward neural network of the form
F θ = ψ ◦ a3θ ◦ φq2 ◦ a2θ ◦ φq1 ◦ a1θ (6)
where
q1 and q2 are the number of nodes in the hidden layers.
a1θ : Rd → Rq1 , a2θ : Rq1 → Rq2 , a3θ : Rq2 → R satisfy
aiθ (x) = Wi x + bi
φqi : Rqi → Rqi are ReLU activation function, φqi (x1 , ..., xqi ) = (x1+ , ..., xq+i ).
ψ : R → R is the logistic sigmoid function, ψ(x) = 1/(1 + e −x ).
Return to Binary Decisions
Parameters are θ = {(Ai , bi )3i=1 } ∈ Rq , where q = q1 (d + q2 + 1) + 2q2 + 1.

Can now use F θ to find optimal θn via gradient descent for n = N − 1, ..., 0.
After, we can compute f θn : Rd → {0, 1} using:
f θn = I[0,∞) ◦ a3θn ◦ φq2 ◦ a2θn ◦ φq1 ◦ a1θn (7)
Note [Becker et al., 2018] does not provide formal proof that using F θ to
optimize θ will provide optimal parameters for f θ .
However, it does make sense intuitively when we consider F θn (Xn ) to be the
probability that stopping is the optimal decision at time n (given Xn ).
Selecting a Reward Function
Main Idea
Want to define a reward function that yields a stopping decision at time n that
will maximize our expected future payoff.
If at time n we:
Stop =⇒ fn (Xn ) = 1 and we will receive payoff g (n, Xn )
Continue and after proceed optimally =⇒ fn (Xn ) = 0 and we will
eventually receive payoff of g (τn+1 , Xτn+1 ).
So we want our reward function at time n to approximate
sup E[g (n, Xn )f (Xn ) + g (τn+1 , Xτn+1 )(1 − f (Xn ))] (8)
f ∈D
where D is the set of all f : Rd → {0, 1} measurable.
...But can we even replace f in (8) with neural network f θ ?
Theorem 2
Let n ∈ {0, ..., N − 1} and fix a stopping time θn+1 ∈ Tn+1 , then ∀ > 0, there
exists q1 , q2 ∈ N+ such that
sup E[g (n, Xn )f θ (Xn ) + g (τn+1 , Xτn+1 )(1 − f θ (Xn ))]

θ∈Rq
≥ sup E[g (n, Xn )f (Xn ) + g (τn+1 , Xτn+1 )(1 − f (Xn ))] − (9)
f ∈D
where D is the set of all measurable functions f : Rd → {0, 1}.
Corrallary 1
For any > 0, there exists q1 , q2 ∈ N+ and neural network functions of the form
(7) such that f θN ≡ 1 and the stopping time
N
X n−1
Y
τ̂ = nf θn (Xn ) (1 − f θk (Xk )) (10)
n=1 k=0
satisfies Eg (τ̂ , Xτ̂ ) ≥ supτ ∈T Eg (τ, Xτ ) − .
This means we can approximate the sequence of optimal stopping decisions

{fn }N θn N
n=0 with the sequence of optimized neural networks {f }n=0 .
Sketch of Theorem 2 Proof (Optional)
By integrability of g , ∃f˜ : Rd → {0, 1} meas. s.t.
E[g (n, Xn )f˜θ (Xn ) + g (τn+1 , Xτn+1 )(1 − f˜θ (Xn ))]
≥ sup E[g (n, Xn )f (Xn ) + g (τn+1 , Xτn+1 )(1 − f (Xn ))] − /4 (11)
f ∈D
Let f˜ = IA , A = {x ∈ Rd |f˜ = 1} and note
B 7→ E[|g (n, Xn )|IB (Xn )] and B 7→ E[|g (τn+1 , Xτn+1 )|IB (Xn )]
are finite Borel measures on Rd .
So ∃K ⊆ A compact s.t.
E[g (n, Xn )IK (Xn ) + g (τn+1 , Xτn+1 )(1 − IK (Xn ))]

≥ E[g (n, Xn )f˜θ (Xn ) + g (τn+1 , Xτn+1 )(1 − f˜θ (Xn ))] − /4 (12)
Sketch of Theorem 2 Proof Cont. (Optional)
Let ρK (x) = inf y ∈K kx − y k2 and note kj (x) = max{1 − j · ρK (x), −1},
j ∈ N converges pointwise to IK − IK c . So by DCT:
E[g (n, Xn )Ikj (Xn )≥0 + g (τn+1 , Xτn+1 )(1 − Ikj (Xn )≥0 )]
≥ E[g (n, Xn )IK (Xn ) + g (τn+1 , Xτn+1 )(1 − IK (Xn ))] − /4 (13)
By [Leshno
Pet al., 1993] sinceP kj can be approx. uniformly on compact sets
r s
by h(x) = i=1 (viT x + ci )+ − i=1 (wiT x + di )+ and thus:
E[g (n, Xn )Ih(Xn )≥0 + g (τn+1 , Xτn+1 )(1 − Ih(Xn )≥0 )]

≥ E[g (n, Xn )Ikj (Xn )≥0 + g (τn+1 , Xτn+1 )(1 − Ikj (Xn )≥0 )] − /4 (14)
By considering I[0,∞) ◦ h as an NN of the form f θ , eq.(14) becomes:
E[g (n, Xn )f θ (Xn ) + g (τn+1 , Xτn+1 )(1 − f θ (Xn ))]

≥ E[g (n, Xn )Ikj (Xn )≥0 + g (τn+1 , Xτn+1 )(1 − Ikj (Xn )≥0 )] − /4 (15)
Combining (11), (12), (13) and (15) we obtain (9), as required.

Computing Estimates via the Neural Network
Simulate M independent paths of Markov process, (xnm )N

n=0 .
θN is chosen so f θN ≡ 1 and for n = N − 1, ..., 0:
m
Use θn+1 , ..., θN to compute τ̄n+1 along each of the m paths via
N
X k−1
Y
m
τ̄n+1 = kf θk (xnm ) (1 − f θj (xjm ))
k=n+1 j=n+1
At time n along path m, if we stop w.p. F θn (xnm ) and then adhere to

{f θk (xkm )}Nk=n+1 , the realized reward is
rnm (θn ) = g (n, xnm )F θ (xnm ) + g (τ̄n+1

m
, xτ̄mn+1
m )
For sufficiently large M

M
1 X m
rn (θn ) (16)
M m=1
approximates E[g (n, Xn )F θn (Xn ) + g (τn+1 , Xτn+1 )(1 − F θn (Xn ))].
Note (16) acts as the reward function we want to maximize w.r.t. θn via a
gradient ascent algorithm.
Next translate F θn (xnm ) into 0-1 stopping decisions, f θn (xnm ).
Now generate our testing sample paths (ynm )N

n=0 for m = 1, ..., M.
For n = N − 1, ..., 0, using the {θn } found during training, compute
N
X k−1
Y
m
τ̃n+1 = kf θk (ykm ) (1 − f θj (yjm ))
k=n+1 j=n+1
along each of the m sample paths.
Overall optimal stopping time (estimate of τ̂ from (10)):
N
X n−1
Y
τ̃ m = kf θn (ynm ) (1 − f θk (ykm ))
n=1 k=1
Corresponding Monte Carlo estimate of V = g (τ̂ , Xτ̂ ):

M
1 X
V̂ = g (τ̃m , yτ̃mm ) (17)
M m=1
By CLT a 1 − α, α ∈ (0, 1) confidence interval for V is:

σ̂ σ̂
V̂ − zα/2 √ , V̂ + zα/2 √
M M
where zα/2 is the 1 − α/2 quantile of N (0, 1) and
M
1 X 2
σ̂ = g (τ̃m , yτ̃mm ) − V̂
M − 1 m=1
Application: Bermudan Max Call Option
Example
Bermudan max-call option expiring at time T with strike price K written on d
assets, X 1 , ..., X d , with N + 1 equidistant exercise times
tn = nT /N, n = 0, 1, ..., N.
+
Payoff function time t is: maxi∈{1,...,d} Xti − K .
h + i
So its price at time t is supτ E e −r τ maxi∈{1,...,d} Xτi − K .
Frame this as an optimal stopping problem supτ ∈T Eg (τ, Xτ ) where

+
−rtn i
g (n, x) = e max x −K (18)
i∈{1,...,d}
Application: Bermudan Max Call Option
Assume a Black-Scholes market model with d uncorrelated assets.
For i = 1, ..., d set:
x0i = 90, K = 100, σi = 0.2, δi = 0.1, r = 0.05, T = 3, N=9
Asset price paths can be simulated via

( n )
m
X
2
√ m

xn,i = x0,i · exp (r − δi − σi /2)∆t + σi ∆t · Zk,i (19)
k=0
m
where ∆t = T /N and Zk,i ∼ N (0, 1).
Building Neural Network using PyTorch
Constructing a neural network of the form F θ from (6) is simple using Pytorch!
1 import torch.nn as nn
2
3 class NeuralNet(torch.nn.Module):
4 def __init__(self, d, q1, q2):
5 super(NeuralNet, self).__init__()
6 self.a1 = nn.Linear(d, q1)
7 self.relu = nn.ReLU()
8 self.a2 = nn.Linear(q1, q2)
9 self.a3 = nn.Linear(q2, 1)
10 self.sigmoid=nn.Sigmoid()
11
12 def forward(self, x):
13 out = self.a1(x)
14 out = self.relu(out)
15 out = self.a2(out)
16 out = self.relu(out)
17 out = self.a3(out)
18 out = self.sigmoid(out)
19
20 return out
Training the Network in Pytorch
1 def loss(y_pred,s, x, n, tau):

2 r_n=torch.zeros((s.M))
3 for m in range(0,s.M):
4 r_n[m]=-s.g(n,m,x)*y_pred[m] - s.g(tau[m],m,x)*(1-y_pred[m])
5 return(r_n.mean())
6
7 def NN(n,x,s, tau_n_plus_1):
8 epochs=50
9 model=NeuralNet(s.d,s.d+40,s.d+40)
10 optimizer = torch.optim.Adam(model.parameters(), lr = 0.0001)
11
12 for epoch in range(epochs):
13 F = model.forward(X[n])
14 optimizer.zero_grad()
15 criterion = loss(F,S,X,n,tau_n_plus_1)
16 criterion.backward()
17 optimizer.step()
18
19 return F,model
s.g is g , as defined in equation (18).
Adaptive Moment Estimation (Adam) Optimization
Input: α, r (θ), θ0
1 Set β1 = 0.9; β2 = 0.99; = 10−8
2 m0 = v0 = 0; t = 0
3 while r (θt ) not minimized do
4 t =t +1
5 Gt = ∇θ r (θt−1 )
6 mt = β1 · mt−1 + (1 − β1 ) · Gt
7 vt = β2 · vt−1 + β2 · (Gt )2
8 m̂t = mt /(1 − (β1 )t )
9 v̂t = vt /(1 − (β2 )t )
√
10 θt = θt−1 − α · m̂t /( v̂t + )
Output: θt
Adaptive: Performs smaller updates (lower learning rate) for frequently

occurring features.
Moment Estimation: Stores exponentially decaying average of gradient’s
mean (mt ) and uncentered variance (vt ).
Results
d Actual Estimate Standard Error 95 % Confidence Interval

2 8.04 7.50 0.23 (7.05, 7.95)
5 16.64 16.80 0.32 (16.18,17.43)
10 26.20 23.83 0.29 (23.25, 24.40)
Table 1: Tabulated estimates for V . Results computed using M = 5000 sample paths.
Note: Actual estimate for d = 2 is from [Hu, 2019] and for d = 5, 10 are from
[Becker et al., 2018].
Optimally Stopping a Fractional Brownian Motion
(Optional)
Fractional Brownian motion with Hurst Parameter H is a Gaussian process

with mean zero and covariance structure
1 2H
E[WtH WsH ] = t + s 2H − |t − s|2H

(20)
2
Standard Brownian Motion is a fractional Brownian motion with H = 1/2.
Objective
We want to evaluate sup0≤τ ≤1 EWτH .
Challenge of Optimally Stopping FBM (Optional)
Optional Stopping Theorem (OST)

Let (Xt )t≥0 be a martingale and τ be a stopping time with respect to the
filtration (Ft )t≥0 , then if either of the following hold:
τ is bounded
Xτ ∧t is bounded
it follows that EX τ = EX0 .
1/2 1/2
So by OST EWτ = 0 for any stopping time τ s.t. Wτ ∧t is bounded above
by a constant.
However if H 6= 1/2, W H is not a martingale, so the OST does not apply.
(Optional)
n
Discretize the time interval [0, 1] into time steps, tn = 100 .
Create a 100-dimensional Markov process, (Xn )n=0 to describe (WtHn )100
100
n=0 :
X0 = (0, 0, ..., 0)
X1 = (WtH1 , 0, ..., 0)
H H
X2 =.. (Wt1 , Wt2 , ..., 0)
.
X100 = (WtH100 , WtH99 , ..., WtH1 )
Letting g : R100 → R be g (x1 , ..., x100 ) = x1 , we now aim to compute
sup Eg (Xτ ) (21)

τ ∈T
where τ is the set of all X-stopping times.
(Optional)
[Becker et al., 2018] simulated (Xn )100 H H H

n=0 by defining Yn = Wtn − Wtn−1 for
each n = 1, ..., N, which forms a stationary Gaussian process with
autocovariance
|k + 1|2H − |k|2H + |k − 1|2H
E[YnH Yn+k
H
]=
2(1002H )
Pn
Sample paths of (Xn ) are then computed via WtHn = k=1 YkH .
Then train neural networks of the form (7) with d = 100, q1 = 110 and
q2 = 55 and approximate (21) via Monte Carlo.
(Optional)
Figure 2: Results of [Becker et al., 2018]. Expected value of optimally stopped FBM
w.r.t. its Hurst parameter H.
Source: [Becker et al., 2018]
Concluding Remarks
Neural networks (and machine learning algorithms in general) can aid in

solving the optimal stopping problem in higher dimensions.
New area of research:
Neural Network employed by [Becker et al., 2018] has one of the simplest
architectures.
More complex architectures lack fully developed convergence theory
[Hu, 2019].
Questions for further research:
Can we strategically select the number of nodes per hidden layer and/or the
number of epochs for this algorithm?
What other neural network architectures might be worth trying?
What other problems in mathematical finance can be approximated in higher
dimensions by applying neural networks?
References I
Becker, S., Cheridito, P., and Jentzen, A. (2018).

Deep optimal stopping.
Technical report.
Hijazi, S., Kumar, R., and Rowen, C. (2015).
Using convolutional neural networks for image recognition.
Hu, R. (2019).
Deep learning for ranking response surfaces with applications to optimal
stopping problems.
arXiv e-prints, page arXiv:1901.03478.
Kingma, D. P. and Ba, J. (2014).
Adam: A method for stochastic optimization.
CoRR.
Kohler, M., Krzyżak, A., and Todorovic, N. (2010).
Pricing of high-dimensional american options by neural networks.
Mathematical Finance, 20(3):383–410.
References II
Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993).

Multilayer feedforward networks with a nonpolynomial activation function can
approximate any function.
Neural Networks, 6:861–867.
Yu, H. and Bertsekas, D. P. (2007).
Q-learning algorithms for optimal stopping based on least squares.
2007 European Control Conference (ECC), pages 2368–2375.

DOS - Presentation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DOS - Presentation

Uploaded by

Copyright:

Available Formats

Deep Optimal Stopping

March 26, 2019

T is the set of all X -stopping times.

g : {0, 1, ..., N} × Rd → R is a measurable, integrable function.

A neural network is a computational system modelled loosely after the human

Source: [Hijazi et al., 2015]

Input Layer: x = (x1 , x2 , x3 ).

Hidden Layer: Takes x, and outputs y = (y1 , y2 , y3 ) where yi = fy (wi x + bi )

Output Layer: Take y and outputs z1 = fz (wz y + bz ) where

Training is done by minimizing a loss function, C , w.r.t θ.

∇θ C (θt ) is computed using backpropagation.

Note: Minimizing a loss function via gradient descent is equivalent to

Agents aim to learn a policy πθ (st , at ) so they interact with their

Model-free approach to policy iteration.

Q-function (a.k.a. action-value function): Expected rewards (discounted by

Q-learning is an iterative algorithm based on the recursive formula for Q

Q(st , at ) ← Q(st , at ) + α(r (st , at ) + γ max Q(st+1 , a) − Q(st , at ))

Note r (st , at ) + γ maxa Q(st , a) − Q(st , at ) is the Bellman error

Given current state xt we have two options:

Let Q ∗ (xt ) = mina Q(xt , a) where a is either to stop or continue.

Consider approximations for Q that use a linear projection of transformed xt

The corresponding projected Bellman error is then minimized w.r.t. βt via a

Longstaff, F. and Schwartz E.: Valuing American Options by Simulation: A Simple

Simulate M stock price paths (xnm )N

Set VNm = g (N, xNm ), ∀m and then for n = N − 1, ..., 0:

Kohler, Krzyzak and Todorovic: Pricing of High-Dimensional American Options by

Neural network applied to Longstaff-Schwartz algorithm to price higher dimensional

Source: [Hu, 2019]

Optimal stopping problem can be reframed as an image classification problem

Disadvantage: Convergence theory has yet to have been completely

Write time n stopping time, τn , as a function of {fk (Xk )}N

Let Tn be the set of all X -stopping times τ s.t. n ≤ τ ≤ N.

Clearly TN = {N}, so let τN = N · fN (XN ) where fN ≡ 1.

Now, for n = N − 1, ..., 0, we can define:

Then there exists a measurable function fn : Rd → {0, 1} such that τn ∈ Tn

Eg (τn , XτN ) ≥ Vn − (Vn+1 − Eg (τn+1 , Xτn+1 )) (3)

where Vn and Vn+1 satisfy

Vn = sup Eg (τ, Xτ ) (4)

Fix stopping time τ ∈ Tn and let  = Vn+1 − Eg (τn+1 , Xτn+1 ).

Define D = {g (n, Xn ) ≥ hn (Xn )}, E = {τ = n} ∈ Fn and:

Exercise: Show Eg (τn , Xτn ) ≥ Eg (τ, Xτ ) − .

Since τ is arbitrary, we have Eg (τn , Xτn ) ≥ Vn − , satisfying (3).

Proof is completed by showing τn here satisfies (1).

Theorem 1 shows that τn from (1) can adequately be used to compute Vn .

Optimal stopping time corresponding to V = supτ ∈T Eg (τ, Xτ ) is:

...But how do we find the sequence {fn }N

Using fN ≡ 1 we construct a sequence of neural networks, f θn : Rd → {0, 1},

f θn produces binary output =⇒ it is not continuous w.r.t. θ.

Introduce F θ : Rd → (0, 1), a two-layer, feed-forward neural network of the form

F θ = ψ ◦ a3θ ◦ φq2 ◦ a2θ ◦ φq1 ◦ a1θ (6)

Parameters are θ = {(Ai , bi )3i=1 } ∈ Rq , where q = q1 (d + q2 + 1) + 2q2 + 1.

f θn = I[0,∞) ◦ a3θn ◦ φq2 ◦ a2θn ◦ φq1 ◦ a1θn (7)

So we want our reward function at time n to approximate

where D is the set of all f : Rd → {0, 1} measurable.

sup E[g (n, Xn )f θ (Xn ) + g (τn+1 , Xτn+1 )(1 − f θ (Xn ))]

where D is the set of all measurable functions f : Rd → {0, 1}.

satisfies Eg (τ̂ , Xτ̂ ) ≥ supτ ∈T Eg (τ, Xτ ) − .

This means we can approximate the sequence of optimal stopping decisions

By integrability of g , ∃f˜ : Rd → {0, 1} meas. s.t.

Let f˜ = IA , A = {x ∈ Rd |f˜ = 1} and note

are finite Borel measures on Rd .

E[g (n, Xn )IK (Xn ) + g (τn+1 , Xτn+1 )(1 − IK (Xn ))]

E[g (n, Xn )Ih(Xn )≥0 + g (τn+1 , Xτn+1 )(1 − Ih(Xn )≥0 )]

Fix stopping time τ ∈ Tn and let = Vn+1 − Eg (τn+1 , Xτn+1 ).

Exercise: Show Eg (τn , Xτn ) ≥ Eg (τ, Xτ ) − .

Since τ is arbitrary, we have Eg (τn , Xτn ) ≥ Vn − , satisfying (3).

satisfies Eg (τ̂ , Xτ̂ ) ≥ supτ ∈T Eg (τ, Xτ ) − .