Professional Documents
Culture Documents
Slides Active Flow Control Deep Reinforcement Learning
Slides Active Flow Control Deep Reinforcement Learning
Slides Active Flow Control Deep Reinforcement Learning
J. Rabault1 ,
M. Kuchta1 , N. Cerardi1,2 , U. Reglade1,2 , A. Jensen1 , A. Kuhnle 3
1:
University of Oslo, Department of Mathematics
2 : Mines ParisTech, CEMEF
3 : University of Cambridge, Computer Science Laboratory
Deep ANNs: ANNs with several hidden layers. Feed one layer in the next:
’Deep Learning’, LeCun et. al., Nature (2015).
Then: Q ∗ = maxπ Q π (s, a), and V ∗ (s) = maxa Q ∗ (s, a). Finally:
Therefore, knowledge of Q∗
is knowledge of π ∗ . Q learning: iterative
approximation of Q ∗ using the Bellman equation:
Q ∗ (s, a) = r + γ max
0
Q ∗ (s 0 , a0 ).
a
hP i
H P
Value: V (Θ) = E t=0 R(st , at )|πθ = τ P(τ, Θ)R(τ )
X
∇Θ V (Θ) = ∇Θ P(τ, Θ)R(τ )
τ
X P(τ, Θ)
= ∇Θ P(τ, Θ)R(τ )
τ
P(τ, Θ)
X
= P(τ, Θ)∇Θ log (P(τ, Θ)) R(τ )
τ
Expected value, approximate with empirical sampling under policy πθ :
M
1 X
∇Θ V (Θ) ≈ ĝ = ∇Θ log P(τ (i) , Θ) R(τ (i) )
M
i=1
J. Rabault et. al. MINDS seminar January 31st, 2019 23 / 46
Understanding the policy gradient
M
1 X
∇Θ V (Θ) ≈ ĝ = ∇Θ log P(τ (i) , Θ) R(τ (i) )
m
i=1
The log prob gradient depends only on the policy! No dynamics model.
This method is unbiased:
∇Θ V (Θ) = E [ĝ ] ,
PM
= m1 (i) (i)
so use ∇Θ V (Θ) ≈ ĝ i=1 ∇Θ log P(τ , Θ) R(τ ),
(i) (i)
with ∇Θ log P(τ (i) , Θ) = ∇Θ H
P
t=0 log πτ (at |st ).
J. Rabault et. al. MINDS seminar January 31st, 2019 25 / 46
Proximal Policy Optimization (PPO)
D
Add velocity probes (st ), normal jets (at ), reduce CD = 1/2ρU 2 L
(rt ).
J. Rabault et. al. MINDS seminar January 31st, 2019 29 / 46
Learning an active flow control strategy
Simple network: fully connected [512, 512].
Reward: rt = −hCD iT − 0.2|hCL iT |.
Learning for 1300 vortex sheddings.
ANN plays with the flow and observes the effect on the reward. Do more
of the successful actuation, less of the unsuccessful. Weights of the ANN
encode its ’understanding’ of the behavior of the system.
Dynamic system:
da1
dt= σa1 − a2
da2
dt= σa2 + a1
da3
dt= −0.1a3 − 10a4
da4
dt= −0.1a4 + 10a3 + b
σ = 0.1 − a12 − a22 − a32 − a42
Cost function:
Do like the brain: low frequency, high latency, highly parallel, very large
computation power, very energy efficient.
Need adapted algos to take advantage of computing power increases!
J. Rabault et. al. MINDS seminar January 31st, 2019 44 / 46
Collaboration with MINDS