Slides Active Flow Control Deep Reinforcement Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Deep Reinforcement Learning: a new tool to apprehend

complexity arising from Fluid Mechanics and


Active Flow Control?

J. Rabault1 ,
M. Kuchta1 , N. Cerardi1,2 , U. Reglade1,2 , A. Jensen1 , A. Kuhnle 3

1:
University of Oslo, Department of Mathematics
2 : Mines ParisTech, CEMEF
3 : University of Cambridge, Computer Science Laboratory

January 31st, 2019

J. Rabault et. al. MINDS seminar January 31st, 2019 1 / 46


whoami / affiliation

PostDoc, University of Oslo. Cooperation UNIS / Svalbard, Mines.


Main project: wave / ice interaction in the polar regions.
Mostly fieldwork and laboratory experiments.

J. Rabault et. al. MINDS seminar January 31st, 2019 2 / 46


whoami / Northwest Barents sea

J. Rabault et. al. MINDS seminar January 31st, 2019 3 / 46


whoami / ANNs

Interest in Machine Learning / ANNs for Fluid Mechanics.


Keep informed about developments in AI, apply it to my field.

J. Rabault et. al. MINDS seminar January 31st, 2019 4 / 46


Complexity: when our tools break

Modern science has much difficulty to handle combination of:


Non-linearity
High dimensionality
Evolution in time
Most analytical tools break; simulation can be expensive. ’Lucky’ in Fluid
Mechanics to have this problem:
Proof or dismissal that ”In three space dimensions and time, given an
initial velocity field, there exists a vector velocity and a scalar pressure
field, which are both smooth and globally defined, that solve the
Navier Stokes equations”, Clay Mathematical Institute / Millenium
problems.
”I think the next [21st] century will be the century of complexity”,
Stephen W. Hawking (2005).

J. Rabault et. al. MINDS seminar January 31st, 2019 5 / 46


Flow Control

’Delaying transition to turbulence by a passive mechanism’, Fransson et. al. (2006).

Flow control: a relevant theoretical and industrial problem.


Difficult: ’the Art of Flow Control’.
Time dependence, nonlinearity, high dimensionality, turbulence.
We propose Artificial Neural Networks (ANNs) and Deep Reinforcement
Learning (DRL) as a tool to perform active flow control:
”Artificial Neural Networks trained through Deep Reinforcement Learning
discover control strategies for active flow control”, Rabault, Kuchta,
Jensen, Reglade, Cerardi, ArXiv (2018), accepted JFM.
Can we use our actuators / design parameter space better?
J. Rabault et. al. MINDS seminar January 31st, 2019 6 / 46
Agenda

Introduction to Artificial Neural Networks and Supervised Learning.


Taking Machine Learning further: Deep Reinforcement Learning.
Application to a simple case of Active Flow Control.
Attacking challenging multimodal non linearity.

A few words of bonus.


Collaboration with MINDS.

J. Rabault et. al. MINDS seminar January 31st, 2019 7 / 46


A short introduction to Artificial Neural Networks and
Supervised Learning

J. Rabault et. al. MINDS seminar January 31st, 2019 8 / 46


Artificial Neural Networks: a bit of history

Idea of ANNs appeared with the first computers:

’The perceptron, a perceiving and recognizing automaton’, Rosenblatt,


Cornell Aeronautical Laboratory (1957).

It took some time before useful applications:


A few applications during the 70s.
Large scale application of CNNs in the 90s.
Recently ’AI spring’ since around 2012: Deep Learning.
Succession of ’AI springs’ and ’AI winters’, but now it is summer...
J. Rabault et. al. MINDS seminar January 31st, 2019 9 / 46
Deep ANNs: I
Artificial Neural Networks are simply a set of neurons.

Deep ANNs: ANNs with several hidden layers. Feed one layer in the next:
’Deep Learning’, LeCun et. al., Nature (2015).

Convolution layers are a special type of layer:

J. Rabault et. al. MINDS seminar January 31st, 2019 10 / 46


Deep ANNs: II
What activation function to use?

Linear activation function: only linear network...


Non linear functions: in practice (leaky) ReLU best.
Deep ANN + non linear activation function = universal approximator.
’Multilayer feedforward networks are universal approximators’, Hornik et.
al., Neural Networks (1989). Even Turing complete with memory.
J. Rabault et. al. MINDS seminar January 31st, 2019 11 / 46
Deep ANNs: III
Universal approximator - Easily scaled - Easily parallelizable
BUT... difficult to train.

Solution: gradient descent, back-propagation of errors: ’Applications


of advances in nonlinear sensitivity analysis’, Werbos, System
modeling and optimization (1982).

Feed data in, compute error.


Back propagate (Leibniz) the error gradient.
Update the coefficients; efficient: computational graph, compiled.
J. Rabault et. al. MINDS seminar January 31st, 2019 12 / 46
The Deep Learning revolution

Large amounts of data.


Powerful GPUs, fast batch training.
A number of technical tricks.

’Large-scale deep unsupervised learning using graphics processors’, Raina


et. al. (2009)

’Understanding the difficulty of training deep feedforward neural networks’,


X. Glorot et. al. (2010)

’Deep sparse rectifier neural networks’, Glorot et. al. (2011)

’Imagenet classification with deep convolutional neural networks’,


Krizhevsky et. al. (2012)

J. Rabault et. al. MINDS seminar January 31st, 2019 13 / 46


Machine Learning is taking over

Supervised learning, image analysis work...

J. Rabault et. al. MINDS seminar January 31st, 2019 14 / 46


Taking Machine Learning one step further: Deep
Reinforcement Learning

J. Rabault et. al. MINDS seminar January 31st, 2019 15 / 46


Deep Reinforcement Learning overview
Get computers to learn by trial and error.
ANNs can be used as a tool to build such systems.
Birth of Deep Reinforcement Learning: Deep Q Learning:
’Human-level control through deep reinforcement learning’, Mnih et. al.,
Nature (2015).
State-of-the-art, industry standard for continuous control: PPO:
’Proximal policy optimization algorithms’, Schulman et. al., ArXiv (2017).

J. Rabault et. al. MINDS seminar January 31st, 2019 16 / 46


The reinforcement learning framework
Interaction between Agent and Environment through 3 channels.

Guide through the reward function: no need to know the solution in


advance on a training set. Learn by experimenting.

J. Rabault et. al. MINDS seminar January 31st, 2019 17 / 46


Q-Learning (I)

’Human-level control through deep reinforcement learning’, Mnih et. al.,


Nature (2015).

Based on older work about Q-Learning:


’Q-learning’, Watkins et. al., Machine Learning (1992).

Discounted future reward: Rt = rt + γrt+1 + γ 2 rt+2 + ... = rt + γRt+1 .

Given a state s and a policy π define the value function:


 
X
V π (s) = E  γ t rt |s, π  ,
t≥0

and the optimal value function: V ∗ (s) = maxπ V π (s).


J. Rabault et. al. MINDS seminar January 31st, 2019 18 / 46
Q-Learning (II)

Introduce the action of the agent: expected return, starting in state s,


performing action a, continuing with policy π: Q π (s, a).

Then: Q ∗ = maxπ Q π (s, a), and V ∗ (s) = maxa Q ∗ (s, a). Finally:

π ∗ (s) = arg max Q ∗ (s, a).


a

Therefore, knowledge of Q∗
is knowledge of π ∗ . Q learning: iterative
approximation of Q ∗ using the Bellman equation:

Q ∗ (s, a) = r + γ max
0
Q ∗ (s 0 , a0 ).
a

Random action at probability , arg max action at probability 1 − , and:

Qn+1 (st , at ) = Qn (st , at ) + α(rt+1 + γ max Qn (st+1 , a) − Qn (st , at )).


a

J. Rabault et. al. MINDS seminar January 31st, 2019 19 / 46


The ’maze’ illustration
Good way of thinking about Deep Q Learning: investigate a maze
Matrix Q, one entry per location.
For each location, what is best possible output for each action.

J. Rabault et. al. MINDS seminar January 31st, 2019 20 / 46


Deep Q-Learning
World is non-linear, high dimensionality, complex...

... use a Deep Neural Network for Q. Therefore ’Deep Q-Learning’.

’Mastering the game of Go without human knowledge’, Silver et. al.,


Nature (2015).

Parametrize Q(s, a, Θi ) using a deep ANN (Θi network weights).


max instead argmax: give the Q-value for each a in different output.
h i
Loss function: Li (Θi ) = E (r + γ maxa0 Q(s 0 , a0 , Θi ) − Q(s, a, Θi ))2
Experience replay buffer inspired from biology.
Update target value at end of training for stability.
Computers that learn without us knowing a solution. Adapted to
non-linear, high dimensional, complex problems. Can make DANNs fight
each other (adversarial training). ANN just one solution (but works good).
J. Rabault et. al. MINDS seminar January 31st, 2019 21 / 46
Policy Gradient methods

Policy Gradient: directly learn the policy π.

Policy gradient vs Q-Learning:


Often π simpler than Q.
Q: need to be able to effectively solve arg max, tricky high dimension.
Policy gradient: more versatile (continuous learning).
Q: better exploration (can be taken care of).
Use a stochastic policy: π is a distribution for each state input:
Smooths out actions.
Train by slightly shifting distribution.
Characteristics parametrized by an ANN (weights Θi ).
hP i
H t
Objective: find Θ such that: maxΘ E t=0 R(st )|πθ , R(st ) = rt γ

J. Rabault et. al. MINDS seminar January 31st, 2019 22 / 46


Computing the policy gradient
PH
τ : s-a sequence: (s0 , a0 ), (s1 , a1 ), ..., (sH , aH ), R(τ ) = t=0 R(st , at )

hP i
H P
Value: V (Θ) = E t=0 R(st , at )|πθ = τ P(τ, Θ)R(τ )

X
∇Θ V (Θ) = ∇Θ P(τ, Θ)R(τ )
τ
X P(τ, Θ)
= ∇Θ P(τ, Θ)R(τ )
τ
P(τ, Θ)
X
= P(τ, Θ)∇Θ log (P(τ, Θ)) R(τ )
τ
Expected value, approximate with empirical sampling under policy πθ :
M
1 X  
∇Θ V (Θ) ≈ ĝ = ∇Θ log P(τ (i) , Θ) R(τ (i) )
M
i=1
J. Rabault et. al. MINDS seminar January 31st, 2019 23 / 46
Understanding the policy gradient

M
1 X  
∇Θ V (Θ) ≈ ĝ = ∇Θ log P(τ (i) , Θ) R(τ (i) )
m
i=1

Valid for discontinuous / unknown R.


Works with discrete or continuous state / action.
Works by improving the probability of paths with good R (and reducing
the probability of paths with bad R).

J. Rabault et. al. MINDS seminar January 31st, 2019 24 / 46


Computing the gradient of the log probability
Express the log prob as a dynamics model and a policy part:
"H #
 
(i) (i) (i) (i) (i)
Y
∇Θ log P(τ (i) , Θ) = ∇Θ log P(st+1 |st , at )πτ (at |st )
t=0
" H H
#
(i) (i) (i) (i) (i)
X X
= ∇Θ log P(st+1 |st , at ) + log πτ (at |st )
t=0 t=0
H
(i) (i)
X
= ∇Θ log πτ (at |st )
t=0

The log prob gradient depends only on the policy! No dynamics model.
This method is unbiased:

∇Θ V (Θ) = E [ĝ ] ,
PM
= m1 (i) (i)

so use ∇Θ V (Θ) ≈ ĝ i=1 ∇Θ log P(τ , Θ) R(τ ),
(i) (i)
with ∇Θ log P(τ (i) , Θ) = ∇Θ H
 P
t=0 log πτ (at |st ).
J. Rabault et. al. MINDS seminar January 31st, 2019 25 / 46
Proximal Policy Optimization (PPO)

State of the art gradient based method:

’Proximal Policy Optimization Algorithms’, Schulman et. al., ArXiv


(2017).

Implement a few tricks:


Clip gradient: avoid too large policy updated.
Explore trust regions first.
Used in the following; Open Source implementation:

’Tensorforce: a TensorFlow library for applied reinforcement learning’,


Kuhnle et. al., Github (2017)
https://github.com/tensorforce/tensorforce

Built as a Tensorflow computational graph for efficiency.


J. Rabault et. al. MINDS seminar January 31st, 2019 26 / 46
Real world application
Games; Go; some recent industry deployments:

”Google just gave control over data center cooling to an AI”,


MIT Technology Review, Aug. 2018.

Energy saving in servers cooling, -40% electricity.

J. Rabault et. al. MINDS seminar January 31st, 2019 27 / 46


Application to a simple case of Active Flow Control

J. Rabault et. al. MINDS seminar January 31st, 2019 28 / 46


Karman vortex street and the Turek benchmark

Karman vortex street is emblematic of Fluid Mechanics.

Thereafter: 2D FEniCS at Re = ŪL/ν = 100: laminar unsteady, as in:


’Benchmark Computations of Laminar Flow Around a Cylinder’,
Schäfer and Turek, Flow Sims. with High-Perf. Computers (1996).

D
Add velocity probes (st ), normal jets (at ), reduce CD = 1/2ρU 2 L
(rt ).
J. Rabault et. al. MINDS seminar January 31st, 2019 29 / 46
Learning an active flow control strategy
Simple network: fully connected [512, 512].
Reward: rt = −hCD iT − 0.2|hCL iT |.
Learning for 1300 vortex sheddings.
ANN plays with the flow and observes the effect on the reward. Do more
of the successful actuation, less of the unsuccessful. Weights of the ANN
encode its ’understanding’ of the behavior of the system.

Need some tricks:


Frequency of control: learn necessity to set time-correlated signals.
Do not let the network cheat...
Results, methodology, implementation:

”Artificial Neural Networks trained through Deep Reinforcement Learning


discover control strategies for active flow control”,
Rabault et. al., ArXiv (2018), accepted JFM (2019).
J. Rabault et. al. MINDS seminar January 31st, 2019 30 / 46
Results (I)

Effective control of the flow configuration and wake (video:


https://www.youtube.com/watch?v=0QbKk5SzMws).

8% drag reduction, 130% increase wake area, altered shedding frequency.

J. Rabault et. al. MINDS seminar January 31st, 2019 31 / 46


Results (II)

J. Rabault et. al. MINDS seminar January 31st, 2019 32 / 46


Results (III)
Very small jets allow for ’boat tailing’.

J. Rabault et. al. MINDS seminar January 31st, 2019 33 / 46


Results (IV)
Recover 93 % of the drag reduction relative with the case without vortex
shedding; very similar recirculation area.

Many convenient points with Deep Reinforcement Learning:


Learn once, control as many times and for as long as wanted.
Cheap evaluation.
Trainable offline.
Scalable training.
Easy to use, black box ’solve it all’.
Learn the transfer functions of a real-world system (?).
Transfer Learning (?).
J. Rabault et. al. MINDS seminar January 31st, 2019 34 / 46
Taking it further: taming non linearity

J. Rabault et. al. MINDS seminar January 31st, 2019 35 / 46


Simple PDE models

”Machine Learning Control - Taming Nonlinear Dynamics and


Turbulence”, Duriez et. al., Springer (2016).

Dynamic system:
da1
dt= σa1 − a2
da2
dt= σa2 + a1
da3
dt= −0.1a3 − 10a4
da4
dt= −0.1a4 + 10a3 + b
σ = 0.1 − a12 − a22 − a32 − a42
Cost function:

J(a1 (t), a2 (t), b(t)) = a12 + a22 + γb 2

Weak (nonlinear) coupling between modes; disappear when linearized.


Pseudo harmonic, amplified oscillations.
J. Rabault et. al. MINDS seminar January 31st, 2019 36 / 46
Machine learning control
Adjoint method fails (trapped in local minima, weak coupling).
Genetic Programming works (with the right base functions).
Satisfactory Deep Reinforcement Learning.

a1 (t) and a2 (t)

a3 (t) and a4 (t)


J. Rabault et. al. MINDS seminar January 31st, 2019 37 / 46
Conclusion

J. Rabault et. al. MINDS seminar January 31st, 2019 38 / 46


Summary of the results

Artificial Neural Networks trained through Deep Reinforcement


Learning can be successful at performing active flow control.
Opens a new direction for active flow control.
More generally, a new tool to attack complexity?
ANNs/DRL is a promising non linear optimization tool for complex
systems:
Robust algorithms.
Easy to use: plug-and-play, out-of-the-box.
Leverage ANNs for optimization problems.
Unsupervised.
Field of (very) active research, many optimizations possible, constant
improvement of the algorithms, get rid of the need for a ’smart’ user...

J. Rabault et. al. MINDS seminar January 31st, 2019 39 / 46


Dramatic developments in the field

Figures from Google Brain blog.


J. Rabault et. al. MINDS seminar January 31st, 2019 40 / 46
Bonus material

J. Rabault et. al. MINDS seminar January 31st, 2019 41 / 46


ANNs, computing power, CPUs, GPUs, Moore’s law
Increasingly attractive to use ANNs: the Moore’s law is NOT dead yet.
But one thread, free performance updates are dead, and many algorithms
are difficult to parallelize.

Power limitations is the new wall computing power is hitting.


J. Rabault et. al. MINDS seminar January 31st, 2019 42 / 46
Dennard scaling, energy wall
Moore’s law holds, Dennard scaling is dead.
”Moore’s law gives us more transistors, Dennard scaling makes it useful”,
Bob Colwell, chief engineer P. II, P. III, P. 4, Intel;
Director of the Microsystems Technology Office, DARPA.

J. Rabault et. al. MINDS seminar January 31st, 2019 43 / 46


The way out: new architectures
Many problems: power dissipation wall, dark silicon, etc.
CPU: 1700 pJ/Op, GPU: 160 pJ/Op, ASIC: < 10 pJ/Op.

Do like the brain: low frequency, high latency, highly parallel, very large
computation power, very energy efficient.
Need adapted algos to take advantage of computing power increases!
J. Rabault et. al. MINDS seminar January 31st, 2019 44 / 46
Collaboration with MINDS

J. Rabault et. al. MINDS seminar January 31st, 2019 45 / 46


Collaboration with MINDS

Many thanks to Elie Hachem and Thierry Coupez!


Identify simple, proof-of-concept problems.
Refine to more realistic configurations.

Flow control: take it further (3D, higher Re, different shapes).


New problems: optimal cooling, optimal design.
Solver optimization / Physics informed networks.

Many ideas to try. Powerful nonlinear optimization tool.

High risk, high gain.

J. Rabault et. al. MINDS seminar January 31st, 2019 46 / 46

You might also like