Professional Documents
Culture Documents
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
Equations:
Approximation Theory and Structural Properties
1 / 36
Today’s Goal
Goal of this talk: Discuss the suitability of neural networks as an
ansatz system for the solution of PDEs.
Two threads:
0.5
1
0
0
1
-1 0.8
1 1
0.6
0.8 0.8
0.6 0.6 0.4 0
0.2
0.4 0.4
0.2 0.4
0.2 0.2 0.6
0.8
0 0 0
1
2 / 36
Outline
Neural networks
Introduction to neural networks
Approaches to solve PDEs
Structural results
Convexity
Closedness
Stable parametrization
3 / 36
Neural networks
Overwhelming success:
I Image classification
I Text understanding
5 / 36
Why are neural networks interesting? - II
6 / 36
How can we apply NNs to solve PDEs?
7 / 36
Approaches to solve PDEs - Examples
8 / 36
How can we apply NNs to solve PDEs?
9 / 36
Approximation theory of neural
networks
10 / 36
Complexity of neural networks
Recall:
Φ% (x) = TL (%(TL−1 (%(. . . %(T1 (x)))))), x ∈ Rd .
Each affine linear mapping T` is defined by a matrix A` ∈ RN` ×N`−1
and a translation b` ∈ RN` via T` (x) = A` x + b` .
12 / 36
Power of the architecture — Exemplary results
Not so many...
Theorem (Maiorov, Pinkus; 1999)
There exists an activation function %weird : R → R that
I is analytic and strictly increasing,
I satisfies limx→−∞ %weird (x) = 0 and limx→∞ %weird (x) = 1,
such that for any d ∈ N, any f ∈ C ([0, 1]d ), and any ε > 0, there is
a 3-layer %-network Φ%ε weird with kf − Φ%ε weird kL∞ ≤ ε and
N(Φ%ε weird ) = 9d + 3.
12 / 36
Power of the architecture — Exemplary results
I Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function
% sigmoidal of order zero.
13 / 36
Power of the architecture — Exemplary results
I Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function
% sigmoidal of order zero.
I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.
For f ∈ C s ([0, 1]d ), we have kf − Φ%n kL∞ . N(Φ%n )−s/d and
L(Φ%n ) = L(d, s, k).
13 / 36
Power of the architecture — Exemplary results
I Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function
% sigmoidal of order zero.
I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.
For f ∈ C s ([0, 1]d ), we have kf − Φ%n kL∞ . N(Φ%n )−s/d and
L(Φ%n ) = L(d, s, k).
I Yarotsky; 2017: For f ∈ C s ([0, 1]d ), we have for %(x) = x+
(called ReLU) that kf − Φ%n kL∞ . W (Φ%n )−s/d and
L(Φ%ε ) log(n).
13 / 36
Power of the architecture — Exemplary results
I Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function
% sigmoidal of order zero.
I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.
For f ∈ C s ([0, 1]d ), we have kf − Φ%n kL∞ . N(Φ%n )−s/d and
L(Φ%n ) = L(d, s, k).
I Yarotsky; 2017: For f ∈ C s ([0, 1]d ), we have for %(x) = x+
(called ReLU) that kf − Φ%n kL∞ . W (Φ%n )−s/d and
L(Φ%ε ) log(n).
I Shaham, Cloninger, Coifman; 2015: One can implement
certain wavelets using 4–layer NNs.
13 / 36
Power of the architecture — Exemplary results
I Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function
% sigmoidal of order zero.
I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.
For f ∈ C s ([0, 1]d ), we have kf − Φ%n kL∞ . N(Φ%n )−s/d and
L(Φ%n ) = L(d, s, k).
I Yarotsky; 2017: For f ∈ C s ([0, 1]d ), we have for %(x) = x+
(called ReLU) that kf − Φ%n kL∞ . W (Φ%n )−s/d and
L(Φ%ε ) log(n).
I Shaham, Cloninger, Coifman; 2015: One can implement
certain wavelets using 4–layer NNs.
I He, Li, Xu, Zheng; 2018, Opschoor, Schwab, P.; 2019:
ReLU NNs reproduce approximation rates of h-, p- and
hp-FEM.
13 / 36
Lower bounds
Options:
(A) Place restrictions on activation function (e.g. only consider
the ReLU), thereby excluding pathological examples like %weird .
( VC dimension bounds)
14 / 36
Lower bounds
Options:
(A) Place restrictions on activation function (e.g. only consider
the ReLU), thereby excluding pathological examples like %weird .
( VC dimension bounds)
14 / 36
Asymptotic min-max rate distortion
Encoders: Let C ⊂ L2 (Rd ), ` ∈ N
n o n o
E ` := E : C → {0, 1}` , D` := D : {0, 1}` → L2 (Rd ) .
{0, 1, 0, 0, 1, 1, 1}
Optimal exponent:
15 / 36
Asymptotic min-max rate distortion
16 / 36
Some instances of optimal approximation
17 / 36
ReLU Approximation
Piecewise smooth functions:
Eβ,d denotes the d-dimensional
C β -piecewise smooth functions
1
1 0.8
0.6
0.4
0
1
0.8 0.2
0.6
0.4
0.2 0
0
18 / 36
High-dimensional approximation
Curse of dimension: To guarantee approximation with error ≤ ε
of functions in Eβ,d one requires networks with O(ε−2(d−1)/β )
weights.
19 / 36
Curse of dimension
Two-step setup: f = χ ◦ τ
I τ : RD → Rd is a smooth dimension reducing feature map.
I χ ∈ Eβ,d performs classification on low-dimensional space.
20 / 36
Curse of dimension
Two-step setup: f = χ ◦ τ
I τ : RD → Rd is a smooth dimension reducing feature map.
I χ ∈ Eβ,d performs classification on low-dimensional space.
20 / 36
Compositional functions
R8 3 x 7→ h13 (h12 (h11 (x1 , x2 ), h21 (x3 , x4 )), h22 (h31 (x5 , x6 ), h41 (x7 , x8 )))
x1 x2 x3 x4 x5 x6 x7 x8
21 / 36
Extensions
22 / 36
Optimal parametrization
Optimal parametrization:
I Neural networks yield optimal representations of many function
classes relevant in PDE applications,
I Approximation is flexible and quality is improved if
low-dimensional structure is present.
PDE discretization:
I Problem complexity drastically reduced,
I No design of ansatz system necessary, since NNs approximate
almost every function class well.
23 / 36
The inconvenient structure of
neural networks
24 / 36
Fixed architecture networks
Goal: Fix a space of networks with prescribed shape and
understand the associated set of functions.
d =8 N1 = 12 N2 = 12 N3 = 12 N4 = 8
25 / 36
Back to the basics
26 / 36
Star-shapedness
...but...
27 / 36
Convexity?
28 / 36
Weak Convexity?
29 / 36
Weak Convexity?
29 / 36
Illustration
Illustration: The set N N % (d, N1 , . . . , NL−1 , 1) has very few
centers, it is scaling invariant, not approximately convex, and
nowhere dense.
30 / 36
Closedness in Lp
Compact weights: If the activation function % is continuous, then
a compactness argument shows that the set of networks of a
compact parameter set is closed.
31 / 36
Closedness in Lp
Compact weights: If the activation function % is continuous, then
a compactness argument shows that the set of networks of a
compact parameter set is closed.
Theorem (P., Raslan, Voigtlaender; 2018)
Let d, L, N1 , . . . , NL−1 ∈ N, N0 = d. If % has one of the properties
below, then N N % (d, N1 , . . . , NL−1 , 1) is not closed in Lp ,
p ∈ (0, ∞).
I analytic, bounded, not constant,
I C 1 but not C ∞ ,
I continuous, monotone, bounded, %0 (x0 ) exists and is non-zero
in at least one point x0 ∈ R.
I continuous, monotone, continuous differentiable outside a
compact set, and limx→∞ %0 (x), limx→−∞ %0 (x) exist and do
not coincide.
31 / 36
Closedness in L∞
32 / 36
Illustration
Illustration: For most activation functions (except the ReLU) % the
set N N % (d, N1 , . . . , NL−1 , 1) is star-shaped with center 0, not
approximately convex, not closed.
33 / 36
Stable parametrization
34 / 36
Consequences
No convexity:
I Want to solve ∇J(Φ) = 0 for an energy J and NN Φ.
I Not only J could be non-convex, but also the set we optimize
over.
I Similar to N-term approximation by dictionaries.
No closedness:
I Exploding coefficients (if PN N (f ) 6∈ N N ).
I No low-neuron approximation.
No inverse-stable parametrization:
I Error term very small, while parametrization is far from optimal.
I Potentially very slow convergence.
35 / 36
Where to go from here?
Different networks:
I Special types of networks could be more robust.
I Convolutional neural networks are probably still too large a
class. [P., Voigtlaender; 2018].
Stronger norms:
I Stronger norms naturally help with closedness and inverse
stability.
I Example is Sobolev training [Czarnecki, Osindero, Jaderberg,
Swirszcz, Pascanu; 2017].
I Many arguments of our results break down if W 1,∞ norm is
used.
36 / 36
Conclusion
Approximation: NNs are a very powerful approximation tool:
I Often optimally efficient
parametrization
I overcome curse of dimension
I surprisingly efficient black-box
optimization
37 / 36
References:
H. Andrade-Loarca, G. Kutyniok, O. Öktem, P. Petersen, Extraction of digital wavefront sets using applied
harmonic analysis and deep neural networks, arXiv:1901.01388
H. Bölcskei, P. Grohs, G. Kutyniok, P. Petersen, Optimal Approximation with Sparsely Connected Deep
Neural Networks, arXiv:1705.01714
J. Opschoor, P. Petersen, Ch. Schwab, Deep ReLU Networks and High-Order Finite Element Methods,
SAM, ETH Zürich, 2019.
P. Petersen, F. Voigtlaender, Optimal approximation of piecewise smooth functions using deep ReLU neural
networks, Neural Networks, (2018)
P. Petersen, M. Raslan, F. Voigtlaender, Topological properties of the set of functions generated by neural
networks of fixed size, arXiv:1806.08459
37 / 36
Thank you for your attention!
References:
H. Andrade-Loarca, G. Kutyniok, O. Öktem, P. Petersen, Extraction of digital wavefront sets using applied
harmonic analysis and deep neural networks, arXiv:1901.01388
H. Bölcskei, P. Grohs, G. Kutyniok, P. Petersen, Optimal Approximation with Sparsely Connected Deep
Neural Networks, arXiv:1705.01714
J. Opschoor, P. Petersen, Ch. Schwab, Deep ReLU Networks and High-Order Finite Element Methods,
SAM, ETH Zürich, 2019.
P. Petersen, F. Voigtlaender, Optimal approximation of piecewise smooth functions using deep ReLU neural
networks, Neural Networks, (2018)
P. Petersen, M. Raslan, F. Voigtlaender, Topological properties of the set of functions generated by neural
networks of fixed size, arXiv:1806.08459
36 / 36