Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford

Deep Neural Networks and Partial Differential
Equations:
Approximation Theory and Structural Properties
Philipp Christian Petersen

Joint work
Joint work with:

I Helmut Bölcskei (ETH Zürich)
I Philipp Grohs (University of Vienna)
I Joost Opschoor (ETH Zürich)
I Gitta Kutyniok (TU Berlin)
I Mones Raslan (TU Berlin)
I Christoph Schwab (ETH Zürich)
I Felix Voigtlaender (KU Eichstätt-Ingolstadt)
1 / 36
Today’s Goal
Goal of this talk: Discuss the suitability of neural networks as an
ansatz system for the solution of PDEs.
Two threads:
Approximation theory: Structural properties:

I universal approximation I non-convex, non-closed
ansatz spaces
I optimal approximation
rates for all classical I parametrization not stable
function spaces I very hard to optimize over
I reduced curse of dimen-
sion
1
0.5
1
0
0
1
-1 0.8
1 1
0.6
0.8 0.8
0.6 0.6 0.4 0
0.2
0.4 0.4
0.2 0.4
0.2 0.2 0.6
0.8
0 0 0
1
2 / 36
Outline
Neural networks
Introduction to neural networks
Approaches to solve PDEs
Approximation theory of neural networks

Classical results
Optimality
High-dimensional approximation
Structural results
Convexity
Closedness
Stable parametrization
3 / 36
Neural networks
We consider neural networks as a special kind of functions:

I d = N0 ∈ N: input dimension,
I L: number of layers,
I % : R → R: activation function,
I T` : RN`−1 → RN` , ` = 1, . . . , L: affine-linear maps.
Then Φ% : Rd → RNL given by

Φ% (x) = TL (%(TL−1 (%(. . . %(T1 (x)))))), x ∈ Rd ,
is called a neural network (NN). The sequence (d, N1 , . . . , NL ) is
called the architecture of Φ% .
4 / 36
Why are neural networks interesting? - I
Deep Learning: Deep learning describes a variety of techniques
based on data-driven adaptation of the affine linear maps in a neural
network.
Overwhelming success:
I Image classification
I Text understanding
Ren, He, Girshick, Sun; 2015

I Game intelligence
Hardware design of the

future!
5 / 36
Why are neural networks interesting? - II
Expressibility: Neural networks constitute a very powerful

architecture.
Theorem (Cybenko; 1989, Hornik; 1991, Pinkus; 1999)

Let d ∈ N, K ⊂ Rd compact, f : K → R continuous, % : R → R
continuous and not a polynomial. Let ε > 0, then there exist a
two-layer NN Φ% : kf − Φ% k∞ ≤ ε.
Efficient expressibility: RM 3 θ 7→ (T1 , . . . , TL ) 7→ Φ%θ yields a

parametrized system of functions. In a sense this parametrization is
optimally efficient. (More on this below).
6 / 36
How can we apply NNs to solve PDEs?
PDE problem: For D ⊂ Rd , d ∈ N find u such that
G (x, u(x), ∇u(x), ∇2 u(x)) = 0 for all x ∈ D.
Approach of [Lagaris, Likas, Fotiadis; 1998]: Let (xi )i∈I ⊂ D,

find a NN Φ%θ such that
G (xi , Φ%θ (xi ), ∇Φ%θ (xi ), ∇2 Φ%θ (xi )) = 0 for all i ∈ I .
Standard methods can be used to find parameters θ.
7 / 36
Approaches to solve PDEs - Examples
General Framework: Deep Ritz Method [E, Yu; 2017]: NNs

as trial functions, SGD naturally replaces quadrature.
High-dimensional PDEs: [Sirignano, Spiliopoulos; 2017]: Let
D ⊂ Rd d ≥ 100 find u such that
∂u
(t, x) + H(u)(t, x) = 0, (t, x) ∈ [0, T ] × Ω, + BC + IC
∂t
As the number of parameters of the NNs increases the minimizer of
associated energy approaches true solution. No mesh generation
required!
[Berner, Grohs, Hornung, Jentzen, von Wurstemberger;
2017]:
Phrasing problem as empirical risk minimization provably no
curse of dimension in approximation problem or number of samples.
8 / 36
How can we apply NNs to solve PDEs?
Deep learning and PDEs: Both approaches above are based on

two ideas.
I Neural networks are highly efficient in representing solutions of
PDEs, hence the complexity of the problem can be greatly
reduced.
I There exist black box methods from machine learning that
solve the optimization problem.
This talk:
I We will show exactly how efficient the representations are.
I Raise doubt that the black box can produce reliable results in
general.
9 / 36
Approximation theory of neural
networks
10 / 36
Complexity of neural networks
Recall:
Φ% (x) = TL (%(TL−1 (%(. . . %(T1 (x)))))), x ∈ Rd .
Each affine linear mapping T` is defined by a matrix A` ∈ RN` ×N`−1
and a translation b` ∈ RN` via T` (x) = A` x + b` .
The number of weights W (Φ% ) and the number of neurons

N(Φ% ) are
X L
X
W (Φ% ) = (kAj k`0 + kbj k`0 ) and N(Φ% ) = Nj .
j≤L j=0
11 / 36
Power of the architecture — Exemplary results
Given f from some class of functions, how many weights/neurons

does an ε-approximating NN need to have?
12 / 36
Given f from some class of functions, how many weights/neurons

does an ε-approximating NN need to have?
Not so many...
Theorem (Maiorov, Pinkus; 1999)
There exists an activation function %weird : R → R that
I is analytic and strictly increasing,
I satisfies limx→−∞ %weird (x) = 0 and limx→∞ %weird (x) = 1,
such that for any d ∈ N, any f ∈ C ([0, 1]d ), and any ε > 0, there is
a 3-layer %-network Φ%ε weird with kf − Φ%ε weird kL∞ ≤ ε and
N(Φ%ε weird ) = 9d + 3.
12 / 36
I Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function
% sigmoidal of order zero.
13 / 36
I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.
For f ∈ C s ([0, 1]d ), we have kf − Φ%n kL∞ . N(Φ%n )−s/d and
L(Φ%n ) = L(d, s, k).
13 / 36
L(Φ%n ) = L(d, s, k).
I Yarotsky; 2017: For f ∈ C s ([0, 1]d ), we have for %(x) = x+
(called ReLU) that kf − Φ%n kL∞ . W (Φ%n )−s/d and
L(Φ%ε ) log(n).
13 / 36
L(Φ%n ) = L(d, s, k).
L(Φ%ε ) log(n).
I Shaham, Cloninger, Coifman; 2015: One can implement
certain wavelets using 4–layer NNs.
13 / 36
L(Φ%n ) = L(d, s, k).
L(Φ%ε ) log(n).
certain wavelets using 4–layer NNs.
I He, Li, Xu, Zheng; 2018, Opschoor, Schwab, P.; 2019:
ReLU NNs reproduce approximation rates of h-, p- and
hp-FEM.
13 / 36
Lower bounds
Optimal approximation rates: Lower bounds on required network

size only exist under additional assumptions. (Recall networks based
on %weird ).
Options:
(A) Place restrictions on activation function (e.g. only consider
the ReLU), thereby excluding pathological examples like %weird .
( VC dimension bounds)
(B) Place restrictions on the weights.

( Information theoretical bounds, entropy arguments)
(C) Use still other concepts like continuous N-widths.
14 / 36
Lower bounds
Optimal approximation rates: Lower bounds on required network

size only exist under additional assumptions. (Recall networks based
on %weird ).
Options:
(A) Place restrictions on activation function (e.g. only consider
the ReLU), thereby excluding pathological examples like %weird .
( VC dimension bounds)
(B) Place restrictions on the weights.

( Information theoretical bounds, entropy arguments)
(C) Use still other concepts like continuous N-widths.
14 / 36
Asymptotic min-max rate distortion
Encoders: Let C ⊂ L2 (Rd ), ` ∈ N
n o n o
E ` := E : C → {0, 1}` , D` := D : {0, 1}` → L2 (Rd ) .
{0, 1, 0, 0, 1, 1, 1}
Min-max code length:

L(, C) := min ` ∈ N : ∃D ∈ D` , C ∈ C ` : sup kD(E (f )) − f k2 < .
f ∈C
Optimal exponent:
γ ∗ (C) := inf γ > 0 : L(, C) = O(−γ ) .

15 / 36
Asymptotic min-max rate distortion
Theorem (Boelcskei, Grohs, Kutyniok, P.; 2017)

Let C ⊂ L2 (Rd ), % : R → R, then for all > 0:
 
∗
sup  inf W (Φ% ) & −γ (C) . (1)
f ∈C Φ% NN with quantised weights
kΦ% −f k2 ≤
Optimal approximation/parametrization: If for C ⊂ L2 (Rd ) one

also has . in (1), then NNs approximate a function class optimally.
Versatility: It turns out that NNs achieve optimal approximation
rates for many practically-used function classes.
16 / 36
Some instances of optimal approximation

For f ∈ C s ([0, 1]d ), we have kf − Φ%n kL∞ . N(Φ%n )−s/d .
We have γ ∗ ({f ∈ C s ([0, 1]d : kf k ≤ 1}) = d/s.
certain wavelets using 4–layer ReLU NNs. Optimal, when
wavelets are optimal.
I Bölcskei, Grohs, Kutyniok, P.; 2017: Networks yield

optimal rates if any affine system does. Example: shearlets for
cartoon-like functions.
17 / 36
ReLU Approximation
Piecewise smooth functions:
Eβ,d denotes the d-dimensional
C β -piecewise smooth functions
1
1 0.8
0.6
on [0, 1]d with interfaces in C β .

0.5
0.4
0
1
0.8 0.2
0.6
0.4
0.2 0
0
Theorem (P., Voigtlaender; 2018)

Let d ∈ N, β ≥ 0, % : R → R, %(x) = x+ , then
 
∗ (E
sup  inf W (Φ% ) ∼ −γ β,d )
= −2(d−1)/β .
f ∈Eβ,d Φ% NN with quantised weights
kΦ% −f k2 ≤
The optimal depth of the networks is ∼ β/d.
18 / 36
High-dimensional approximation
Curse of dimension: To guarantee approximation with error ≤ ε
of functions in Eβ,d one requires networks with O(ε−2(d−1)/β )
weights.
Symmetries and invariances:

Image classifiers are often:
I translation, dilation, and rotation invariant,
I invariant to small deformations,
I invariant to small changes in brightness, contrast, color.
19 / 36
Curse of dimension
Two-step setup: f = χ ◦ τ
I τ : RD → Rd is a smooth dimension reducing feature map.
I χ ∈ Eβ,d performs classification on low-dimensional space.
20 / 36
Curse of dimension
Two-step setup: f = χ ◦ τ
I τ : RD → Rd is a smooth dimension reducing feature map.
I χ ∈ Eβ,d performs classification on low-dimensional space.
Theorem (P., Voigtlaender; 2017)

Let %(x) = x+ . There are constants c > 0, L ∈ N such that for any
f = χ ◦ τ and any ε ∈ (0, 1/2), there is a NN Φ%ε with at most L
layers, and at most c · ε−2(d−1)/β non-zero weights such that
kΦ%ε − f kL2 < ε.
Asymptotic approximation rate depends only on d, not on D.
20 / 36
Compositional functions
Compositional functions: [Mhaskar, Poggio; 2016]

High-dimensional functions as dyadic composition of 2-dimensional
functions.
R8 3 x 7→ h13 (h12 (h11 (x1 , x2 ), h21 (x3 , x4 )), h22 (h31 (x5 , x6 ), h41 (x7 , x8 )))
x1 x2 x3 x4 x5 x6 x7 x8
21 / 36
Extensions
Approximation with respect to Sobolev norms: ReLU NNs Φ

are Lipschitz continuous. Hence, for s ∈ [0, 1], p ≥ 1 and
f ∈ W s,p (Ω), we can measure
kf − ΦkW s,p (Ω) .
ReLU Networks achieve the same approximation rates as h-, p-,

hp-FEM, [Opschoor, P., Schwab; 2019].
Convolutional neural networks: Direct correspondence between
approximation by CNNs (without pooling) and approximation by
fully-connected networks, [P., Voigtlaender; 2018].
22 / 36
Optimal parametrization
Optimal parametrization:
I Neural networks yield optimal representations of many function
classes relevant in PDE applications,
I Approximation is flexible and quality is improved if
low-dimensional structure is present.
PDE discretization:
I Problem complexity drastically reduced,
I No design of ansatz system necessary, since NNs approximate
almost every function class well.
Can neural networks really be this good?
23 / 36
The inconvenient structure of
neural networks
24 / 36
Fixed architecture networks
Goal: Fix a space of networks with prescribed shape and
understand the associated set of functions.
Fixed architecture networks: Let d, L ∈ N, N1 , . . . , NL−1 ∈ N,

% : R → R then we define by
N N % (d, N1 , . . . , NL−1 , 1)
the set of NNs with architecture (d, N1 , . . . , NL−1 , 1).
d =8 N1 = 12 N2 = 12 N3 = 12 N4 = 8
25 / 36
Back to the basics
Topological properties: Is N N % (d, N1 , . . . , NL−1 , 1)

I star-shaped?
I convex? approximately convex?
I closed?
Is the map (T1 , . . . , TL ) → Φ open?
Implications for optimization:

If we do not have the properties above, then we can have
I terrible local minima,
I exploding weights,
I very slow convergence.
26 / 36
Star-shapedness
Star-shapedness: N N % (d, N1 , . . . , NL−1 , 1) is trivially star-shaped

with center 0.
...but...
Proposition (P., Raslan, Voigtlaender; 2018)

Let d, L, N, N1 , . . . , NL−1 ∈ N and let % : R → R be locally
Lipschitz continuous. Then the number of P linearly independent
centers of N N % (d, N1 , . . . , NL ) is at most L`=1 (N`−1 + 1)N` ,
where N0 = d.
27 / 36
Convexity?
Corollary (P., Raslan, Voigtlaender; 2018)

Let d, L, N, N1 , . . . , NL−1 ∈ N, N0 = d, and let % : R → R be
locally Lipschitz continuous.
If N N % (d, N1 , . . . , NL−1 , 1) contains more than L`=1 (N`−1 + 1)N`
P
linearly independent functions, then N N % (d, N1 , . . . , NL−1 , 1) is
not convex.
From translation invariance: If N N % (d, N1 , . . . , NL−1 , 1) only
finitely many linearly independent functions then % is a finite sum of
complex exponentials multiplied with polynomials.
28 / 36
Weak Convexity?
Weak convexity: N N % (d, N1 , . . . , NL−1 , 1) is almost never

k·k
convex, but what about N N % (d, N1 , . . . , NL−1 , 1) + B ∞ (0) for a
hopefully small > 0?
29 / 36
Weak Convexity?
Weak convexity: N N % (d, N1 , . . . , NL−1 , 1) is almost never

k·k
convex, but what about N N % (d, N1 , . . . , NL−1 , 1) + B ∞ (0) for a
hopefully small > 0?
Theorem (P., Raslan, Voigtlaender; 2018)

Let d, L, N, N1 , . . . , NL−1 ∈ N, N0 = d. For all commonly-used
activation functions there does not exist > 0 such that
k·k
N N % (d, N1 , . . . , NL−1 , 1) + B ∞ (0) is convex.
As a corollary, we also get that N N % (d, N1 , . . . , NL−1 , 1) is usually

nowhere dense.
29 / 36
Illustration
Illustration: The set N N % (d, N1 , . . . , NL−1 , 1) has very few
centers, it is scaling invariant, not approximately convex, and
nowhere dense.
30 / 36
Closedness in Lp
Compact weights: If the activation function % is continuous, then
a compactness argument shows that the set of networks of a
compact parameter set is closed.
31 / 36
Closedness in Lp
Compact weights: If the activation function % is continuous, then
a compactness argument shows that the set of networks of a
compact parameter set is closed.
Let d, L, N1 , . . . , NL−1 ∈ N, N0 = d. If % has one of the properties
below, then N N % (d, N1 , . . . , NL−1 , 1) is not closed in Lp ,
p ∈ (0, ∞).
I analytic, bounded, not constant,
I C 1 but not C ∞ ,
I continuous, monotone, bounded, %0 (x0 ) exists and is non-zero
in at least one point x0 ∈ R.
I continuous, monotone, continuous differentiable outside a
compact set, and limx→∞ %0 (x), limx→−∞ %0 (x) exist and do
not coincide.
31 / 36
Closedness in L∞

Let d, L, N, N1 , . . . , NL−1 ∈ N, N0 = d. If % is has one of the
properties below, then N N % (d, N1 , . . . , NL−1 , 1) is not closed in
L∞ .
I analytic, bounded, not constant,
I C 1 but not C ∞
p
I ρ ∈ C p and |ρ(x) − x+ | bounded, for p ≥ 1.
ReLU: The set of two-layer ReLU NNs is closed in L∞ !
32 / 36
Illustration
Illustration: For most activation functions (except the ReLU) % the
set N N % (d, N1 , . . . , NL−1 , 1) is star-shaped with center 0, not
approximately convex, not closed.
33 / 36
Stable parametrization
Continuous parametrization: It is not hard to see, that if % is

continuous, then so is the map R% : (T1 , . . . , TL ) → Φ.
Quotient map: We can also ask if R% is a quotient map, i.e., if
Φ1 , Φ2 are NNs which are close (w.r.t. k · ksup ), then there exist
(T11 , . . . , TL1 ) and (T12 , . . . , TL2 ) which are close in some norm and
R% ((T11 , . . . , TL1 )) = Φ1 and R% ((T12 , . . . , TL2 )) = Φ2 .
Proposition (P., Raslan, Voigtlaender; 2018)

Let % be Lipschitz continuous and not affine linear, then R% is not a
quotient map.
34 / 36
Consequences
No convexity:
I Want to solve ∇J(Φ) = 0 for an energy J and NN Φ.
I Not only J could be non-convex, but also the set we optimize
over.
I Similar to N-term approximation by dictionaries.
No closedness:
I Exploding coefficients (if PN N (f ) 6∈ N N ).
I No low-neuron approximation.
No inverse-stable parametrization:
I Error term very small, while parametrization is far from optimal.
I Potentially very slow convergence.
35 / 36
Where to go from here?
Different networks:
I Special types of networks could be more robust.
I Convolutional neural networks are probably still too large a
class. [P., Voigtlaender; 2018].
Stronger norms:
I Stronger norms naturally help with closedness and inverse
stability.
I Example is Sobolev training [Czarnecki, Osindero, Jaderberg,
Swirszcz, Pascanu; 2017].
I Many arguments of our results break down if W 1,∞ norm is
used.
36 / 36
Conclusion
Approximation: NNs are a very powerful approximation tool:
I Often optimally efficient
parametrization
I overcome curse of dimension
I surprisingly efficient black-box
optimization
Topological structure: NNs form an impractical set:

I non-convex
I non-closed
I no inverse-stable parametrization.
37 / 36
References:
H. Andrade-Loarca, G. Kutyniok, O. Öktem, P. Petersen, Extraction of digital wavefront sets using applied
harmonic analysis and deep neural networks, arXiv:1901.01388
H. Bölcskei, P. Grohs, G. Kutyniok, P. Petersen, Optimal Approximation with Sparsely Connected Deep
Neural Networks, arXiv:1705.01714
J. Opschoor, P. Petersen, Ch. Schwab, Deep ReLU Networks and High-Order Finite Element Methods,
SAM, ETH Zürich, 2019.
P. Petersen, F. Voigtlaender, Optimal approximation of piecewise smooth functions using deep ReLU neural
networks, Neural Networks, (2018)
P. Petersen, M. Raslan, F. Voigtlaender, Topological properties of the set of functions generated by neural
networks of fixed size, arXiv:1806.08459
P. Petersen, F. Voigtlaender, Equivalence of approximation by convolutional neural networks and

fully-connected networks, arXiv:1809.00973
37 / 36
Thank you for your attention!
References:
H. Andrade-Loarca, G. Kutyniok, O. Öktem, P. Petersen, Extraction of digital wavefront sets using applied
harmonic analysis and deep neural networks, arXiv:1901.01388
H. Bölcskei, P. Grohs, G. Kutyniok, P. Petersen, Optimal Approximation with Sparsely Connected Deep
Neural Networks, arXiv:1705.01714
J. Opschoor, P. Petersen, Ch. Schwab, Deep ReLU Networks and High-Order Finite Element Methods,
SAM, ETH Zürich, 2019.
P. Petersen, F. Voigtlaender, Optimal approximation of piecewise smooth functions using deep ReLU neural
networks, Neural Networks, (2018)
P. Petersen, M. Raslan, F. Voigtlaender, Topological properties of the set of functions generated by neural
networks of fixed size, arXiv:1806.08459
P. Petersen, F. Voigtlaender, Equivalence of approximation by convolutional neural networks and

fully-connected networks, arXiv:1809.00973
36 / 36

Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford

Uploaded by

Copyright:

Available Formats

Deep Neural Networks and Partial Differential

Philipp Christian Petersen

Joint work with:

Approximation theory: Structural properties:

Approximation theory of neural networks

We consider neural networks as a special kind of functions:

Then Φ% : Rd → RNL given by

Ren, He, Girshick, Sun; 2015

Hardware design of the

Expressibility: Neural networks constitute a very powerful

Theorem (Cybenko; 1989, Hornik; 1991, Pinkus; 1999)

Efficient expressibility: RM 3 θ 7→ (T1 , . . . , TL ) 7→ Φ%θ yields a

PDE problem: For D ⊂ Rd , d ∈ N find u such that

G (x, u(x), ∇u(x), ∇2 u(x)) = 0 for all x ∈ D.

Approach of [Lagaris, Likas, Fotiadis; 1998]: Let (xi )i∈I ⊂ D,

G (xi , Φ%θ (xi ), ∇Φ%θ (xi ), ∇2 Φ%θ (xi )) = 0 for all i ∈ I .

Standard methods can be used to find parameters θ.

General Framework: Deep Ritz Method [E, Yu; 2017]: NNs

Deep learning and PDEs: Both approaches above are based on

The number of weights W (Φ% ) and the number of neurons

Given f from some class of functions, how many weights/neurons

Given f from some class of functions, how many weights/neurons

Optimal approximation rates: Lower bounds on required network

(B) Place restrictions on the weights.

(C) Use still other concepts like continuous N-widths.

Optimal approximation rates: Lower bounds on required network

(B) Place restrictions on the weights.

(C) Use still other concepts like continuous N-widths.

Min-max code length:

γ ∗ (C) := inf γ > 0 : L(, C) = O(−γ ) .

Theorem (Boelcskei, Grohs, Kutyniok, P.; 2017)

Optimal approximation/parametrization: If for C ⊂ L2 (Rd ) one

I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.

I Bölcskei, Grohs, Kutyniok, P.; 2017: Networks yield

on [0, 1]d with interfaces in C β .

Theorem (P., Voigtlaender; 2018)

The optimal depth of the networks is ∼ β/d.

Symmetries and invariances:

Theorem (P., Voigtlaender; 2017)

kΦ%ε − f kL2 < ε.

Asymptotic approximation rate depends only on d, not on D.

Compositional functions: [Mhaskar, Poggio; 2016]

Approximation with respect to Sobolev norms: ReLU NNs Φ

kf − ΦkW s,p (Ω) .

ReLU Networks achieve the same approximation rates as h-, p-,

Can neural networks really be this good?

Fixed architecture networks: Let d, L ∈ N, N1 , . . . , NL−1 ∈ N,

Topological properties: Is N N % (d, N1 , . . . , NL−1 , 1)

Implications for optimization:

Star-shapedness: N N % (d, N1 , . . . , NL−1 , 1) is trivially star-shaped

Proposition (P., Raslan, Voigtlaender; 2018)

Corollary (P., Raslan, Voigtlaender; 2018)

Weak convexity: N N % (d, N1 , . . . , NL−1 , 1) is almost never

Weak convexity: N N % (d, N1 , . . . , NL−1 , 1) is almost never

Theorem (P., Raslan, Voigtlaender; 2018)

As a corollary, we also get that N N % (d, N1 , . . . , NL−1 , 1) is usually

Theorem (P., Raslan, Voigtlaender; 2018)

ReLU: The set of two-layer ReLU NNs is closed in L∞ !

Continuous parametrization: It is not hard to see, that if % is

R% ((T11 , . . . , TL1 )) = Φ1 and R% ((T12 , . . . , TL2 )) = Φ2 .

Proposition (P., Raslan, Voigtlaender; 2018)

Topological structure: NNs form an impractical set:

P. Petersen, F. Voigtlaender, Equivalence of approximation by convolutional neural networks and

γ ∗ (C) := inf γ > 0 : L(, C) = O(−γ ) .