Deep Neural Networks and Partial Differential

Approximation Theory and Structural Properties

Philipp Christian Petersen

Joint work

Joint work with:

I Helmut Bölcskei (ETH Zürich)
I Philipp Grohs (University of Vienna)
I Joost Opschoor (ETH Zürich)
I Gitta Kutyniok (TU Berlin)
I Mones Raslan (TU Berlin)
I Christoph Schwab (ETH Zürich)
I Felix Voigtlaender (KU Eichstätt-Ingolstadt)

Today’s Goal
Goal of this talk: Discuss the suitability of neural networks as an
ansatz system for the solution of PDEs.

Two threads:

Approximation theory: Structural properties:

I universal approximation I non-convex, non-closed
ansatz spaces
I optimal approximation
rates for all classical I parametrization not stable
function spaces I very hard to optimize over
I reduced curse of dimen-



-1 0.8
1 1
0.8 0.8
0.6 0.6 0.4 0
0.4 0.4
0.2 0.4
0.2 0.2 0.6
0 0 0

Neural networks
Introduction to neural networks
Approaches to solve PDEs

Approximation theory of neural networks

Classical results
High-dimensional approximation

Structural results
Stable parametrization

Neural networks

We consider neural networks as a special kind of functions:

I d = N0 ∈ N: input dimension,
I L: number of layers,
I % : R → R: activation function,
I T` : RN`−1 → RN` , ` = 1, . . . , L: affine-linear maps.

Then Φ% : Rd → RNL given by

Φ% (x) = TL (%(TL−1 (%(. . . %(T1 (x)))))), x ∈ Rd ,
is called a neural network (NN). The sequence (d, N1 , . . . , NL ) is
called the architecture of Φ% .
Why are neural networks interesting? - I
Deep Learning: Deep learning describes a variety of techniques
based on data-driven adaptation of the affine linear maps in a neural

Overwhelming success:

I Image classification

I Text understanding

Ren, He, Girshick, Sun; 2015

I Game intelligence

Hardware design of the


Why are neural networks interesting? - II

Expressibility: Neural networks constitute a very powerful


Theorem (Cybenko; 1989, Hornik; 1991, Pinkus; 1999)

Let d ∈ N, K ⊂ Rd compact, f : K → R continuous, % : R → R
continuous and not a polynomial. Let ε > 0, then there exist a
two-layer NN Φ% : kf − Φ% k∞ ≤ ε.

Efficient expressibility: RM 3 θ 7→ (T1 , . . . , TL ) 7→ Φ%θ yields a

parametrized system of functions. In a sense this parametrization is
optimally efficient. (More on this below).

How can we apply NNs to solve PDEs?

PDE problem: For D ⊂ Rd , d ∈ N find u such that

G (x, u(x), ∇u(x), ∇2 u(x)) = 0 for all x ∈ D.

Approach of [Lagaris, Likas, Fotiadis; 1998]: Let (xi )i∈I ⊂ D,

find a NN Φ%θ such that

G (xi , Φ%θ (xi ), ∇Φ%θ (xi ), ∇2 Φ%θ (xi )) = 0 for all i ∈ I .

Standard methods can be used to find parameters θ.

Approaches to solve PDEs - Examples

General Framework: Deep Ritz Method [E, Yu; 2017]: NNs

as trial functions, SGD naturally replaces quadrature.
High-dimensional PDEs: [Sirignano, Spiliopoulos; 2017]: Let
D ⊂ Rd d ≥ 100 find u such that
(t, x) + H(u)(t, x) = 0, (t, x) ∈ [0, T ] × Ω, + BC + IC
As the number of parameters of the NNs increases the minimizer of
associated energy approaches true solution. No mesh generation
[Berner, Grohs, Hornung, Jentzen, von Wurstemberger;
Phrasing problem as empirical risk minimization provably no
curse of dimension in approximation problem or number of samples.

How can we apply NNs to solve PDEs?

Deep learning and PDEs: Both approaches above are based on

two ideas.
I Neural networks are highly efficient in representing solutions of
PDEs, hence the complexity of the problem can be greatly
I There exist black box methods from machine learning that
solve the optimization problem.
This talk:
I We will show exactly how efficient the representations are.
I Raise doubt that the black box can produce reliable results in

Approximation theory of neural

Complexity of neural networks
Φ% (x) = TL (%(TL−1 (%(. . . %(T1 (x)))))), x ∈ Rd .
Each affine linear mapping T` is defined by a matrix A` ∈ RN` ×N`−1
and a translation b` ∈ RN` via T` (x) = A` x + b` .

The number of weights W (Φ% ) and the number of neurons

N(Φ% ) are
W (Φ% ) = (kAj k`0 + kbj k`0 ) and N(Φ% ) = Nj .
j≤L j=0
Power of the architecture — Exemplary results

Given f from some class of functions, how many weights/neurons

does an ε-approximating NN need to have?

Power of the architecture — Exemplary results

Given f from some class of functions, how many weights/neurons

does an ε-approximating NN need to have?

Not so many...
Theorem (Maiorov, Pinkus; 1999)
There exists an activation function %weird : R → R that
I is analytic and strictly increasing,
I satisfies limx→−∞ %weird (x) = 0 and limx→∞ %weird (x) = 1,
such that for any d ∈ N, any f ∈ C ([0, 1]d ), and any ε > 0, there is
a 3-layer %-network Φ%ε weird with kf − Φ%ε weird kL∞ ≤ ε and
N(Φ%ε weird ) = 9d + 3.

12 / 36
Power of the architecture — Exemplary results
I Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function
% sigmoidal of order zero.

13 / 36
Power of the architecture — Exemplary results
I Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function
% sigmoidal of order zero.
I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.
For f ∈ C s ([0, 1]d ), we have kf − Φ%n kL∞ . N(Φ%n )−s/d and
L(Φ%n ) = L(d, s, k).

13 / 36
Power of the architecture — Exemplary results
I Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function
% sigmoidal of order zero.
I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.
For f ∈ C s ([0, 1]d ), we have kf − Φ%n kL∞ . N(Φ%n )−s/d and
L(Φ%n ) = L(d, s, k).
I Yarotsky; 2017: For f ∈ C s ([0, 1]d ), we have for %(x) = x+
(called ReLU) that kf − Φ%n kL∞ . W (Φ%n )−s/d and
L(Φ%ε )  log(n).

13 / 36
Power of the architecture — Exemplary results
I Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function
% sigmoidal of order zero.
I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.
For f ∈ C s ([0, 1]d ), we have kf − Φ%n kL∞ . N(Φ%n )−s/d and
L(Φ%n ) = L(d, s, k).
I Yarotsky; 2017: For f ∈ C s ([0, 1]d ), we have for %(x) = x+
(called ReLU) that kf − Φ%n kL∞ . W (Φ%n )−s/d and
L(Φ%ε )  log(n).
I Shaham, Cloninger, Coifman; 2015: One can implement
certain wavelets using 4–layer NNs.

13 / 36
Power of the architecture — Exemplary results
I Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function
% sigmoidal of order zero.
I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.
For f ∈ C s ([0, 1]d ), we have kf − Φ%n kL∞ . N(Φ%n )−s/d and
L(Φ%n ) = L(d, s, k).
I Yarotsky; 2017: For f ∈ C s ([0, 1]d ), we have for %(x) = x+
(called ReLU) that kf − Φ%n kL∞ . W (Φ%n )−s/d and
L(Φ%ε )  log(n).
I Shaham, Cloninger, Coifman; 2015: One can implement
certain wavelets using 4–layer NNs.
I He, Li, Xu, Zheng; 2018, Opschoor, Schwab, P.; 2019:
ReLU NNs reproduce approximation rates of h-, p- and
Lower bounds

Optimal approximation rates: Lower bounds on required network

size only exist under additional assumptions. (Recall networks based
on %weird ).

(A) Place restrictions on activation function (e.g. only consider
the ReLU), thereby excluding pathological examples like %weird .
( VC dimension bounds)

(B) Place restrictions on the weights.

( Information theoretical bounds, entropy arguments)

(C) Use still other concepts like continuous N-widths.

Lower bounds

Optimal approximation rates: Lower bounds on required network

size only exist under additional assumptions. (Recall networks based
on %weird ).

(A) Place restrictions on activation function (e.g. only consider
the ReLU), thereby excluding pathological examples like %weird .
( VC dimension bounds)

(B) Place restrictions on the weights.

( Information theoretical bounds, entropy arguments)

(C) Use still other concepts like continuous N-widths.

Asymptotic min-max rate distortion
Encoders: Let C ⊂ L2 (Rd ), ` ∈ N
n o n o
E ` := E : C → {0, 1}` , D` := D : {0, 1}` → L2 (Rd ) .

{0, 1, 0, 0, 1, 1, 1}

Min-max code length:

L(, C) := min ` ∈ N : ∃D ∈ D` , C ∈ C ` : sup kD(E (f )) − f k2 <  .
f ∈C

Optimal exponent:

γ ∗ (C) := inf γ > 0 : L(, C) = O(−γ ) .

Asymptotic min-max rate distortion

Theorem (Boelcskei, Grohs, Kutyniok, P.; 2017)

Let C ⊂ L2 (Rd ), % : R → R, then for all  > 0:
 

sup  inf W (Φ% ) & −γ (C) . (1)
f ∈C Φ% NN with quantised weights
kΦ% −f k2 ≤

Optimal approximation/parametrization: If for C ⊂ L2 (Rd ) one

also has . in (1), then NNs approximate a function class optimally.
Versatility: It turns out that NNs achieve optimal approximation
rates for many practically-used function classes.

Some instances of optimal approximation

I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.

For f ∈ C s ([0, 1]d ), we have kf − Φ%n kL∞ . N(Φ%n )−s/d .
We have γ ∗ ({f ∈ C s ([0, 1]d : kf k ≤ 1}) = d/s.
I Shaham, Cloninger, Coifman; 2015: One can implement
certain wavelets using 4–layer ReLU NNs. Optimal, when
wavelets are optimal.

I Bölcskei, Grohs, Kutyniok, P.; 2017: Networks yield

optimal rates if any affine system does. Example: shearlets for
cartoon-like functions.

ReLU Approximation
Piecewise smooth functions:
Eβ,d denotes the d-dimensional
C β -piecewise smooth functions

1 0.8


on [0, 1]d with interfaces in C β .


0.8 0.2
0.2 0

Theorem (P., Voigtlaender; 2018)

Let d ∈ N, β ≥ 0, % : R → R, %(x) = x+ , then
 
∗ (E
sup  inf W (Φ% ) ∼ −γ β,d )
= −2(d−1)/β .
f ∈Eβ,d Φ% NN with quantised weights
kΦ% −f k2 ≤

The optimal depth of the networks is ∼ β/d.

18 / 36
High-dimensional approximation
Curse of dimension: To guarantee approximation with error ≤ ε
of functions in Eβ,d one requires networks with O(ε−2(d−1)/β )

Symmetries and invariances:

Image classifiers are often:
I translation, dilation, and rotation invariant,
I invariant to small deformations,
I invariant to small changes in brightness, contrast, color.

19 / 36
Curse of dimension
Two-step setup: f = χ ◦ τ
I τ : RD → Rd is a smooth dimension reducing feature map.
I χ ∈ Eβ,d performs classification on low-dimensional space.

20 / 36
Curse of dimension
Two-step setup: f = χ ◦ τ
I τ : RD → Rd is a smooth dimension reducing feature map.
I χ ∈ Eβ,d performs classification on low-dimensional space.

Theorem (P., Voigtlaender; 2017)

Let %(x) = x+ . There are constants c > 0, L ∈ N such that for any
f = χ ◦ τ and any ε ∈ (0, 1/2), there is a NN Φ%ε with at most L
layers, and at most c · ε−2(d−1)/β non-zero weights such that

kΦ%ε − f kL2 < ε.

Asymptotic approximation rate depends only on d, not on D.

20 / 36
Compositional functions

Compositional functions: [Mhaskar, Poggio; 2016]

High-dimensional functions as dyadic composition of 2-dimensional

R8 3 x 7→ h13 (h12 (h11 (x1 , x2 ), h21 (x3 , x4 )), h22 (h31 (x5 , x6 ), h41 (x7 , x8 )))

x1 x2 x3 x4 x5 x6 x7 x8

Approximation with respect to Sobolev norms: ReLU NNs Φ

are Lipschitz continuous. Hence, for s ∈ [0, 1], p ≥ 1 and
f ∈ W s,p (Ω), we can measure

kf − ΦkW s,p (Ω) .

ReLU Networks achieve the same approximation rates as h-, p-,

hp-FEM, [Opschoor, P., Schwab; 2019].
Convolutional neural networks: Direct correspondence between
approximation by CNNs (without pooling) and approximation by
fully-connected networks, [P., Voigtlaender; 2018].

Optimal parametrization

Optimal parametrization:
I Neural networks yield optimal representations of many function
classes relevant in PDE applications,
I Approximation is flexible and quality is improved if
low-dimensional structure is present.

PDE discretization:
I Problem complexity drastically reduced,
I No design of ansatz system necessary, since NNs approximate
almost every function class well.

Can neural networks really be this good?

The inconvenient structure of
neural networks

Fixed architecture networks
Goal: Fix a space of networks with prescribed shape and
understand the associated set of functions.

Fixed architecture networks: Let d, L ∈ N, N1 , . . . , NL−1 ∈ N,

% : R → R then we define by
N N % (d, N1 , . . . , NL−1 , 1)
the set of NNs with architecture (d, N1 , . . . , NL−1 , 1).

d =8 N1 = 12 N2 = 12 N3 = 12 N4 = 8

25 / 36
Back to the basics

Topological properties: Is N N % (d, N1 , . . . , NL−1 , 1)

I star-shaped?
I convex? approximately convex?
I closed?
Is the map (T1 , . . . , TL ) → Φ open?

Implications for optimization:

If we do not have the properties above, then we can have
I terrible local minima,
I exploding weights,
I very slow convergence.

Star-shapedness: N N % (d, N1 , . . . , NL−1 , 1) is trivially star-shaped

with center 0.


Proposition (P., Raslan, Voigtlaender; 2018)

Let d, L, N, N1 , . . . , NL−1 ∈ N and let % : R → R be locally
Lipschitz continuous. Then the number of P linearly independent
centers of N N % (d, N1 , . . . , NL ) is at most L`=1 (N`−1 + 1)N` ,
where N0 = d.

Corollary (P., Raslan, Voigtlaender; 2018)

Let d, L, N, N1 , . . . , NL−1 ∈ N, N0 = d, and let % : R → R be
locally Lipschitz continuous.
If N N % (d, N1 , . . . , NL−1 , 1) contains more than L`=1 (N`−1 + 1)N`
linearly independent functions, then N N % (d, N1 , . . . , NL−1 , 1) is
not convex.
From translation invariance: If N N % (d, N1 , . . . , NL−1 , 1) only
finitely many linearly independent functions then % is a finite sum of
complex exponentials multiplied with polynomials.

Weak Convexity?

Weak convexity: N N % (d, N1 , . . . , NL−1 , 1) is almost never

convex, but what about N N % (d, N1 , . . . , NL−1 , 1) + B ∞ (0) for a
hopefully small  > 0?

29 / 36
Weak Convexity?

Weak convexity: N N % (d, N1 , . . . , NL−1 , 1) is almost never

convex, but what about N N % (d, N1 , . . . , NL−1 , 1) + B ∞ (0) for a
hopefully small  > 0?

Theorem (P., Raslan, Voigtlaender; 2018)

Let d, L, N, N1 , . . . , NL−1 ∈ N, N0 = d. For all commonly-used
activation functions there does not exist  > 0 such that
N N % (d, N1 , . . . , NL−1 , 1) + B ∞ (0) is convex.

As a corollary, we also get that N N % (d, N1 , . . . , NL−1 , 1) is usually

nowhere dense.

Illustration: The set N N % (d, N1 , . . . , NL−1 , 1) has very few
centers, it is scaling invariant, not approximately convex, and
nowhere dense.

Closedness in Lp
Compact weights: If the activation function % is continuous, then
a compactness argument shows that the set of networks of a
compact parameter set is closed.

31 / 36
Closedness in Lp
Compact weights: If the activation function % is continuous, then
a compactness argument shows that the set of networks of a
compact parameter set is closed.
Theorem (P., Raslan, Voigtlaender; 2018)
Let d, L, N1 , . . . , NL−1 ∈ N, N0 = d. If % has one of the properties
below, then N N % (d, N1 , . . . , NL−1 , 1) is not closed in Lp ,
p ∈ (0, ∞).
I analytic, bounded, not constant,
I C 1 but not C ∞ ,
I continuous, monotone, bounded, %0 (x0 ) exists and is non-zero
in at least one point x0 ∈ R.
I continuous, monotone, continuous differentiable outside a
compact set, and limx→∞ %0 (x), limx→−∞ %0 (x) exist and do
not coincide.

Closedness in L∞

Theorem (P., Raslan, Voigtlaender; 2018)

Let d, L, N, N1 , . . . , NL−1 ∈ N, N0 = d. If % is has one of the
properties below, then N N % (d, N1 , . . . , NL−1 , 1) is not closed in
L∞ .
I analytic, bounded, not constant,
I C 1 but not C ∞
I ρ ∈ C p and |ρ(x) − x+ | bounded, for p ≥ 1.

ReLU: The set of two-layer ReLU NNs is closed in L∞ !

Illustration: For most activation functions (except the ReLU) % the
set N N % (d, N1 , . . . , NL−1 , 1) is star-shaped with center 0, not
approximately convex, not closed.

Stable parametrization

Continuous parametrization: It is not hard to see, that if % is

continuous, then so is the map R% : (T1 , . . . , TL ) → Φ.
Quotient map: We can also ask if R% is a quotient map, i.e., if
Φ1 , Φ2 are NNs which are close (w.r.t. k · ksup ), then there exist
(T11 , . . . , TL1 ) and (T12 , . . . , TL2 ) which are close in some norm and

R% ((T11 , . . . , TL1 )) = Φ1 and R% ((T12 , . . . , TL2 )) = Φ2 .

Proposition (P., Raslan, Voigtlaender; 2018)

Let % be Lipschitz continuous and not affine linear, then R% is not a
quotient map.

No convexity:
I Want to solve ∇J(Φ) = 0 for an energy J and NN Φ.
I Not only J could be non-convex, but also the set we optimize
I Similar to N-term approximation by dictionaries.
No closedness:
I Exploding coefficients (if PN N (f ) 6∈ N N ).
I No low-neuron approximation.
No inverse-stable parametrization:
I Error term very small, while parametrization is far from optimal.
I Potentially very slow convergence.

Where to go from here?
Different networks:
I Special types of networks could be more robust.
I Convolutional neural networks are probably still too large a
class. [P., Voigtlaender; 2018].
Stronger norms:
I Stronger norms naturally help with closedness and inverse
I Example is Sobolev training [Czarnecki, Osindero, Jaderberg,
Swirszcz, Pascanu; 2017].
I Many arguments of our results break down if W 1,∞ norm is

Approximation: NNs are a very powerful approximation tool:
I Often optimally efficient
I overcome curse of dimension
I surprisingly efficient black-box

Topological structure: NNs form an impractical set:

I non-convex
I non-closed
I no inverse-stable parametrization.

