Professional Documents
Culture Documents
Invariant Scattering Convolution Networks: CMAP, Ecole Polytechnique, Palaiseau, France
Invariant Scattering Convolution Networks: CMAP, Ecole Polytechnique, Palaiseau, France
Invariant Scattering Convolution Networks: CMAP, Ecole Polytechnique, Palaiseau, France
✦
arXiv:1203.1513v2 [cs.CV] 8 Mar 2012
Abstract—A wavelet scattering network computes a translation invari- as wavelets. However, wavelet transforms are not invari-
ant image representation, which is stable to deformations and preserves ant to translations. Building invariant representations
high frequency information for classification. It cascades wavelet trans- from wavelet coefficients requires introducing non-linear
form convolutions with non-linear modulus and averaging operators. The
operators, which leads to a convolution network archi-
first network layer outputs SIFT-type descriptors whereas the next layers
provide complementary invariant information which improves classifica- tecture.
tion. The mathematical analysis of wavelet scattering networks explain Deep convolution networks have the ability to build
important properties of deep convolution networks for classification. large-scale invariants which are stable to deformations
A scattering representation of stationary processes incorporates [18]. They have been applied to a wide range of image
higher order moments and can thus discriminate textures having same classification tasks. Despite the remarkable successes
Fourier power spectrum. State of the art classification results are ob-
of this neural network architecture, the properties and
tained for handwritten digits and texture discrimination, with a Gaussian
kernel SVM and a generative PCA classifier.
optimal configurations of these networks are not well
understood because of cascaded non-linearities. Why use
multiple layers ? How many layers ? How to optimize
filters and pooling non-linearities ? How many internal
1 I NTRODUCTION
and output neurons ? These questions are mostly an-
A major difficulty of image classification comes from the swered through numerical experimentations that require
considerable variability within image classes and the in- significant expertise.
ability of Euclidean distances to measure image similari- Deformation stability is obtained with localized
ties. Part of this variability is due to rigid translations, ro- wavelet filters which separate the image variations at
tations or scaling. This variability is often uninformative multiple scales and orientations [22]. Computing a non-
for classification and should thus be eliminated. In the zero translation invariant representation from wavelet
framework of kernel classifiers [31], metrics are defined coefficients requires introducing a non-linearity, which is
as a Euclidean distance applied on a representation Φ(x) chosen to be a modulus to optimize stability [6]. Wavelet
of signals x. The operator Φ must therefore be invariant scattering networks, introduced in [23], [22], build trans-
to these rigid transformations. lation invariant representations with average poolings
Non-rigid deformations also induce important vari- of wavelet modulus coefficients. The output of the first
ability within object classes [3], [15], [34]. For instance, network layer is similar to SIFT [21] or Daisy [33] type
in handwritten digit recognition, one must take into ac- descriptors. However, this limited set of locally invariant
count digit deformations due to different writing styles. coefficients is not sufficiently informative to discriminate
However, a full deformation invariance would reduce complex structures over large-size domains. The infor-
discrimination since a digit can be deformed into a mation lost by the averaging is recovered by computing
different digit, for example a one into a seven. The rep- a next layer of invariant coefficients, with the same
resentation must therefore not be deformation invariant wavelet convolutions and average modulus poolings. A
but continuous to deformations, to handle small defor- wavelet scattering is thus a deep convolution network
mations with a kernel classifier. A small deformation which cascades wavelet transforms and modulus opera-
of an image x into x′ should correspond to a small tors. The mathematical properties of scattering operators
Euclidean distance kΦ(x) − Φ(x′ )k in the representation [22] explain how these deep network coefficients relate to
space, as further explained in Section 2. image sparsity and geometry. The network architecture
Translation invariant representations can be con- is optimized in Section 3, to retain important information
structed with registration algorithms [32] or with the while avoiding useless computations.
Fourier transform modulus. However, Section 2.1 ex- A scattering representation of stationary processes is
plains why these invariants are not stable to deforma- introduced for texture discrimination. As opposed to
tions and hence not adapted to image classification. the Fourier power spectrum, it provides information
Trying to avoid Fourier transform instabilities suggests on higher order moments and can thus discriminate
replacing sinusoidal waves by localized waveforms such non-Gaussian textures having the same power spec-
2
2
trum. Classification applications are studied in Section For example, a Dirac δ(u) and a linear chirp eiu are
4.1. Scattering classification properties are demonstrated totally different signals having Fourier transforms whose
with a Gaussian kernel SVM and a generative classi- moduli are equal and constant. Very different signals
fier, which selects affine space models computed with may not be discriminated from their Fourier modulus.
a PCA. State-of-the-art results are obtained for hand- A registration invariant Φ(x) = x(u − a(x)) carries
written digit recognition on MNIST and USPS databes, more information than a Fourier modulus, and charac-
and for texture discrimination. Software is available at terizes x up to a global absolute position information
www.cmap.polytechnique.fr/scattering. [32]. However, it has the same high-frequency instability
as a Fourier transform. Indeed, for any choice of anchor
2 TOWARDS A C ONVOLUTION N ETWORK point a(x), applying the Plancherel formula proves that
Section 2.1 formalizes the deformation stability condition kx(u − a(x)) − x′ (u − a(x′ ))k ≥ (2π)−1 k|x̂(ω)| − |x̂′ (ω)|k .
as a Lipschitz continuity property, and explains why (3)
high Fourier frequencies are source of unstabilites. Sec- If x′ = Lτ x, the Fourier transform instability at high
tion 2.2 introduces a wavelet-based scattering transform, frequencies implies that Φ(x) = x(u − a(x)) is also
which is translation invariant and stable to deformations, unstable with respect to deformations.
and section 2.3 describes its convolutional network archi-
tecture. 2.2 Scattering Wavelets
A wavelet is a localized waveform and is thus stable
2.1 Fourier and Registration Invariants to deformation, as opposed to the Fourier sinusoidal
A representation Φ(x) is invariant to global translations waves. A wavelet transform computes convolutions with
Lc x(u) = x(u − c) by c = (c1 , c2 ) ∈ R2 if wavelets. It is thus translation covariant, not invariant.
A scattering transform computes non-linear invariants
Φ(Lc x) = Φ(x) . (1)
with modulus and averaging pooling functions.
A canonical invariant [15], [32] Φ(x) = x(u − a(x)) Two-dimensional directional wavelets are obtained by
registers x with an anchor point a(x), which is translated scaling and rotating a single band-pass filter ψ. Let G
when x is translated: a(Lc x) = a(x) + c. It is therefore be a discrete, finite rotation group in R2 . Multiscale
invariant: Φ(Lc x) = Φ(x). For example, the anchor point directional wavelet filters are defined for any j ∈ Z and
may be a filtered maxima a(x) = arg maxu |x ⋆ h(u)|, for rotation r ∈ G by
some filter h(u).
ψ2j r (u) = 22j ψ(2j r−1 u) . (4)
The Fourier transform modulus is another example
of translation invariant representation. Let x̂(ω) be the If the Fourier transform ψ̂(ω) is centered at a frequency
Fourier transform of x(u). Since L dc x(ω) = e
−ic.ω
x̂(ω), it η then ψ̂2j r (ω) = ψ̂(2−j r−1 ω) has a support centered at
d
results that |Lc x| = |x̂| does not depend upon c. 2j rη, with a bandwidth proportional to 2j . To simplify
To obtain appropriate similarity measurements be- notations, we denote λ = 2j r ∈ Λ = G × Z, and |λ| = 2j .
tween images which have undergone non-rigid trans- A wavelet transform filters x using a family of
formations, the representation must also be stable to wavelets: {x ⋆ ψλ (u)}λ . It is computed with a filter bank
small deformations. A small deformation can be written of dilated and rotated wavelets having no orthogonality
Lτ x(u) = x(u − τ (u)) where τ (u) depends upon u property. As further explained in Section 3.1, it is stable
and thus deforms the image. The deformation gradient and invertible if the rotated and scaled wavelet filters
tensor ∇τ (u) is a matrix whose norm |∇τ (u)| measures cover the whole frequency plane. On discrete images,
the deformation amplitude at u. A small deformation to avoid aliasing, we only capture frequencies in the
is an invertible transformation if |∇τ (u)| < 1 [2], [34]. circle |ω| ≤ π inscribed in the image frequency square.
Stability to deformations is expressed as a Lipschitz However, most digital natural images and textures have
continuity condition relative to this deformation metric: negligible energy outside this frequency circle.
Let u.u′ and |u| denote the inner product and norm in
kΦ(Lτ x) − Φ(x)k ≤ C kxk sup |∇τ (u)| , (2) 2
R . A Morlet wavelet ψ is an example of wavelet given
u
R by
where kxk2 = |x(u)|2 du. This property implies global 2 2
ψ(u) = C1 (eiu.ξ − C2 ) e−|u| /(2σ ) ,
translation invariance, because if τ (u) = c then ∇τ (u) = R
0, but it is much stronger. where C2 is adjusted so that ψ(u) du = 0. Figure 1
A Fourier modulus is translation invariant but un- shows the Morlet wavelet with σ = 0.85 and ξ = 3π/4,
stable with respect to deformations at high frequencies. used in all classification experiments.
d
Indeed, | |x̂(ω)| − |L τ x(ω)| | can be arbitrarily large at a A wavelet transform commutes with translations, and
high frequency ω, even for small deformations and in is therefore not translation invariant. To build a transla-
particular small dilations. As a result, Φ(x) = |x̂| does tion invariant representation, it is necessary to introduce
not satisfy the deformation continuity condition (2) [22]. a non-linearity. If R is a linear or non-linear operator
A Fourier modulus also loses too much information. which commutes with translations, R(Lc x) = Lc Rx,
3
Fig. 1. Complex Morlet wavelet. (a): Real part of ψ. (b): Imaginary part of ψ. (c): Fourier modulus |ψ̂|.
R
then the integral Rx(u) du is translation invariant. a Dirac:
Z Z
RApplying this to Rx = x ⋆ ψλ gives a Rtrivial invariant
x ⋆ ψλ (u) du = 0 for all x because ψλ (u) du = 0. Sx(p) = µ−1
p U [p]x(u) du with µp = U [p]δ(u) du .
If Rx = M (x ⋆ ψλ ) but M is linear and commutes
with translations then the integral still vanishes, which Each scattering coefficient Sx(p) is invariant to a trans-
imposes choosing a non-linear M . Taking advantage of lation of x. We shall see that this transform has many
the wavelet transform stability to deformations, to obtain similarities with the Fourier transform modulus, which
integrals which are also stable to deformations we also is also translation invariant. However, a scattering is
impose that M commutes with deformations Lipschitz continuous to deformations as opposed to the
Fourier transform modulus.
∀τ (u) , M Lτ = Lτ M . For classification, it is often better to compute localized
descriptors which are invariant to translations smaller
By adding a weak differentiability condition, one can than a predefined scale 2J , while keeping the spatial
prove [6] that M must necessarily be a pointwise op- variability at scales larger than 2J . This is obtained by
erator, which means that M x(u) only depends on the localizing the scattering integral with a scaled spatial
value x(u). If we also impose an L2 (R2 ) stability window φ2J (u) = 2−2J φ(2−J u). It defines a windowed
scattering transform in the neighborhood of u:
∀(x, y) ∈ L2 (R2 )2 , kM xk = kxk and kM x−M yk ≤ kx−yk,
Z
then one can verify [6] that necessarily M x = e |x|, iα SJ [p]x(u) = U [p]x ⋆ φ2J (u) = U [p]x(v)φ2J (u − v) dv ,
and we set α = 0. The resulting translation invariant R
coefficients are therefore L1 (R2 ) norms: kx⋆ψλ k1 = |x⋆ and hence
ψλ (u)| du. SJ [p]x(u) = | ||x ⋆ ψλ1 | ⋆ ψλ2 | ... | ⋆ ψλm | ⋆ φ2J (u) ,
The L1 (R2 ) norms {kx ⋆ ψλ k1 }λ form a crude sig-
nal representation, which measures the sparsity of the with SJ [∅]x = x ⋆ φ2J . For each path p, SJ [p]x(u) is
wavelet coefficients. For appropriate wavelets, one can a function of the window position u, which can be
prove [36] that x can be reconstructed from {|x⋆ψλ (u)|}λ , subsampled at intervals proportional to the window size
up to a multiplicative constant. The information loss 2J . The averaging by φ2J implies that SJ [p]x(u) is nearly
thus comes from the integration of |x ⋆ ψλ (u)|, which invariant to translations Lc x(u) = x(u − c) if |c| ≪ 2J .
removes all non-zero frequency components. These non- Section 3.1 proves that it is also stable relatively to
zero frequencies can be recovered by calculating the deformations.
wavelet coefficients {|x⋆ψλ1 |⋆ψλ2 (u)}λ2 of |x⋆ψλ1 |. Their
L1 (R2 ) norms define a much larger family of invariants, 2.3 Scattering Convolution Network
for all λ1 and λ2 :
Z If p is a path of length m then SJ [p]x(u) is called
k|x ⋆ ψλ1 | ⋆ ψλ2 k1 = ||x ⋆ ψλ1 (u)| ⋆ ψλ2 | du . scattering coefficient of order m at the scale 2J . It is
computed at the layer m of a convolution network which
is specified. For large scale invariants, several layers are
More translation invariant coefficients can be com-
necessary to avoid losing crucial information.
puted by further iterating on the wavelet transform and
For appropriate wavelets, first order coefficients
modulus operators. Let U [λ]x = |x ⋆ ψλ |. Any sequence
SJ [λ1 ]x are equivalent to SIFT coefficients [21]. Indeed,
p = (λ1 , λ2 , ..., λm ) defines a path, i.e, the ordered product
SIFT computes the local sum of image gradient ampli-
of non-linear and non-commuting operators
tudes among image gradients having nearly the same
U [p]x = U [λm ] ... U [λ2 ] U [λ1 ]x = | ||x⋆ψλ1 |⋆ψλ2 | ... |⋆ψλm | , direction, in a histogram having 8 different direction
bins. The DAISY approximation [33] shows that these
with U [∅]x = x. A scattering transform along the path p coefficients are well approximated by SJ [2j r]x = |x ⋆
is defined as an integral, normalized by the response of ψ2j r | ⋆ φ2J (u) where ψ2j r is the partial derivative of a
4
Gaussian computed at the finest image scale 2j , for 8 condition which imposes a separation of the different
different rotations r. The averaging filter φ2J is a scaled image scales [22], hence the use of wavelets.
Gaussian. The modulus operator which recombines real and
Partial derivative wavelets are well adapted to detect imaginary parts can be interpreted as a pooling function
edges or sharp transitions but do not have enough fre- in the context of convolution networks. The averaging
quency and directional resolution to discriminate com- by φ2J at the output is also a pooling operator which
plex directional structures. For texture analysis, many aggregates coefficients to build an invariant. It has been
researchers [19], [30], [28] have been using averaged argued [7] that an average pooling loses information,
wavelet coefficient amplitudes |x ⋆ ψλ | ⋆ φJ (u), but cal- which has motivated the use of other operators such
culated with a complex wavelet ψ having a better fre- as hierarchical maxima [8]. The high frequencies lost
quency and directional resolution. by the averaging are recovered as wavelet coefficients
A scattering transform computes higher-order coeffi- in the next layers, which explains the importance of
cients by further iterating on wavelet transforms and using a multilayer network structure. As a result, it
modulus operators. At a maximum scale 2J , wavelet only loses the phase of these wavelet coefficients. This
coefficients are computed at frequencies 2j ≥ 2−J , and phase may however be recovered from the modulus
lower frequencies are filtered by φ2J (u) = 2−2J φ(2−J u). thanks to the wavelet transform redundancy. It has been
Since images are real-valued signals, it is sufficient to proved [36] that the wavelet-modulus operator UJ x =
consider “positive” rotations r ∈ G+ with angles in [0, π): {x ⋆ φ2J , |x ⋆ ψλ |}λ∈ΛJ is invertible with a continuous
n o inverse. It means that x and hence the complex phase
WJ x(u) = x ⋆ φ2J (u) , x ⋆ ψλ (u) (5) of each x ⋆ ψλ can be reconstructed. Although UJ is
λ∈ΛJ
invertible, the scattering transform is not exactly invert-
with ΛJ = {λ = 2j r : r ∈ G+ , j ≥ −J}. For a ible because of instabilities. Indeed, applying UJ in (7)
Morlet wavelet ψ, the averaging filter φ is chosen to be for m ≤ mmax computes all SJ [Λm J ]x for m ≤ mmax
a Gaussian. Let us emphasize that 2J is a spatial scale but also the last layer of internal network coefficients
variable whereas λ = 2j r is assimilated to a frequency U [ΛJmmax +1 ]x. The next section proves that U [ΛJmmax +1 ]x
variable. can be neglected because its energy converges to zero as
A wavelet modulus propagator keeps the low- mmax increases. However, this introduces a small error
frequency averaging and computes the modulus of com- which accumulates when iterating on UJ−1 .
plex wavelet coefficients: Scattering coefficients can be displayed in the fre-
n o quency plane. Let {Ω[p]}p∈Λm J
be a partition of R2 . To
2
UJ x(u) = x ⋆ φ2J (u) , |x ⋆ ψλ (u)| . (6) each frequency ω ∈ R we associate the path p(ω)
λ∈ΛJ
such that ω ∈ Ω[p]. We display SJ [p(ω)]x(u), which
Let ΛmJ be the set of all paths p = (λ1 , ..., λm ) of length is a piecewise constant function of ω ∈ R2 , for each
m. We denote U [Λm m
J ]x = {U [p]x}p∈ΛJ and SJ [ΛJ ]x =
m position u and each m = 1, 2. For m = 1, each Ω[2j1 r1 ] is
{SJ [p]x}p∈Λm
J
. Since chosen to be a quadrant rotated by r1 , to approximate the
n o frequency support of ψ̂2j1 r1 , whose size is proportional
UJ U [p]x = U [p]x ⋆ φ2J , |U [p]x ⋆ ψλ | , to kψ2j1 r1 k2 and hence to 2j1 . This defines a partition of
a dyadic annulus illustrated in Figure 3(a). For m = 2,
and SJ [p]x = U [p]x ⋆ φ2J , it results that Ω[2j1 r1 , 2j2 r2 ] is obtained by subdividing Ω[2j1 r1 ], as
n o illustrated in Figure 3(b). Each Ω[2j1 r1 ] is subdivided
m+1
UJ U [Λm J ]x = {UJ U [p]x}p∈ΛmJ
= S J [Λ m
J ]x , U [Λ J ]x . along the radial axis into quadrants indexed by j2 . Each
(7) of these quadrants are themselves subdivided along the
This implies that SJ [p]x can be computed along paths angular variable into rotated quadrants Ω[2j1 r1 , 2j2 r2 ]
of length m ≤ mmax by first calculating UJ x = having a surface proportional to k|ψ2j1 r1 | ⋆ ψ2j2 r2 k2 .
{SJ [∅]x , U [Λ1J ]x} and iteratively applying UJ to each Figure 4 shows the Fourier transform of two images,
U [ΛmJ ]x for increasing m ≤ mmax . This algorithm is and the amplitude of their scattering coefficients of
illustrated in Figure 2. orders m = 1 and m = 2, at a maximum scale 2J equal to
A scattering transform thus appears to be a deep the image size. A scattering coefficient over a quadrant
convolution network [18], with some particularities. As Ω[2j1 r1 ] gives an approximation of the Fourier transform
opposed to most convolution networks, a scattering net- energy over the support of ψ̂2j1 r1 . Although the top
work outputs coefficients SJ [p]x at all layers m ≤ mmax , and bottom images are very different, they have same
and not just at the last layer mmax [18]. The next section order m = 1 scattering coefficients. Here, first-order
proves that the energy of the deepest layer converges coefficients are not sufficient to discriminate between
quickly to zero as mmax increases. two very different images. However, coefficients of order
A second distinction is that filters are not learned from m = 2 succeed in discriminating between the two
data but are predefined wavelets. Wavelets are stable images. The top image has wavelet coefficients which are
with respect to deformations and provide sparse image much more sparse than the bottom image. As a result,
representations. Stability to deformations is a strong Section 3.1 shows that second-order scattering coeffi-
5
SJ [∅]f = f ⋆ φJ
f
m=0
SJ [λ1 ]f
U [λ1 ]f m=1
SJ [λ1 , λ2 ]f
m=2
U [λ1 , λ2 ]f
m=3
Fig. 2. A scattering propagator UJ applied to x computes each U [λ1 ]x = |x ⋆ ψλ1 | and outputs SJ [∅]x = x ⋆ φ2J (black
arrow). Applying UJ to each U [λ1 ]x computes all U [λ1 , λ2 ]x and outputs SJ [λ1 ] = U [λ1 ] ⋆ φ2J (black arrows). Applying
UJ iteratively to each U [p]x outputs SJ [p]x = U [p]x ⋆ φ2J (black arrows) and computes the next path layer.
Ω[2j1 r1 , 2j2 r2 ]
Ω[2j1 r1 ]
(a) (b)
Fig. 3. For m = 1 and m = 2, a scattering is displayed as piecewise constant functions equal to SJ [p]x(u) over each
frequency subset Ω[p]. (a): For m = 1, each Ω[2j1 r1 ] is a rotated quadrant of surface proportional to 2j1 . (b): For m = 2,
each Ω[2j1 r1 ] is subdivided into a partition of subsets Ω[2j1 r1 , 2j2 r2 ].
cients have a larger amplitude. Higher-order coefficients 3.1 Energy Conservation and Deformation Stability
are not displayed because they have a negligible energy A windowed scattering SJ is computed with a cascade
as explained in Section 3. of wavelet modulus operators UJ , and its properties
thus depend upon the wavelet transform properties.
Conditions are given on wavelets to define a scattering
transform which is contracting and preserves the sig-
3 S CATTERING P ROPERTIES nal norm. This analysis shows that kSJ [p]xk decreases
quickly as the length of p increases, and is non-negligible
A convolution network is highly non-linear, which only over a particular subset of frequency-decreasing
makes it difficult to understand how the coefficient paths. Reducing computations to these paths defines
values relate to the signal properties. For a scattering a convolution network with much fewer internal and
network, Section 3.1 analyzes the coefficient properties output coefficients.
and optimizes the network architecture. For texture anal- The norm of a sequence of transformed
P signals Rx =
ysis, the scattering transform of stationary processes {gn }n∈Ω is defined by kRxk2 = n∈Ω kgn k2 . If x is real
is studied in Section 3.2. The regularity of scattering and there exists ǫ > 0 such that for all ω ∈ R2
∞
coefficients can be exploited to reduce the size of a 1 XX
scattering representation, by using a cosine transform, 1 − ǫ ≤ |φ̂(ω)|2 + |ψ̂(2−j rω)|2 ≤ 1 , (8)
2 j=1
r∈G
as shown in Section 3.3. Finally, Section 3.4 provides a
fast computational algorithm. then applying the Plancherel formula proves that WJ x =
6
Fig. 4. Scattering display of two images having the same first order scattering coefficients. (a) Image x(u). (b) Fourier
modulus |x̂(ω)|. (c) Scattering SJ x[p(ω)] for m = 1. (d) Scattering SJ x[p(ω)] for m = 2.
{x ⋆ φJ , x ⋆ ψλ }λ∈ΛJ satisfies of the last network layer converges to zero when mmax
increases
(1 − ǫ) kxk2 ≤ kWJ xk2 ≤ kxk2 , (9) ∞
P X
with kWJ xk2 = kx ⋆ φJ k2 + λ∈ΛJ kx ⋆ ψλ k2 . In the lim kU [Λm J
max
]xk 2
= lim kSJ [Λm 2
J ]xk = 0 .
mmax →∞ mmax →∞
following we suppose that ǫ < 1 and hence that the m=mmax
(11)
wavelet transform is a contracting and invertible opera-
This result is also important for numerical applications
tor, with a stable inverse. If ǫ = 0 then WJ is unitary. The
because it explains why the network depth can be lim-
Morlet wavelet ψ in Figure 1 satisfies (8) with ǫ = 0.25,
2 ited with a negligible loss of signal energy.
together with φ(u) = C exp(−|u|
R /(2σ02 )) with σ0 = 0.7
The scattering energy conservation also provides a
and C adjusted so that φ(u) du = 1. These functions
relation between the network energy distribution and
are used in all classification applications. Rotated and
the wavelet transform sparsity. For p = (λ1 , ..., λm ),
dilated cubic spline wavelets are constructed in [22] to
we denote p + λ = (λ, λ1 , ..., λm ). Applying (10) to
satisfy (8) with ǫ = 0.
U [λ]x = |x ⋆ ψλ | instead of x, and separating the first
The modulus is contracting in the sense that ||a|−|b|| ≤
term for m = 0 yields
|a−b|. Since UJ = {x⋆φJ , |x⋆ψλ |}λ∈ΛJ is obtained with a
wavelet transform WJ followed by modulus, which are ∞ X
X
both contractive, it is also contractive: kSJ [λ]xk2 + kSJ [λ + p]xk2 = kx ⋆ ψλ k2 . (12)
m=1 p∈Λm
J
kUJ x − UJ yk ≤ kx − yk .
But SJ [λ]x = |x ⋆ ψλ | ⋆ φ2J is a local L1 (R2 ) norm and
If WJ is unitary then UJ also preserves the signal norm one can prove [22] that limJ→∞ 22J kSJ [λ]xk2 = kφk2 kx⋆
kUJ xk = kxk. ψλ k21 . The more sparse x⋆ψλ (u) thePsmaller 2
P kx⋆ψλ k1 and
Let PJ = ∪m≥0 Λm J be the set of all possible paths of
∞
(12) implies that the total energy m=1 p∈Λm kSJ [p +
J
any length m ∈ N.P The norm of SJ [PJ ]x = {SJ [p]x}p∈PJ λ]xk2 of higher-order scattering terms is then larger.
is kSJ [PJ ]xk2 = 2
p∈PJ kSJ [p]xk . Since SJ iteratively Figure 4 shows two images having same first order scat-
applies UJ which is contractive, it is also contractive: tering coefficients, but the top image is piecewise regular
kSJ x − SJ yk ≤ kx − yk . and hence has wavelet coefficients which are much more
sparse than the uniform texture at the bottom. As a result
If WJ is unitary, ǫ = 0 in (9) and for appropriate the top image has second order scattering coefficients of
wavelets, it is proved in [22] that larger amplitude than at the bottom. For typical images,
∞
X ∞ X
X as in the CalTech101 dataset [10], Table 1 shows that the
2
kSJ xk = kSJ [Λm
J ]xk
2
= kSJ [p]xk2 = kxk2 . scattering energy has an exponential decay as a function
m=0 m=0 p∈Λm
J
of the path length m. As proved by (11), the energy of
(10) scattering coefficients converges to 0 as m increases and
This result uses the fact that UJ preserves the sig- is below 1% for m ≥ 3.
m+1
nal norm and that UJ U [Λm m
J ]x = {SJ [ΛJ ]x , U [ΛJ ]x}. The energy conservation (10) is proved by showing
Proving (10) is thus equivalent to prove that the energy that the scattering energy kU [p]xk2 propagates towards
7
Fig. 5. Two different textures having the same Fourier power spectrum. (a) Textures X(u). Top: Brodatz texture.
Bottom: Gaussian process. (b) Same estimated power spectrum RX(ω). b (c) Nearly same scattering coefficients
J
SJ [p]X for m = 1 and 2 equal to the image width. (d) Different scattering coefficients SJ [p]X for m = 2.
E(SJ [p]X)|2 ) and the variance decay in the Karhunen- Let us recall from Section 2.3 that scattering coeffi-
Loève basis computed on paths of length m = 1, and on cients are computed by iteratively applying the one-step
paths of length m = 2, over the Caltech image dataset propagator UJ . To compute subsampled scattering coeffi-
with a Morlet wavelet. The variance decay is much faster cients along frequency-decreasing paths, this propagator
in the Karhunen-Loève basis, which shows that there is a is truncated. For any scale 2k , Uk,J transforms a signal
strong correlation between scattering coefficients of same x(2k αn) into
path length. n o
A change of variables proves that a rotation and scal- Uk,J x = x ⋆ φJ (2J αn) , |x ⋆ ψ2j r (2j αn)| .
ing X2l r (u) = X(2−l ru) produces a rotation and inverse −J<j≤k,r∈G+
(16)
scaling on the path variable p = (2j1 r1 , ..., 2jm rm ):
The algorithm computes subsampled scattering coeffi-
SX2l r (p) = SX(2l rp) where 2l rp = (2l+j1 rr1 , ..., 2l+jm rrm ) cients
. by iterating on this propagator.
If images are randomly rotated and scaled by 2l r−1 then Algorithm 1 Reduced Scattering Transform
the path p is randomly rotated and scaled [27]. In this
Compute U0,J (x)
case, the scattering transform has stationary variations
along the scale and rotation variables. This suggests Output x ⋆ φ2J (2J αn)
approximating the Karhunen-Loève basis by a cosine for m = 1 to mmax − 1 do
for all 0 ≥ j1 > ... > jm > −J do
basis along these variables. Let us parameterize each
rotation r by its angle θ ∈ [0, 2π]. A path p is then for all (r1 , ..., rq ) ∈ G+m do
parameterized by ([j1 , θ1 ], ..., [jm , θm ]). if m = mmax − 1 then
Compute |||x⋆ψ2j1 r1 |⋆...|⋆ψ2jm rm |⋆φ2J (2J αn)
Since scattering coefficients are computed along fre-
else
quency decreasing paths for which −J ≤ jk < jk−1 , to
Compute Ujm ,J (|||x ⋆ ψ2j1 r1 | ⋆ ...| ⋆ ψ2jm rm |)
reduce boundary effects, a separable cosine transform
end if
is computed along the variables j̃1 = j1 , j̃2 = j2 −
Output |||x ⋆ ψj1 ,γ1 | ⋆ ...| ⋆ ψjq ,γq | ⋆ φJ (2J αn)
j1 , ... , j̃m = jm − jm−1 , and along each angle variable
end for
θ1 , θ2 , ... , θm . We define the cosine scattering transform
end for
as the coefficients obtained by applying this separable
end for
discrete cosine transform along the scale and angle
variables of SJ [p]X(u), for each u and each path length
m. Figure 6 shows that the cosine scattering coefficients If x is a signal of size P then FFT’s compute Uk,J x
have variances for m = 1 and m = 2 which decay with O(P log P ) operations. A reduced scattering
P max trans-
nearly as fast as the variances in the Karhunen-Loeve form thus computes its NJ = N α−2 2−2J m m=0 C
m J
m
basis. It shows that a DCT across scales and orientations coefficients with O(NJ log N ) operations. If mmax = 2
is nearly optimal to decorrelate scattering coefficients. then NJ = N α−2 2−2J (CJ + C 2 J(J − 1)/2). It decreases
Lower-frequency DCT coefficients absorb most of the exponentially when the scale 2J increases.
scattering energy. On natural images, more than 99% Scattering coefficients are decorrelated with a sepa-
of the scattering energy is absorbed by the 1/3 lowest rable DCT along each scale variable j̃1 = j1 , j̃2 =
frequency cosine scattering coefficients. j2 − j1 , ... , j̃m = jm − jm−1 and each rotation angle
variable θ1 , θ2 , ... , θm , which also requires O(NJ log N )
operations. For natural images, more than 99.5 % of the
3.4 Fast Scattering Computations total signal energy is carried by the resulting NJ /2 cosine
Section 3.1 shows that the scattering energy is concen- scattering coefficients of lower frequencies.
trated along frequency-decreasing paths p = (2jk rk )k Numerical computations in this paper are performed
satisfying 2−J ≤ 2jk+1 < 2jk . If the wavelet transform by rotating wavelets along C = 6 directions, for scat-
is computed along C directions then the total number tering representations of maximum order mmax = 2.
J
of frequency-decreasing paths of length m is C m m . The resulting size of a reduced cosine scattering repre-
Since φ2J is a low-pass filter, SJ [p]x(u) = U [p]x ⋆ φ2J (u) sentation has at most three times as many coefficients
can be uniformly sampled at intervals α2J , with α = 1 as a dense SIFT representation. SIFT represents small
or α = 1/2. If x(n) is a discrete image with N pix- blocks of 42 pixels with 8 coefficients. A cosine scattering
els, then each SJ [p]x has 2−2J α−2 N coefficients. The representation represents each image block of 22J pixels
scattering representation along all frequency-decreasing by NJ 22J /(2N ) = (CJ + C 2 J(J − 1)/2)/2 coefficients,
paths of length at most m thus has a total Pm number which is equal to 24 for C = 6 and J = 2. The cosine
of coefficients equal to NJ = N α−2 2−2J q=0 C q Jq . scattering transform is thus three times the size of SIFT
This reduced scattering representation is computed by for J = 2, but as J increases, the relative size decreases.
a cascade of convolutions, modulus, and sub-samplings, If J = 3 then the size of a cosine scattering representation
with O(N log N ) operations. The final DCT transform is twice the size of a SIFT representation but for J = 7
further compresses the resulting representation. it is about 20 times smaller.
10
order 1 2
order 2
2
10 10
A A
B B
1
10 C 0 C
10
0
10
−2
10
−1
10
−4
−2 10
10
−3 −6
10 10
0 2 4 6 8 10 12 14 16 18 0 20 40 60 80 100 120
Fig. 6. (A): Sorted variances of scattering coefficients for m = 1 (left) and m = 2 (right). (B): Sorted variances of DCT
scattering coefficients. (C): Variances in the scattering Karhunen-Loeve basis.
A scattering transform eliminates the image variability k̂(X) = argmin kSJ X − PAd,k (SJ X)k . (17)
k≤K
due to translation and is stable to deformations. The
resulting classification properties are studied with a PCA The minimization of this distance has similarities with
and an SVM classifier applied to scattering representa- the minimization of a tangential distance [12] in the
tions computed with a Morlet wavelet. State-of-the-art sense that we remove the principal scattering directions
results are obtained for hand-written digit recognition of variabilities to evaluate the distance. However it is
and for texture discrimination. much simpler since it does not evaluate a tangential
⊥
space which depends upon SJ x. Let Vd,k be the orthog-
onal complement of Vd,k corresponding to directions
4.1 PCA Affine Scattering Space Selection of lower variability. This distance is also equal to the
norm of the difference between SJ x and the average
Although discriminant classifiers such as SVM have ⊥
class “template” E(SJ Xk ), projected in Vd,k :
better asymptotic properties than generative classifiers
[26], the situation can be inverted for small training
kSJ x − PAd,k (SJ x)k =
PVd,k
⊥ SJ x − E(SJ Xk )
. (18)
sets. We introduce a simple robust generative classifier
based on affine space models computed with a PCA. Minimizing the affine space approximation error is thus
Applying a DCT on scattering coefficients has no effect equivalent to finding the class centroid E(SJ Xk ) which
on any linear classifier because it is a linear orthogonal is the closest to SJ x, without taking into account the
transform. However, keeping the 50% lower frequency first d principal variability directions. The d principal
cosine scattering coefficients reduces computations and directions of the space Vd,k result from deformations and
has a negligible effect on classification results. The clas- from structural variability. The projection PAd,k (SJ x)
sification algorithm is described directly on scattering is the optimum linear prediction of SJ x from these
coefficients to simplify explanations. Each signal class is d principal modes. The selected class has the smallest
represented by a random vector Xk , whose realizations prediction error.
are images of N pixels in the class. This affine space selection is effective if SJ Xk −
Let E(SJ X) = {E(SJ [p]X(u))}p,u be the family E(SJ Xk ) is well approximated by a projection in a low-
of NJ expected scattering values, computed along all dimensional space. This is the case if realizations of Xk
frequency-decreasing paths of length m ≤ mmax and are translations and limited deformations of a single tem-
all subsampled positions u = α2J n. The difference plate. Indeed, the Lipschitz continuity condition implies
SJ Xk − E(SJ Xk ) is approximated by its projection in that small deformations are linearized by the scattering
a linear space of low dimension d ≪ NJ . The covariance transform. Hand-written digit recognition is an example.
matrix of SJ Xk is a matrix of size NJ2 . Let Vd,k be the This is also valid for stationary textures where SJ Xk has
linear space generated by the d PCA eigenvectors of a small variance, which can be interpreted as structural
this covariance matrix having the largest eigenvalues. variability.
Among all linear spaces of dimension d, this is the The dimension d must be adjusted so that SJ Xk has
space which approximates SJ Xk − E(SJ Xk ) with the a better approximation in the affine space Ad,k than in
smallest expected quadratic error. This is equivalent affine spaces Ad,k′ of other classes k ′ 6= k. This is a
to approximating SJ Xk by its projection on an affine model selection problem, which requires to optimize the
approximation space: dimension d in order to avoid over-fitting [5].
The invariance scale 2J must also be optimized. When
Ad,k = E{SJ Xk } + Vd,k . the scale 2J increases, translation invariance increases
11
but it comes with a partial loss of information which 4.2 Handwritten Digit Recognition
brings the representations of different signals closer. One
can prove [22] that for any x and x′ The MNIST database of hand-written digits is an ex-
ample of structured pattern classification, where most
kSJ+1 x − SJ+1 x′ k ≤ kSJ x − SJ x′ k . of the intra-class variability is due to local translations
and deformations. It comprises at most 60000 training
When 2J goes to infinity, this scattering distance con- samples and 10000 test samples. If the training dataset
verges to a non-zero value. To classify deformed tem- is not augmented with deformations, the state of the art
plates such as hand-written digits, the optimal 2J is of was achieved by deep-learning convolutional networks
the order of the maximum pixel displacements due to [29], deformation models [15], and dictionary learning
translations and deformations. In a stochastic framework [25]. These results are improved by a scattering classifier.
where x and x′ are realizations of stationary processes, All computations are performed on the reduced cosine
SJ x and SJ x′ converge to the expected scattering trans- scattering representation described in Section 3.3, which
forms Sx and Sx′ . In order to classify stationary pro- keeps the lower-frequency half of the coefficients. Table
cesses such as textures, the optimal scale is the maximum 4 computes classification errors on a fixed set of test
scale equal to the image width, because it minimizes the images, depending upon the size of the training set,
variance of the windowed scattering estimator. for different representations and classifiers. The affine
A cross-validation procedure is used to find the di- space selection of section 4.1 is compared with an SVM
mension d and the scale 2J which yield the smallest classifier using RBF kernels, which are computed us-
classification error. This error is computed on a subset ing Libsvm [9], and whose variance is adjusted using
of the training images, which is not used to estimate the standard cross-validation over a subset of the training
covariance matrix for the PCA calculations. set. The SVM classifier is trained with a renormalization
As in the case of SVM, the performance of the affine which maps all coefficients to [−1, 1]. The PCA classifier
PCA classifier can be improved by equalizing the de- is trained with the renormalisation (19). The first two
scriptor space. Table 1 shows that scattering vectors have columns of Table 4 show that classification errors are
unequal energy distribution along its path variables, in much smaller with an SVM than with the PCA algo-
particular as the order varies. A robust equalization rithm if applied directly on the image. The 3rd and 4th
is obtained by re-normalizing each SJ [p]X(u) by the columns give the classification error obtained with a
P 1/2
maximum kSJ [p]Xi k = |S [p]X (u)| 2
over all PCA or an SVM classification applied to the modulus
u J i
of a windowed Fourier transform. The spatial size 2J of
training signals Xi :
the window is optimized with a cross-validation which
SJ [p]X(u) yields a minimum error for 2J = 8. It corresponds
. (19) to the largest pixel displacements due to translations
supXi kSJ [p]Xi k
or deformations in each class. Removing the complex
To simplify notations, we still write SJ X for this nor- phase of the windowed Fourier transform yields a locally
malized scattering vector. invariant representation but whose high frequencies are
Affine space scattering models can be interpreted as unstable to deformations, as explained in Section 2.1.
generative models computed independently for each Suppressing this local translation variability improves
class. As opposed to discriminative classifiers such as the classification rate by a factor 3 for a PCA and by
SVM, they do not estimate cross-terms between classes, almost 2 for an SVM. The comparison between PCA
besides from the choice of the model dimensionality and SVM confirms the fact that generative classifiers
d. Such estimators are particularly effective for small can outperform discriminative classifiers when training
number of training samples per class. Indeed, if there samples are scarce [26]. As the training set size increases,
are few training samples per class then variance terms the bias-variance trade-off turns in favor of the richer
dominate bias errors when estimating off-diagonal co- SVM classifiers, independently of the descriptor.
variance coefficients between classes [4]. Columns 6 and 8 give the PCA classification result
An affine space approximation classifier can also be applied to a windowed scattering representation for
interpreted as a robust quadratic discriminant classifier mmax = 1 and mmax = 2. The cross validation also
obtained by coarsely quantizing the eigenvalues of the chooses 2J = 8. For the digit ‘3’, Figure 7 displays
inverse covariance matrix. For each class, the eigenval- the 4-by-4 array of normalized scattering vectors. For
ues of the inverse covariance are set to 0 in Vd,k and to each u = 2J (n1 , n2 ) with 1 ≤ ni ≤ 4, the scattering
⊥ vector SJ [p]X(u) is displayed for paths of length m = 1
1 in Vd,k , where d is adjusted by cross-validation. This
coarse quantization is justified by the poor estimation and m = 2, as circular frequency energy distributions
of covariance eigenvalues from few training samples. following Section 2.3.
These affine space models will typically be applied to Increasing the scattering order from mmax = 1 to
distributions of scattering vectors having non-Gaussian mmax = 2 reduces errors by about 30%, which shows that
distributions, where a Gaussian Fisher discriminant can second order coefficients carry important information
lead to important errors. even at a relatively small scale 2J = 8. However, third
12
Fig. 7. (a): Image X(u) of a digit ’3’. (b): Array of scattering vectors SJ [p]X(u), for m = 1 and u sampled at intervals
2J = 8. (c): Scattering vectors SJ [p]X(u), for m = 2.
TABLE 4
MNIST classification results.
Training x Wind. Four. Scat. mmax = 1 Scat. mmax = 2 Conv.
size PCA SVM PCA SVM PCA SVM PCA SVM Net.
300 14.5 15.4 7.35 7.4 5.7 8 4.7 5.6 7.18
1000 7.2 8.2 3.74 3.74 2.35 4 2.3 2.6 3.21
2000 5.8 6.5 2.99 2.9 1.7 2.6 1.3 1.8 2.53
5000 4.9 4 2.34 2.2 1.6 1.6 1.03 1.4 1.52
10000 4.55 3.11 2.24 1.65 1.5 1.23 0.88 1 0.85
20000 4.25 2.2 1.92 1.15 1.4 0.96 0.79 0.58 0.76
40000 4.1 1.7 1.85 0.9 1.36 0.75 0.74 0.53 0.65
60000 4.3 1.4 1.80 0.8 1.34 0.62 0.7 0.43 0.53
Fig. 8. Examples of textures from the CUReT database with normalized mean and variance. Each row corresponds to
a different class, showing intra-class variability in the form of stochastic variability and changes in pose and illumination.
TABLE 9
Percentage of errors on CUReT for different training sizes.
Fig. 9. (a): Example of CureT texture X(u). (b): Scattering coefficients SJ [p]X, for m = 1 and 2J equal to the image
width. (c): Scattering coefficients SJ [p]X(u), for m = 2.
to reduce the scattering estimation variance. Indeed, class normalized approximation error is σd2 = 2.5 · 10−1
contrarly to a power spectrum estimation, the variance of and the separation ratio is λd = 3 for mmax = 2.
the scattering vector decreases when 2J increases. Figure
9 displays the scattering coefficients SJ [p]X of order
m = 1 and m = 2 of a CureT textured image X. When
mmax = 1, the error drops to 0.5%, although first-order The PCA classifier provides a partial rotation invari-
scattering coefficients essentially depend upon second ance by removing principal components. It averages
order moments as the Fourier spectrum. This is probably scattering coefficients along path rotation parameters,
due to the fact that image textures have a spectrum which comes with a loss of discriminability. However,
which typically decays like |ω|−α . For such spectrum, a more efficient rotation invariant texture classification
an estimation over dyadic frequency bands provide a is obtained by cascading this translation invariant scat-
better bias versus variance trade-off than a windowed tering with a second rotation invariant scattering [24].
Fourier spectrum [1]. For mmax = 2, the error further It transforms each layer of the translation invariant
drops to 0.2%. Indeed, scattering coefficients of order scattering network with new wavelet convolutions along
m = 2 depend upon moments of order 4, which are rotation parameters, followed by modulus and average
necessary to differentiate textures having same second pooling operators, which are cascaded. A combined
order moments as in Figure 5. The dimension of the translation and rotation scattering yields a translation
affine approximation space model is d = 16, the intra- and rotation invariant representation which is stable to
deformations [22].
15