Professional Documents
Culture Documents
2020 C H Chen - Handbook of Pattern Recognition and Computer Vision (6th Edition)
2020 C H Chen - Handbook of Pattern Recognition and Computer Vision (6th Edition)
PATTERN RECOGNITION
AND COMPUTER VISION
6th Edition
editor
C H Chen
University of Massachusetts Dartmouth, USA
World Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI • TOKYO
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center,
Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from
the publisher.
Printed in Singapore
v
This page intentionally left blank
PREFACE TO THE 6TH EDITION
vii
viii Preface
processing, the readers may be interested with other recent work by Dr. Fletcher
reported in a chapter on Using Prior Information to Enhance Sensitivity of
Longitudinal Brain Change Computation, in Frontiers of Medical Imaging
(World Scientific Publishing, 2015). Chapter 2.5, Automatic Segmentation of
IVUS Images Based on Temporal Texture Analysis, is devoted to more traditional
approach to intravascular ultrasonic image analysis using both textural and
spatial (or multi-image) information for the analysis and delineation of lumen
and external elastic membrane boundaries. The use of multiple images in a
sequence, processed by discrete waveframe transform, clearly provides better
segmentation results over many of those reported in the literature. We take this
traditional approach as the available data set for the study is limited.
Chapter 2.6 by F. Liwicki and Prof. M. Liwicki on Deep Learning for
Historical Documents provides an overview of the state of the art and recent
methods in the area of historical document analysis, especially those using deep
learning and long-Short-Term Memory Networks. Historical documents differ
from the ordinary documents due to the presence of different artifacts. Their idea
of detection of graphical elements in historical documents and their ongoing
efforts towards the creation of large databases are also presented. Actually graphs
allow us to simultaneously model the local features and the global structure of a
handwritten signature in a natural and comprehensive way. Chapter 2.7 by Drs.
Maergner, Riesen et al. thoroughly review two standard graph matching
algorithms that can be readily integrated into an end-to-end signature framework.
The system presented in the chapter was able to combine the complementary
strengths of structural approach and the statistical models to improve the
signature verification performance. The reader may also be interested with the
chapter in the 5th edition of the Handbook series also by Prof. Riesen on Graph
Edit Distance-Novel Approximation Algorithms.
Chapter 2.8 by Prof. Huang and Dr. Hsieh is on Cellular Neural Network for
Seismicc Pattern Recognition. The discrete-time cellular neural network (DT-
CNN) is used as associate memory, and then the associate memory is used to
recognize seismic patterns. The seismic patterns are bright spot pattern, right and
left pinch-out patterns that have the structure of gas and oil sand zones. In
comparison with the use of Hopefield associative memory, the DT-CNN has
better recovery capacity. The results of seismic image interpretation using DT-
CNN are also good. An automatic matching algorithm is necessary for a quick
and accurate search of the law enforcement face databases or surveillance
cameras using a forensic sketch. In Chapter 2.9, Incorporating Facial Attributes
in Cross-Modal Face Verification and Synthesis, by H. Kazemi et al., two deep
learning frameworks are introduced to train a Deep Coupled Convolution Neural
Network for facial attribute guided sketch-to-photo matching and synthesis. The
Preface xi
C.H. Chen
February 3, 2020
This page intentionally left blank
CONTENTS
Dedication v
Preface vii
xiii
xiv Contents
A BRIEF INTRODUCTION
another hot topics in the 70’s (see e.g. [1,7]). For many years, researchers have
considered the so called “Hughes phenomenon” (see e.g. [8]), which states that for
finite training sample size, there is a peak mean recognition accuracy. Large
feature dimension however may imply better separability among pattern classes.
Support vector machine is one way to increase the number of features for better
classification.
Syntactic pattern recognition is a very different approach that has different
feature extraction and decision making processes. It consists of string grammar-
based methods, tree grammar-based methods and graph grammar-based methods.
The most important book is by Fu [9]. More recent books include those by Bunke,
et al. [10], and Flasinski [11] which has over 1,000 entries in the Bibliography.
Structural pattern recognition (see e.g. [12]) can be more related to signal/image
segmentation and can be closely linked to syntactic pattern recognition.
In more recent years, much research effort in pattern recognition was in sparse
representation (see e.g. [13]) and in tree classifications like the use of random
forests, as well as various forms of machine learning involving neural networks.
In connection with sparse representation, compressive sensing (not data
compression) has been very useful in some complex image and signal recognition
problems (see e.g. [14]).
The development of computer vision was largely evolved from digital image
processing with early frontier work by Rosenfeld [15] and many of his subsequent
publications. A popular textbook on digital image processing is by Gonzalez and
Woods [16]. Digital image processing by itself can only be considered as low level
to mid-level computer vision. While image segmentation and edge extraction can
be loosely considered as middle level computer vision, the high level computer
vision which is supposed to be like human vision has not been well defined.
Among many textbooks in computer vision is the work of Haralick et al. listed in
[17,18]. There has been much advances in computer vision especially in the last
20 years (see. e.g. [19]).
Machine learning has been the fundamental process to both pattern
recognition and computer vision. In pattern recognition, many supervised, semi-
supervised and unsupervised learning approaches have been explored. The neural
network approaches are particularly suitable for machine learning for pattern
recognition. Multilayer perceptron using back-propagation training algorithm,
kernel methods for support vector machines, self-organizing maps and
dynamically driven recurrent networks represent much of what neural networks
have contributed to machine learning [4].
The recently popular dynamic learning neural networks started with a
complex extension of multilayered neural network by LeCun et al. [20] and
4 Introduction
expanded into versions in convolutional neural networks (see e.g. [21]). Deep
learning implies a lot of learning with many parameters (weights) on a large data
set. As expected, some performance improvement over traditional neural network
methods can be achieved. As an emphasis of this Handbook edition, we have
included several chapters dealing with deep learning. Clearly, deep learning, as a
renewed effort in neural networks since the mid-ninties, is among the important
steps toward matured artificial intelligence. However, we take a balanced view in
this book by placing as much importance of the past work on pattern recognition
and computer vision as the new approach like deep learning. We believe that any
work that is built on solid mathematical and/or physics foundation will have long
lasting value. Examples are the Bayes decision rule, nearest-neighbor decision
rule, snake-based image segmentation models, etc.
Though theoretical work on pattern recognition and computer vision has
moved on a fairly slow or steady pace, the software and hardware development has
progressed so much faster, thanks to the ever-increasing computer power. MatLab
alone, for example, has served so well in software needs, thus diminishing the need
for dedicated software systems. Rapid development in powerful sensors and
scanners has made possible many real-time or near real-time use of pattern
recognition and computer vision. Throughout this Handbook series, we have
included several chapters on hardware development. Perhaps continued and
increased commercial and non-commercial needs have driven the rapid progress
in the hardware as well as software development.
References
1. K. Fukanaga, “Introduction to Statistical Pattern Recognition”, second edition, Academic Press
1990.
2. R. Duda, P. Hart, and D. G. Stork, “Pattern Classification’, second edition, Wiley 1995.
3. P.A. Devijver and J. Kittler, “Pattern Recognition: A Statistical Approach”, Prentice 1982.
4. S. Haykin, “Neural Networks and Learning Machines”, third edition, 2008.
5. K.S. Fu and T.S. Yu, “Statistical Pattern Classification using Contextual Information”, Research
Studies Press, a Division of Wiley, 1976.
6. G. Toussaint, “The use of context in pattern recognition”, Pattern Recognition, Vol. 10, pp. 189-
204, 1978.
7. C.H. Chen, “On information and distance measures, error bounds, and feature selection”,
Information Sciences Journal, Vol. 10, 1976.
8. D. Landgrebe, “Signal Theory Methods in Multispectral Remote Sensing”, Wiley 2003.
9. K.S. Fu, “Syntactic Pattern Recognition and Applications, Prentice-Hall 1982.
10. H.O. Bunke, A. Sanfeliu, editors, “Syntactic and Structural Pattern Recognition-theory and
Applications”, World Scientific Publishing, 1992.
11. M. Flasinski, “Syntactic Pattern Recognition”, World Scientific Publishing, March 2019.
12. T. Pavlidiis. “Structural Pattern Recognition”, Springer, 1977.
Introduction 5
13. Y. Chen, T.D. Tran and N.M. Nasrabdi, “Sparse representation for target detection and
classification in hyperspectral imagery”, Chapter 19 of “Signal and Image Processing for
Remote Sensing”, second edition, edited by C.H. Chen, CRC Press 2012.
14. M.L. Mekhalfi, F. Melgani, et al., “Land use classification with sparse models”, Chapter 14 of
“Compressive sensing of Earth Observations”, edited by C.H. Chen, CRC Press 2017.
15. A. Rosenfeld, “Picture Processing by Computer’, Academic Press 1969.
16. R.C. Gonzelez and R.E. Woods, “Digital Image Processing”, 4th edition, Prentice-Hall 2018.
17. R.M. Haralick and L. G. Shapiro, “Computer and Robot Vision, Vol. 1, Addison-Wesley
Longman 2002.
18. R.M. Haralick and L.G. Shapiro, “Computer and robot Vision”. Vol. 2, Addison-Wesley
Longman 2002.
19. C.H. Chen, editor, “Emerging Topics in Computer Vision”, World Scientific Publishing 2012.
20. Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning”, Nature, Vol. 521, no. 7553, pp. 436-444,
2015
21. L Goodfellow, Y. Bengio and A. Courville, “Deep Learning”, Cambridge, MA. MIT Press 2016.
This page intentionally left blank
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 7
CHAPTER 1.1
Typical classification rules input sample data and output a classifier. This is
different from the engineering paradigm in which an optimal operator is derived
based on a model and cost function. If the model is uncertain, one can incor-
porate prior knowledge of the model with data to produce an optimal Bayesian
operator. In classification, the model is a feature-label distribution and, if this
is known, then a Bayes classifier provides optimal classification relative to the
classification error. This chapter reviews optimal Bayesian classification, in
which there is an uncertainty class of feature-label distributions governed by a
prior distribution, a posterior distribution is derived by conditioning the prior
on the sample, and the optimal Bayesian classifier possesses minimal expected
error relative to the posterior. The chapter covers binary and multi-class clas-
sification, prior construction from scientific knowledge, and optimal Bayesian
transfer learning, where the training data are augmented with data from a
different source.
1 Introduction
7
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 8
process that generates the random sample constitutes the sampling distribution.
A classifier is optimal relative to a feature-label distribution and a collection C
of classifiers if it is in C and its error is minimal among all classifiers in C:
ψopt = arg min ε[ψ]. (1)
ψ∈C
Suppose the feature-label distribution is unkown, but we know that it is charac-
terized by an uncertainty class Θ of parameter vectors corresponding to feature-label
distributions fθ (x, y) for θ ∈ Θ. Now suppose we have scientific knowledge regard-
ing the features and labels, and this allows us to construct a prior distribution π(θ)
governing the likekihood that θ ∈ Θ parameterizes the true feature-label distribu-
tion, where we assume the prior is uniform if we have no knowledge except that
the true feature-label distribution lies in the uncertainty class. Then the optimal
classifier, known as an intrinsically Bayesian robust classifier (IBRC) is defined by
Θ
ψIBR = arg min Eπ [εθ [ψ]], (2)
ψ∈C
where εθ [ψ] is the error of ψ relative to fθ (x, y) and Eπ is expectation relative to
π.11,12 The IBRC is optimal on average over the uncertainty class, but it will not
be optimal for any particular feature-label distribution unless it happens to be a
Bayes classifier for that distribution.
Going further, suppose we have a random sample Sn = {(X1 , Y1 ), . . . , (Xn , Yn )}
of vector-label pairs drawn from the actual feature-label distribution. The posterior
distribution is defined by π ∗ (θ) = π(θ|Sn ) and the optimal classifier, known as an
Θ
optimal Bayesian classifier (OBC), denoted ψOBC , is defined by Eq. 2 with π ∗ in
11
place of π. An OBC is an IBRC relative to the posterior, and an IBRC is an OBC
with a null sample. Because we are generally interested in design using samples, we
focus on the OBC. For both the IBRC and the OBC, we omit the Θ in the notation
if the uncertainty class is clear from the context. Given our prior knowledge and
the data, the OBC is the best classifier to use.
A sample-dependent minimum-mean-square-error (MMSE) estimator ε̂(Sn ) of
εθ [ψ] minimzes Eπ,Sn [|εθ [ψ] − ξ(Sn )|2 ] over all Borel measurable functions ξ(Sn ),
where Eπ,Sn denotes expectation with respect to the prior distribution and the sam-
pling distribution. According to classical estimation theory, ε̂(Sn ) is the conditional
expectation given Sn . Thus,
ε̂(Sn ) = Eπ [εθ [ψ]|Sn ] = Eπ∗ [εθ [ψ]]. (3)
In this light, Eπ∗ [εθ [ψ]] is called the Bayesian MMSE error estimator (BEE) and
is denoted by ε̂Θ [ψ; Sn ].13,14 The OBC can be reformulated as
Θ
ψOBC (Sn ) = arg min ε̂Θ [ψ; Sn ]. (4)
ψ∈C
Besides minimizing Eπ,Sn [|εθ [ψ] − ξ(Sn )|2 ], the BEE is also an unbiased estimator
of εθ [ψ] over the distribution of θ and Sn :
ESn [ε̂Θ [ψ; Sn ]] = ESn [ Eπ [εθ [ψ]|Sn ] ] = Eπ,Sn [εθ [ψ]] . (5)
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 10
where ny is the number of y-labeled points (xi , yi ) in the sample, Sny is the subset
of sample points from class y, and the constant of proportionality can be found by
normalizing the integral of π ∗ (θy ) to 1. The term f (Sny |θy ) is called the likelihood
function.
Although we call π(θy ), y = 0, 1, the “prior probabilities,” they are not required
to be valid density functions. A prior is called “improper” if the integral of π(θy )
is infinite. When improper priors are used, Bayes’ rule does not apply. Hence,
assuming the posterior is integrable, we take Eq. 8 as the definition of the posterior
distribution, normalizing it so that its integral is equal to 1.
Owing to the posterior independence between c, θ0 and θ1 , and the fact that
εyθ [ψ], the error on class y, is a function of θy only, the BEE can be expressed as
where
Eπ∗ [εyθ [ψ]] = εyθy [ψ]π ∗ (θy )dθy (10)
Θy
is the posterior expectation for the error contributed by class y. Letting ε̂yΘ [ψ; Sn ] =
Eπ∗ [εyn [ψ]], Eq. 9 takes the form
fΘ (x|y) = fθy (x|y) π ∗ (θy ) dθy . (12)
Θy
The following theorem provides the key representation for the BEE.
Theorem 1 [11]. If ψ (x) = 0 if x ∈ R0 and ψ (x) = 1 if x ∈ R1 , where R0 and
R1 are measurable sets partitioning d , then, given random sample Sn , the BEE is
given by
ε̂Θ [ψ; Sn ] = Eπ∗ [c] fΘ (x|0) dx + (1 − Eπ∗ [c]) fΘ (x|1) dx
R1 R0
= (Eπ∗ [c]fΘ (x|0) Ix∈R1 + (1 − Eπ∗ [c])fΘ (x|1) Ix∈R0 ) dx, (13)
d
ε̂yΘ [ψ; Sn ] = Eπ∗ [εyθ [ψ; Sn ]] = fΘ (x|y) Ix∈R1−y dx. (14)
d
In the unconstrained case in which the OBC is over all possible classifiers, The-
orem 1 leads to pointwise expression of the OBC by simply minimizing Eq. 13.
Theorem 2 [11]. The optimal Bayesian classifier over the set of all classifiers is
given by
Θ
0 if Eπ∗ [c]fΘ (x|0) ≥ (1 − Eπ∗ [c])fΘ (x|1) ,
ψOBC (x) = (15)
1 otherwise.
The representation in the theorem is the representation for the Bayes classifier
for the feature-label distribution defined by class-conditional densities fΘ (x|0) and
fΘ (x|1), and class-0 prior probability Eπ∗ [c]; that is, the OBC is the Bayes classifier
for the effective class-conditional densities. We restrict our attention to the OBC
over all possible classifiers.
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 12
Ujy + αjy
fΘ (j|y) = b . (18)
ny + i=1 αiy
From Eq. 13,
b
Uj0 + αj0 Uj1 + αj1
ε̂Θ [ψ; Sn ] = Eπ∗ [c] b Iψ(j)=1 + (1 − Eπ∗ [c]) b Iψ(j)=0 .
j=1 n0 + i=1 αi0 n1 + i=1 αi1
(19)
In particular,
b
Ujy + αjy
ε̂yΘ [ψ; Sn ] = b Iψ(j)=1−y . (20)
j=1 ny + i=1 αiy
From Eq. 15 using the effective class-conditional densities in Eq. 18,11
⎧
0 0
⎪
⎨ 1 if E ∗ [c] Uj + αj Uj1 + αj1
Θ π b < (1 − Eπ∗ [c]) b ,
ψOBC (j) = n + α 0 n + α 1 (21)
⎪ 0 i 1 i
⎩ 0 otherwise. i=1 i=1
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 13
b
Uj0 + αj0 Uj1 + αj1
εOBC = min Eπ∗ [c] b , (1 − Eπ∗ [c]) b . (22)
j=1 n0 + i=1 αi0 n1 + i=1 αi1
The OBC minimizes the BEE by minimizing each term in the sum of Eq. 19 by
assigning ψ(j) the class with the smaller constant scaling the indicator function.
The OBC is optimal on average across the posterior distribution, but its behavior
for any specific feature-label distribution is not guaranteed. Generally speaking, if
the prior is concentrated in the vicinity of the true feature-label distribution, then
results are good. But there is risk. If one uses a tight prior that is concentrated away
from the true feature-label distribution, results can be very bad. Correct knowledge
helps; incorrect knowledge hurts. Thus, prior construction is very important, and
we will return to that issue in a subsequent section.
Following an example in Ref. 3, suppose the true distribution is discrete with
c = 0.5,
p1 = p2 = p3 = p4 = 3/16,
p5 = p6 = p7 = p8 = 1/16,
q1 = q2 = q3 = q4 = 1/16,
q5 = q6 = q7 = q8 = 3/16.
c = 0.5
0.7
histogram
0.6 Bayes error
OBC, prior 1
Average true error
OBC, prior 2
OBC, prior 3
0.5 OBC, prior 4
OBC, prior 5
0.4
0.3
0.2
5 10 15 20 25 30
sample size
Fig. 1. Average true errors for the histogram classifier and OBCs based on different prior distri-
butions. [Reprinted from Dougherty, Optimal Signal Processing Under Uncertainty, SPIE Press,
2018.]
y and Σ
where μ y are the sample mean and sample covariance for class y.
Similar results are found in Ref. 15.
The posterior can be expressed as
3
Level curve 0
2.5 Level curve 1
PI
2 OBC
IBR
1.5
1
2
0.5
x
-0.5
-1
-1.5
-2
-2 -1 0 1 2 3
x1
where Ψy is the scale matrix in Eq. 35. The OBC discriminant becomes
k0 +d
1 ∗ T −1 ∗
DOBC (x) = K 1 + (x − m0 ) Ψ0 (x − m0 )
k0
k1 +d
1 ∗ T −1 ∗
− 1+ (x − m1 ) Ψ1 (x − m1 ) , (37)
k1
where
2 d 2
1 − Eπ∗ [c] k0 |Ψ0 | Γ(k0 /2)Γ((k1 + d)/2)
K= . (38)
Eπ∗ [c] k1 |Ψ1 | Γ((k0 + d)/2)Γ(k1 /2)
ψOBC (x) = 0 if and only if DOBC (x) ≤ 0. This classifier has a polynomial decision
boundary as long as k0 and k1 are integers, which is true if κ0 and κ1 are integers.
Consider a synthetic Gaussian model with d = 2 features, independent general
covariance matrices, and a proper prior defined by known c = 0.5 and hyperpa-
rameters ν0 = κ0 = 20d, m0 = [0, . . . , 0], ν1 = κ1 = 2d, m1 = [1, . . . , 1], and
Sy = (κy − d − 1)Id . We assume that the true model corresponds to the means
of the parameters, and take a stratified sample of 10 randomly chosen points from
each true class-conditional distribution. We find both the IBRC ψIBR and the OBC
ψOBC relative to the family of all classifiers. We also consider a plug-in classifier ψPI ,
which is the Bayes classifier corresponding to the means of the parameters. ψPI is
linear. Figure 2 shows ψOBC , ψIBR , and ψPI . Level curves for the class-conditional
distributions corresponding to the expected parameters are also shown.
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 18
For the Gaussian and discrete models discussed herein, the OBC can be solved
analytically; however, in many real-world situations Gaussian models are not
suitable. Shortly after the introduction of the OBC, Markov-chain-Monte-Carlo
(MCMC) methods were utilized for RNA-Seq applications.19,20 Other MCMC-based
OBC applications include liquid chromatography-mass spectrometry data,21 selec-
tion reaction monitoring data,22 and classification based on dynamical measure-
ments of single-gene expression,23 the latter using an IBR classifier because no
sample data were included. Another practical issue pertains to missing values,
which are common in many applications, such as genomic classification. The OBC
has been reformulated to take into account missing values.24 Finally, let us note
that, while random sampling is a common assumption in classification theory, non-
random sampling can be beneficial for classifier design.25 In the case of the OBC,
optimal sampling has been considered under different scenarios.3,26
5 Multi-class Classification
In this section, we generalize the BEE and OBC to treat multiple classes with arbi-
trary loss functions. We present the analogous concepts of Bayesian risk estimator
(BRE) and optimal Bayesian risk classifier (OBRC), and show that the BRE and
OBRC can be represented in the same form as the expected risk and Bayes decision
rule with unknown true densities replaced by effective densities. We consider M
classes, y = 0, . . . , M − 1, let f (y | c) be the probability mass function of Y parame-
terized by a vector c, and for each y let f (x | y, θy ) be the class-conditional density
function for X parameterized by θy . Let θ be composed of the θy .
Let L(i, y) be a loss function quantifying a penalty in predicting label i when
the true label is y. The conditional risk in predicting label i for a given point x is
defined by R(i, x, c, θ) = E[L(i, Y ) | x, c, θ]. A direct calculation yields
M −1
y=0 L(i, y)f (y | c)f (x | y, θy )
R(i, x, c, θ) = M −1 . (39)
y=0 f (y | c)f (x | y, θy )
is the probability that a class y point will be assigned class i by ψ, and the Ri =
{x : ψ(x) = i} partition the feature space into decision regions.
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 19
A Bayes decision rule (BDR) minimizes expected risk, or equivalently, the con-
ditional risk at each fixed point x:
We break ties with the lowest index, i ∈ {0, . . . , M − 1}, minimizing R(i, x, c, θ).
In the binary case with the zero-one loss function, L(i, y) = 0 if i = y and
L(i, y) = 1 if i = y, the expected risk reduces to the classification error so that the
BDR is a Bayes classifier.
With uncertainty in the multi-class framework, we assume that c is the probabil-
ity mass function of Y , that is, c = {c0 , . . . , cM −1 } ∈ ΔM −1 , where f (y | c) = cy and
ΔM −1 is the standard M − 1 simplex defined by cy ∈ [0, 1] for y ∈ {0, . . . , M − 1}
M −1
and y=0 cy = 1. Also assume θy ∈ Θy for some parameter space Θy , and
θ ∈ Θ = Θ0 × . . . × ΘM −1 . Let C and T denote random vectors for parameters
c and θ. We assume that C and T are independent prior to observing data, and
assign prior probabilities π(c) and π(θ). Note the change of notation: up until now,
c and θ have denoted both the random variables and the parameters. The change
is being made to avoid confusion regarding the expectations in this section.
Let Sn be a random sample, xiy the ith sample point in class y, and ny the
number of class-y sample points. Given Sn , the priors are updated to posteriors:
M −1
ny
π ∗ (c, θ) = f (c, θ | Sn ) ∝ π(c)π(θ) f (xiy , y | c, θy ), (43)
y=0 i=1
where the product on the right is the likelihood function. Since f (xiy , y | c, θy ) =
cy f (xiy | y, θy ), we may write π ∗ (c, θ) = π ∗ (c)π ∗ (θ), where
M −1
π ∗ (c) = f (c | Sn ) ∝ π(c) (cy )ny (44)
y=0
and
M −1
ny
∗
π (θ) = f (θ | Sn ) ∝ π(θ) f (xiy | y, θy ) (45)
y=0 i=1
The OBRC minimizes the average loss weighted by fΘ (y)fΘ (x | y). The OBRC
has the same functional form as the BDR with fΘ (y) substituted for the true class
probability f (y | c), and fΘ (x | y) substituted for the true density f (x | y, θy ) for all
y. Closed-form OBRC representation is available for any model in which fΘ (x | y)
has been found, including discrete and Gaussian models. For binary classification,
the BRE reduces to the BEE and the OBRC reduces to the OBC.
6 Prior Construction
In 1968, E. T. Jaynes remarked,28 “Bayesian methods, for all their advantages, will
not be entirely satisfactory until we face the problem of finding the prior probability
squarely.” Twelve years later, he added,29 “There must exist a general formal the-
ory of determination of priors by logical analysis of prior information — and that
to develop it is today the top priority research problem of Bayesian theory.” The
problem is to transform scientific knowledge into prior distributions.
Historically, prior construction has usually been treated independently of real
prior knowledge. Subsequent to Jeffreys’ non-informative prior,17 objective-based
methods were proposed.30 These were followed by information-theoretic and sta-
tistical approaches.31 In all of these methods, there is a separation between prior
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 22
knowledge and observed sample data. Several specialized methods have been pro-
posed for prior construction in the context of the OBC. In Ref. 32, data from unused
features is used to construct a prior. In Refs. 19 and 20, a hierarchical Poisson prior
is employed that models cellular mRNA concentrations using a log-normal distri-
bution and then models where the uncertainty is on the feature-label distribution.
In the context of phenotype classification, knowledge concerning genetic signaling
pathways has been integrated into prior construction.33-35
Here, we outline a general paradigm for prior formation involving an optimiza-
tion constrained by incorporating existing scientific knowledge augmented by slack-
ness variables.36 The constraints tighten the prior distribution in accordance with
prior knowledge, while at the same time avoiding inadvertent over-restriction of the
prior. Two definitions provide the general framework.
Given a family of proper priors π(θ, γ) indexed by γ ∈ Γ, a maximal knowledge-
driven information prior (MKDIP) is a solution to the optimization
where Cθ (ξ, γ, D) is a cost function depending on (1) the random vector θ parame-
terizing the uncertainty class, (2) the parameter γ, and (3) the state ξ of our prior
knowledge and part of the sample data D. When the cost function is additively
decomposed into costs on the hyperparameters and the data, it takes the form
(1) (2)
Cθ (ξ, γ, D) = (1 − β)gθ (ξ, γ) + βgθ (ξ, D), (59)
(1) (2)
where β ∈ [0, 1] is a regularization parameter, and and gθ gθ
are cost functions.
Various cost functions in the literature can be adpated for the MKDIP.36
A maximal knowledge-driven information prior with constraints takes the form
(3)
of the optimization in Eq. 58 subject to the constraints Eπ(θ,γ) [gθ,i (ξ)] = 0, i =
(3)
1, 2, ..., nc , where gθ,i , i = 1, 2, ..., nc , are constraints resulting from the state ξ of
our knowledge, via a mapping
(3) (3)
T : ξ → Eπ(θ,γ) [gθ,1 (ξ)], ..., Eπ(θ,γ) [gθ,nc (ξ)] . (60)
A nonnegative slackness variable εi can be considered for each constraint for the
MKDIP to make the constraint structure more flexible, thereby allowing potential
error or uncertainty in prior knowledge (allowing inconsistencies in prior knowledge).
Slackness variables become optimization parameters, and a linear function times a
regulatory coefficient is added to the cost function of the optimization in Eq. 58,
so that the optimization in Eq. 58 relative to Eq. 59 becomes
nc
(1) (2)
arg min Eπ(θ,γ) λ1 [(1 − β)gθ (ξ, γ) + βgθ (ξ, D)] + λ2 εi , (61)
γ∈Γ,ε∈E
i=1
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 23
(3)
subject to −εi ≤ Eπ(θ,γ) [gθ,i (ξ)] ≤ εi , i = 1, 2, ..., nc , where λ1 and λ2 are nonneg-
ative regularization parameters, and ε = (ε1 , ..., εnc ) and E represent the vector of
all slackness variables and the feasible region for slackness variables, respectively.
Each slackness variable determines a range — the more uncertainty regarding a
constraint, the greater the range for the corresponding slackness variable.
Scientific knowledge is often expressed in the form of conditional probabilities
characterizing conditional relations. For instance, if a system has m binary random
variables X1 , X2 , . . . , Xm , then potentially there are m2m−1 probabilities for which
a single variable is conditioned by the other variables:
(3)
gθ,i (ξ) = Pθ (Xi = ki |X1 = k1 , . . . , Xi−1 = ki−1 , Xi+1 = ki+1 , . . . , Xm = km )
−aki i (k1 , . . . , ki−1 , ki+1 , . . . , km ). (63)
When slackness variables are introduced, the optimization constraints take the form
Not all constraints will be used, depending on our prior knowledge. In fact, the
general conditional probabilities conditioned on all expressions Xj = kj , j = i,
will not likely be used because they will likely not be known when there are many
random variables, so that conditioning will be on subsets of these expressions.
Regardless of how the prior is constructed, the salient point regarding optimal
Bayesian operator design (including the OBC) is that uncertainty is quantified rel-
ative to the scientific model (the feature-label distribution for classification). The
prior distribution is on the physical parameters. This differs from the common
method of placing prior distributions on the parameters of the operator. For in-
stance, if we compare optimal Bayesian regression37 to standard Bayesian linear
regression models,38-40 in the latter, the connection of the regression function and
prior assumptions with the underlying physical system is unclear. As noted in Ref.
37, there is a scientific gap in constructing operator models and making prior as-
sumptions on them. In fact, operator uncertainty is a consequence of uncertainty
in the physical system and is related to the latter via the optimization procedure
that produces an optimal operator. A key reason why the MKDIP approach works
is because the prior is on the scientific model, and therefore scientific knowledge
can be applied directly in the form of constraints.
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 24
The standard assumption in classification theory is that training and future data
come from the same feature-label distribution. In transfer learning, the training
data from the actual feature label distribution, called the target, are augmented
with data from a different feature-label distribution, called the source.41 The key
issue is to quantify domain relatedness. This can be achieved by extending the
OBC framework so that transfer learning from source to target domain is via a
joint prior probability density function for the model parameters of the feature-
label distributions of the two domains.42 The posterior distribution of the target
model parameters can be updated via the joint prior probability distribution func-
tion in conjunction with the source and target data. We use π to denote a joint
prior distribution and p to denote a conditional distribution involving uncertainty
parameters. As usual, a posterior distribution refers to a distribution of uncertainty
parameters conditioned on the data.
We consider L common classes in each domain. Let Ss and St denote samples
from the source and target domains with sizes of Ns and Nt , respectively. For l =
1, 2, ..., L, let Ssl = {xs,1
l l
, xs,2 , · · · , xs,n
l
l } and St = {xt,1 , xt,2 , · · · , xt,nl }. Moreover,
l l l l
L
s
L
t
Ss = ∪L l=1 Ss , St = ∪l=1 St , Ns =
l L l
l=1 n l
s , and N t = l=1 n l
t . Since the feature
spaces are the same in both domains, xsl and xtl are d-vectors for d features of the
source and target domains, respectively. Since in transfer learning there is no joint
sampling of the source and target domains, we cannot use a general joint sampling
model, but instead assume that there are two datasets separately sampled from the
source and target domains. Transferability (relatedness) is characterized by how
we define a joint prior distribution for the source and target precision matrices, Λls
and Λlt , l = 1, 2, ..., L.
We employ a Gaussian model for the feature-label distribution, xzl ∼
−1
N (μlz , (Λlz ) ), for l ∈ {1, ..., L}, where z ∈ {s, t} denotes the source s or tar-
get t domain, μls and μlt are mean vectors in the source and target domains for
label l, respectively, Λls and Λlt are the d × d precision matrices in the source and
target domains for label l, respectively, and we employ a joint Gaussian-Wishart
distribution as a prior for mean and precision matrices of the Gaussian models. The
joint prior distribution for μls , μlt , Λls , and Λls takes the form
π μls , μlt , Λls , Λlt = p μls , μlt |Λls , Λlt π Λls , Λlt . (65)
Assuming that, for any l, μs and μt are conditionally independent given Λs and Λlt
l l l
Based on a theorem in Ref. 43, we define the joint prior distribution π(Λls , Λlt )
in Eq. 66 of the precision matrices of the source and target domains for class l:
1 l −1 l T l l
π(Λt , Λs ) = K etr −
l l l
Mt + (F ) C F Λlt
2
1 −1 l l ν l −d−1 l ν l −d−1
Λs 2
× etr − Cl Λs Λt 2
2
l
ν 1 l
× 0 F1 ; G , (67)
2 4
where etr(A) = exp(tr(A)),
l
Mlt Mlts
M = (68)
(Mlts )T Mls
Based upon a theorem in Ref. 45, Λlt and Λls possess Wishart marginal distributions:
Λlz ∼ Wd (Mlz , ν l ), for l ∈ {1, ..., L} and z ∈ {s, t}.
We need the posterior distribution of the parameters of the target domain upon
observing the source and target samples. The likelihoods of the samples St and
Ss are conditionally independent given the parameters of the target and source
domains. The dependence between the two domains is due to the dependence of the
prior distributions of the precision matrices. Within each domain, the likelihoods of
the classes are conditionally independent given the class parameters. Under these
conditions, and assuming that the priors of the parameters in different classes are
independent, the joint posterior can be expressed as a product of the individual
class posteriors:42
L
π(μt , μs , Λt , Λs |St , Ss ) = π(μlt , μls , Λlt , Λls |Stl , Ssl ), (73)
l=1
where
π(μlt , μls , Λlt , Λls |Stl , Ssl ) ∝ p(Stl |μlt , Λlt )p(Ssl |μls , Λls )
× p μls |Λls p μlt |Λlt π Λls , Λlt . (74)
The next theorem gives the posterior for the target domain.
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 26
Theorem 5 [42]. Given the target St and source Ss samples, the posterior
distribution of target mean μlt and target precision matrix Λlt for class l has a
Gaussian-hypergeometric-function distribution
1
l l2
κlt,n l T l l
π(μt , Λt |St , Ss ) = A Λt exp −
l l l l
μt − mt,n Λt μt − mt,n
l l
2
ν +nt2−d−1 1 −1 l
l l
where π ∗ (μlt , Λlt ) = π(μlt , Λlt |Stl , Ssl ) is the posterior of (μlt , Λlt ) upon observation of
Stl and Ssl . We evaluate it.
Theorem 6 [42]. If Fl is full rank or null, then the effective class-conditional
density in the target domain for class l is given by
d l
−d
κlt,n 2 ν + nlt + 1
fOBTL (x|l) = π 2 Γd
κlx 2
l
ν + nlt l ν +n2 t +1 l − ν +n
l l l l
t
×Γ−1
d T x T t
2
2
l
ν + nls ν l + nlt + 1 ν l l l l l T
× 2 F1 , ; ; Ts F Tx (F )
2 2 2
l l l l l
ν + ns ν + nt ν T
×2 F1−1 , ; ; Tls Fl Tlt (Fl ) , (79)
2 2 2
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 27
A Dirichlet prior is assumed for the prior probabilities clt that the target sample
belongs to class l: ct = (c1t , · · · , cL 1
t ) ∼ Dir(L, ξt ), where ξt = (ξt , · · · , ξt ) is the
L
vector of concentration parameters, and ξt > 0 for l ∈ {1, ..., L}. As the Dirichlet
l
If there is no interaction between the source and target domains in all the classes,
then the OBTLC reduces to the OBC in the target domain. Specifically, if Mlts = 0
for all l ∈ {1, ..., L}, then ψOBTL = ψOBC .
Figure 3 shows simulation results comparing the OBC (trained only with target
data) and the OBTL classifier for two classes and ten features (see Ref. 42 for
simulation details). α is a parameter measuring the relatedness between the source
and target domains: α = 0 when the two domains are not related, and α close to
1 indicates greater relatedness. Part (a) shows average classification error versus
the number of source points, with the number of target points fixed at 10, and part
0.35
OBTL, = 0.8
Average classification error
0.22 0.25
0.2
0.2
0.18
0.15
0.16
number of source points per class number of target points per class
Fig. 3. Average classification error: (a) average classification error versus the number of source
training data per class, (b) average classification error versus the number of target training data
per class.
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 28
28 Bibliography
(b) shows average classification error versus the number of target points with the
number of source points fixed at 200.
8 Conclusion
References
[1] Yoon, B-J., Qian, X., and E. R. Dougherty, Quantifying the objective cost of uncer-
tainty in complex dynamical systems, IEEE Trans Signal Processing, 61, 2256-2266,
(2013).
[2] Dalton, L. A., and E. R. Dougherty, Intrinsically optimal Bayesian robust filtering,
IEEE Trans Signal Processing, 62, 657-670, (2014).
[3] Dougherty, E. R., Optimal Signal Processing Under Uncertainty, SPIE Press, Belling-
ham, (2018).
[4] Silver, E. A., Markovian decision processes with uncertain transition probabilities or
rewards, Technical report, Defense Technical Information Center, (1963).
[5] Martin, J. J., Bayesian Decision Problems and Markov Chains, Wiley, New York,
(1967).
[6] Kuznetsov, V. P., Stable detection when the signal and spectrum of normal noise
are inaccurately known, Telecommunications and Radio Engineering, 30-31, 58-64,
(1976).
[7] Poor, H. V., On robust Wiener filtering, IEEE Trans Automatic Control, 25, 531-536,
(1980).
[8] Grigoryan, A. M. and E. R. Dougherty, Bayesian robust optimal linear filters, Signal
Processing, 81, 2503-2521, (2001).
[9] Dougherty, E. R., Hua, J., Z. Xiong, and Y. Chen, Optimal robust classifiers, Pattern
Recognition, 38, 1520-1532, (2005).
[10] Dehghannasiri, R., Esfahani, M. S., and E. R. Dougherty, Intrinsically Bayesian ro-
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 29
Bibliography 29
bust Kalman filter: an innovation process approach, IEEE Trans Signal Processing,
65, 2531-2546, (2017).
[11] Dalton, L. A., and E. R. Dougherty, Optimal classifiers with minimum expected
error within a Bayesian framework–part I: discrete and Gaussian models, Pattern
Recognition, 46, 1288-1300, (2013).
[12] Dalton, L. A., and E. R. Dougherty, Optimal classifiers with minimum expected error
within a Bayesian framework–part II: properties and performance analysis, Pattern
Recognition, 46, 1301-1314, (2013).
[13] Dalton, L. A., and E. R. Dougherty, Bayesian minimum mean-square error estimation
for classification error–part I: definition and the Bayesian MMSE error estimator for
discrete classification, IEEE Trans Signal Processing, 59, 115-129, (2011).
[14] Dalton, L. A. , and E. R. Dougherty, Bayesian minimum mean-square error estimation
for classification error–part II: linear classification of Gaussian models, IEEE Trans
Signal Processing, 59, 130-144, (2011).
[15] DeGroot, M. H., Optimal Statistical Decisions, McGraw-Hill, New York, (1970).
[16] Raiffa, H., and R. Schlaifer, Applied Statistical Decision Theory, MIT Press, Cam-
bridge, (1961).
[17] Jeffreys, H., An invariant form for the prior probability in estimation problems, Proc
Royal Society of London. Series A, Mathematical and Physical Sciences, 186, 453-461,
(1946).
[18] Jeffreys, H., Theory of Probability, Oxford University Press, London, (1961).
[19] Knight, J., Ivanov, I., and E. R. Dougherty, MCMC Implementation of the optimal
Bayesian classifier for non-Gaussian models: model-based RNA-seq classification,
BMC Bioinformatics, 15, (2014).
[20] Knight, J., Ivanov, I., Chapkin, R., and E. R. Dougherty, Detecting multivariate
gene interactions in RNA-seq data using optimal Bayesian classification, IEEE/ACM
Trans Computational Biology and Bioinformatics, 15, 484-493, (2018).
[21] Nagaraja, K., and U. Braga-Neto, Bayesian classification of proteomics biomarkers
from selected reaction monitoring data using an approximate Bayesian computation–
Markov chain monte carlo approach, Cancer Informatics, 17, (2018).
[22] Banerjee, U., and U. Braga-Neto, Bayesian ABC-MCMC classification of liquid
chromatography–mass spectrometry data, Cancer Informatics, 14, (2015).
[23] Karbalayghareh, A., Braga-Neto, U. M., and E. R. Dougherty, Intrinsically Bayesian
robust classifier for single-cell gene expression time series in gene regulatory networks,
BMC Systems Biology, 12, (2018).
[24] Dadaneh, S. Z., Dougherty, E. R., and X. Qian, Optimal Bayesian classification with
missing values, IEEE Trans Signal Processing, 66, 4182-4192, (2018).
[25] Zollanvari, A., Hua, J., and E. R. Dougherty, Analytic study of performance of lin-
ear discriminant analysis in stochastic settings, Pattern Recognition, 46, 3017-3029,
(2013).
[26] Broumand, A., Yoon, B-J., Esfahani, M. S., and E. R. Dougherty, Discrete optimal
Bayesian classification with error-conditioned sequential sampling, Pattern Recogni-
tion, 48, 3766-3782, (2015).
[27] Dalton, L. A.„ and M. R. Yousefi, On optimal Bayesian classification and risk es-
timation under multiple classes, EURASIP J. Bioinformatics and Systems Biology,
(2015).
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 30
30 Bibliography
[28] Jaynes, E. T., Prior Probabilities, IEEE Trans Systems Science and Cybernetics, 4,
227-241, (1968).
[29] Jaynes, E., What is the question? in Bayesian Statistics, J. M. Bernardo et al., Eds.,
Valencia University Press, Valencia, (1980).
[30] Kashyap, R., Prior probability and uncertainty, IEEE Trans Information Theory,
IT-17, 641-650, (1971).
[31] Rissanen, J., A universal prior for integers and estimation by minimum description
length, Annals of Statistics, 11, 416-431, (1983).
[32] Dalton, L. A., and E. R. Dougherty, Application of the Bayesian MMSE error esti-
mator for classification error to gene-expression microarray data, Bioinformatics, 27,
1822-1831, (2011).
[33] Esfahani, M. S., Knight, J., Zollanvari, A., Yoon, B-J., and E. R. Dougherty, Classifier
design given an uncertainty class of feature distributions via regularized maximum
likelihood and the incorporation of biological pathway knowledge in steady-state phe-
notype classification, Pattern Recognition, 46, 2783-2797, (2013).
[34] Esfahani, M. S., and E. R. Dougherty, Incorporation of biological pathway knowledge
in the construction of priors for optimal Bayesian classification, IEEE/ACM Trans
Computational Biology and Bioinformatics, 11, 202-218, (2014).
[35] Esfahani, M. S., and E. R. Dougherty, An optimization-based framework for the
transformation of incomplete biological knowledge into a probabilistic structure and
its application to the utilization of gene/protein signaling pathways in discrete phe-
notype classification, IEEE/ACM Trans Computational Biology and Bioinformatics,
12, 1304-1321, (2015).
[36] Boluki, S., Esfahani, M. S., Qian, X., and E. R. Dougherty, Incorporating biological
prior knowledge for Bayesian learning via maximal knowledge-driven information
priors, BMC Bioinformatics, 18, (2017).
[37] Qian, X., and E. R. Dougherty, Bayesian regression with network prior: optimal
Bayesian filtering perspective, IEEE Trans Signal Processing, 64, 6243-6253, (2016).
[38] Bernado, J., and A. Smith, Bayesian Theory, Wiley, Chichester, U.K., (2000).
[39] Bishop, C., Pattern Recognition and Machine Learning. Springer-Verlag, New York,
(2006).
[40] Murphy, K., Machine Learning: A Probabilistic Perspective, MIT Press, Cambridge,
(2012).
[41] Pan, S. J., and Q.Yang, A survey on transfer learning, IEEE Trans Knowledge and
Data Engineering, 22, 1345-1359, (2010).
[42] Karbalayghareh, A., Qian, X., and E. R. Dougherty, Optimal Bayesian transfer learn-
ing, IEEE Trans Signal Processing, 66, 3724-3739, (2018).
[43] Halvorsen, K., Ayala, V. and E. Fierro, On the marginal distribution of the diagonal
blocks in a blocked Wishart random matrix, Int. J. Anal, vol. 2016, pp. 1-5, 2016.
[44] Nagar, D. K., and J. C. Mosquera-Benı́tez, Properties of matrix variate hypergeo-
metric function distribution, Appl. Math. Sci., vol. 11, no. 14, pp. 677-692, 2017.
[45] Muirhead, R. J., Aspects of Multivariate Statistical Theory, Wiley, Hoboken, 2009.
March 12, 2020 10:0 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter˙Shi˙Gong page 31
CHAPTER 1.2
This chapter introduces two deep discriminative feature learning methods for
object recognition without the need to increase the network complexity, one based
on entropy-orthogonality loss, and another one based on Min-Max loss. These
two losses can enforce the learned feature vectors to have better within-class
compactness and between-class separability. Therefore the discriminative ability
of the learned feature vectors is highly improved, which is very essential to object
recognition.
1. Introduction
Recent years have witnessed the bloom of convolutional neural networks (CNNs) in
many pattern recognition and computer vision applications, including object recog-
nition,1–4 object detection,5–8 face verification,9,10 semantic segmentation,6 object
tracking,11 image retrieval,12 image enhancement,13 image quality assessment,14
etc.
These impressive accomplishments mainly benefit from the three factors below:
(1) the rapid progress of modern computing technologies represented by GPGPUs
and CPU clusters has allowed researchers to dramatically increase the scale and
complexity of neural networks, and to train and run them within a reasonable time
frame, (2) the availability of large-scale datasets with millions of labeled training
samples has made it possible to train deep CNNs without a severe overfitting,
and (3) the introduction of many training strategies, such as ReLU,1 Dropout,1
DropConnect,15 and batch normalization,16 can help produce better deep models
by the back-propagation (BP) algorithm.
Recently, a common and popular method to improve object recognition perfor-
mance of CNNs is to develop deeper network structures with higher complexities
and then train them with large-scale datasets. However, this strategy is unsustain-
able, and inevitably reaching its limit. This is because training very deep CNNs is
31
March 12, 2020 10:0 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter˙Shi˙Gong page 32
becoming more and more difficult to converge, and also requires GPGPU/CPU clus-
ters and complex distributed computing platforms. These requirements go beyond
the limited budgets of many research groups and many real applications.
The learned features to have good discriminative ability are very essential to
object recognition.17–21 Discriminative features are the features with better within-
class compactness and between-class separability. Many discriminative feature
learning methods22–27 that are not based on deep learning have been proposed.
However, constructing a highly efficient discriminative feature learning method for
CNN is non-trivial. Because the BP algorithm with mini-batch is used to train
CNN, a mini-batch cannot very well reflect the global distribution of the training
set. Owing to the large scale of the training set, it is unrealistic to input the whole
training set in each iteration. In recent years, contrastive loss10 and triplet loss28 are
proposed to strengthen the discriminative ability of the features learned by CNN.
However, both of them suffer from dramatic data expansion when composing the
sample pairs or triplets from the training set. Moreover, it has been reported that
the way of constituting pairs or triplets of training samples can significantly affect
the performance accuracy of a CNN model by a few percentage points.17,28 As a
result, using such losses may lead to a slower model convergence, higher computa-
tional cost, increased training complexity and uncertainty.
For almost all visual tasks, the human visual system (HVS) is always superior
to current machine visual systems. Hence, developing a system that simulates some
properties of the HVS will be a promising research direction. Actually, existing
CNNs are well known for their local connectivity and shared weight properties that
originate from discoveries in visual cortex research.
Research findings in the areas of neuroscience, physiology, psychology, etc,29–31
have shown that, object recognition in human visual cortex (HVC) is accomplished
by the ventral stream, starting from the V1 area through the V2 area and V4 area,
to the inferior temporal (IT) area, and then to the prefrontal cortex (PFC) area.
By this hierarchy, raw input stimulus from the retina are gradually transformed
into higher level representations that have better discriminative ability for speedy
and accurate object recognition.
In this chapter, we introduce two deep discriminative feature learning methods
for object recognition by drawing lessons from HVC object recognition mechanisms,
one inspired by the class-selectivity of the neurons in the IT area, and another one
inspired by the “untangling” mechanism of HVC.
In the following, we first introduce the class-selectivity of the neurons in the IT
area and “untangling” mechanism of HVC, respectively.
Class-selectivity of the neurons in the IT area. Research findings30 have
revealed the class-selectivity of the neurons in the IT area. Specifically, the response
of an IT neuron to visual stimulus is sparse with respect to classes, i.e., it only
responds to very few classes. The class-selectivity implies that the feature vectors
from different classes can be easily separated.
March 12, 2020 10:0 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter˙Shi˙Gong page 33
Chair
manifold Transformations
Chair
Not Chair
(a) (b)
Fig. 1. (color online) In the beginning, manifolds corresponding to different object classes are
highly curved and “tangled”. For instance, a chair manifold (see blue manifold) and all other non-
chair manifolds (where, the black manifold is just one example). After a series of transformations,
in the end, each manifold corresponding to an object category is very compact, and the distances
between different manifolds are very large, and then the discriminative features are learned.30,33
Inspired by the class-selectivity of the neurons in the IT area, Shi et al.34 pro-
posed to improve the discriminative feature learning of CNN models by enabling
the learned feature vectors to have class-selectivity. To achieve this, a novel loss
function, termed entropy-orthogonality loss (EOL), is proposed to modulate the
neuron outputs (i.e., feature vectors) in the penultimate layer of a CNN model.
The EOL explicitly enables the feature vectors learned by a CNN model to have
the following properties: (1) each dimension of the feature vectors only responds
strongly to as few classes as possible, and (2) the feature vectors from different
classes are as orthogonal as possible. Hence this method makes an analogy between
the CNN’s penultimate layer neurons and the IT neurons, and the EOL measures
the degree of discrimination of the learned features. The EOL and the softmax
loss have the same training requirement without the need to carefully recombine
the training sample pairs or triplets. Accordingly, the training of CNN models is
more efficient and easier-to-implement. When combined with the softmax loss, the
EOL not only can enlarge the differences in the between-class feature vectors, but
also can reduce the variations in the within-class feature vectors. Therefore the dis-
criminative ability of the learned feature vectors is highly improved, which is very
essential to object recognition. In the following, we will introduce the framework of
the EOL-based deep discriminative feature learning method.
2.1. Framework
n
Assume that T = {Xi , ci }i=1 is the training set, where Xi represents the ith training
sample (i.e., input image), ci ∈ {1, 2, · · · , C} refers to the ground-truth label of Xi ,
C refers to the number of classes, and n refers to the number of training samples
in T . For the input image Xi , we denote the outputa of the penultimate layer√of a
1
CNN by xi , and view xi as the feature vector of Xi learned by the CNN. 22 , 2
This method improves discriminative feature learning of a CNN by embedding
the entropy-orthogonality loss (EOL) into the penultimate layer of the CNN during
training. For an L-layer CNN model, embedding the EOL into the layer L − 1 of
the CNN, the overall objective function is:
n
min L = (W, Xi , ci ) + λM(F, c) , (1)
W
i=1
where (W, Xi , ci ) is the softmax loss for sample Xi , W denotes the total layer
parameters of the CNN model, W = {W(l) , b(l) }L l=1 , W
(l)
represents the filter
weights of the l layer, b refers to the corresponding biases. M(F, c) denotes
th (l)
the EOL, F = [x1 , · · · , xn ], and c = {ci }ni=1 . Hyperparameter λ adjusts the balance
between the softmax loss and the EOL.
a Assume that the output has been reshaped into a column vector.
March 12, 2020 10:0 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter˙Shi˙Gong page 35
|xj (k)| |xkj |
Pkc = j∈π
n
c j∈π
n c , (3)
i=1 |x i (k)| i=1 |xki |
where, xki (i.e., xi (k)) refers to the k th dimension of xi , πc represents the index set
of the samples belonging to the cth class.
The maximum possible value for E(k) is 1 when ∀c, Pkc = C1 , which means that
the set of supported classes of dimension k includes all the classes and, therefore,
dimension k is not class-selective at all (it is extremely “class-sharing”). Similarly,
the minimum possible value of E(k) is 0 when ∃c, Pkc = 1 and ∀c = c, Pkc = 0,
which means that the set of supported classes of dimension k includes just one class
c and, therefore, dimension k is extremely class-selective. For dimension k, the
degree of its class-selectivity is determined by the value of E(k) (between 0 and 1).
As the value of E(k) decreases, the class-selectivity of dimension k increases.
According to the discussions above, the entropy loss E(F, c) can be defined as:
d
E(F, c) = E(k) , (4)
k=1
the entropy loss does not consider the connection between different dimensions,
which is problematic. Take 3-dimensional feature vector as an example. If we
have six feature vectors from 3 different classes, x1 and x2 come from class 1, x3
and x4 come from class 2, x5 and x6 come from class 3. For the feature vector
matrix F = [x1 , x2 , x3 , x4 , x5 , x6 ], when it takes the following value of A and B,
respectively, E(A,
c) = E(B, c), where c = {1, 1, 2, 2, 3, 3}. However, the latter
one can not be classified at all, this is because x2 , x4 and x6 have the same value.
Although the situation can be partially avoided by the softmax loss, it can still cause
contradiction to the softmax loss and therefore affect the discriminative ability of
the learned features.
⎡ ⎤ ⎡ ⎤
11 00 11 101000
A = ⎣ 0 0 1 1 1 1 ⎦, B = ⎣ 0 1 0 1 0 1 ⎦ (5)
1 1 1
2 1 2 1 2 1 1 0 0 0 1 0
To address this problem, we need to promote orthogonality (i.e., minimize dot
products) between the feature vectors of different classes. Specifically, we need to
introduce the following orthogonality loss O(F, c):
n
O(F, c) = (x
i xj − φij ) = F F − ΦF ,
2 2
(6)
i,j=1
where,
1 , if ci = cj ,
φij = (7)
0 , else ,
Φ = (φij )n×n , · F denotes the Frobenius norm of a matrix, and the superscript
denotes the transpose of a matrix. Minimizing the orthogonality loss is equivalent
to enforcing that (1) the feature vectors from different classes are as orthogonal as
possible, (2) the L2 -norm of each feature vector is as close as possible to 1, and (3)
the distance between any two feature vectors belonging to the same class is as small
as possible.
Based on the above discussions and definitions, the entropy-orthogonality loss
(EOL) M(F, c) can be obtained by integrating Eq. (4) and Eq. (6):
M(F, c) = αE(F, c) + (1 − α)O(F, c)
d
=α E(k) + (1 − α)F F − Φ2F , (8)
k=1
where α is the hyperparameter to adjust the balance between the two terms.
Combining Eq. (8) with Eq. (1), the overall objective function becomes:
n
min L(W, T ) = (W, Xi , ci ) + λαE(F, c) + λ(1 − α)O(F, c)
i=1
n
= (W, Xi , ci ) + λ1 E(F, c) + λ2 O(F, c) , (9)
i=1
where, λ1 = λα, λ2 = λ(1 − α). Next, we will introduce the optimization algorithm
for Eq. (9).
March 12, 2020 10:0 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter˙Shi˙Gong page 37
Forward Propagation
W(1) , b(1) W(2) ,b(2) W(3) , b(3) W(4) ,b(4) W(5) , b(5) C -dimensional
output
Input Image
conv1 conv2 conv3 fc2
fc1
Entropy-Orthogonality Loss (EOL)
Softmax Loss
Error Flows Back-Propagation Process
EOL
Fig. 2. The flowchart of training process in an iteration for the EOL-based deep discriminative
feature learning method.34 CNN shown in this figure consists of 3 convolutional (conv) layers
and 2 fully connected (fc) layers, i.e., it is a 5-layer CNN model. The last layer fc2 outputs a
C-dimensional prediction vector, C is the number of classes. The penultimate layer in this model
is fc1, so the entropy-orthogonality loss (EOL) is applied to layer fc1. The EOL is independent of
the CNN structure.
2.3. Optimization
We employ the BP algorithm with mini-batch to train the CNN model. The over-
all objective function is Eq. (9). Hence, we need to compute the gradients of L
with respect to (w.r.t.) the activations of all layers, which are called the error
flows of the corresponding layers. The gradient calculation of the softmax loss is
straightforward. In the following, we focus on obtaining the gradients of the E(F, c)
and O(F, c) w.r.t. the feature vectors xi = [x1i , x2 , · · · , xdi ] , (i = 1, 2, · · · , n),
respectively.
The gradient of E(F, c) w.r.t. xi is
∂E(F, c) ∂E(1) ∂E(2) ∂E(d)
=[ , ,··· , ] , (10)
∂xi ∂x1i ∂x2i ∂xdi
∂E(k) C
(1 + ln(Pkc )) ∂Pkc
=− · , (11)
∂xki c=1
ln(C) ∂xki
⎧
j∈π |xkj |
⎪
⎨ (n c|x |)2 × sgn(xki ) , i ∈ πc ,
∂Pkc j=1 kj
= − (12)
∂xki ⎪
⎩ n j∈πc |xkj |
2 × sgn(xki ) , i ∈ πc ,
( j=1 |xkj |)
where sgn(·) is sign function.
The O(F, c) can be written as:
O(F, c) = F F − Φ2F = T r((F F − Φ) (F F − Φ))
= T r(F FF F) − 2T r(ΦF F) + T r(Φ Φ) , (13)
where T r(·) refers to the trace of a matrix.
The gradients of O(F, c) w.r.t. xi is
∂O(F, c)
= 4F(F F − Φ)(:,i) , (14)
∂xi
where the subscript (:, i) represents the ith column of a matrix.
Fig. 2 shows the flowchart of the training process in an iteration for the EOL-
based deep discriminative feature learning method. Based on the above derivatives,
the training algorithm for this method is listed in Algorithm 1.
In principle, the Min-Max loss is independent of any CNN structures, and can be
applied to any layers of a CNN model. The experimental evaluations20,33 show that
applying the Min-Max loss to the penultimate layer is most effective for improving
the model’s object recognition accuracies. In the following, we will introduce the
framework of the Min-Max loss based deep discriminative feature learning method.
3.1. Framework
n
Let {Xi , ci }i=1 be the set of input training data, where Xi denotes the ith raw
input data, ci ∈ {1, 2, · · · , C} denotes the corresponding ground-truth label, C is
the number of classes, and n is the number of training samples. The goal of training
CNN is to learn filter weights and biases that minimize the classification error from
the output layer. A recursive function for an M -layer CNN model can be defined
as follows:
(m) (m−1)
Xi = f (W(m) ∗ Xi + b(m) ) , (15)
(0)
i = 1, 2, · · · , n; m = 1, 2, · · · , M ; Xi = Xi , (16)
where, W(m) denotes the filter weights of the mth layer to be learned, b(m) refers to
the corresponding biases, ∗ denotes the convolution operation, f (·) is an element-
(m)
wise nonlinear activation function such as ReLU, and Xi represents the feature
maps generated at layer m for sample Xi . The total parameters of the CNN model
can be denoted as W = {W(1) , · · · , W(M ) ; b(1) , · · · , b(M ) } for simplicity.
This method improves discriminative feature learning of a CNN model by em-
bedding the Min-Max loss into certain layer of the model during the training pro-
cess. Embedding this loss into the k th layer is equivalent to using the following cost
function to train the model:
n
min L = (W, Xi , ci ) + λL(X (k) , c) , (17)
W
i=1
where (W, Xi , ci ) is the softmax loss for sample Xi , L(X (k) , c) denotes the Min-
(k) (k)
Max loss. The input to it includes X (k) = {X1 , · · · , Xn } which denotes the set
of produced feature maps at layer k for all the training samples, and c = {ci }ni=1
which is the set of corresponding labels. Hyper-parameter λ controls the balance
between the classification error and the Min-Max loss.
Note that X (k) depends on W(1) , · · · , W(k) . Hence directly constraining X (k)
will modulate the filter weights from 1th to k th layers (i.e. W(1) , · · · , W(k) ) by
feedback propagation during the training phase.
Within-manifold Between-manifold
Compactness Margin
xi Manifold-1
xj
Manifold-2
3.3. Optimization
We use the back-propagation method to train the CNN model, which is carried
out using mini-batch. Therefore, we need to calculate the gradients of the overall
objective function with respect to the features of the corresponding layers. Because
the softmax loss is used as the first term of Eq. (23) and (30), its gradient calculation
is straightforward. In the following, we focus on obtaining the gradient of the Min-
Max loss with respect to the feature maps xi in the corresponding layer.
1 (W )
n
S(W ) = Ω (xi − xj ) (xi − xj ) , (34)
2 i,j=1 ij
1 (B)
n
S(B) = Ω (xi − xj ) (xi − xj ) , (35)
2 i,j=1 ij
(W ) (B)
where n is the number of inputs in a mini-batch, Ωij and Ωij are respectively
(W ) (W )
elements (i, j) of the within-manifold adjacency matrix Ω = (Ωij )n×n , and
(B)
the between-manifold adjacency matrix Ω(B) = (Ωij )n×n
based on the features
X (k) (i.e., the generated feature maps at the k layer, X = {x1 , · · · , xn }) from
th (k)
∂T r(S(B) )
= (xi 1
n − H)(Ω
(B)
+ Ω(B) )(:,i) , (39)
∂xi
where H = [x1 , · · · , xn ], and the subscript (:, i) denotes the ith column of a matrix.
Then the gradients of the Min-Max loss with respect to features xi is
∂T r(S(B) ) ∂T r(S(W ) )
∂L T r(S(W ) ) ∂xi − T r(S(B) ) ∂xi
= . (40)
∂xi [T r(S(W ) )]2
Incremental mini-batch training procedure
March 12, 2020 10:0 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter˙Shi˙Gong page 44
In practise, when the number of the classes is large relative to the mini-batch size,
because there is no guarantee that each mini-batch will contain training samples
from all the classes, the above gradient must be calculated in an incremental fashion.
th
Firstly, the mean vector of the c class can be updated as
i∈πc (t) xi (t) + Nc (t − 1)mc (t − 1)
mc (t) = , (41)
Nc (t)
where (t) indicates tth iteration, Nc (t) represents the cumulative total number of
cth class training samples, πc (t) denotes the index set of the samples belonging to
the cth class in a mini-batch, and nc (t) = |πc (t)|. Accordingly, the overall mean
vector m(t) can be updated by
1
C
m(t) = nc (t)mc (t) . (42)
n c=1
C
where n = c=1 |πc (t)|, i.e. n is the number of training samples in a mini-batch.
(W )
In such scenario, at the tth iteration, the within-manifold distance Sc (t) for
class c can be represented as
Sc(W ) (t) = (xi (t) − mc (t)) (xi (t) − mc (t)) , (43)
i∈πc (t)
the total within-manifold distance S (W ) (t) can be denoted as
C
S (W ) (t) = Sc(W ) (t) , (44)
c=1
(B)
the total between-manifold distance S (t) can be expressed as
C
S (B) (t) = nc (t)(mc (t) − m(t)) (mc (t) − m(t)) . (45)
c=1
(W )
Then the gradients of S (t) and S (B) (t) with respect to xi (t) become:
∂S (W ) (t)
C (W )
∂Sc (t)
= I(i ∈ πc (t))
∂xi (t) c=1
∂xi (t)
C
(nc (t)mc (t) − j∈πc (t) xj (t))
=2 I(i ∈ πc (t)) (xi (t) − mc (t)) + , (46)
c=1
Nc (t)
and C
∂S (B) (t) ∂ c=1 nc (t)(mc (t) − m(t)) (mc (t) − m(t))
=
∂xi (t) ∂xi (t)
C
nc (t)(mc (t) − m(t))
=2 I(i ∈ πc (t)) , (47)
c=1
Nc (t)
where I(·) refers to the indicator function that equals 1 if the condition is satisfied,
and 0 if not. Accordingly, the gradients of the Min-Max loss with respect to features
xi (t) is
(B) (W )
∂L S (W ) (t) ∂S (t)
∂xi (t) − S
(B)
(t) ∂S∂xi (t)(t)
= . (48)
∂xi (t) [S (W ) (t)]2
The total gradient with respect to xi is sum of the gradient from the softmax
loss and that of the Min-Max loss.
March 12, 2020 10:0 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter˙Shi˙Gong page 45
4.2. Datasets
The CIFAR10,41 CIFAR100,41 MNIST42 and SVHN43 datasets are chosen to con-
duct performance evaluations. CIFAR10 and CIFAR100 are natural image datasets.
MNIST is a dataset of hand-written digit (0-9) images. SVHN is collected from
house numbers in Google Street View images. For an image of SVHN, there may
be more than one digit, but the task is to classify the digit in the image center.
Table 1 lists the details of the CIFAR10, CIFAR100, MNIST and SVHN datasets.
These four datasets are very popular in image classification research community.
This is because they contain a large amount of small images, hence they enable
models to be trained in reasonable time frames on moderate configuration comput-
ers.
Table 1. Details of the CIFAR10, CIFAR100, MNIST and SVHN datasets.
training/test:
CIFAR10 10 60000 32×32 RGB
50000/10000
training/test:
CIFAR100 100 60000 32×32 RGB
50000/10000
training/test:
MNIST 10 70000 28×28 gray-scale
60000/10000
training/test/extra:
SVHN 10 630420 32×32 RGB
73257/26032/531131
connected (fc) layers. We evaluated the QCNN model using CIFAR10, CIFAR100
and SVHN, respectively. MNIST can not be used to evaluate the QCNN model,
because the input size of QCNN must be 32×32, but the images in MNIST are
28 × 28 in size.
Table 2 shows the test set top-1 error rates of CIFAR10, CIFAR100 and SVHN,
respectively. It can be seen that training QCNN with the EOL or Min-Max loss is
able to effectively improve performance compared to the respective baseline. These
remarkable performance improvements clearly reveal the effectiveness of the EOL
and the Min-Max loss.
Next, we apply the EOL or Min-Max loss to the NIN models.40 NIN consists of
9 conv layers without fc layer. The four datasets, including CIFAR10, CIFAR100,
MNIST and SVHN, are used in the evaluation. For fairness, we complied with the
same training/testing protocols and data preprocessing as in.40,44
Table 3 provides the respective comparison results of test set top-1 error rates for
the four datasets. For NIN baseline, to be fair, we report the evaluation results from
both our own experiments and the original paper.40 We also include the results of
DSN44 in this table. DSN is also based on NIN with layer-wise supervisions. These
results again reveal the effectiveness of the EOL and the Min-Max loss.
March 12, 2020 10:0 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter˙Shi˙Gong page 47
Fig. 4. Feature visualization of the CIFAR10 test set, with (a) QCNN; (b) QCNN+EOL. One
dot denotes a image, different colors denote different classes.
Fig. 5. Feature visualization of the CIFAR10 test set, with (a) NIN; (b) NIN+EOL.
5. Discussions
From section 4, all the experiments indicate superiority of the EOL and the Min-
Max loss. The reasons that why better within-class compactness and between-class
separability will lead to better discriminative ability of the learned feature vectors
are as follows:
(1) Almost all the data clustering methods,46–48 discriminant analysis meth-
ods,35,49,50 etc, use this principle to learn discriminative features to better
accomplish the task. Data clustering can be regarded as unsupervised data
classification. Therefore, by analogy, learning features that possess the above
property will certainly improve performance accuracies for supervised data clas-
sification.
March 12, 2020 10:0 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter˙Shi˙Gong page 48
References
17. Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach
for deep face recognition. In Proceedings of the European Conference on Computer
Vision, pp. 499–515, (2016).
18. W. Shi, Y. Gong, J. Wang, and N. Zheng. Integrating supervised laplacian objective
with cnn for object recognition. In Pacific Rim Conference on Multimedia, pp. 64–73,
(2016).
19. G. Cheng, C. Yang, X. Yao, L. Guo, and J. Han, When deep learning meets metric
learning: Remote sensing image scene classification via learning discriminative cnns,
IEEE Transactions on Geoscience and Remote Sensing. (2018). doi: 10.1109/TGRS.
2017.2783902.
20. W. Shi, Y. Gong, and J. Wang. Improving cnn performance with min-max objective.
In Proceedings of the International Joint Conference on Artificial Intelligence, pp.
2004–2010, (2016).
21. W. Shi, Y. Gong, X. Tao, and N. Zheng, Training dcnn by combining max-margin,
max-correlation objectives, and correntropy loss for multilabel image classification,
IEEE Transactions on Neural Networks and Learning Systems. 29(7), 2896–2908,
(2018).
22. C. Li, Q. Liu, W. Dong, F. Wei, X. Zhang, and L. Yang, Max-margin-based dis-
criminative feature learning, IEEE Transactions on Neural Networks and Learning
Systems. 27(12), 2768–2775, (2016).
23. G.-S. Xie, X.-Y. Zhang, X. Shu, S. Yan, and C.-L. Liu. Task-driven feature pooling for
image classification. In Proceedings of the IEEE International Conference on Computer
Vision, pp. 1179–1187, (2015).
24. G.-S. Xie, X.-Y. Zhang, S. Yan, and C.-L. Liu, Sde: A novel selective, discriminative
and equalizing feature representation for visual recognition, International Journal of
Computer Vision. 124(2), 145–168, (2017).
25. G.-S. Xie, X.-Y. Zhang, S. Yan, and C.-L. Liu, Hybrid cnn and dictionary-based
models for scene recognition and domain adaptation, IEEE Transactions on Circuits
and Systems for Video Technology. 27(6), 1263–1274, (2017).
26. J. Tang, Z. Li, H. Lai, L. Zhang, S. Yan, et al., Personalized age progression with bi-
level aging dictionary learning, IEEE Transactions on Pattern Analysis and Machine
Intelligence. 40(4), 905–917, (2018).
27. G.-S. Xie, X.-B. Jin, Z. Zhang, Z. Liu, X. Xue, and J. Pu, Retargeted multi-view
feature learning with separate and shared subspace uncovering, IEEE Access. 5,
24895–24907, (2017).
28. F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face
recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 815–823, (2015).
29. T. Serre, A. Oliva, and T. Poggio, A feedforward architecture accounts for rapid
categorization, Proceedings of the National Academy of Sciences. 104(15), 6424–6429,
(2007).
30. J. J. DiCarlo, D. Zoccolan, and N. C. Rust, How does the brain solve visual object
recognition?, Neuron. 73(3), 415–434, (2012).
31. S. Zhang, Y. Gong, and J. Wang. Improving dcnn performance with sparse category-
selective objective function. In Proceedings of the International Joint Conference on
Artificial Intelligence, pp. 2343–2349, (2016).
32. N. Pinto, N. Majaj, Y. Barhomi, E. Solomon, D. Cox, and J. DiCarlo. Human versus
machine: comparing visual object recognition systems on a level playing field. In
Computational and Systems Neuroscience, (2010).
33. W. Shi, Y. Gong, X. Tao, J. Wang, and N. Zheng, Improving cnn performance accu-
March 12, 2020 10:0 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter˙Shi˙Gong page 50
racies with min-max objective, IEEE Transactions on Neural Networks and Learning
Systems. 29(7), 2872–2885, (2018).
34. W. Shi, Y. Gong, D. Cheng, X. Tao, and N. Zheng, Entropy and orthogonality based
deep discriminative feature learning for object recognition, Pattern Recognition. 81,
71–80, (2018).
35. S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, Graph embedding and
extensions: a general framework for dimensionality reduction, IEEE Transactions on
Pattern Analysis and Machine Intelligence. 29(1), 40–51, (2007).
36. M. Sugiyama. Local fisher discriminant analysis for supervised dimensionality reduc-
tion. In Proc. Int. Conf. Mach. Learn., pp. 905–912, (2006).
37. G. S. Xie, X. Y. Zhang, Y. M. Zhang, and C. L. Liu. Integrating supervised subspace
criteria with restricted boltzmann machine for feature extraction. In Int. Joint Conf.
on Neural Netw., (2014).
38. M. K. Wong and M. Sun, Deep learning regularized fisher mappings, IEEE Transac-
tions on Neural Networks. 22, 1668–1675, (2011).
39. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Pro-
ceedings of the ACM International Conference on Multimedia, pp. 675–678, (2014).
40. M. Lin, Q. Chen, and S. Yan, Network in network, arXiv preprint arXiv:1312.4400.
(2013).
41. A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images,
Master’s thesis, University of Toronto. (2009).
42. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to
document recognition, Proceedings of the IEEE. 86(11), 2278–2324, (1998).
43. Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in
natural images with unsupervised feature learning. In Neural Information Processing
Systems (NIPS) workshop on deep learning and unsupervised feature learning, vol.
2011, p. 5, (2011).
44. C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In
Artificial Intelligence and Statistics, pp. 562–570, (2015).
45. L. Van der Maaten and G. Hinton, Visualizing data using t-sne, Journal of Machine
Learning Research. 9, 2579–2605, (2008).
46. A. K. Jain, M. N. Murty, and P. J. Flynn, Data clustering: a review, ACM computing
surveys (CSUR). 31(3), 264–323, (1999).
47. U. Von Luxburg, A tutorial on spectral clustering, Statistics and computing. 17(4),
395–416, (2007).
48. S. Zhou, Z. Xu, and F. Liu, Method for determining the optimal number of clusters
based on agglomerative hierarchical clustering, IEEE Transactions on Neural Net-
works and Learning Systems. (2016).
49. R. A. Fisher, The use of multiple measurements in taxonomic problems, Annals of
eugenics. 7(2), 179–188, (1936).
50. J. L. Andrews and P. D. Mcnicholas, Model-based clustering, classification, and dis-
criminant analysis via mixtures of multivariate t-distributions, Statistics and Com-
puting. 22(5), 1021–1029, (2012).
CHAPTER 1.3
Machine learning has been widely applied for detection of moving objects from
static cameras. Recently, many methods using deep learning for background
subtraction have been reported, with very promising performance. This chapter
provides a survey of different deep-learning based background subtraction
methods. First, a comparison of the architecture of each method is provided,
followed by a discussion against the specific application requirements such as
spatio-temporal and real-time constraints. After analyzing the strategies of each
method and showing their limitations, a comparative evaluation on the large
scale CDnet2014 dataset is provided. Finally, we conclude with some potential
future research directions.
1. Introduction
51
52 J. H. Giraldo1, et al.
2. Background Subtraction
image patches for each pixel and then combine them with the corresponding
patches from the background model. The size of image patch in this research is
27*27. After that, these combined patches are used as input of a neural network
to predict the probability of a pixel being foreground or background. The authors
use 5*5 local receptive fields, and 3*3 non-overlapping receptive fields for all
pooling layers. The numbers of feature maps of the first two convolutional layers
are 6 and 16, respectively. The first fully connected layer consists of 120 neurons
and the output layer generates a single sigmoid unit. There are 20,243 parameters
which are trained using back-propagation with a cross-entropy loss function. The
algorithm needs for training the foreground results of a previous segmentation
algorithm (IUTIS [65]) or the ground truth information provided in CDnet 2014
[19]. The CDnet 2014 datasets was divided into two halves; one for training, and
one for testing purposes. The ConvNet shows a very similar performance to other
state-of-the-art methods. Moreover, it outperforms all other methods significantly
with the use of the ground-truth information, especially in videos of hard
shadows and night videos. The F-Measure score of ConvNet in CDnet2014
dataset is 0.9046. DNNs approaches have been applied in other applications such
as: vehicle detection [69], and pedestrian detection [127]. To be more precise,
Yan et al. [127] used a similar scheme to detect pedestrian with both visible and
thermal images. The inputs of the networks consist of the visible frame (RGB),
thermal frame (IR), visible background (RGB) and thermal background (IR),
which summed up to an input size of 64*64*8. This method shows a great
improvement on OCTBVS dataset, in comparison with T2F-MOG, SuBSENSE,
and DECOLOR.
Remarks: ConvNet is one of the simplest approaches to model the
differences between the background and the foreground using CNNs. A merit
contribution of the study of Braham and Van Droogenbroeck [67] is being the
first application of deep learning for background subtraction. For this reason, it
can be used as a reference for comparison in terms of the improvement in
performance. However, several limitations are presented. First, it is difficult to
learn the high-level information through patches [93]. Second, due to the over-
fitting that is caused by using highly redundant data for training, the network is
scene-specific. In experiment, the model can only process a certain scenery, and
needs to be retrained for other video scenes. For many applications where the
camera is fixed and always captures similar scene, this fact is not a serious
problem. However, it may not be the case in certain applications as discussed by
Hu et al. [71]. Third, ConvNet processes each pixel independently so the
foreground mask may contain isolated false positives and false negatives. Fourth,
this method requires the extraction of large number of patches from each frame
in the video, which presents a very expensive computation as pointed out by Lim
54 J. H. Giraldo1, et al.
and Keles [59]. Fifth, the method requires pre-or post-processing of the data, and
thus is not applicable for end-to-end learning framework. The long-term
dependencies of the input video sequences are not considered since ConvNet
uses only a few frames as input. ConvNet is a deep encoder-decoder network that
is a generator network. However, one of the disadvantages of classical generator
networks is unable to preserve the object edges because they minimize the
classical loss function (e.g., Euclidean distance) between the predicted output and
the ground-truth [93]. This leads to the generation of blurry foreground regions.
Since this first valuable study, the posterior methods were introduced trying to
alleviate these limitations.
MatConvNet. The Cascaded CNN suffers from several limitations: 1) the model
is more suitable for ground-truth generation than an automated background/
foreground separation application, and 2) it is computationally expensive.
In another study, Lim and Keles [59] proposed a method which is based on a
triplet CNN and a Transposed Convolutional Neural Network (TCNN) attached
at the end of it in an encoder-decoder structure. The model called FgSegNet_M
reuses the four blocks of the pre-train VGG-16 [73] under a triplet framework as
a multiscale feature encoder. At the end of the network, a decoder network is
integrated to map the features to a pixel-level foreground probability map.
Finally, the binary segmentation labels are generated by applying a threshold to
this feature map. Similar to the method proposed by Wang et al. [64], the
network is trained with only few frames (from 50 up to 200). Experimental
results [59] show that TCNN outperforms both ConvNet [67] and Cascaded CNN
[64]. In addition, it obtained an overall F-Measure score of 0.9770, which
outperformed all the reported methods. A variant of FgSegNet_M, called
FgSegNet was introduced by Lim and Keles [60] by adding a feature pooling
module FPM to operate on top of the final encoder (CNN) layer. Lin et al. [64]
further improve the model by proposing a modified FM with feature fusion.
FgSegNet_V2 achieves the highest performance on the CDnet 2014 dataset.
A common drawback of these previous methods is that they require a large
amount of densely labeled video training data. To solve this problem, a novel
training strategy to train a multi-scale cascaded scene-specific (MCSS) CNNs
was proposed by Liao et al. [119]. The network is constructed by joining the
ConvNets [67] and the multiscale-cascaded architecture [64] with a training that
takes advantage of the balance of positive and negative training samples.
Experimental results demonstrate that MCSS obtains the score of 0.904 on the
CDnet2014 dataset (excluding the PTZ category), which outperforms Deep CNN
[72], TCNN [95] and SFEN [104].
A multi-scale CNN based background subtraction method was introduced
by Liang et al. [128]. A specific CNN model is trained for each video to ensure
accuracy, but the authors manage to avoid manual labeling. First, Liang et al.
[128] used the SubSENSE algorithm to generate an initial foreground mask. This
initial foreground mask is not accurate enough to be directly used as ground
truth. Instead, it is used to select reliable pixels to guide the CNN training. A
simple strategy to automatically select informative frames for guided learning is
also proposed. Experiments on the CDnet 2014 dataset show that Guided Multi-
scale CNN outperforms DeepBS and SuBSENSE, with F-Measure score of
0.759.
56 J. H. Giraldo1, et al.
multiple filters at various scales on the same input, and integrates two
Complementary Feature Flow (CFF) and a Pivotal Feature Flow (PFF)
architecture. The authors also exploited intra-domain transfer learning in order to
improve the accuracy of foreground region prediction. In MV-FCN, the inception
modules are employed at early and late stages with three different sizes of
receptive fields to capture invariance at different scales. To enhance the spatial
representation, the features learned in the encoding phase are fused with
appropriate feature maps in the decoding phase through residual connections.
These multi-view receptive fields, together with the residual feature connections
provide generalized features which are able to improve the performance of pixel-
wise foreground region identification. Alikan [84] evaluated MV-FCN model on
the CDnet 2014 dataset, in comparison with classical neural networks (Stacked
Multi-Layer [87], Multi-Layered SOM [26]), and two deep learning approaches
(SDAE [88], Deep CNN [72]). However, only results with selected sequences are
reported which makes the comparison less complete.
Zeng and Zhu [89] targeted the moving object detection in infrared videos
and they designed a Multiscale Fully Convolutional Network (MFCN). MFCN
does not require the extraction of the background images. The network takes
inputs as frames from different video sequences and generates a probability map.
The authors borrow the architecture of VGG-16 net and use the input size of
224*224. The VGG-16 network consists of five blocks; each block contains
some convolution and max pooling operations. The deeper blocks have a lower
spatial resolution and contain more high-level local features whilst the lower
blocks contain more low-level global features at a higher resolution. After the
output feature layer, a contrast layer is added based on the average pooling
operation with a kernel size of 3*3. Zeng and Zhu [89] proposed a set of
deconvolution operations to upsample the features, creating an output probability
map with the same size as the input, in order to exploit multiscale features from
multiple layers. The cross-entropy is used to compute the loss function. The
network uses the pretrained weights for layers from VGG-16, whilst randomly
initializes other weights with a truncated normal distribution. Those randomly
initialized weights are then trained using the AdamOptimizer method. The
MFCN obtains the best score in THM category of CDnet 2014 dataset with a F-
Measure score of 0.9870 whereas Cascaded CNN [64] obtains 0.8958. Over all
the categories, the F-Measure score of MFCN is 0.96. In a further study, Zeng
and Zhu [90] introduced a method called CNN-SFC by fusing the results
produced by different background subtraction algorithms (SuBSENSE [86],
FTSG [91], and CwisarDH+ [92]) and achieve even better performance. This
method outperforms its direct competitor IUTIS [65] on the CDnet 2014 dataset.
58 J. H. Giraldo1, et al.
mask than ConvNet [67] and is not very prone to outliers in presence of dynamic
backgrounds. Experiment results show that deep CNN based background
subtraction outperforms the existing algorithms when the challenge does not lie
in the background modeling maintenance. The F-Measure score of Deep CNN in
CDnet2014 dataset is 0.7548. However, Deep CNN suffers from the following
limitations: 1) It does not handle very well the camouflage regions within
foreground objects, 2) it performs poorly on PTZ videos category, and 3) due to
the corruption of the background images, it provides poor performance in
presence of large changes in the background.
In another work, Zhao et al. [95] designed an end-to-end two-stage deep
CNN (TS-CNN) framework. The network consists of two stages: a convolutional
encoder-decoder followed by a Multi-Channel Fully Convolutional sub-Network
(MCFVN). The target of the first stage is to reconstruct the background images
and encode rich prior knowledge of background scenes whilst the latter stage
aims to accurately detect the foreground. The authors decided to jointly optimize
the reconstruction loss and segmentation loss. In practice, the encoder consists of
a set of convolutions which can represent the input image as a latent feature
vector. The feature vectors are used by the decoder to restore the background
image. The l2 distance is used to compute the reconstruction loss. The encoder-
decoder network learns from training data to separate the background from the
input image and restores a clean background image. After training, the second
network can learn the semantic knowledge of the foreground and background.
Therefore, the model is able to process various challenges such as the night light,
shadows and camouflaged foreground objects. Experimental results [95] show
that the TS-CNN obtains the F-Measure score of 0.7870 which is more accurate
than SuBSENSE [86], PAWCS [99], FTSG [91] and SharedModel [100] in the
case of night videos, camera jitter, shadows, thermal imagery and bad weather.
The Joint TS-CNN achieves a score of 0.8124 in CDnet2014 dataset.
Li et al. [101] proposed to predict object locations in a surveillance scene
with an adaptive deep CNN (ADCNN). First, the generic CNN-based classifier is
transferred to the surveillance scene by selecting useful kernels. After that, a
regression model is employed to learn the context information of the surveillance
scene in order to have an accurate location prediction. ADCNN obtains very
promising performance on several surveillance datasets for pedestrian detection
and vehicle detection. However, ADCNN focus on object detection and thus it
does not use the principle of background subtraction. Moreover, the performance
of ADCNN was reported with the CUHK square dataset [102], the MIT traffic
dataset [103] and the PETS 2007 instead of the CDnet2014 dataset.
60 J. H. Giraldo1, et al.
boundaries and holes in the foreground mask. Experimental results on the CDnet
2014 show that Struct-CNN outperforms SuBSENSE [86], PAWCS [99], FTSG
[91] and SharedModel [100] in the case of bad weather, camera jitter, low frame
rate, intermittent object motion and thermal imagery. The F-Measure score
excluding the “PTZ” category is 0.8645. The authors excluded this category
arguing that they focused only on static cameras.
2.6. 3D CNNs
Sakkos et al. [106] proposed an end-to-end 3D-CNN to track temporal changes in
video sequences without using a background model for the training. For this
reason, 3D-CNN is able to process multiple scenes without further fine-tuning.
The network architecture is inspired by the C3D branch [107]. Practically, 3D-
CNN outperforms ConvNet [67] and deep CNN [72]. Furthermore, the
evaluation on the ESI dataset [108] with extreme and sudden illumination
changes, show that 3D CNN obtains higher score than the two designed
illumination invariant background subtraction methods (Universal Multimode
Background Subtraction (UMBS) [109] and ESI [108]). For CDnet 2014 dataset,
the proposed framework achieved an average F-Measure of 0.9507.
Yu et al. [117] designed a spatial-temporal attention-based 3D ConvNets to
jointly learn the appearance and motion of objects-of-interest in a video with a
Relevant Motion Event detection Network (ReMotENet). Similar to the work of
Sakkos et al. [106], the architecture of the proposed network is borrowed from
C3D branch [107]. However, instead of using max pooling both spatially and
temporally, the authors divided the spatial and temporal max pooling to capture
fine-grained temporal information, as well as to make the network deeper to learn
better representations. Experimental results show that ReMotENet obtains
comparative results than the object detection-based method, with three to four
orders of magnitude faster. With model size of less than 1MB, it is able to detect
relevant motion in a 15s video in 4-8 milliseconds on a GPU and a fraction of a
second on a CPU.
In another work, Hu et al. [71] developed a 3D Atrous CNN model which
can learn deep spatial-temporal features without losing resolution information.
The authors combined this model with two convolutional long short-term
memory (ConvLSTM) to capture both short-term and long-term spatio-temporal
information of the input frames. In addition, the 3D Atrous ConvLSTM does not
require any pre- or post-processing of the data, but to process data in a
completely end-to-end manner. Experimental results on CDnet 204 dataset show
that 3D atrous CNN outperforms SuBSENSE, Cascaded CNN and DeepBS.
62 J. H. Giraldo1, et al.
3. Experimental Results
evaluation is reported for all frames. All the challenges of these different
categories have different spatial and temporal properties.
The F-measures obtained by the different DNN algorithms are compared
with the F-measures of other representative background subtraction algorithms
over the complete evaluation dataset: (1) two conventional statistical models
(MOG [128], RMOG [132], and (2) three advanced non-parametric models
(SubSENSE [126], PAWCS [127], and Spectral-360 [114]). The evaluation of
deep learning-based background separation models is reported on the following
categories:
x Pixel-wise algorithms: The algorithms in this category were directly applied
by the authors to background/foreground separation without considering
spatial and temporal constraints. Thus, they may introduce isolated false
positives and false negatives. We compare two algorithms: FgSegNet (multi-
scale) [80], and BScGAN [10].
x Temporal-wise algorithms: These algorithms model the dependencies
among adjacent temporal pixels and thus enforce temporal coherence. We
compared one algorithm: 3D-CNN [110].
Table 1 groups the different F-measures which come either from the
corresponding papers, or the CDnet 2014 website. In the same way, Table 2
shows some visual results obtained using SuBSENSE [126], FgSegNet-V2 [61],
and BPVGAN [63].
Table 1. F-measure metric over the 6 categories of the CDnet2014, namely Baseline (BSL), Dynamic background (DBG), Camera jitter (CJT), Intermittent
Motion Object (IOM), Shadows (SHD), Thermal (THM), Bad Weather (BDW), Low Frame Rate (LFR), Night Videos (NVD), PTZ, Turbulence (TBL). In
bold, the best score in each algorithm's category. The top 10 methods are indicated with their rank. There are three groups of leading methods: FgSegNet's
group, 3D-CNNs group and GANs group.
Algorithms (Authors) BSL DBG CJT IOM SHD THM BDW LFR NVD PTZ TBL Average F-Measure
Basic statistical models
65
66
Table 2. Visual results on CDnet 2014 dataset: From left to right: Original images, Ground-Truth images, SubSENSE [126], FgSegNet-V2 [61],
BPVGAN [63].
B-Weather
Skating
(in002349)
Baseline
Pedestrian
J. H. Giraldo1, et al.
(in000490)
C-Jitter
Badminton
(in001123)
Dynamic-B
Fall
(in002416)
I-O-Motion
Sofa
(in001314)
Deep Learning Based Background Subtraction 67
4. Conclusion
In this chapter, we have presented a full review of recent advances on the use of
deep neural networks applied to background subtraction for detection of moving
objects in video taken by a static camera. The experiments reported on the large-
scale CDnet 2014 dataset show the gap of performance obtained by the
supervised deep neural networks methods in this field. Although, applying deep
neural networks on for background subtraction problem has received significant
attention in the last two years since the paper of Braham and Van Droogenbroeck
[67], there are many unsolved important issues. Researchers need to answer the
question: what is the best suitable type of deep neural networks and its
corresponding architecture for background initialization, background subtraction
and deep learned features in the presence of complex backgrounds? Several
authors avoid experiments on the "PTZ" category and when the F-Measure is
provided the score is not always very high. Thus, it seems that the current deep
neural networks tested meet problems in the case of moving cameras. In the field
of background subtraction, only convolutional neural networks and generative
adversarial networks have been employed. Thus, future directions may
investigate the adequacy of deep belief neural networks, deep restricted kernel
neural networks [129], probabilistic neural networks [130] and fuzzy neural
networks [131] in the case of static camera as well as moving camera.
References
[1] S. Cheung, C. Kamath, “Robust Background Subtraction with Foreground Validation for
Urban Traffic Video”, Journal of Applied Signal Processing, 14, 2330-2340, 2005.
[2] J. Carranza, C. Theobalt. M. Magnor, H. Seidel, “Free-Viewpoint Video of Human Actors”,
ACM Transactions on Graphics, 22 (3), 569-577, 2003.
[3] F. El Baf, T. Bouwmans, B. Vachon, “Comparison of Background Subtraction Methods for
a Multimedia Learning Space”, SIGMAP 2007, Jul. 2007.
[4] I. Junejo, A. Bhutta, H Foroosh, “Single Class Support Vector Machine (SVM) for Scene
Modeling”, Journal of Signal, Image and Video Processing, May 2011.
[5] J. Wang, G. Bebis, M. Nicolescu, M. Nicolescu, R. Miller, “Improving target detection by
coupling it with tracking”, Machine Vision and Application, pages 1-19, 2008.
[6] A. Tavakkoli, M. Nicolescu, G. Bebis, “A Novelty Detection Approach for Foreground
Region Detection in Videos with Quasi-stationary Backgrounds”, ISVC 2006, pages 40-49,
Lake Tahoe, NV, November 2006.
[7] F. El Baf, T. Bouwmans, B. Vachon, “Fuzzy integral for moving object detection”, IEEE
FUZZ-IEEE 2008, pages 1729–1736, June 2008.
[8] F. El Baf, T. Bouwmans, B. Vachon, “Type-2 fuzzy mixture of Gaussians model:
Application to background modeling”, ISVC 2008, pages 772–781, December 2008.
68 J. H. Giraldo1, et al.
[74] L. Cinelli, “Anomaly Detection in Surveillance Videos using Deep Residual Networks",
Master Thesis, Universidade de Rio de Janeiro, February 2017.
[75] K. He, X. Zhang, S. Ren, "Deep residual learning for image recognition", "EEE CVPR
2016, June 2016.
[76] L. Yang, J. Li, Y. Luo, Y. Zhao, H. Cheng, J. Li, "Deep Background Modeling Using Fully
Convolutional Network", IEEE Transactions on Intelligent Transportation Systems, 2017.
[77] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. Yuille, “DeepLab: Semantic image
segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs”,
Tech. Rep., 2016.
[78] K. He, X. Zhang, S. Ren, J. Sun, "Delving deep into rectifiers: Surpassing human-level
performance on ImageNet classification", IEEE ICCV 2015, pages 1026–1034, 2015.
[79] C. Stauffer, W. Grimson, “Adaptive background mixture models for real-time tracking”,
IEEE CVPR 1999, pages 246-252, 1999.
[80] K. Kim, T. H. Chalidabhongse, D. Harwood, L. Davis, “Background Modeling and
Subtraction by Codebook Construction”, IEEE ICIP 2004, 2004
[81] O. Barnich, M. Van Droogenbroeck, “ViBe: a powerful random technique to estimate the
background in video sequences”, ICASSP 2009, pages 945-948, April 2009.
[82] M. Hofmann, P. Tiefenbacher, G. Rigoll, "Background Segmentation with Feedback: The
Pixel-Based Adaptive Segmenter", IEEE Workshop on Change Detection, CVPR 2012,
June 2012
[83] L. Yang, H. Cheng, J. Su, X. Li, “Pixel-to-model distance for robust background
reconstruction", IEEE Transactions on Circuits and Systems for Video Technology, April
2015.
[84] T. Akilan, "A Foreground Inference Network for Video Surveillance using Multi-View
Receptive Field", Preprint, January 2018.
[85] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, A. Rabinovich,
"Going deeper with convolutions", IEEE CVPR 2015, pages 1-9, 2015.
[86] P. St-Charles, G. Bilodeau, R. Bergevin, "Flexible Background Subtraction with Self-
Balanced Local Sensitivity", IEEE CDW 2014, June 2014.
[87] Z. Zhao, X. Zhang, Y. Fang, “Stacked multilayer self-organizing map for background
modeling” IEEE Transactions on Image Processing, Vol. 24, No. 9, pages. 2841–2850,
2015.
[88] Y. Zhang, X. Li, Z. Zhang, F. Wu, L. Zhao, “Deep learning driven bloc-kwise moving
object detection with binary scene modeling”, Neurocomputing, Vol. 168, pages 454-463,
2015.
[89] D. Zeng, M. Zhu, "Multiscale Fully Convolutional Network for Foreground Object
Detection in Infrared Videos", IEEE Geoscience and Remote Sensing Letters, 2018.
[90] D. Zeng, M. Zhu, “Combining Background Subtraction Algorithms with Convolutional
Neural Network”, Preprint, 2018.
[91] R. Wang, F. Bunyak, G. Seetharaman, K. Palaniappan, “Static and moving object detection
using flux tensor with split Gaussian model”, IEEE CVPR 2014 Workshops, pages 414–
418, 2014.
[92] M. De Gregorio, M. Giordano, “CwisarDH+: Background detection in RGBD videos by
learning of weightless neural networks”, ICIAP 2017, pages 242–253, 2017.
[93] C. Lin, B. Yan, W. Tan, "Foreground Detection in Surveillance Video with Fully
Convolutional Semantic Network", IEEE ICIP 2018, pages 4118-4122, Athens, Greece,
October 2018.
[94] J. Long, E. Shelhamer, T. Darrell, “Fully convolutional networks for semantic
segmentation,” IEEE CVPR 2015, pages 3431-3440, 2015.
[95] X. Zhao, Y. Chen, M. Tang, J. Wang, "Joint Background Reconstruction and Foreground
Segmentation via A Two-stage Convolutional Neural Network", Preprint, 2017.
[96] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A. Efros, "Context encoders: Feature
learning by inpainting", arXiv preprint arXiv:1604.07379, 2016.
72 J. H. Giraldo1, et al.
CHAPTER 1.4
Sedat Ozer
Bilkent University, Ankara, Turkey
sedatist@gmail.com
In this chapter, we present a method to model and extract the skeleton of a shape with re-
cently proposed similarity domains network (SDN). SDN is especially useful when there
is only one image sample available and when there is no additional pre-trained model is
available. SDN is a neural network with one-hidden layer with explainable kernel param-
eters. Kernel parameters have a geometric meaning within the SDN framework, which is
encapsulated with similarity domains (SDs) within the feature space. We model the SDs
with Gaussian kernel functions. A similarity domain is a d dimensional sphere in the d
dimensional feature space and it represents the similarity domain of an important data sam-
ple where any other data that falls inside the similarity domain of that important sample is
considered similar to that sample and they share the same class label. In this chapter, we
first demonstrate how using SDN can help us model a pixel-based image in terms of SDs
and then demonstrate how those learned SDs can be used to extract the skeleton from a
shape.
1. Introduction
Recent advances in deep learning moved attention of many researchers to the neural net-
works based solutions for shape understanding, shape analysis and parametric shape mod-
eling. While a significant amount of research has been done for skeleton extraction and
modeling from shapes in the past, recent advances in deep learning and their improved
success in object detection and classification applications has also moved the attention of
researchers towards neural networks based solutions for skeleton extraction and model-
ing. in this chapter, we introduce a novel shape modeling algorithm based on Radial Basis
Networks (RBNs) which are a particular type of neural networks that utilize radial basis
functions (RBF) as activation function in its hidden layer. RBFs have been used in the
literature for many classification tasks including the original LeNET architecture [1]. Even
though RBFs are useful in modeling surfaces and various classification tasks as in [2–8],
when the goal is modeling a shape and extracting a skeleton many challenges appear as-
sociated with utilizing RBFs in neural networks. Two such challenges are: (I) estimating
the optimal number of used RBFs (e.g., the number of yellow circles in our 2D image ex-
amples) in the network along with their optimal locations (their centroid values), and (II)
75
March 12, 2020 10:14 ws-rv961x669 HBPRCV-6th Edn.–11573 BookChapterNew page 76
76 S. Ozer
(a) Binary input image (b) Altered output image using SDs
(c) Visualization of all the SDs (d) Visualization of only the foreground SDs
Fig. 1. This figure demonstrates how the shape parameters of SDN can be utilized on shapes. (a) Original binary
input image is shown. (b) The altered image by utilizing the SDN’s shape parameters. Each object is scaled and
shifted at different and individual scales. We first used a region growing algorithm to isolate the kernel parameters
for each object and then individually scaled and shifted them. (c) All the computed shape parameters of the input
binary image are visualized. (d) Only the foreground parameters are visualized.
estimating the optimal parameters of RBFs by relating them to shapes geometrically. The
kernel parameters are typically known as the scale or the shape parameter (representing
the radius of a circle in the figures) and used interchangeably in the literature. The stan-
dard RBNs as defined in [9] apply the same kernel parameter value to each basis function
used in the network architecture. Recent literature focused on using multiple kernels with
individual and distinct kernel parameters as in [10] and [11]. While the idea of utilizing
different kernels with individual parameters has been heavily studied in the literature under
the “Multiple Kernel Learning” (MKL) framework as formally modeled in [11], there are
not many efficient approaches and available implementations focusing on utilizing multiple
kernels with their own parameters in RBNs for shape modeling. Recently, the work in [12]
combined the optimization advances achieved in the kernel machines domain with the ra-
dial basis networks and introduced a novel algorithm for shape analysis. In this chapter, we
call that algorithm as “Similarity Domains Network” (SDN) and discuss its benefits from
both shape modeling (see Figure 1) and skeleton extraction perspectives. As we demon-
strate in this chapter, the computed similarity domains of SDN can be used not only for
obtaining parametric models for shapes but also models for their skeletons.
March 12, 2020 10:14 ws-rv961x669 HBPRCV-6th Edn.–11573 BookChapterNew page 77
Similarity Domains Network for Modeling Shapes and Extracting Skeletons without Large Datasets 77
2. Related Work
Skeleton extraction has been widely studied in the literature as in [13–16]. However, in this
chapter, we study and focus on how to utilize SDs that are obtained by a novel and recently
introduced algorithm, Similarity Domains Networks, and demonstrate how to obtain para-
metric models for shapes and to extract the skeleton of a shape. SDs save only a portion
of the entire data, thus they provide a means to reduce the complexity of computations for
skeleton extraction and shape modeling. Our proposed algorithm: SDN is related to both
radial basis networks and kernel machines. However, in this chapter, we mainly discuss and
present our novel algorithm from the neural networks perspective and relate it to the radial
basis networks (RBNs). In the past, the RBN related research mainly focused on comput-
ing the optimal kernel parameter (i.e., the scale or shape parameter) that was used in all of
the RBFs as in [17, 18]. While the parameter computation for multiple kernels have been
heavily studied under the MKL framework in the literature (for examples, see the survey
papers: [19, 20]), the computation of multiple kernel parameters in RBNs has been mostly
studied under two main approaches: using optimization or using heuristic methods. For
example, in [21], the authors proposed using multiple scales as opposed to using a single
scale value in RBNs. Their approach utilizes first computing the standard deviation of each
cluster (after applying a k-means like clustering on the data) and then using a scaled ver-
sion of those standard deviations of each cluster as the shape parameter for each RBF in
the network. The work in [22] also used a similar approach by using the root-mean-square-
deviation (RMSD) value between the RBF centers and the data value for each RBF in the
network. The authors used a modified orthogonal least squares (OLS) algorithm to select
the RBF centers. The work in [10] used k-means algorithm on the training data to choose
k centers and used those centers as RBF centers. Then it used separate optimizations for
computing the kernel parameters and the kernel weights (see next chapter for the formal
definitions). Using additional optimization steps for different set of parameters is costly
and makes it harder to interpret those parameters and to relate them to shapes geomet-
rically and accurately. As an alternative solution, the work in [12] proposed a geometric
approach by using the distance between the data samples as a geometric constraint. In [12],
we did not use the well known MKL model. Instead, we defined interpretable similarity
domains concept using RBFs and developed his own optimization approach with geometric
constrains similar to the original Sequential Minimal Optimization (SMO) algorithm [23].
Consequently, the SDN algorithm combines both RBN and kernel machine concepts to
develop a novel algorithm with geometrically interpretable kernel parameters.
In this chapter, we demonstrate using SDN for parametric shape modeling and skeleton
extraction. Unlike the existing work on radial basis networks, instead of applying an ini-
tial k-means algorithm or OLS algorithm to compute the kernel centers separately or using
multiple cost functions, SDN chooses the RBF centers and their numbers automatically
via its sparse modeling and uses a single cost function to be optimized with its geometric
constraint. That is where SDN differs from other similar RBN works as they would have
issues on computing all those parameters within a single optimization step while automat-
ically adjusting the number of RBFs used in the network sparsely.
March 12, 2020 10:14 ws-rv961x669 HBPRCV-6th Edn.–11573 BookChapterNew page 78
78 S. Ozer
Fig. 2. An illustration of SDN as a radial basis network. The network contains a single hidden layer.
The input layer (d dimensional input vector) is connected to n radial basis functions. The output is
the weighted sum of the radial basis functions’ outputs.
3. Similarity Domains
A similarity domain [12] is a geometric concept that defines a local similarity around a
particular data set where that data represents the center of a similarity sphere (i.e., similar-
ity domain) in the Euclidian space. Through similarity domains, we can define a unified
optimization problem in which the kernel parameters are computed automatically and ge-
ometrically. We formalize the similarity domain of xi Rd , as the sphere in Rd where the
center is the support vector (SV) xi and the sphere radius is ri . The radius ri is defined as
follows:
For any (+1) labelled support vector x+ + d
i , where xi R and superscript (+) repre-
sents the (+1) class:
− −
i − x1 , ..., xi − xk )/2
ri = min( x+ +
(1)
where superscript (-) means the (-1) class.
For any (-1) labelled support vector x− i :
ri = min( x− + −
i − x1 , ..., xi − xk )/2.
+
(2)
In this work, we use Gaussian kernel function to represent similarities and similarity do-
mains as follows:
Kσi (x, xi ) = exp(− x − xi 2 /σi2 ) (3)
where σi is the kernel parameter for SV xi . The similarity (kernel) function takes its
maximum value where x = xi . The relation between ri and σi is as follows: ri2 = aσi2
where a is a domain specific scalar (constant). In our image experiments, the value of a is
found via a grid search and we observed that setting a = 2.85 suffices for all the images
used in our experiments.
Note that, in contrast to [24, 25], our similarity domain definition differs from the term
“minimal enclosing sphere”. In our approach, we define the term similarity domain as
the dominant region of a SV in which the SV is the centroid and all the points within the
domain are similar to the SV. The boundary of the similarity domain of a SV is defined
March 12, 2020 10:14 ws-rv961x669 HBPRCV-6th Edn.–11573 BookChapterNew page 79
Similarity Domains Network for Modeling Shapes and Extracting Skeletons without Large Datasets 79
based on its distance to the closest point from the other class. Thus any given vector within
a similarity domain (a region) will be similar to the associated SV of that similarity domain.
We will use the similarity domain concept to define a kernel machine that computes its
kernel parameters automatically and geometrically in the next section.
A typical radial basis network (RBN) includes a single hidden layer and uses a radial basis
function (RBF) as the activation function in each neuron in that hidden layer (i.e., the
hidden layer uses n RBFs). Similar to RBN, a similarity domains network also uses a single
hidden layer where the activation functions are radial basis functions. Unlike a typical RBN
that uses the same kernel parameter in all the radial basis functions in the hidden layer, SDN
uses different kernel parameters for each RBF used in the hidden layer. The illustration of
SDN as a radial basis network is given in Figure 2. While the number of RBFs in the hidden
layer is decided by different algorithms in RBN (as discussed in the previous section),
SDN assigns a RBF to each training sample. In the figure, the hidden layer uses all of the n
training data as an RBF center and then through its sparse optimization, it selects a subset of
the training data (e.g., subset of pixels for shape modeling) and reduces that number n to k
where n ≥ k. SDN models the decision boundary as a weighted combination of Similarity
Domains (SDs). A similarity domain is a d dimensional sphere in the d dimensional feature
space. Each similarity domain is centered at an RBF center and modeled with a Gaussian
RBF in SDN. SDN estimates the label y of a given input vector x as y as shown below:
k
y = sign(f (x)) and f (x) = αi yi Kσi (x, xi ), (4)
i=1
where the scalar αi is a nonzero weight for the RBF center xi , yi {−1, +1} the class label
of the training data and k the total number of RBF centers. K(.) is the Gaussian RBF
kernel defined as:
Kσi (x, xi ) = exp(− x − xi 2 /σi2 ) (5)
March 12, 2020 10:14 ws-rv961x669 HBPRCV-6th Edn.–11573 BookChapterNew page 80
80 S. Ozer
where σi is the shape parameter for the center xi . The centers are automatically selected
among the training data during the training via the following cost function:
n
1
n n
max Q(α) = αi − αi αj yi yj Kσij (xi , xj ),
α
i=1
2 i=1 j=1
n
(6)
subject to: αi yi = 0, C ≥ αi ≥ 0 for i = 1, 2, ..., n,
i=1
where T is a constant scalar value assuring that the RBF function yields a smaller value
for any given pair of samples from different classes. The shape parameter σij is defined as
σij = min(σi , σj ). For a given closest pair of vectors xi and xj for which yi yj = −1, we
can define the kernel parameters as follows:
− xi − xj 2
σi2 = σj2 = (7)
ln(K(xi , xr ))
As a result, the decision function takes the form:
k
x − xi 2
f (x) = αi yi exp(− )−b (8)
i=1
σi2
where k is the total number of support vectors. In our algorithm the bias value b is constant
and is equal to 0.
Discussion: Normally, avoiding the term b in the decision function eliminates the con-
n
straint αi yi = 0 in the optimization problem. However, since yi {−1, +1}, the sum in
i=1
n 1
m 2
m
that constraint can be rewritten as α i yi αi − αi 0 where m1 + m2 = n.
i=1 i=1 i=1
This means that if the αi values are around the value of 1 (or equal to 1) ∀i , then this con-
straint also means that the total number of support vectors from each class should be equal
n
or similar to each other, i.e., m1 m2 . That is why we keep the constraint α i yi = 0
i=1
in our algorithm as that would help us to compute SVs from both classes with comparable
numbers.
The decision function f (x) can be expressed as:
k1
k2
f (x) = αi yi Ki (x, xi ) + αj yj Kj (x, xj ) where k1 is the total number of SVs
i=1 j=1
near vector x such that the Euclidian norm xi − x 2 − σi2 >> 0 and k2 is the total
number of SVs for which xj − x 2 − σj2 0. Notice that k1 + k2 = k. This property
suggests that local predictions can be made by the approximated decision function:
k1
f (x) αi yi Ki (x, xi ). This approach can simplify the computations in large
i=1
datasets as in this approach, we do not require access to all of the available of SVs. Further
details on SDs and SDN formulation can be found in [12].
March 12, 2020 10:14 ws-rv961x669 HBPRCV-6th Edn.–11573 BookChapterNew page 81
Similarity Domains Network for Modeling Shapes and Extracting Skeletons without Large Datasets 81
Sparse and parametric shape modeling is a challenge in the literature. For shape modeling,
we propose using SDNs. SDN can model shapes sparsely with its computed kernel param-
eters. For that, first we train SDN to learn the shape as a decision boundary from the given
binary image. For that, we label the shape (e.g., the white region in Fig. 3a) as foreground
and label everything else (e.g., the black region in Fig. 3a) as background while using each
pixel 2D coordinate as features. Once the image is learned by SDN, the computed kernel
parameters of SDN along with their 2D coordinates are used to model the shape with our
one-class classifier without performing any re-training.
As mentioned earlier, we can use Gaussian RBFs and their shape parameters (i.e. the
kernel parameters) to model shapes parametrically within the SDN framework. For that
purpose, we can save and use only the foreground (the shape’s) RBF (or SD) centers and
their shape parameters to obtain a one class classifier. The computed RBF centers of SDN
can be grouped for both foreground and for background as:
s1
s2
C1 = xi and C2 = xi , where s1 + s2 = k, s1 is the total number
i=1,yi ∈+1 i=1,yi ∈−1
of centers from the (+1) class and s2 is the total number of centers from the (-1) class.
Since the Gaussian kernel functions (RBFs) now represent local SDs geometrically, the
original decision function f (x) can now be approximated by using only C1 (or by using
only C2 ). Therefore, we define the one-class approximation by using only the centers and
their associated kernel parameters from the C1 forany given x as follows:
y = +1, if x − xi < aσi2 , ∃xi ∈ S1
(9)
otherwise y = −1,
where the SD radius for the ith center xi is defined as aσi2 and a is a domain specific
constant. One class approximation examples are given in Figure 1b where we used only
the SDs from the foreground to reconstruct the altered image.
The parametric and geometric properties of SDN provides new parameters to analyze a
shape via its similarity domains. Furthermore, while the typical neural network based
applications for skeleton estimation focus on learning from multiple images, SDN can learn
a shape’s parameters from only the given single image without requiring any additional data
set or a pre-trained model. Therefore, SDN is advantageous especially in cases where the
data is very limited or only one sample shape is available.
Once learned and computed by the SDN, the similarity domains (SDs) can be used to
extract the skeleton of a given shape. When computed by considering only the existing
SDs, the process of skeleton extraction requires only a subset of pixels (i.e., the SDs)
during the computation. To extract the skeleton, we first bin the computed shape parameters
(σi2 ) into m bins (in our experiments m is set to 10). Since typically the majority of the
similarity domains lay around the object (or shape) boundary, they appear in small values.
March 12, 2020 10:14 ws-rv961x669 HBPRCV-6th Edn.–11573 BookChapterNew page 82
82 S. Ozer
(a) All of the foreground ri (b) Skeleton for σi2 > 29.12 (c) Skeleton for σi2 > 48.32
(d) Skeleton for σi2 > 67.51 (e) Skeleton for σi2 > 86.71 (f) Skeleton for σi2 > 105.90
Fig. 4. The results of filtering shape parameters at different thresholds is visualized for the image
shown in Figure 3a. The remaining similarity domains after thresholding and the extracted skeletons
from those similarity domains are visualized: (b) for σi2 > 29.12; (c) for σi2 > 48.32; (d) for
σi2 > 67.51; (e) for σi2 > 86.71; (f) for σi2 > 105.90.
With a simple thresholding process, we can eliminate the SDs from our subset where we
search for the skeleton. Eliminating them at first, gives us a lesser number of SDs to
consider for skeleton extraction. After eliminating those small SDs and their computed
parameters with a simple thresholding process, we connect the centers of the remaining
SDs by tracing the overlapping SDs. If there are non-overlapping SDs exist within the
same shape after the thresholding process, we perform linear estimation and connect the
closest SDs. We interpolate a line between those closest SDs to visualize the skeleton in
our figures. Thresholding the kernel parameters of SDs at different values, yields different
set of SDs, and therefore, we obtain different skeletons as shown in Figure 4.
7. Experiments
In this section, we demonstrate how to use SDN for parametric shape learning from a given
single input image without requiring any additional dataset. Since it is hard to model shapes
with the standard RBNs, and since there is no good RBN implementation was available to
us, we did not use any other RBN network in our experiments for comparison. As discussed
in the earlier sections, the standard RBNs have many issues and multiple individual steps
to compute the RBN parameters including the total number of RBF centers and finding
the center values along with the computation of the shape parameters at those centers.
However, comparison of kernel machines (SVM) and SDN on shape modeling was already
studied in the literature before (see [12]). Therefore, in this section, we focus on parametric
shape modeling and skeleton extraction from SDs by using SDNs. In the figures, all the
images are resized to fit into the figures.
March 12, 2020 10:14 ws-rv961x669 HBPRCV-6th Edn.–11573 BookChapterNew page 83
Similarity Domains Network for Modeling Shapes and Extracting Skeletons without Large Datasets 83
Here, we first demonstrate visualizing the computed shape parameters of SDN on a given
sample image in Figure 3. Figure 3a shows the original input image. We used each pixel’s
2D coordinate in the image as the features in the training data , and each pixel’s color
(being black or white) as the training labels. SDN is trained at T=0.05. SDN learned and
modeled the shape and reconstructed it with zero pixel error by using 1393 SDs. Pixel error
is the total number of wrongly classified pixels in the image. Figure 3b visualizes all the
computed shape parameters of the RBF centers of SDN as circles and Figure 3c visualizes
the ones for the foreground only. The radius of a circle in all figures is computed as aσi2
where a = 2.85. We found the value of a through a heuristic search and noticed that 2.85
suffices for all the shape experiments that we had. There are total of 629 foreground RBF
centers computed by SDN (only 2.51% of all the input image pixels).
Next, we demonstrate the skeleton extraction from the computed similarity domains as a
proof of concept. Extracting the skeleton from the SDs as opposed to extracting it from the
pixels, simplifies the computations as SDs are only a small portion of the total number of
pixels (reducing the search space). To extract the skeleton from the computed SDs, we first
quantize the shape parameters of the object into 10 bins and then starting from the largest
bin, we select the most useful bin value to threshold the shape parameters. The remaining
SD centers are connected based on their overlapping similarity domains. If multiple SDs
overlap inside the same SD, we look at their centers and we ignore the SDs whose centers
fall within the same SD (accepted the original SD center). That is why some points of
the shape are not considered as a part of the skeleton in Figure 4. Figure 4 demonstrates
the remaining (thresholded) SD centers and their radiuses at various thresholds in yellow.
In the figure, the skeletons (shown as a blue line) are extracted by considering only the
remaining SDs after thresholding as explained in Section 6. Another example is shown in
Figure 5. The input binary image is shown in Figure 5a. Figure 5b shows all the foreground
similarity domains. The learned SDs are thresholded and the corresponding skeleton as
extracted from the remaining SDs are visualized as a blue line in Figure 5c.
One benefit of using only SDs to re-compute skeletons is that, as the SDs are a subset
of the training data, the number of SDs are gradually less than the total number of pixels
that needs to be considered for skeleton computation. While our shown skeleton extraction
algorithm here is a naive and a basic one, the goal here is showing the use of SDs for
(re)computation of skeletons instead of using all the pixels of the shape.
Table 1. Bin centers for the quantized foreground shape parameters (σi2 ) and the total number of shape
parameters that fall in each bin for the image in Fig. 3a.
Bin Center: 9.93 29.12 48.32 67.51 86.71 105.90 125.09 144.29 163.48 182.68
Total Counts: 591 18 7 3 2 4 0 0 1 3
March 12, 2020 10:14 ws-rv961x669 HBPRCV-6th Edn.–11573 BookChapterNew page 84
84 S. Ozer
(a) Input Image (b) σi2 > 0 (c) for σi2 > 6.99
Fig. 5. (color online)Visualization of the skeleton (shown as blue line) extracted from SDs on another
image. (a) Input image: 64 x 83 pixels. (b) Foreground SDs. (c) Skeleton for σi2 > 6.99.
8. Conclusion
In this chapter, we introduced how the computed SDs of the SDN algorithm can be used
to extract skeleton from shapes as a proof of concept. Instead of using and processing
all the pixels to extract the skeleton of a given shape, we propose to use the SDs of the
shape to extract the skeleton. SDs are a subset of the training data (i.e., a subset if the all
pixels), thus using SDs can gradually reduces the (re)computation of skeletons at different
parameters. SDs and their parameters are obtained by SDN after the training steps. The
RBF shape parameters of SDN are used to define the size of SDs and they can be used to
model a shape as described in Section 5 and as visualized in our experiments. While the
presented skeleton extraction algorithm is a naive solution to demonstrate the use of SDs,
future work will focus on presenting more elegant solutions to extract the skeleton from
SDs. SDN is a novel classification algorithm and has potential in many shape analysis
applications besides the skeleton extraction. SDN architecture contains a single hidden
layer neural network and it uses RBFs as activation functions in the hidden layer. Each
RBF as its own kernel parameter.
Optimization algorithm plays an important role to obtain meaningful SDs with SDN
for skeleton extraction. We use a modified version of Sequential Minimal Optimization
(SMO) algorithm [23] to train SDN. While we have not tested its performance with other
optimization techniques yet, we do not expect other standard batch or stochastic gradient
based algorithms to yield the same results as we obtain with our algorithm. A future work
will focus on the optimization part and will perform a more in detailed analysis from the
optimization perspective.
A shape can be modeled parametrically by using SDNs via its similarity domains where
SDs are modeled with radial basis functions. A further reduction in parameters can be
obtained with one class classification approximation of SDN as shown in Eq. 9. SDN
can parametrically model a given single shape without requiring or using large datasets.
Therefore, it can be efficiently used to learn and model a shape even if there is only one
image is available where there is no any additional dataset or model can be provided.
March 12, 2020 10:14 ws-rv961x669 HBPRCV-6th Edn.–11573 BookChapterNew page 85
Similarity Domains Network for Modeling Shapes and Extracting Skeletons without Large Datasets 85
A future work may include introducing a better skeleton algorithm utilizing SDs. Cur-
rent naive technique relies on manual thresholding. However, a future technique may elim-
inate such manual operation to extract the skeleton.
Acknowledgement
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the
Quadro P6000 GPU used for this research. The author would like to thank Prof. Chi Hau
Chen for his valuable comments and feedback.
References
[1] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., Gradient-based learning applied to document
recognition, Proceedings of the IEEE. 86(11), 2278–2324, (1998).
[2] S. Ozer, D. L. Langer, X. Liu, M. A. Haider, T. H. van der Kwast, A. J. Evans, Y. Yang, M. N.
Wernick, and I. S. Yetik, Supervised and unsupervised methods for prostate cancer segmenta-
tion with multispectral mri, Medical physics. 37(4), 1873–1883, (2010).
[3] L. Jiang, S. Chen, and X. Jiao, Parametric shape and topology optimization: A new level set
approach based on cardinal basis functions, International Journal for Numerical Methods in
Engineering. 114(1), 66–87, (2018).
[4] S.-H. Yoo, S.-K. Oh, and W. Pedrycz, Optimized face recognition algorithm using radial basis
function neural networks and its practical applications, Neural Networks. 69, 111–125, (2015).
[5] M. Botsch and L. Kobbelt. Real-time shape editing using radial basis functions. In Computer
graphics forum, vol. 24, pp. 611–621. Blackwell Publishing, Inc Oxford, UK and Boston, USA,
(2005).
[6] S. Ozer, M. A. Haider, D. L. Langer, T. H. van der Kwast, A. J. Evans, M. N. Wernick, J. Tracht-
enberg, and I. S. Yetik. Prostate cancer localization with multispectral mri based on relevance
vector machines. In Biomedical Imaging: From Nano to Macro, 2009. ISBI’09. IEEE Interna-
tional Symposium on, pp. 73–76. IEEE, (2009).
[7] S. Ozer, On the classification performance of support vector machines using chebyshev kernel
functions, Master’s Thesis, University of Massachusetts, Dartmouth. (2007).
[8] S. Ozer, C. H. Chen, and H. A. Cirpan, A set of new chebyshev kernel functions for support
vector machine pattern classification, Pattern Recognition. 44(7), 1435–1447, (2011).
[9] R. P. Lippmann, Pattern classification using neural networks, IEEE communications magazine.
27(11), 47–50, (1989).
[10] L. Fu, M. Zhang, and H. Li, Sparse rbf networks with multi-kernels, Neural processing letters.
32(3), 235–247, (2010).
[11] F. R. Bach, G. R. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the
smo algorithm. In Proceedings of the twenty-first international conference on Machine learn-
ing, p. 6. ACM, (2004).
[12] S. Ozer, Similarity domains machine for scale-invariant and sparse shape modeling, IEEE
Transactions on Image Processing. 28(2), 534–545, (2019).
[13] N. D. Cornea, D. Silver, and P. Min, Curve-skeleton properties, applications, and algorithms,
IEEE Transactions on Visualization & Computer Graphics. (3), 530–548, (2007).
[14] H. Sundar, D. Silver, N. Gagvani, and S. Dickinson. Skeleton based shape matching and re-
trieval. In 2003 Shape Modeling International., pp. 130–139. IEEE, (2003).
[15] P. K. Saha, G. Borgefors, and G. S. di Baja, A survey on skeletonization algorithms and their
applications, Pattern Recognition Letters. 76, 3–12, (2016).
March 12, 2020 10:14 ws-rv961x669 HBPRCV-6th Edn.–11573 BookChapterNew page 86
86 S. Ozer
1. Introduction
87
88 On Curvelet-based Texture Features for Pattern Classification
of the curvelet transform has gone through two generations. The first generation
attempted to extend the ridglelet transform in small blocks of smoothly
partitioned subband-filtered images to obtain a piecewise line segment in
succession to approximate a curve segment at each scale. It suffered from the
accuracy problem of using partitioned blocks of small size to compute the local
ridgelet transform. The second generation adopts a new formulation through
curvelet design in the frequency domain that results in the equivalent outcome
attaining the same curve characteristics in the spatial domain. A fast digital
implementation of the discrete curvelet transform is also available for the use in
applications.11,13 The curvelet transform has been applied to image de-noising,
estimation, contrast enhancement, image fusion, texture classification, inverse
problems and sparse sensing.2, 14-17
There is an excellent paper, written by J. Ma and G. Plonka,18 as a review and
tutorial on the curvelet transform for engineering and information scientists. The
subject has also been discussed in sections of S. Mallat’s book4 on signal
processing and of the book by J-L Starck19, et al., on sparse image and signal
processing. This chapter provides a concise description of the second generation
curvelet transform and feature extraction in images based on multi-scale curvelet
coefficients for image texture pattern classification.
ijj,ߠ, k (x1, x2) = 2-3j/4ij( Rș [2-j (x1 í k1), 2-j/2 (x2 í k2 )])
C.-C. Li and W.-C. Lin. 89
and
Mˆ j ,T ,k (Z1 , Z2 ) 23 j / 4 M RT >2 j Z1 ,2 j / 2 Z2 @ e i Z x Z x
T
1 1 2 2
(1)
§ cosT sin T ·
RT ¨¨ ¸,
© sin T cosT ¸¹
The set of {ijj,ߠ,k (x1, x2)} is a tight frame that can be used to give an optimal
representation of a function f(x1, x2) by the linear combination of the curvelets
{ijj,ߠ,k (x1, x2)} with coefficients {cj, ߠ, k} which are the set of inner products
f ¦c
j ,T , k
j ,T , k M j ,T ,k x1 , x 2
and
1 2 ˆ
c j ,T , k f , M j ,T ,k ( ) f , Mˆ j ,T ,k (2)
2S
In the discrete curvelet transform, let us consider the design of curvelet ijj (x1,
x2) through its Fourier transform Mˆ j (Z1 , Z 2 ) in the polar frequency domain (r, ș)
via a pair of radial window W(r) and angular window V(t) in the polar
coordinates where r (1/2, 2) and t [í1, 1]. Note that r is the normalized
radial frequency variable with the normalization constant ʌ and the angular
variable ș is normalized by 2ʌ to give parameter t which may vary around the
normalized orientation șl in the range [í1, 1]. Both W(r) and V(t) windows are
smooth non-negative real-valued functions and are subject to the admissibility
conditions
f
3 3 (3)
¦W
j -f
2
(2 j r ) 1, r ( , );
4 2
¦V
A -f
2
(t l ) 1, t (
1 1
, ).
2 2
(4)
Let Uj be defined by
3 j
2 ¬ j / 2 ¼T (5)
U j (r ,T l ) 2 4
W ( 2 j r )V ( ),
2S
90 On Curvelet-based Texture Features for Pattern Classification
where l is the normalized șl at scale j (j 0). With the symmetry property of the
Fourier transform, the range of ș is now (íʌ/2, ʌ/2) and thus the resolution unit
can be reduced to half size.
Let Uj be the polar wedge defined with the support of W and V
3 j
2 ¬ j / 2 ¼T ș (íʌ/2, ʌ/2) (6)
U j (r ,T l ) 2 4
W ( 2 j r )V ( ),
2S
where ¬ j / 2¼ denotes the integer part of j/2. This is illustrated by the shaded
sector in Figure 2. In the frequency domain, the scaled curvelet at scale j without
shift can be chosen with the polar wedge Uj given by
Mˆ j ,l ,k (Z1 , Z2 ) U j (r ,T )
Mˆ j ,l ,k (Z1 , Z 2 ) Mˆ j ,l ,k (U j (r ,T T l )e i (Z k Z k 1 1 2 2)
(7)
j/2
where șl = l (2S ) 2 ¬ ¼ , with l = 0, 1, 2, … such that 0 șl < ʌ. Then,
through the Plancherel’s theorem, curvelet coefficients can be obtained by using
the inner product in the frequency domain
1
fˆ (Z )Mˆ j (U j (r k ,T Tl ))dZ
(2S ) 2 ³
c( j , l , k ) :
1 i xk( j ,l ) ,Z
2 ³ fˆ (Z )U j (r ,T )e dZ
(2S )
1
³ fˆ (Z ,Z )U (r k ,T T )e
i k1Z1 k 2Z2
2 1 2 j l dZ1Z2 (8)
(2S )
Fig. 1. A narrow rectangular support for curvelet in the spatial domain is shown in the right, its
width and length have two different scales according to the parabolic scaling rule; also shown is its
shift and rotation in the spatial domain. In the left, a 2-dimensional frequency plane in polar
coordinates is shown with radial windows in a circular corona for supporting a curvelet in different
orientation, the shaded sector illustrates a radial wedge with parabolic scaling.
information, will wrap its information into the rectangular wedge to contain the
same frequency information as in the parallel pipe and, thus, in the original
wedge, to compute the inner product with the given image to obtain the same
inner product. Although the wrapped wedge appears to contain the broken pieces
of the data, actually it is just the re-indexing of the components of the original
data. In this way, the inner product can be computed for each wedge and
immediately followed by the inverse FFT to obtain the contribution to the
curvelet coefficients from that original wedge with the trapezoidal support.
Pooling contributions from all the wedges will give the final curvelet coefficients
at that scale. Software for both algorithms are freely available in Candes’
IJĴ
laboratory, we have used the second algorithm in our study of the curvelet-
based texture pattern classification of prostate cancer tissue images.
Fig. 2. The digital coronae in the frequency plane with pseudo-polar coordinates, trapezoidal
wedges are shown also satisfying parabolic scaling rule.
C.-C. Li and W.-C. Lin. 93
Fig. 3. The schematic diagram to illuminate the concept of the wrapping algorithm for computing
digital curvelet coefficients at a given scale. A shaded trapezoidal wedge in a digital corona in the
frequency plane is sheared into a parallelepiped wedge, and then is mapped into a rectangular
wedge by a wrapping process to make it having the identical frequency information content by
virtue of the periodization. Although the wrapped wedge appears to contain the broken pieces of
the data, actually it is just the re-indexing of the components of the original data.
Fig. 4. Fuzzy yeast cells in the dark background. The original image27 has very smooth intensity
variation. Scales 2í4 illustrate the integrated curvelets extracted from the original image.
C.-C. Li and W.-C. Lin. 95
Fig. 5. NASA satellite view of cloud pattern over the South Atlantic Ocean (NASA courtesy/Jeff
Schmaltz).
96 On Curvelet-based Texture Features for Pattern Classification
Fig. 6. Iris image. The scales 2í5 curvelet coefficient patterns illustrate different texture
distributions of the original image.28
C.-C. Li and W.-C. Lin. 97
Fig. 7. TMA prostate images of Gleason grades P3S3, P3S4, P4S3 and P4S4 along with their
respective curvelet patterns of scales 2í5 demonstrate the transitions from benign, critical
intermediate class to carcinoma class.
98 On Curvelet-based Texture Features for Pattern Classification
The value of curvelet coefficients cj,l,k at position (k1, k2) under scale j denotes the
strength of the curvelet component oriented at angle ș in the representation of an
image function f(x1, x2). It contains the information on edginess coordinated
along a short path of connected pixels in that orientation. Intuitively, it would be
advantageous to extract texture features in the curvelet coefficient space. One
may use the standard statistical measures, such as, entropy, energy, mean,
standard deviation, 3rd order and 4th order moments of an estimated marginal
distribution (histogram) of curvelet coefficients as texture features,20 and also of
the co-occurrence of curvelet coefficients. The correlation of coefficients across
orientation and across scale may also be utilized. They may provide more
discriminative power in texture classification than those features extracted by the
classical approaches.
Dettori, Semler and Zayed7-8 studied the use of curvelet-based texture features
in recognition of normality of organ sections in CT images and reported a
significant improvement in recognition accuracy, in comparison with the result
obtained by using wavelet-based and ridgelet-based features. Arivazhagan,
Ganesan and Kumar studied the texture classification on a set of natural images
from VisTex Dataset,33-34 using curvelet-based statistical and co-occurrence
features, they also obtained superior classification result. Alecu, Munteanu, et
al.,35 conducted an information-theoretic analysis on correlations of curvelet
coefficients in a scale, between orientations, and across scales; they showed that
the generalized Gaussian density function gave a better fit for marginal
probability density functions of curvelet coefficients. This is due to the sparse
representation by curvelet coefficients, there will be a fewer number of
coefficients and the histogram at a given scale will appear to be more peaked and
have a long tail in general. Following that notion, Gomez and Romero developed
a new set of curvelet-based texture descriptors under the consideration of the
generalized Gaussian model for marginal density functions and demonstrated
their success in a classification experiment using a set of natural images from the
KTH-TIPS dataset.32 Murtagh and Starck also considered the generalized
Gaussian model for histograms of curvelet coefficients of each scale and selected
the second order, third order and fourth order moments as statistical texture
features in classifying and grading the aggregate mixtures with superb
experimental results.19-20 Rotation invariant curvelet features were used in
studies of region-based image retrieval by Zhang, Islam and Sumana, by
Cavusoglu, and by Zand, Doraisamy, Halin and Mustaffa, all gave superior
results in their comparative studies.29-31
C.-C. Li and W.-C. Lin. 99
One SVM machine at the next level is to differentiate P3S3 from P3S4 and the
other to classify P4S3 and P4S4.
A central area of 768 × 768 pixels of the tissue image was taken which should
be sufficient for covering the biometrics of the prostate cells and glandular
structure characteristics. Sampled by the aforementioned approach, each patch
with 256 × 256 pixels then undergoes the fast discrete curvelet transform with the
use of the Curvelab Toolbox software13 to generate curvelet coefficients cj,l,k in 4
scales. The prostate cellular and ductile structures contained therein are
represented by the texture characteristics in the image region where the curvelet-
based analysis is performed. As shown in Fig 7, four patches taken from four
prostate patterns including two critical in-between grades of P3S4 and P4S3. The
curvelet coefficients at each scale are displayed in the lower part to illustrate the
edge information that integrated over all orientations.
The "scale number" used here for curvelet coefficients corresponds to the
subband index number considered in the discrete frequency domain. For a 256 ×
256 image, the scale 5 refers to the highest frequency subband, that is subband 5,
and scales 4, 3 and 2 refer to the successively lower frequency subbands.
Their statistical measures including mean ȝj, variance ıj2, entropy ej, and
energy Ej, of curvelet coefficients at each scale j for each patch are computed as
textual features. Nine features have been selected to form a 9-dimensional
feature vector for use in pattern classification which includes entropy in scales
3í4, energy in scales 2í4, mean in scales 2í3 and variance in scales 2í3.
All three kernel SVMs were successfully trained, together with the successful
majority decision rule, giving a trained tree classifier of 4 classes of critical
Gleason scoring with no training error. The leave-one (image)-out cross-
validation was applied to assess the trained classifier. The 10% Jackknife cross-
validation tests were carried out for 100 realizations for all three SVMs and the
statistical results are listed in Table 1 with above 95.7% accuracy for individual
machines and an overall accuracy of 93.68% for 4 classes. The trained classifier
was tested with 96 images (24 images p e r c l a s s ) . The result given in Table 2
shows remarkable testing accuracy to classify tissue images of four critical
C.-C. Li and W.-C. Lin. 101
Gleason scores (GS) 3+3, 3+4, 4+3 and 4+4, as compared t o the published
results that we are aware of. The lowest correct classification rate (87.5%)
obtained for the intermediate grade P4S3 by virtue of the situation between
P3S4 and P4S4 where subtle textural characteristics are difficult to be
differentiated.
Grade Accuracy
P3S3 95.83%
P3S4 91.67%
P4S3 87.50%
P4S4 91.67%
We have discussed the exciting development over the past decade on the application
of curvelet transform to the texture feature extraction in several biomedical imaging,
material grading and document retrieval problems9, 37-43. One may consider the
curvelet as a sophisticated “texton”, the image texture is characterized in terms of its
dynamic and geometric distributions in multiple scales. With the implication of
sparse representation, it leads to efficient and effective multiresolution texture
descriptors capable of providing the enhanced performance of pattern classification.
Many more works need to be done to explore its full potential in various
applications. A closely related method of wave atoms44 representation on oscillatory
patterns may guide new joint development on image texture characterization in
different fields of practical applications.
Appendix
sign of those negative coefficients and merge them into the pool of the positive
curvelet coefficients of significant magnitude, we obtain a single mode
distribution of the curvelet magnitude which is nearly rotation invariant. Thus,
for a given scale, at each location, the maximum curvelet coefficient is defined
by
Patches in images of GS 3+4 and GS 4+3 may have textures as a mixture of grade
G3 and grade G4, instead of either G3 or G4; thus a refinement of the feature
vector for use in SVM 2 at the second level was obtained by adding some fine
scale feature components and re-ranking the features to enhance the
differentiation between GS 3+4 and GS 4+3 which resulted the selection of a set
of 10 feature components as given below.
Test Indecision
Label GS 3+4 GS 4+3
GS 3+4 GS 4+3
GS 3+4 31 1
GS 4+3 31 1
Average 96.88% 96.88% 3.12% 3.12%
References
18. Jianwei Ma and Gerlind Plonka. "The curvelet transform." Signal Processing Magazine,
IEEE 27.2, pp.118-133 (2010).
19. J-C. Starck, F. Murtagh and J. M. Fadili, “Sparse Image and Signal Processing”, Cambridge
University Press, Chap. 5, (2010).
20. F. Murtagh and J-C. Starck, “Wavelet and Curvelet Moments for Image Classification:
Application to Aggregate Mixture Grading,” Pattern Recognition Letters, vol. 29, pp. 1557-
1564, (2008).
21. C. Mosquera-Lopez, S. Agaian, A. Velez-Hoyos and I. Thompson, “Computer-aided
Prostate Cancer Diagnosis from Digitized Histopathology: A Review on Texture-based
Systems,” IEEE Review in Biomedical Engineering, v.8, pp. 98-113, (2015).
22. D. F. Gleason, and G. T. Mellinger, “The Veterans Administration Cooperative Urological
Research Group: Prediction of Prognosis for Prostatic Adenocarcinoma by Combined
Histological Grading and Clinical Staging,” J Urol, v.111, pp. 58-64, (1974).
23. Luthringer, D. J., and Gross, M.,"Gleason Grade Migration: Changes in Prostate Cancer
Grade in the Contemporary Era," PCRI Insights, vol. 9, pp. 2-3, (August 2006).
24. J.I. Epstein, "An update of the Gleason grading system," J Urology, v. 183, pp. 433-440,
(2010).
25. Pierorazio PM, Walsh PC, Partin AW, and Epstein JI., “Prognostic Gleason grade grouping:
data based on the modified Gleason scoring system,” BJU International, (2013).
26. D. F. Gleason, and G. T. Mellinger, “The Veterans Administration Cooperative Urological
Research Group: Prediction of Prognosis for Prostatic Adenocarcinoma by Combined
Histological Grading and Clinical Staging,” J Urol, v.111, pp. 58-64, (1974).
27. Gonzalez, Rafael C., and Richard E. Woods. "Digital image processing 3rd edition." (2007).
28. John Daugman, University of Cambridge, Computer Laboratory. [Online]
KWWSZZZFOFDPDFXNaMJGVDPSOHLULVMSJ.
29. Cavusoglu, “Multiscale Texture Retrieval based on Low-dimensional and Rotation-
invariant Features of Curvelet Transform,” EURASIP Jour. On Image and Video
Processing, paper 2014:22, (2014).
30. Zhang, M. M. Islam, G. Lu and I. J. Sumana, “Rotation Invariant Curvelet Features for
Region Based Image Retrieval,” Intern. J. Computer Vision, vo. 98, pp. 187-201, (2012).
31. Zand, Mohsen, et al. "Texture classification and discrimination for region-based image
retrieval." Journal of Visual Communication and Image Representation 26, pp. 305-316.
(2015).
32. F. Gomez and E. Romero, “Rotation Invariant Texture Classification using a Curvelet based
Descriptor,” Pattern Recognition Letter, vol. 32, pp. 2178-2186, (2011).
33. S. Arivazhagan and T. G. S. Kumar, “Texture Classification using Curvelet Transform,”
Intern. J. Wavelets, Multiresolution & Inform Processing, vol. 5, pp. 451-464, (2007).
34. S. Arivazhagan, L. Ganesan and T. G. S. Kumar, “Texture Classification using Curvelet
Statistical and Co-occurrence Features,” Proc. IEEE ICPR’06, pp. 938-941, (2006).
35. Alecu, A. Munteanu, A. Pizurica, W. P. Y. Cornelis and P. Schelkeus, “Information-
Theoretic Analysis of Dependencies between Curvelet Coefficients,” Proc. IEEE ICOP, pp.
1617-1620, (2006).
36. J. Daugman, “How Iris Recognition Works,” IEEE Trans. On Circuits & Systems for Video
Technology, vo. 14, pp. 121-130, (2004).
37. L. Shen and Q. Yin, “Texture Classification using Curvelet Transform,” Proc. ISIP’09,
China, pp. 319-324, (2009).
106 On Curvelet-based Texture Features for Pattern Classification
38. H. Chang and C. C. J. Kuo, “Texture analysis and Classification with Tree-structured
Wavelet Transform,” IEEE Trans. Image Proc., vol. 2, pp. 429-444, (1993).
39. M. Unser and M. Eden, “Multiresolution Texture Extraction and Selection for Texture
Segmentation,” IEEE Trans. PAMI, vol. 11, pp. 717-728, (1989).
40. M. Unser, “Texture Classification and Segmentation using Wavelet Frames,” IEEE Trans.
IP, vol. 4, pp. 1549-1560, (1995).
41. Lain and J. Fan, “Texture Classification by Wavelet Packet Signatures,” IEEE Trans.
PAMI, vol. 15, pp. 1186-1191, (1993).
42. Lain and J. Fan, “Frame Representation for Texture Segmentation,” IEEE Trans. IP, vol. 5,
pp. 771-780, (1996).
43. Nielsen, F. Albregtsen and H. E., “Statistical Nuclear Texture Analysis in Cancer Research:
A Review of Methods and Applications,” Critical Review in Oncogenesis, vol. 14, pp. 89-
164, (2008).
44. L. Demanet and L. Ying, “Wave Atoms and Sparsity of Oscillatory Patterns,” Appl.
Comput. Harmon. Anal., vol. 23, pp. 368-387, (2007).
45. Lin, Wen-Chyi, Ching-Chung Li, Jonathan I. Epstein, and Robert W. Veltri. "Curvelet-
based texture classification of critical Gleason patterns of prostate histological images." In
Computational Advances in Bio and Medical Sciences (ICCABS), 2016 IEEE 6th
International Conference on, pp. 1-6, (2016).
46. Lin, Wen-Chyi, Ching-Chung Li, Jonathan I. Epstein, and Robert W. Veltri. "Advance on
curvelet application to prostate cancer tissue image classification." In 2017 IEEE 7th
International Conference on Computational Advances in Bio and Medical Sciences
(ICCABS), pp. 1-6, (2017).
47. C. Mosquera-Lopez, S. Agaian and A. Velez-Hoyos. "The development of a multi-stage
learning scheme using new descriptors for automatic grading of prostatic carcinoma." Proc.
IEEE ICASSP, pp. 3586-3590, (2014).
48. D. Fehr, H. Veeraraghavan, A. Wibmer, T. Gondo, K. Matsumoto, H. A. Vargas, E. Sala,
H. Hricak, and J. O. Deasy. "Automatic classification of prostate cancer Gleason scores
from multiparametric magnetic resonance images," Proceedings of the National Academy
of Sciences 112, no. 46, pp.6265-6273, (2015).
CHAPTER 1.6
AN OVERVIEW OF EFFICIENT DEEP LEARNING
ON EMBEDDED SYSTEMS
Xianju Wang
Bedford, MA, USA
wxjzyw@gmail.com
Deep neural networks (DNNs) have exploded in the past few years, particularly in the
area of visual recognition and natural language processing. At this point, they have
exceeded human levels of accuracy and have set new benchmarks in several tasks.
However, the complexity of the computations requires specific thought to the network
design, especially when the applications need to run on high-latency, energy efficient
embedded devices.
In this chapter, we provide a high-level overview of DNNs along with specific
architectural constructs of DNNs like convolutional neural networks (CNNs) which are
better suited for image recognition tasks. We detail the design choices that deep-learning
practitioners can use to get DNNs running efficiently on embedded systems. We
introduce chips most commonly used for this purpose, namely microprocessor, digital
signal processor (DSP), embedded graphics processing unit (GPU), field-programmable
gate array (FPGA) and application specific integrated circuit (ASIC), and the specific
considerations to keep in mind for their usage. Additionally, we detail some
computational methods to gain more efficiency such as quantization, pruning, network
structure optimization (AutoML), Winograd and Fast Fourier transform (FFT) that can
further optimize ML networks after making the choice of network and hardware.
1. Introduction
Deep learning has evolved into the state-of-the-art technique for artificial
intelligence (AI) tasks since early 2010. Since the breakthrough application of
deep learning for image recognition and natural language processing (NLP), the
number of applications that use deep learning has increased significantly.
In many applications, deep neural networks (DNNs) are now able to surpass
human levels of accuracy. However, the superior accuracy of DNNs comes from
the cost of high computational complexity. DNNs are both computationally
intensive and memory intensive, making them difficult to deploy and run on
embedded devices with limited hardware resources [1,2,3].
The deep neural networks (DNN) process and task includes training and
inference, which have different computational needs. Training is the stage in
107
108 An Overview of Efficient Deep Learning on Embedded Systems
which your network tries to learn from the data, while inference is the phase in
which a trained model is used to predict the real samples.
Network training often requires a large dataset and significant computational
resources. In many cases, training a DNN model still takes several hours to days
to complete and thus is typically executed in the cloud. For the inference, it is
desirable to have its processing near the sensor and on the embedded systems to
reduce latency and improve privacy and security. In many applications, inference
requires high speed and low power consumption. Thus, implementing deep
learning on embedded systems becomes more critical and difficult. In this
chapter, we are going to focus on reviewing different methods to implement deep
learning inference on embedded systems.
At the highest level, one can think of a DNN as a series of smooth geometric
transformations from an input space to a desired output space. The “series” are
the layers that are stacked one after the other, with the output of the outer layer
fed as inputs to the inner layer. The input space could contain mathematical
representations of images, language or any other feature set, while the output is
the desired “answer” that during the training phase is fed to the network and
during inference is predicted. The geometric transformations can take several
forms and the choice often depends on the nature of the problem that needs to be
solved. In the world of image and video processing, the most common form is
convolutional neural networks (CNNs). CNNs specifically have contributed to
the rapid growth in computer vision due to the nature of their connections and the
computational efficiency they offer compared to other types of networks.
3.1. Microprocessors
For many years microprocessors have been applied as the only efficient way to
implement embedded systems. Advanced RISC machine (ARM) processors use
reduced instruction sets and require fewer transistors than those with a complex
instruction set computing (CISC) architecture (such as the x86 processors used in
most personal computers), which reduces the size and complexity, while
110 An Overview of Efficient Deep Learning on Embedded Systems
lowering the power consumption. ARM processors have been extensively used
in consumer electronic devices such as smart phones, tablets, multimedia players
and other mobile devices.
Microprocessors are extremely flexible in terms of programmability, and all
workloads can be run reasonably well on them. While ARM is quite powerful, it
is not a good choice for massive data parallel computations, and is only used for
low speed or low-cost applications. Recently, Arm Holdings developed Arm NN,
an open-source network machine learning (ML) software, and NXP
Semiconductors released the eIQ™ machine learning software development
environment. Both of these include inference engines, neural network compilers
and optimized libraries, which help users develop and deploy machine learning
and deep learning systems with ease.
3.2. DSPs
DSPs are well known for their high computation performance, low-power
consumption and relatively small size. DSPs have highly parallel architectures
with multiple functional units, VLIW/SIMD features and pipeline capability,
which allow complex arithmetic operations to be performed efficiently.
Compared to microprocessors, one of the primary advantages of DSP is its
capability of handling multiple instructions at the same time [5] without
significantly increasing the size of the hardware logic. DSPs are suitable for
accelerating computationally intensive tasks on embedded devices and have been
used in many real time signal and image processing systems.
GPUs are currently the most widely used hardware option for machine and deep
learning. GPUs are designed for high data parallelism and memory bandwidth
(i.e. can transport more “stuff” from memory to the compute cores). A typical
NVIDIA GPU has thousands of cores, allowing for fast execution of the same
operation across multiple cores. Graphics processing units (GPUs) are widely
used in network training. Although extremely capable, GPUs have had trouble
gaining traction in the embedded space given the power, size and cost constraints
often found in embedded applications.
X. Wang 111
3.4. FPGAs
FPGAs were developed for digital embedded systems, based on the idea of using
arrays of reconfigurable complex logic blocks (LBs) with a network of
programmable interconnects surrounded by a perimeter of I/O blocks (IOBs).
FPGAs allow the design of custom circuits that implement hardware specific
time-consuming computation. The benefit of an FPGA is the great flexibility in
logic, providing extreme parallelism in data flow and processing vision
applications, especially at the low and intermediate levels where they are able to
employ the parallelism inherent in images. For example, 640 parallel
accumulation buffers and ALUs can be created, summing up an entire 640x480
image in just 480 clock cycles [6, 7]. In many cases, FPGAs have the potential to
exceed the performance of a single DSP or multiple DSPs. However, a big
disadvantage is their power consumption efficiency.
FPGAs have more recently become a target appliance for machine learning
researchers, and big companies like Microsoft and Baidu have invested heavily in
FPGAs. It is apparent that FPGAs offer much higher performance/watt than
GPUs, because even though they cannot compete on pure performance, they use
much less power. Generally, FPGA is about an order of magnitude less efficient
than ASIC. However, modern FPGAs contain hardware resources, such as DSPs
for arithmetic operations and on-chip memories located next to DSPs which
increase the flexibility and reduce the efficiency gap between FPGA and ASIC
[2].
3.5. ASICs
As discussed in Section 3, the superior accuracy of DNNs comes from the cost of
high computational complexity. DNNs are both computationally intensive and
memory intensive, making them difficult to deploy and run on embedded devices
with limited hardware resources. In the past few years, several methods were
proposed to implement efficient inference.
4.1.1. Quantization
Network pruning has been widely used to compress CNN and recurrent neural
network (RNN) models. Neural network pruning is an old concept, dating back to
1990 [10]. The main idea is that, among many parameters in the network, most
are redundant and do not contribute much to the output. It has been proven to be
an effective way to reduce the network complexity and over-fitting [11,12,13].
As shown in Figure 1, before pruning, every neuron in each layer has a
connection to the following layer and there are a lot of multiplications to execute.
After pruning, the network becomes sparse and only connects each neuron to a
few others, which saves a lot of multiplication computations.
As shown in Figure 2, pruning usually includes a three-step process: training
connectivity, pruning connections and retraining the remaining weights. It starts
by learning the connectivity via normal network training. Next, it prunes the
small-weight connections: all connections with weights below a threshold are
removed from the network. Finally, it retrains the network to learn the final
weights for the remaining sparse connections. This is the most straight-forward
method of pruning and is called one-shot pruning. Song Han et. al show that this
is surprisingly effective and usually can reduce 2x the connections without losing
accuracy. They also noticed that after pruning followed by retraining, they can
achieve much better results with higher sparsity at no accuracy loss. They called
this iterative pruning. We can think of iterative pruning as repeatedly learning
which weights are important, removing the least important weights, and then
retraining the model to let it "recover" from the pruning by adjusting the
remaining weights. The number of parameters was reduced by 9x and 12x for
AlexNet and VGG-16 model respectively [2,10].
Train
Connectivity
Prune
Connectivity
Retrain Weights
Pruning makes network weights sparse. While it reduces model size and
computation, it also reduces the regularity of the computations. This makes it
more difficult to parallelize in most embedded systems. In order to avoid the
need for custom hardware like FPGA, structured pruning is developed and
involves pruning groups of weights, such as kernels, filters, and even entire
feature-maps. The resulting weights can better align with the data-parallel
X. Wang 115
5. Summary
While DNNs have seen vast growth in the past few years and have surpassed the
levels of accuracy that humans can achieve in many tasks, the computational
complexity and demands make them sub-optimal for efficient embedded
processing. Consequently, techniques that enable efficiency in processing and
throughput are critical to expanding DNNs to serve within embedded platforms.
In this chapter, we reviewed some of the methods that can be used to improve
energy efficiency without sacrificing accuracy within cost-effective hardware,
such as quantization, pruning, network structure optimization (AutoML),
Winograd and FFT. These methods are useful for increasing and diversifying the
capabilities of DNNs, which makes DNNs more accessible to end-users. We
expect that research and commercial applications in this area will continue to
grow over the next few years.
References
1. Song Han, Huizi Mao, William J. Dally, Deep Compression: Compressing deep neural
networks with pruning, trained quantization and Huffman coding, ICLR, San Juan, Puerto
Rico, October 2016.
2. Vivienne Sze, Tien-Ju Yang, Yu-Hsin Chen, Joe Emer, Efficient Processing of Deep Neural
Networks: A Tutorial and Survey, Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329,
December 2017.
3. Song Han, Efficient Methods and Hardware for Deep Learning, PhD.’s thesis, Stanford
University, USA (2017)
4. Kamel Abdelouahab, Maxime Pelcat, Francois Berry, Jocelyn Serot, Accelerating CNN
inference on FPGAs: Survey, 2018, https://hal.archives-ouvertes.fr/hal-01695375/document
5. http://www.ti.com/lit/an/sprabf2/sprabf2.pdf
X. Wang 117
CHAPTER 1.7
Many classification problems are naturally multi-view in the sense their data are
described through multiple heterogeneous descriptions. For such tasks, dissimilar-
ity strategies are effective ways to make the different descriptions comparable and
to easily merge them, by (i) building intermediate dissimilarity representations
for each view and (ii) fusing these representations by averaging the dissimilari-
ties over the views. In this work, we show that the Random Forest proximity
measure can be used to build the dissimilarity representations, since this mea-
sure reflects similarities between features but also class membership. We then
propose a Dynamic View Selection method to better combine the view-specific
dissimilarity representations. This allows to take a decision, on each instance to
predict, with only the most relevant views for that instance. Experiments are
conducted on several real-world multi-view datasets, and show that the Dynamic
View Selection offers a significant improvement in performance compared to the
simple average combination and two state-of-the-art static view combinations.
1. Introduction
In many real-world pattern recognition problems, the available data are complex
in that they cannot be described by a single numerical representation. This may
be due to multiple sources of information, as for autonomous vehicles for example,
where multiple sensors are jointly used to recognize the environment.1 It may also
be due to the use of several feature extractors, such as in image recognition tasks,
often based on multiple families of features, such as color, shape, texture descriptors,
etc.2
Learning from these types of data is called multi-view learning and each modal-
ity/set of features is called a view. For this type of task, it is assumed that the views
convey different types of information, each of which can contribute to the pattern
recognition task. Therefore, the challenge is generally to carry out the learning task
taking into account the complementarity of the views. However, the difficulty with
119
March 12, 2020 10:55 ws-rv961x669 HBPRCV-6th Edn.–11573 bernard˙HBPRCV page 120
this is that these views can be very different from each other in terms of dimension,
nature and meaning, and therefore very difficult to compare or merge. In a recent
work,2 we proposed to use dissimilarity strategies to overcome this issue. The idea
is to use a dissimilarity measure to build intermediate representations from each
view separately, and to merge them afterward. By describing the instances with
their dissimilarities to other instances, the merging step becomes straightforward
since the intermediate dissimilarity representations are fully comparable from one
view to another.
For using dissimilarities in multiv-iew learning, two questions must be addressed:
(i) how to measure and exploit the dissimilarity between instances for building the
intermediate representation? and (ii) how to combine the view-specific dissimilarity
representations for the final prediction?
In our preliminary work,2 the first question has been addressed with Random
Forest (RF) classifiers. RF are known to be versatile and accurate classifiers3,4
but they are also known to embed a (dis)similarity measure between instances.5
The advantage of such a mechanism in comparison to traditional similarity mea-
sures is that it takes the classification/regression task into account for computing
the similarities. For classification for example, the instances that belong to the
same class are more likely to be similar according to this measure. Therefore, a
RF trained on a view can be used to measure the dissimilarities between instances
according to the view, and according to their class membership as well. The way
this measure is used to build the per-view intermediate representations is by cal-
culating the dissimilarity of a given instance x to all the n training instances. By
doing so, x can be represented by a new feature vector of size n, or in other words
in a n-dimensional space where each dimension is the dissimilarity to one of the
training instances. This space is called the dissimilarity space6,7 and is used as the
intermediate representation for each view.
As for the second question, we addressed the combination of the view-specific
dissimilarity representations by computing the average dissimilarities over all the
views. That is to say, for an instance x, all the view-specific dissimilarity vectors are
computed and averaged to obtain a final vector of size n. Each value in this vector is
thus the average dissimilarity between x and one of the n training instances over the
views. This is a simple, yet meaningful way to combine the information conveyed
by each view. However, one could find it a little naive when considering the true
rationale behind multi-view learning. Indeed, even if the views are expected to be
complementary to each other, they are likely to contribute in very different ways
to the final decision. One view in particular is likely to be less informative than
another, and this contribution is even likely to be very different from an instance to
predict to another. In that case, it is desirable to estimate and take this importance
into account when merging the view-specific representations. This is the goal we
follow in the present work.
March 12, 2020 10:55 ws-rv961x669 HBPRCV-6th Edn.–11573 bernard˙HBPRCV page 121
In a nutshell, our preliminary work2 has validated the generic framework ex-
plained above, with the two following key steps: (i) building the dissimilarity space
with the RF dissimilarity mechanism and (ii) combining the views afterward by
averaging the dissimilarities. In the present work, we deepen the second step by
investigating two methods to better combine the view-specific dissimilarities:
The rest of this chapter is organized as follows. The Random Forest dissimi-
larity measure is firstly explained in Section 2. The way it is used for multi-view
classification is detailed in Section 3. The different strategies for combining the
dissimilarity representations are given in Section 4, along with our two proposals
for static and dynamic view combinations. Finally, the experimental validation is
presented in Section 5.
where hk (x) is the k th Random Tree of the forest, built using the mechanisms ex-
plained above.3,8 Note however that there exist many other RF learning methods
that differ from the Breiman’s method by the use of different randomization tech-
niques for growing the trees.9
For predicting the class of a given instance x with a Random Tree, x goes down
the tree structure from its root to one of its leaves. The descending path followed by
x is determined by successive tests on the values of its features, one per node along
the path. The prediction is given by the leaf in which x has landed. More informa-
tion about this procedure can be found in the recently published RF reviews.8–10
The key point here is that, if two test instances land in the same terminal node,
they are likely to belong to the same class and they are also likely to share simi-
larities in their feature vectors, since they have followed the same descending path.
This is the main motivation behind using RF for measuring dissimilarities between
instances.
Note that the final prediction of a RF classifier is usually obtained via majority
voting over the component trees. Here again, there exist alternatives to majority
voting,9 but this latter remains the most used as far as we know.
1 1
M M
1
pH (xi , xj ) = pk (xi , xj ) = (4)
M M exp (w.gijk )
k=1 k=1
where, gijk is the number of tree branches between the two terminal nodes occupied
by xi and xj in the k th tree of the forest, and where w is a parameter to control the
influence of g in the computation. When lk (xi ) = lk (xj ), dk (xi , xj ) is still equal to
0, but in the opposite situation the resulting value is in ]0, 1].
A second variant,11 noted RFD in the following, leans on a measure of instance
hardness, namely the κ-Disagreeing Neighbors (κDN) measure,12 that estimates the
intrinsic difficulty to predict an instance as follows:
|xk : xk ∈ κN N (xi ) ∩ yk = yi |
κDN (xi ) = (5)
κ
where κN N (xi ) is the set of the κ nearest neighbors of xi . This value is used for
measuring the dissimilarity dˆk (x, xi ), between any instance x to any of the training
instances xi , as follows:
M
(1 − κDNk (xi )) × dk (x, xi )
dˆk (x, xi ) = k=1M (6)
k=1 (1 − κDNk (xi ))
where κDNk (xi )) is the κDN measure computed in the subspace formed by the
sole features used in the k th tree of the forest.
Any of these variants could be used to compute the dissimilarities in our frame-
work. However, we choose to use the RFD variant in the following, since it has been
shown to give very good results when used for building dissimilarity representations
for multi-view learning.11
March 12, 2020 10:55 ws-rv961x669 HBPRCV-6th Edn.–11573 bernard˙HBPRCV page 124
The key principle of the proposed framework is to compute the RFD matrices
(q)
DH from each of the Q training subsets T (q) . For that purpose, each T (q) is fed
to the RF learning procedure, resulting in Q RF classifiers noted H (q) , ∀q = 1..Q.
(q)
The RFD measure is then used to compute the Q RFD matrices DH , ∀q = 1..Q.
Once these RFD matrices are built, they have to be merged in order to build the
joint dissimilarity matrix DH that will serve as a new training set for an additional
learning phase. This additional learning phase can be realized with any learning
algorithm, since the goal is to address the classification task. For simplicity and
because they are as accurate as they are versatile, the same Random Forest method
used to calculate the dissimilarities is also used in this final learning stage.
Regarding the merging step, which is the main focus of the present work, it can
be straightforwardly done by a simple average of the Q RFD matrices:
1 (q)
Q
DH = D (10)
Q q=1 H
The average dissimilarity is a simple, yet meaningful way to merge the dissimilarity
representations built from all the views. However, it intrinsically considers that
all the views are equally relevant with regard to the task and that the resulting
dissimilarities are as reliable as each other. This is likely to be wrong from our
point of view. In multi-view learning problems, the different views are meant to be
complementary in some ways, that is to say to convey different types of information
March 12, 2020 10:55 ws-rv961x669 HBPRCV-6th Edn.–11573 bernard˙HBPRCV page 127
regarding the classification task. These different types of information may not
have the same contribution to the final predictions. That is the reason why it
may be important to differentiate these contributions, for example with a weighted
combination in which the weights would be defined according to the view reliability.
The calculation of these weights can be done following two paradigms: static
weighting and dynamic weighting. The static weighting principle is to weight the
views once for all, with the assumption that the importance of each view is the same
for all the instances to predict. The dynamic weighting principle on the other way,
aims at setting different weights for each instance to predict, with the assumption
that the contribution of each view to the final prediction is likely to be different
from one instance to another.
Given a set of dissimilarity matrices {D(1) , D(2) , . . . , D(Q) } built from Q different
views, our goal is to find the best set of non-negative weights {w(1) , w(2) , . . . , w(Q) },
so that the joint dissimilarity matrix is:
Q
D= w(q) D(q) (11)
q=1
Q
where w(q) ≥ 0 and q=1 w(q) = 1.
There exist several ways, proposed in the literature, to compute the weights
of such a static combination of dissimilarity matrices. The most natural one is to
deduce the weights from a quality score measured on each view. For example, this
principle has been used for multi-scale image classification14 where each view is a
version of the image at a given scale, i.e. the weights are derived directly from the
scale factor associated with the view. Obviously, this only makes sense with regard
to the application, for which the scale factor gives an indication of the reliability
for each view.
Another, more generic and classification-specific approach, is to evaluate the
quality of the dissimilarity matrix using the performance of a classifier. This makes
it possible to estimate whether a dissimilarity matrix sufficiently reflects class mem-
bership.14,15 For example, one can train a SVM classifier from each dissimilarity
matrix and use its accuracy as an estimation of the corresponding weights.14 kNN
classifiers are also very often used for that purpose.15,16 The reason is that a good
dissimilarity measure is expected to propose good neighborhoods, or in other words
the most similar instances should belong to the same class.
Since kernel matrices can be viewed as similarity matrices, there are also few
solutions in the literature of kernel methods that could be used to estimate the
quality of a dissimilarity matrix. The most notable is the Kernel Alignment (KA)
estimate17 A(K1 , K2 ), for measuring the similarity between two kernel matrices K1
March 12, 2020 10:55 ws-rv961x669 HBPRCV-6th Edn.–11573 bernard˙HBPRCV page 128
and K2 :
K1 , K2 F
A(K1 , K2 ) = (12)
K1 , K1 F K2 , K2 F
where Ki is a kernel matrix and where ·, ·F is the Frobenius norm.17
In order to use the KA measure to estimate the quality of a given kernel matrix, a
target matrix must be defined beforehand. This target matrix is an ideal theoretical
similarity matrix, regarding the task. For example, for binary classification, the
ideal target matrix is usually defined as K∗ = yyT , where y = {y1 , y2 , . . . , yn } are
the true labels of the training instances, in {−1, +1}. Thus, each value in K∗ is:
∗ 1, if yi = yj
Kij = (13)
−1, otherwise
In other words, the ideal matrix is the similarity matrix in which instances are
considered similar (K∗ij = 1) if and only if they belong to the same class. This
estimate is transposed to multi-class classification problems as follows:18
∗ 1, if yi = yj
Kij = −1 (14)
C−1 , otherwise
Both kNN and KA methods presented above are used in the experimental part
for comparison purposes (cf. Section 5). However, in order to use the KA method
for our problem, some adaptations are required. Firstly, the dissimilarity matrices
need to be transformed into similarity matrices by S(q) = 1 − D(q) . The following
heuristic is then used to deduce the weight from the KA measure:19
A(S(q) , yyT )
w(q) = Q (15)
(h) , yyT )
h=1 A(S
Strictly speaking, for the similarity matrices S(q) to be considered as kernel matrices,
it must be proven that they are p.s.d. When such matrices are proven to be p.s.d,
the KA estimates is necessarily non-negative, and the corresponding w(q) are also
non-negative.17,19 However, as it is not proven that our matrices S(q) built from
RF D are p.s.d., we propose to use the softmax function to normalize the weights
and to ensure they are strictly positive:
exp(A(S(q) , K∗ ))
w(q) = Q (16)
(h) , K∗ ))
h=1 exp(A(S
The main drawback of the methods mentioned above is that they evaluate the
quality of the dissimilarity matrices based solely on the training set. This is the
very essence of these methods, which are designed to evaluate (dis)similarity matri-
ces built from a sample, e.g. the training set. However, this may cause overfitting
March 12, 2020 10:55 ws-rv961x669 HBPRCV-6th Edn.–11573 bernard˙HBPRCV page 129
issues when these dissimilarity matrices are used for classification purposes as it is
the case in our framework. Ideally, the weights should be set from the quality of the
dissimilarity representations estimated on an independent validation dataset. Ob-
viously, this requires to have additional labeled instances. The method we propose
in this section allows to estimate the quality of the dissimilarity representations
without the use of additional validation instances.
The idea behind our method is that the relevance of a RFD space is reflected by
the accuracy of the RF classifier used to build it. This accuracy can be efficiently
estimated with a mechanism called the Out-Of-Bag (OOB) error. This OOB error
is an estimate supplied by the Bagging principle, known to be a reliable estimate of
the generalization error.3 Since the RF classifiers in our framework are built with
the Bagging principle, the OOB error can be used to estimate their generalization
error without the need of an independent validation dataset.
Let us briefly explained here how the OOB error is obtained from a RF: let B
denote a Bootstrap sample formed by randomly drawing p instances from T , with
replacement. When p = n, n being the number of instances in T , it can be proven
that about one third of T , in average, will not be drawn to form B.3 These instances
are called the OOB instances of B. Using Bagging for growing a RF classifier, each
tree in the forest is trained on a Bootstrap sample, that is to say using only about
two thirds of the training instances. Similarly, each training instance x is used for
growing about two thirds of the trees in the forest. The remaining trees are called
the OOB trees of x. The OOB error is the error rate measured on the whole training
set by only using the OOB trees of each training instance.
Therefore, the method we propose to use consists in using the OOB error of the
RF classifier trained on a view directly as its weight in the weighted combination.
This method is noted SWOOB in the following.
The generation of the pool is the first key step of DCS. As the aim is to select the
most competent classifier on the fly for each given test instance, the classifiers in
the pool must be as diverse and as individually accurate as possible. In our case,
the challenge is not to create the diversity in the classifiers, since they are trained
on different joint dissimilarity matrices, generated with different sets of weights.
The challenge is rather to generate these different weight tuples used to compute
the joint dissimilarity matrices. For such a task, a traditional grid search strategy
could be used. However, the number of candidate solutions increases exponentially
with respect to the number of views. For example, Suppose that we sample the
weights with 10 values in [0, 1]. For Q views, it would result in 10Q different weight
tuples. Six views would thus imply to generate 1 million weight tuples and to train
1 million classifiers afterwards. Here again, this is obviously highly inefficient.
The alternative approach we propose is to select a subset of views for every
candidate in the pool, instead of considering a weighted combination of all of them.
By doing so, for each instance to predict, only the views that are considered infor-
mative enough are expected to be used for its prediction. The selected views are
then combined by averaging. For example, if a problem is described with six views,
there are 26 − 1 = 63 possible combinations (the situation in which none of the
views is selected is obviously ignored), which will result in a pool of 63 classifiers
H = {H1 , H2 , . . . , H63 }. Lines 1 to 6 of Algorithm 2 give a detailed implementation
of this procedure.
The selection of the most competent classifier is the second key step of DCS. Gen-
erally speaking, this selection is made through two steps:20 (i) the definition of a
region of competence for the instance to predict and (ii) the evaluation of each clas-
sifier in the pool for this region of competence, in order to select the most competent
one.
March 12, 2020 10:55 ws-rv961x669 HBPRCV-6th Edn.–11573 bernard˙HBPRCV page 131
• Create the pool of classifiers by using all the possible subsets of views, to avoid
the expensive grid search for the weights generation (lines 4-5 of Algorithm 2).
March 12, 2020 10:55 ws-rv961x669 HBPRCV-6th Edn.–11573 bernard˙HBPRCV page 132
• Define the region of competence in the dissimilarity space by using the RFD
dissimilarity measure, to circumvent the issues that arise from high dimensional
spaces (lines 12-13 of Algorithm 2).
• Evaluate the competence of each candidate classifier with its OOB error rate,
so that no additional validation instances are required (line 14 of Algorithm 2).
• Select the best classifier for xt (lines 16-17 of Algorithm 2).
These steps are also illustrated in Figure 2 with the generation of the pool of
classifiers in the upper part, and with the evaluation and selection of the classifier
in the lower part of the figure. For illustration purposes, the classifier ultimately
selected for predicting the class of xt is assumed to be the second candidate (in
red).
Fig. 2. The DCSRF D procedure, with the training and prediction phases. The best candidate
classifier that gives the final prediction for xt is H[2] in this illustration (in red).
March 12, 2020 10:55 ws-rv961x669 HBPRCV-6th Edn.–11573 bernard˙HBPRCV page 133
5. Experiments
The multi-view datasets used in this experiment are described in Table 1. All
these datasets are real-world multi-view datasets, supplied with several views of
the same instances: NonIDH1, IDHcodel, LowGrade and Progression are medical
imaging classification problems, with different families of features extracted from
different types of radiographic images; LSVT and Metabolomic are two other med-
ical related classification problems, the first one for Parkinson’s disease recognition
and the second one for colorectal cancer detection; BBC and BBCSport are text
classification problems from news articles; Cal7, Cal20, Mfeat, NUS-WIDE2, NUS-
WIDE3, AWA8 and AWA15 are image classification problems made up with dif-
ferent families of features extracted from the images. More details about how these
datasets have been constituted can be found in the paper (and references therein)
cited in the caption of Table 1.
All the methods used in these experiments include the same first stage, i.e.
building the RF classifiers from each view and building then the view-specific RFD
matrices. Therefore, for a fair comparison on each dataset, all the methods use
the exact same RF classifiers, made up with the same 512 trees.2 As for the other
important parameters of the RF learning procedure, the mtry parameter is set to
√
mq , where mq is the dimension of the q th view, and all the trees are grown to
their maximum depth (i.e. with no pre-pruning).
The methods compared in this experiment differ in the way they combine the
view-specific RFD matrices afterwards. We recall below these differences:
• Avg denotes the baseline method for which the joint dissimilarity representation
is formed by a simple average of the view-specific dissimilarity representations.
• SW3N N and SWKA both denote static weighting methods for determining Q
March 12, 2020 10:55 ws-rv961x669 HBPRCV-6th Edn.–11573 bernard˙HBPRCV page 135
weights, one per view. The first one derives the weights from the performance
of a 3N N classifier applied on each RFD matrix; the second one uses the KA
method to estimate the relevancy of each RFD matrix in regards to the classi-
fication problem.
• SWOOB is the static weighting method we propose in this work and presented
in Section 4.1; it computes the weights of each view from the OOB error rate
of its RF classifier.
• DCSRF D is the dynamic selection method we propose in this work and pre-
sented in Section 4.2; it computes different combinations of the RFD matrices
for each instance to predict based on its k nearest neighbors, with k = 7 fol-
lowing the recommendation in the literature.20
After each method determine a set of Q weights, the joint RFD matrix is computed.
This matrix is then used as a new training set for a RF classifier learnt with the
√
same parameters as above (512 trees, mtry = n with n the number of training
instances, fully grown trees).
As for the pre-processing of the datasets, a stratified random splitting procedure
is repeated 10 times, with 50% of the instances for training and 50% for testing.
The mean accuracy, with standard deviations, are computed over the 10 runs and
reported in Table 2. Bold values in this table are the best average performance
obtained on each dataset.
Table 2. Accuracy (mean ± standard deviation) and average ranks
AWA8 56.22% ± 1.01 56.22% ± 0.99 56.12% ± 1.42 56.59% ± 1.41 57.28% ± 1.49
AWA15 38.23% ± 0.83 38.13% ± 0.87 38.27% ± 1.05 38.23% ± 1.26 38.82% ± 1.56
BBC 95.46% ± 0.65 95.52% ± 0.64 95.36% ± 0.74 95.46% ± 0.60 95.42% ± 0.59
BBCSport 90.18% ± 1.96 90.29% ± 1.83 90.26% ± 1.78 90.26% ± 1.95 90.44% ± 1.89
Cal7 96.03% ± 0.53 96.10% ± 0.57 96.11% ± 0.60 96.10% ± 0.60 94.65% ± 1.09
Cal20 89.76% ± 0.80 89.88% ± 0.82 89.77% ± 0.68 90.00% ± 0.71 89.15% ± 0.97
IDHCodel 76.76% ± 3.59 77.06% ± 3.43 77.35% ± 3.24 76.76% ± 3.82 77.65% ± 3.77
LowGrade 63.95% ± 5.62 62.56% ± 6.10 63.95% ± 3.57 63.95% ± 5.01 65.81% ± 5.31
LSVT 84.29% ± 3.51 84.29% ± 3.65 84.60% ± 3.54 84.76% ± 3.63 84.44% ± 3.87
Metabolomic 69.17% ± 5.80 68.54% ± 5.85 70.00% ± 4.86 70.00% ± 6.12 70.21% ± 4.85
Mfeat 97.53% ± 1.00 97.53% ± 1.09 97.53% ± 1.09 97.57% ± 1.01 97.63% ± 0.99
NonIDH1 80.70% ± 3.76 80.47% ± 3.32 80.00% ± 3.15 80.93% ± 4.00 79.77% ± 2.76
NUS-WIDE2 92.82% ± 1.93 92.86% ± 1.88 92.60% ± 2.12 92.97% ± 1.72 93.30% ± 1.58
NUS-WIDE3 80.32% ± 1.95 79.95% ± 2.40 80.09% ± 2.07 80.14% ± 2.20 80.77% ± 2.06
Progression 65.79% ± 4.71 65.79% ± 4.71 65.79% ± 4.99 66.32% ± 4.37 66.84% ± 5.29
Avg rank 3.67 3.50 3.30 2.40 2.13
in the first two positions. To better assess the extent to which these differences are
significant, a pairwise analysis based on the Sign test is computed on the number of
wins, ties and losses between the baseline method Avg and all the other methods.
The result is shown in Figure 3.
Fig. 3. Pairwise comparison between each method and the baseline Avg. The vertical lines are
the level of statistical significance according to the Sign test.
From this statistical test, one can observe that none of the static weighting
methods allows to reach the significance level of wins over the baseline method. It
indicates that the simple average combination, when using dissimilarity representa-
tions for multi-view learning, is a quite strong baseline. It also underlines that all
views are globally relevant for the final classification task. There is no view that is
always irrelevant, for all the predictions.
Figure 3 shows also that the dynamic selection method proposed in this work
is the only method that predominantly improves the accuracy over this baseline,
till reaching the level of statistical significance. From our point of view, it shows
that all the views do not participate in the same extent to the good prediction
of every instance. Some instances are better recognized when the dissimilarities
are computed by relying on some views more than on the others. These views are
certainly not the same ones from one instance to another, and some instances may
need the dissimilarity information from all the views at some point. Nevertheless,
this highlights that the confusion between the classes is not always consistent from
one view to another. In that sense, the views complement each others, and this can
be efficiently exploited for multi-view learning provided that we can identify the
views that are the most reliable for every instance, one by one.
6. Conclusion
Multi-view data are now very common in real world applications. Whether they
arise from multiple sources or from multiple feature extractors, the different views
are supposed to provide a more accurate and complete description of objects than
a single description would do. Our proposal in this work was to address multi-
view classification tasks using dissimilarity strategies, which give an efficient way to
March 12, 2020 10:55 ws-rv961x669 HBPRCV-6th Edn.–11573 bernard˙HBPRCV page 137
Acknowledgement
This work is part of the DAISI project, co-financed by the European Union with
the European Regional Development Fund (ERDF) and by the Normandy Region.
References
1. X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection net-
work for autonomous driving. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 6526–6534, (2017).
2. H. Cao, S. Bernard, R. Sabourin, and L. Heutte, Random forest dissimilarity based
multi-view learning for radiomics application, Pattern Recognition. 88, 185–197,
(2019).
3. L. Breiman, Random forests, Machine Learning. 45(1), 5–32, (2001).
4. M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, Do we need hundreds
of classifiers to solve real world classification problems?, Journal of Machine Learning
Research. 15, 3133–3181, (2014).
5. C. Englund and A. Verikas, A novel approach to estimate proximity in a random
forest: An exploratory study, Expert Systems with Applications. 39(17), 13046–13050,
(2012).
March 12, 2020 10:55 ws-rv961x669 HBPRCV-6th Edn.–11573 bernard˙HBPRCV page 138
CHAPTER 1.8
This chapter reviews the recent development of image colourisation, which aims
at adding colour to a given greyscale image. There are numerous applications
involving image colourisation, such as converting black and white photos or movies
to colour, restoring historic photographs to improve the aesthetics of the image, as
well as colourising many other types of images lacking colour (e.g. medical images,
infrared night time images). According to the source where the colours come
from, the existing methods can be categorised into three classes: colourisation by
reference, colourisation by scribbles and colourisation by deep learning. In this
chapter, we introduce the basic idea and survey the typical algorithms of each
type of method.
1. Introduction
The first monochrome (black and white) photography was captured in 1839, and
until the mid-20th century the majority of photography remained monochrome. In
order to produce more realistic images, photographers and artists attempted to add
colours to the black and white images. In the mid-19th century to the mid-20th
century, people hand-coloured monochrome photographs manually, such as shown
in Fig. 1. However, hand-colouring of photographs requires expertise knowledge
and is time consuming. In 1970, the term of colourisation was first introduced
by Wilson Markle1 to describe the computer-assisted process for adding colours to
black and white movies. It arose from colouring the classic black and white photos
and videos, and now has been applied in various fields, such as hyperspectral image
visualisation, designing cartoons, and 3-dimensional data rendering, etc.
For human beings, the plausible colourisation images can be produced imme-
diately through our brain. However, it is not so direct for computers to find a
reasonable colourisation result since it requires predicting R, G and B colours for
∗ B.Li is with the School of Mathematics and Information Science, Nanchang Hangkong University,
Nanchang, China, and also with the School of Educational Information Technology, Central China
Normal University, Wuhan, China. e-mail: bolimath@gmail.com. Y.-K. Lai and P. L. Rosin are
with the School of Computer Science and Informatics, Cardiff University, Cardiff, UK.
139
March 12, 2020 11:20 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-Li-Lai-Rosin page 140
Fig. 1. A hand-coloured print from the same negative, hand-coloured by Stillfried & Andersen
between 1875 and 1885. https://en.wikipedia.org/wiki/Hand-colouring_of_photographs
each pixel with the given intensity. Mathematically, image colourisation can be
formulated as follows. Given a greyscale image L ∈ Rm×n , where m and n are the
width and height of the image, image colourisation aims at finding a mapping func-
tion f from the intensity image L to its corresponding colour version C ∈ Rm×n×3 ,
neural networks, it is difficult to control the deep model to generate user desired
colourful images due to its black-box property.
This chapter is organised as follows. We will introduce the basic idea and re-
view the corresponding typical algorithms of each type of method respectively in
sections 2–5, and finally draw conclusions in section 6.
Colourisation by reference image means that, given a target greyscale image and
a colour reference image, the colour will be transferred automatically from the
reference image to the grey image to produce a colourisation result. The basic
pipeline of colourisation by reference image is shown in Fig. 2. Given a colour
reference and greyscale destination pair of images, the first step is feature extraction
for both images. Next, for each pixel in the destination image, the most similar pixel
in the reference image will be found by feature matching, and then the chrominance
information will be transferred to the destination image according to the matching
results to form the initial colourisation image. Finally, a propagation process is
performed to produce a smooth colour image.
The pipeline of the proposed method is shown in Fig. 3. Given a target greyscale
image and a reference colour image, the method first segments the reference colour
image by using a robust supervised classification scheme. Next, each pixel in the
target image is mapped to one segment. As pixel classification can lead to a vast
number of misclassified pixels, a voting postprocessing step is conducted to enhance
locality consistency, then the pixels with a sufficiently high confidence level will be
provided as colour strokes. Finally a colour propagation scheme is used to diffuse
the colours from these strokes to the whole image. The work exploits higher level
features which can discriminate between different regions rather than processing
each pixel independently, and guarantees spatial consistency by adopting a voting
process and global diffusion. However, the performance of this method is highly
reliant upon the image segmentation stage.
the optimal feature matching and colour propagation are simultaneously solved by
a variational energy minimisation problem. First, for each pixel in the destination
image, some candidate matching pixels in the reference image are selected by fast
feature matching. Then a variational energy function is designed to choose the best
candidate and produce a smooth colourisation result by minimising the colour vari-
ance in the interior region while keeping the edges as sharp as possible. However,
the total variation regularisation term used in Bugeau et al.10 is only composed of
chrominance channels, which results in obvious halo effects around strong bound-
aries. Pierre et al.11 proposed a new non-convex variational framework based on
total variation defined on both luminance and chrominance channels to reduce the
halo effects. With the regularisation of the luminance image, the method produces
more spatially consistent results whilst preserving image contours. In addition, the
authors prove the convergence of the proposed non-convex model.
Instead of local pixelwise prediction, Charpiat et al.12 tried to solve the prob-
lem by learning multimodal probability distributions of colours, and finally a global
graph cut is used for automatic colour assignment. In this paper, image colourisa-
tion is stated as a global optimisation problem with an explicit energy. First, the
probability of every possible colour at each pixel is predicted, which can be seen
as a conditional probability of colours given the intensity feature. Then a spatial
coherence criterion is learned, and finally a global graph cut algorithm is used to
find the optimal colourisation result. The method performs at the global level, and
performs more robustly to texture noise and local prediction errors with the help
of graph cuts.
A global histogram regression based colourisation method was proposed in Liu
et al.13 The basic assumption is that the final colourisation image and the reference
image should have similar colour distributions. First, a locally weighted linear re-
gression on the luminance histograms of both source and target images is performed.
Next, zero-points (i.e., local maxima and minima) of the approximated histogram
can be detected and adjusted to match between target and reference images. Then,
the luminance-colour correspondence for the target image can be solved by calcu-
lating the average colour from the source image. Finally, the colourisation result is
achieved by directly mapping this luminance-colour correspondence with the target
image. However, due to the fact that the method does not take the structural in-
formation of the target image into account, it may produce many colour bleeding
effects around boundaries.
In order to remove the influences of illumination, Liu et al.14 proposed an
illumination-independent intrinsic image colourisation algorithm (Fig. 5). First,
both reference images and destination image are decomposed into reflectance
(albedo) components and illumination (shading) components. In order to obtain
robust intrinsic decomposition, multiple reference images containing a similar scene
to the destination image are collected from the Internet. Then the colours from the
reference reflectance image will be transferred to the pixels of the grey destination
March 12, 2020 11:20 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-Li-Lai-Rosin page 145
3. Colourisation by Scribbles
Given a destination greyscale image with some pre-scribbled colour strokes, colouri-
sation by scribbles attempts to propagate the colour from the desired colour strokes
to the whole image automatically based on the assumption that neighbouring pixels
with similar intensity features should have similar colours. The performance of the
colourisation is dependent on the construction of the affinity matrix, and how to
reduce the colour bleeding effects around boundaries is another crucial problem for
the scribble-based colourisation.
The first colourisation model by scribbles was proposed in Levin et al.16 They
assume that neighbouring pixels that have similar intensity features should have
similar colours. Based on the above basic assumption, the colourisation is resolved
by an optimisation process. The algorithm is composed of three steps. First, the
user must paint some colour scribbles in the interior of various regions, such as
shown in Fig. 6. Then an affinity matrix W is constructed, where each element
of the affinity matrix ωr,s measures the similarity between pixels r and s. Finally,
the colours from the scribbles will be propagated to the whole image by minimising
the following quadratic energy function which measures the difference between the
March 12, 2020 11:20 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-Li-Lai-Rosin page 146
colour ur at pixel r and the weighted average of the colours at neighbouring pixels,
min (ur − ωr,s us )2 , s.t. ur = u0,r , r ∈ Ω (2)
u
r s∈Nr
where u means a chrominance channel, and Ω denotes the group of user scribbles.
As problem (2) is a smooth convex optimisation, it can be solved effectively by
traditional methods.
However, the performance of Levin et al.’s method16 is highly dependent on the
accuracy and amount of user scribbles. For images with complex textures, a very
large number of strokes are required to guarantee high quality colourisation results.
In addition, there are obvious colour bleeding effects around boundaries due to the
characteristic of isotropic diffusion defined in (2).
In order to reduce the burden of users, an efficient interactive colourisation
algorithm was proposed in Luan et al.17 Compared with Levin et al.’s method,16
only a small number of colour strokes are required. The algorithm consists of two
stages, colour labelling stage and colour mapping stage. In the first stage, the image
will be segmented into coherent regions according to the intensity similarity to a
small number of user-provided colour strokes. Instead of propagating colours from
strokes to the neighbourhood pixels directly, this method first groups pixels with
similar texture features which should have similar colours. The amount of colour
strokes is reduced dramatically by this strategy. In the colour mapping stage, the
user needs to assign the colour for a few pixels with significant luminance in each
coherent region, and then the colour of the rest of the pixels will be produced by a
simple linear blending by piece-wise linear mapping in the luminance channel.
Xu et al.18 proposed to reduce the colour strokes from a feature space perspec-
tive. The method adaptively determines the impact of each colour stroke in the
feature space composed of spatial location, image structure, and spatial distance.
March 12, 2020 11:20 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-Li-Lai-Rosin page 147
Each stroke is confined to control a subset of pixels via a Laplacian weighted global
optimisation. Numerous regularisation terms can be incorporated with the global
optimisation to enhance edge-preserving property.
In Ding et al.,19 an automatic scribble generation and colourisation method
was proposed. Instead of assigning colour strokes by users, the authors propose
to generate scribbles automatically by distinguishing the pixels where the spatial
distribution entropy achieves locally extreme values. Then the colourisation will
be conducted by computing quaternion wavelet phases along equal-phase lines, and
a contour strength model is also established in scale space to guide the colour
propagation while preserving the edge structure.
In order to fix the artefacts of colour bleeding around boundaries, an adaptive
edge detection based colourisation algorithm was proposed in Huang et al.20 First
the reliable edge information is extracted from the greyscale image, and then the
similar propagation method to Levin et al.’s method16 is conducted with the as-
sistance of the edge structure. In the work by Anagnostopoulos et al.,21 salient
contours are utilised to improve the colour bleeding artefacts caused by weak ob-
ject boundaries. Their method is composed of two stages. In the first stage, the
user-provided scribble image is enhanced with the assistance of salient contours
automatically detected in the destination greyscale image. Meanwhile, the image
will be segmented into homogeneous colour regions of high confidence and critical
attention-needing regions. For pixels in the homogeneous regions, the colour will
be diffused by the model proposed in Levin et al.’s method,16 while for the pix-
els in attention-needing regions, a second edge-preserving diffusion stage will be
performed with the guidance of salient contours.
In order to reduce the complexity of optimisation-based colour propagation,
a fast image colourisation algorithm using chrominance blending was proposed in
Yatziv et al.22 Based on the basic observation that most of the time is spent on the
iterative solution of the optimisation model defined in Levin et al.,16 a non-iterative
method was proposed in this paper. The proposed scheme is based on the concept
of weighted colour blending derived from the geodesic distance between different
pixels computed in the luminance channel. The method is fast and permits the
user to interactively get the desired results promptly after providing a reduced set
of chrominance scribbles.
Scribble-based image colourisation can also be solved by sparse representation
learning.23 First, an over-complete dictionary in chrominance space is trained on
numerous sample colour patches to explore the low-dimensional subspace manifold
structure. Given a greyscale image with a small subset of colour strokes, the image
is first segmented into overlapping patches, and then the sparse coefficients of each
patch on the pretrained dictionary can be learned using a sparse representation
based on the luminance and the given chrominance within the patch. Once the
sparse coefficients are solved, the colour of each patch can be generated by a sparse
linear combination of the colour dictionary. A large dataset which can cover the
March 12, 2020 11:20 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-Li-Lai-Rosin page 148
variation of target images is required to train the dictionary, and each patch is
processed independently without considering the locality consistency.
Benefitting from the strong theories and tractable computations of matrix re-
covery, Wang et al.24 made the first attempt to reformulate the task of image
colourisation as a matrix completion problem. Each chrominance image can be
seen as a corrupted matrix with reliable values only on the locations of scribbles,
then the task of image colourisation is formulated to complete the chrominance
matrix with a semi-supervised learning method. Based on the basic assumption
that any natural image can be effectively approximated by a low-rank matrix plus a
sparse matrix, a low-rank subspace learning method which can be effectively solved
by the augmented Lagrange multiplier algorithm is utilised to complete the colour
matrix.
However, the image matrix cannot guarantee low-rank for images with complex
textures. In this case, a colourisation algorithm by patch-based local low-rank
matrix completion was proposed in Yao et al.25 Instead of assuming that the whole
image matrix is low-rank, the method first divides the image into small patches, and
assumes that the subspace consists of all patches which have a low-rank structure.
Then a local low-rank matrix factorisation algorithm was proposed to complete the
colour image, and an efficient optimisation algorithm based on alternating direction
method of multipliers was proposed.
An image colourisation method based on colour propagation and low-rank min-
imisation was proposed in Ling et al.26 Given a greyscale image and a few colour
strokes, the paper first propagates the colour from the colour strokes to the neigh-
bourhood pixels according to the Chi-square distance of local texture features, and
meantime a confidence map is computed. The initial colourisation result computed
by propagation is not accurate enough, and so a rank optimisation constrained by
the previous computed confidence map was proposed to improve the performance.
Fig. 7. Deep colourisation.36
The first deep learning based image colourisation method was proposed by Cheng
et al.36 In this paper, image colourisation is reformulated as a regression problem
and is solved by a regular fully-connected deep neural network (Fig. 7). Finally, a
joint bilateral filtering is utilised as a post-processing step to reduce the artefacts.
The model utilised in this paper is a three-layer fully connected neural network.
Three levels of features are utilised, including the raw image patch, DAISY features,
and semantic features. Given the features as the input, the output of the network
is the prediction of the corresponding chrominance. 2344 training images are used
to train the network. In addition, the model requires handcrafted features as the
input of the network rather than learning features solely from the input images
themselves.
Instead of relying on hand-crafted features, a fully automatic end-to-end image
colourisation model was proposed by Larsson et al.37 A pretrained VGG network
is utilised to generate features of different scales. For each pixel, a hypercolumn
feature is extracted by concatenating the features at its spatial location in all layers,
which incorporates the semantic information and localisation property. Taking into
account that some objects (such as clothing) may be drawn from many suitable
colours, this paper treats colour prediction as a histogram estimation task rather
than as regression, and a KL-divergence based loss function is designed to measure
the prediction accuracy.
Due to the underlying uncertainty of image colourisation, regression based learn-
ing methods often result in desaturated colourisation. Zhang et al.38 proposed a
novel classification based colourisation network. In order to model the multimodal
nature of image colourisation, the authors attempt to predict a distribution of pos-
sible colours for each pixel rather than a fixed colour. Due to the distribution of
chrominance values in natural images being strongly biased, a class rebalancing pro-
cess is utilised to emphasise rare colours. Finally, vibrant and realistic colourisation
results are produced by taking the “annealed mean” of the distribution. The main
contribution of this work is designing an appropriate objective function that han-
dles the multimodal uncertainty of the colourisation problem and captures a wide
diversity of colours. In addition, this paper proposes a novel framework for testing
March 12, 2020 11:20 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-Li-Lai-Rosin page 150
colourisation results.
A novel end-to-end framework which combines both global priors and local image
features was proposed in Iizuka et al.39 The proposed architecture can extract
local, mid-level and global features jointly from an image, which can then be fused
for predicting the final colourisation. In addition, a global semantic class label is
utilised during the training process to learn more discriminative global features. The
proposed model is composed of four main components: a low-level feature network,
a mid-level feature network, a global feature network and a colourisation network.
First, a 6-layer convolutional neural network is used to learn the low-level features
from the image, and then mid-level and high-level features are learned based on the
shared low-level features. Next, a fusion layer is designed to incorporate the global
features into local mid-level features, and then the fused features are processed by a
set of convolutions and upsampling layers to generate the final colourisation results.
In order to incorporate global semantic priors, a global classification branch is added
to help learn the global context of the image. In addition, the model can directly
transfer the style of an image into the colourisation of another.
Colourisation is an ambiguous problem, with multiple plausible colourisation
results being possible for a single grey-level image. For example, a tree can be
green, yellow, brown or red. However, the above end-to-end deep learning methods
can only produce a single colourisation.
A user-guided deep image colourisation method was proposed in Zhang et al.40
Compared with traditional optimisation-based interactive colourisation methods,16
the proposed deep neural network propagates user edits by fusing low-level cues
along with high-level semantic information, learned from a million images rather
than using hand-defined rules. The proposed network learns how to propagate
sparse user hints by training a deep network to directly predict the mapping from
a greyscale image and randomly generated user colour hints to a full colour image
on a large dataset. In addition, a data-driven colour palette is designed to suggest
colours for each pixel.
Another user-guided deep colourisation method was proposed in He et al.41
In this paper, a reference colour image is used to guide the output of the deep
colour model rather than using user-provided local colour hints as in Zhang et
al.40 It is the first deep learning approach for exemplar-based local colourisation.
The proposed network is composed of two sub-networks, a similarity sub-net and
a colourisation sub-net. The similarity sub-net can be seen as a pre-processing
step for the colourisation sub-net. It computes the semantic similarities between
the reference and the target image by using a pretrained VGG-19 network, and
generates a dense bidirectional mapping function by using a deep image analogy
technique. Then the greyscale target image, the colour reference image and the
learned bidirectional mapping functions are fed into the colourisation sub-net. The
architecture of the colourisation sub-net is a typical multi-task learning framework
consisting of two branches. The first branch is used to measure the chrominance
March 12, 2020 11:20 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-Li-Lai-Rosin page 151
ods, a high-level semantic loss function which can make the output indistinguishable
from reality is learned automatically by a generative adversarial network, which is
then used to train the network to learn the mapping from the input image to the
output image. In addition, the proposed architecture can learn a loss that adapts
to the data, which avoids designing different loss functions for specific tasks.
The image-to-image translation network has to be trained on aligned image
pairs in Isola et al.,47 however, paired training data is not available for many tasks.
A novel unpaired image-to-image translation framework was proposed in Zhu et
al.48 by using cycle-consistent adversarial networks. The proposed cycle generative
model first learns a mapping from the source domain to the target domain, and then
an inverse mapping is introduced to learn the mapping from the target domain to
the input domain coupled with a cycle consistency loss function.
Some experimental results of deep learning based colourisation methods are
shown in Fig. 8. For popular scenes, such as shown in the first row, most of the
algorithms can produce plausible results. However, when the texture of different
objects look similar, many semantically wrong colours will be generated, such as
shown in the second and the third row. In addition, the colourisation results gener-
ated by most of the existing methods are not colourful enough yet, such as shown
in the last three rows.
Fig. 8. Colourisation results by deep learning models. From (a) to (e) are respectively the
ground truth and results of methods: Larson et al.,37 Zhang et al.,38 Iizuka et al.39 and Zhang et
al.40
the outlines. Then in the second stage, incorrect colour regions are detected and
refined with an additional set of user hints. Each stage is learned independently in
the training phase, and in the test phase they are concatenated to produce the final
colourisation.
6. Conclusion
References
CHAPTER 1.9
We discuss two important areas in deep learning based automatic speech recog-
nition (ASR) where significant research attention has been given recently: end-
to-end (E2E) modeling and robust ASR. E2E modeling aims at simplifying the
modeling pipeline and reducing the dependency on domain knowledge by intro-
ducing sequence-to-sequence translation models. These models usually optimize
the ASR objectives end-to-end with few assumptions, and can potentially improve
the ASR performance when abundant training data is available. Robustness is
critical to, but is still less than desired in, practical ASR systems. Many new
attempts, such as teacher-student learning, adversarial training, improved speech
separation and enhancement, have been made to improve the systems’ robust-
ness. We summarize the recent progresses in these two areas with a focus on the
successful technologies proposed and the insights behind them. We also discuss
possible research directions.
1. Introduction
Recent advances in automatic speech recognition (ASR) have been mostly due to
the advent of using deep learning algorithms to build hybrid ASR systems with deep
acoustic models like feed-forward deep neural networks (DNNs), convolutional neu-
ral networks (CNNs), and recurrent neural networks (RNNs). The hybrid systems
usually contain an acoustic model which calculates the likelihood of an acoustic
signal given phonemes; a language model which calculates the probability of a word
sequence; and a lexicon model which decomposes words into phonemes. It also
requires a very complicated decoder to generate word hypotheses during runtime.
Among these components, the most important one is the acoustic model, which
generates pseudo likelihood with neural networks. Given their effectiveness and
robustness, hybrid systems are still dominating ASR services in industry.
However, hybrid systems have the limitation that many components in the sys-
tem either require expert knowledge to build or can only be trained separately. In
lieu of this limitation, in the last few years, researchers in ASR have been devel-
oping fully end-to-end (E2E) systems [1–10]. E2E ASR systems directly translate
an input sequence of acoustic features to an output sequence of tokens (characters,
159
March 12, 2020 11:35 ws-rv961x669 HBPRCV-6th Edn.–11573 ”Speech Regn” page 160
160 J. Li and D. Yu
words etc). This reconciles well with the notion that ASR is inherently a sequence-
to-sequence conversion task that maps input waveforms to output token sequences.
We will describe three most popular E2E systems in detail and discuss practical
issues to be solved in E2E systems in Section 2.
Although significant progresses have been made in ASR, it is still challenging for
even well-trained ASR systems to perform well on highly mismatched environments.
Model adaptation with target domain data is one solution but usually requires
labeled data from the target domain to be effective. Teacher-student learning [11,
12] is another model adaptation technique. It has been gaining popularity in the
industry [13–15] since it can exploit large amounts of unlabeled data. Adversarial
learning [16] tackles the problem from a different angle. It aims at generating
models that are less sensitive to factors irrelevant to the task. Newly developed
speech separation and enhancement techniques, on the other hand, significantly
improves the system’s robustness when recognizing overlapping/noisy speech. We
will discuss all these technologies in Section 3.
Finally we discuss open problems and future directions in Section 4.
2. End-to-End Models
Fig. 1. Flowchart of three popular end-to-end techniques: a) CTC; b) RNN-T; c) AED [26]
March 12, 2020 11:35 ws-rv961x669 HBPRCV-6th Edn.–11573 ”Speech Regn” page 161
henc
t = f enc (xt ), (4)
where t is time index.
The final posterior of each output token is obtained after applying the softmax
operation on top of the logits vector transformed from henct .
Compared to the traditional cross-entropy training in the hybrid system, CTC
is harder to train without proper initialization. In [1], the long short-term mem-
ory (LSTM) network in the CTC system was initialized from the LSTM network
trained with the cross-entropy criterion. This initialization step can be circum-
vented by using very large amounts of training data which also helps to prevent
overfitting [32]. However, even with a large training set, the randomly initialized
CTC model tends to be difficult to train when presented with difficult samples.
In [34], a learning strategy called SortaGrad was proposed. With this strategy, the
system first presents the CTC network with short utterances (easy samples) and
then presents it with longer utterances (harder samples) in the early training stage.
In the later epochs, the training utterances are fed to the CTC network completely
randomly. This strategy significantly improves the convergence of CTC training.
Inspired by the CTC model, Povey et al. proposed the lattice-free maximum
mutual information (LFMMI) [35] strategy to train deep networks directly from
March 12, 2020 11:35 ws-rv961x669 HBPRCV-6th Edn.–11573 ”Speech Regn” page 162
162 J. Li and D. Yu
random initialization. This single-step training procedure has great advantage over
the popular two-step strategy, which trains the model first with the cross-entropy
criterion and then with the sequence discriminative criterion. To build a reliable
LFMMI training pipeline, Povey et al. developed many tricks, including a phoneme
HMM topology in which the first frame of a phoneme has a different label than
the rest frames, a phoneme n-gram language model used to create denominator
graph, a time-constraint that is similar to the delay-constrain in CTC [33], several
regularization methods to reduce overfitting, and framing stacking. LFMMI has
been proven effective on tasks with different scale and underlying models. The
detailed LFMMI training procedure can be found in [7].
The conditional independence assumption in CTC is criticized the most. Several
attempts have been made to relax or remove such assumption in CTC. In [36,
37], attention modeling was directly integrated into the CTC framework by using
time convolution features, non-uniform attention, implicit language modeling, and
component attention. Such attention-based CTC model relaxes the conditional
independence assumption by working on the hidden layers. It does not change the
CTC objective function and training process, and hence enjoys the simplicity of
CTC modeling.
with
P (yu |x, y1:u−1 ) = AttentionDecoder(henc , y1:u−1 ), (11)
(12)
Again, the training objective is to minimize −lnP (y|x).
The flowchart of the attention-based model is shown in Figure 1.(c). Here, the
encoder transforms the whole speech input sequence x to a high-level hidden vector
1 , h2 , ......, hL ), whereL ≤ T . At each step in generating an
sequence henc = (henc enc enc
164 J. Li and D. Yu
henc so that the most related hidden vectors are used for the prediction. Comparing
Eq. (10) to Eq. (2), we can see that the attention-based model doesn’t make the
conditional independence assumption made by CTC.
The decoder network in AED has three components: a multinomial distribution
generator (13), an RNN decoder (14), and an attention network (15)-(20) as follows:
yu = Generate(yu−1 , su , cu ), (13)
su = Recurrent(su−1 , yu−1 , cu ), (14)
T
cu = Annotate(αu , henc ) = αu,t henc
t , (15)
t=1
• using word piece as the modeling unit [44] which balances generalization
and language model quality;
• using scheduled sampling [45] which feeds the predicted label from the
previous time step (instead of the ground truth label) during training to
make training and testing consistent;
March 12, 2020 11:35 ws-rv961x669 HBPRCV-6th Edn.–11573 ”Speech Regn” page 165
In [49], the AED model is optimized together with a CTC model in a multi-
task learning framework by sharing the encoder. Such a training strategy greatly
improves the convergence of the attention-based model and mitigates the alignment
issue. In [50], the system’s performance is further improved by combining the scores
from both the AED model and the CTC model during decoding.
Streaming is critical in speech recognition services in industry. However, in
the vanilla AED model the attention is designed to be applied to the whole input
utterance to achieve good performance. This introduces significant latency and
prevents it from being used in the streaming mode. Attempts have been made to
support streaming in AED models. The basic idea of these methods is to apply
AED on chunks of input audios. The difference between these attempts is the way
the chunks are determined and used for attention. In [51], monotonic chunkwise
attention (MoChA) was proposed to stream the attention by splitting the encoder
outputs into small fixed-size chunks so that the soft attention is only applied to
those small chunks instead of the whole utterance. It was later improved with
adaptive-size chunks in [52]. In [53], CTC segments are used to decide the chunks,
which are used to trigger the attention. In [54], a Continuous Integrate-and-Fire
strategy is used to simulate neural model behavior so that the boundary of attention
can be decided.
166 J. Li and D. Yu
• Shallow fusion [62]: The external LM and E2E model are trained sepa-
rately. The external LM is interpolated log-linearly with the E2E model at
inference time only.
• Deep fusion [62]: The external LM and E2E model are trained separately.
Then the external LM is integrated into the E2E model by fusing the ex-
ternal neural LM’s hidden states and the E2E decoder score.
• Cold fusion [63]: The E2E model is trained from scratch by integrating
with a pre-trained external LM.
were shown to be more effective by using TTS data to train a separate translation
model which is used to correct the errors made by E2E models. Because TTS data
is only used to train the spelling correction model without changing E2E models,
it is a better way to circumvent the quality limitation of TTS.
Hybrid systems usually are equipped with an on-the-fly rescoring strategy that
dynamically adjusts the LM weights of a small number of n-grams which are relevant
to the particular recognition context, such as contacts, locations, and play lists.
Such context modeling significantly boosts the ASR accuracy for those specific
scenarios. It is thus desirable if E2E models also support context modeling. One
solution is to add a context bias encoder in addition to the original audio encoder
into the E2E model [68]. The recognition performance on rare words in the context
can be further improved by adding a phoneme encoder for the words in the context
[69]. However, as shown in [68], it becomes challenging for the bias attention module
to focus if the biasing list is too large. Hence, a more practical way is to do shallow
fusion with the contextual biasing LM [70].
In dialog scenarios which contains multiple related utterances, the context from
previous utterances would help the recognition of the current utterance. In [71], the
state of the previous utterance is used as the initial state of the current utterance.
In [72], a text encoder is used to embed the decoding hypothesis from the previous
utterance as additional input to the decoder of the current utterance. In [73], a
more explicit model was proposed for the two-party conversation scenario by using
a speaker-specific cross-attention mechanism that can look at the output of both
speakers to better recognize long conversations.
3. Robustness
While E2E modeling has advanced the general ASR technology, robustness contin-
ues to be a critical problem in ASR to enable natural interaction between human
and machine. Current state of the art systems can achieve remarkable recognition
accuracy when the test and training conditions match, especially when both are un-
der a quiet close-talk setup. However, the performance dramatically degrades under
mismatched or complicated environments such as high noise conditions, including
music or interfering talkers, or speech with strong accents [74, 75]. The solutions
to this problem include adaptation, speech enhancement, and robust modeling.
The most straightforward way to improve the recognition accuracy in a new domain
is to collect and label data in the new domain and fine-tune the model trained in the
source domain with the newly labeled data. Many adaptation techniques have been
March 12, 2020 11:35 ws-rv961x669 HBPRCV-6th Edn.–11573 ”Speech Regn” page 168
168 J. Li and D. Yu
proposed recently. A detailed review of these techniques can be found in [76]. While
the conventional adaptation techniques require large amounts of labeled data in
the target domain, the teacher-student (T/S) paradigm can better take advantage
of large amounts of unlabeled data and has been widely used in industrial scale
tasks [13–15].
The concept of T/S learning was originally introduced in [77] but became popu-
lar after it was used to learn a shallow network from a deep network by minimizing
the L2 distance of logits between the teacher and student networks [78]. In T/S
learning, the network of interest, the student, is trained to mimic the behavior of
a well-trained network, the teacher. There are two popular applications of T/S in
speech recognition: model compression, which aims to train a small network that
performs similarly to a large network [11, 12], and domain adaptation, which strives
at improving a model’s performance on a target-domain by learning the behavior
of a model trained on a source-domain [13, 79].
The most popular T/S learning strategy for ASR was proposed in 2014 [11].
In this work, Li et al. proposed to minimize the Kullback-Leibler (KL) divergence
between the output posterior distributions of the teacher network and the student
network. Hinton et. al. [12] later suggested an interpolated version which uses a
weighted sum of the soft posteriors and the one-hot hard label to train the student
model. Their method is known as knowledge distillation, but essentially is the same
as T/S learning which transfers the teacher’s knowledge to the student. In addition
to learning from pure soft labels [11] and from interpolated labels [12], conditional
T/S learning [80] was recently proposed to selectively learn from either the soft
label or the hard label conditioned on whether the teacher can correctly predict the
hard label.
The aforementioned T/S learning exploits the frame-level similarity between the
teacher and student networks. Although it was successful in many applications, it
may not be the best solution for ASR because ASR is a sequence translation problem
while the frame-based posteriors from the teacher network may not fully capture the
sequential nature of speech data. In [81], Wong and Gales proposed sequence-level
T/S learning which optimizes the student network by minimizing the KL-divergence
between lattice arc sequence posteriors from teacher and student networks. The
teacher can be an ensemble network which combines the sequence posteriors from
all the experts so that the student network can approximate the performance of the
powerful ensemble network instead of individual experts. This was further improved
in [82] where different sets of state clusters are used to compute the sequence T/S
criterion between teacher and student models. A similar work [83, 84] was conducted
on the LFMMI models.
Given the success of T/S learning, a natural question to ask is why it is superior
to the standard training with hard labels. We conjecture the following advantages:
• T/S learning with pure soft labels, no matter at frame [11] or sequence [82]
level, can leverage large amounts of unlabeled data. This is particularly
March 12, 2020 11:35 ws-rv961x669 HBPRCV-6th Edn.–11573 ”Speech Regn” page 169
Note that although majority works of T/S learning were conducted on hybrid
models, it can be easily applied to E2E models [88–90].
While model adaptation adapts the source model so that the system performs bet-
ter in the target domain, it is more desirable if the model, trained once, performs
robustly under various conditions and domains. Adversarial training [16] aims to
achieve this goal by building robust models during the training stage without the
need of target domain data. The original idea of adversarial learning was proposed
in the generative adversarial network by Goodfellow et. al. [16] for data generation
where a generator network captures the data distribution and an auxiliary discrim-
inator network estimates the probability that a sample comes from the real data.
Later, adversarial learning is applied to unsupervised domain adaptation by gen-
erating a deep feature that is both discriminative for the main task in the source
domain and invariant to the shifts between source and target domains [91]. A gra-
dient reversal layer network was proposed to facilitate the adversarial learning. A
similar idea was then applied to domain [92–94] and speaker adaptation [95] of the
acoustic model.
Due to the inherent inter-domain variability in the speech signal, a multi-
conditional model shows its high variances in hidden and output unit distributions.
Adversarial learning effectively improves the noise robustness [96–99], reduces the
March 12, 2020 11:35 ws-rv961x669 HBPRCV-6th Edn.–11573 ”Speech Regn” page 170
170 J. Li and D. Yu
172 J. Li and D. Yu
In this chapter, we summarized progresses in the two areas where significant efforts
have been taken for ASR, namely, E2E modeling and robust modeling. Although
the effort of using E2E models to replace hybrid systems in industry applications is
only with limited success (for example Google deployed RNN-T on devices where
LM is weak [26]) due to E2E systems’ inability to model rarely observed samples
during training [133], E2E modeling is still a direction worth further investigation.
E2E modeling directly optimizes the objective of the task and provides additional
flexibility in choosing models. It can easily integrate various information through
multiple encoders [69, 134]. In addition, some tasks that are complicated to hybrid
models can be easily realized with E2E models. For example, code switching or
multi-lingual ASR is difficult in the hybrid system. However, it is relatively easier
in E2E models by modeling characters and sub-words from all languages with an
extended output layer [135–138]. As another example, it is challenging to identify
who spoke what at when during conversation, a task which is usually accomplished
with separated ASR and speaker diarization systems. In [139], a single E2E model
was proposed to perform both tasks by generating speaker-decorated transcription.
March 12, 2020 11:35 ws-rv961x669 HBPRCV-6th Edn.–11573 ”Speech Regn” page 173
Among three E2E models introduced in Section 2, CTC has clear disadvantage
due to the output independence assumption, while RNN-T and AED have bet-
ter potential. RNN-T usually outperforms CTC, and AED performs best among
these three models due to its powerful structure. Because RNN-T is a streaming
model and AED can achieve better accuracy, a recent study combines these two
E2E models by using RNN-T in the first-pass decoding and AED in the second-
pass rescoring. This strategy improved the recognition accuracy with reasonable
perceived latency [133]. Given its success in machine translation, the Transformer
is a promising E2E model structure.
While we may continue proposing new E2E model structures, an equally impor-
tant task is solving practical issues that prevent E2E models from being used in
industrial applications. Some of these issues have been discussed in Section 2.4. Ac-
curacy is not everything. To deploy a system we need to tradeoff between accuracy,
latency, and computational cost.
While the performance of ASR systems surpassed the threshold for adoption in
matched training-test environments, research focus has been shifted to developing
robust ASR systems that perform well in challenging real-world scenarios, such as
mismatched test environments and overlapping speech. Model adaptation is the
most straightforward solution. T/S learning has been successful in industry-scale
tasks due to its effectiveness in using large amounts of unlabeled data. The chal-
lenge of T/S learning for model adaptation is its reliance on parallel data although
in many cases simulated parallel data is effective. How to apply T/S model adap-
tation to scenarios where parallel data is not available and difficult to simulate
is an interesting research topic. If there is no prior knowledge about the testing
environment, adversarial learning can be a good choice. It trains ASR models to
generate domain-invariant features. The efficacy of adversarial learning has been
reported on tasks with small dataset. Its effectiveness is pending examination when
huge amount of training data is available in which case the network may learn
domain-invariant features implicitly without adversarial training.
Great progress has been made in speech separation with the introduction of
DPCL, PIT and their variants. However, there are still challenging problems to be
solved.
174 J. Li and D. Yu
5. Acknowledgement
The first author would like to thank Dr. Zhong Meng, Dr. Jeremy Wong, and Dr.
Amit Das at Microsoft for valuable inputs to improve the quality of this chapter.
References
[10] J. Li, G. Ye, A. Das, R. Zhao, and Y. Gong. Advancing acoustic-to-word CTC
model. In Proc. ICASSP, (2018).
[11] J. Li, R. Zhao, J.-T. Huang, and Y. Gong. Learning small-size DNN with output-
distribution-based criteria. In Proc. Interspeech, pp. 1910–1914, (2014).
[12] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network,
arXiv preprint arXiv:1503.02531. (2015).
[13] J. Li, R. Zhao, Z. Chen, et al. Developing far-field speaker system via teacher-student
learning. In Proc. ICASSP, (2018).
[14] L. Mošner, M. Wu, A. Raju, S. H. K. Parthasarathi, K. Kumatani, S. Sundaram,
R. Maas, and B. Hoffmeister. Improving noise robustness of automatic speech recog-
nition via parallel data and teacher-student learning. In Proc. ICASSP, pp. 6475–
6479, (2019).
[15] S. H. K. Parthasarathi and N. Strom. Lessons from building acoustic models with a
million hours of speech. In Proc. ICASSP, pp. 6670–6674, (2019).
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural
information processing systems, pp. 2672–2680, (2014).
[17] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal
classification: labelling unsegmented sequence data with recurrent neural networks.
In Proceedings of the 23rd international conference on Machine learning, pp. 369–
376. ACM, (2006).
[18] A. Graves and N. Jaitley. Towards end-to-end speech recognition with recurrent
neural networks. In PMLR, pp. 1764–1772, (2014).
[19] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk,
and Y. Bengio, Learning phrase representations using RNN encoder-decoder for
statistical machine translation, arXiv preprint arXiv:1406.1078. (2014).
[20] D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning
to align and translate, arXiv preprint arXiv:1409.0473. (2014).
[21] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio. End-to-end
attention-based large vocabulary speech recognition. In Proc. ICASSP, pp. 4945–
4949. IEEE, (2016).
[22] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. Attention-based
models for speech recognition. In NIPS, pp. 577–585, (2015).
[23] A. Graves, Sequence transduction with recurrent neural networks, CoRR.
abs/1211.3711, (2012).
[24] H. Soltau, H. Liao, and H. Sak, Neural speech recognizer: Acoustic-to-word LSTM
model for large vocabulary speech recognition, arXiv preprint arXiv:1610.09975.
(2016).
[25] K. Rao, H. Sak, and R. Prabhavalkar. Exploring architectures, data and units for
streaming end-to-end speech recognition with RNN-transducer. In Proc. ASRU,
(2017).
[26] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach,
A. Kannan, Y. Wu, R. Pang, et al. Streaming end-to-end speech recognition for
mobile devices. In Proc. ICASSP, pp. 6381–6385, (2019).
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L . Kaiser,
and I. Polosukhin. Attention is all you need. In Advances in Neural Information
Processing Systems, pp. 6000–6010, (2017).
[28] L. Dong, S. Xu, and B. Xu. Speech-transformer: a no-recurrence sequence-to-
sequence model for speech recognition. In Proc. ICASSP, pp. 5884–5888, (2018).
[29] S. Zhou, L. Dong, S. Xu, and B. Xu. Syllable-based sequence-to-sequence speech
March 12, 2020 11:35 ws-rv961x669 HBPRCV-6th Edn.–11573 ”Speech Regn” page 176
176 J. Li and D. Yu
speech recognition. In Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 518–529, (2017).
[51] C.-C. Chiu and C. Raffel, Monotonic chunkwise attention, arXiv preprint
arXiv:1712.05382. (2017).
[52] R. Fan, P. Zhou, W. Chen, J. Jia, and G. Liu, An online attention-based model for
speech recognition, arXiv preprint arXiv:1811.05247. (2018).
[53] N. Moritz, T. Hori, and J. Le Roux. Triggered attention for end-to-end speech
recognition. In Proc. ICASSP, pp. 5666–5670, (2019).
[54] L. Dong and B. Xu, CIF: Continuous integrate-and-fire for end-to-end speech recog-
nition, arXiv preprint arXiv:1905.11235. (2019).
[55] L. Lu, X. Zhang, and S. Renais. On training the recurrent neural network encoder-
decoder for large vocabulary end-to-end speech recognition. In Proc. ICASSP, pp.
5060–5064. IEEE, (2016).
[56] K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Nahamoo, Di-
rect acoustics-to-word models for English conversational speech recognition, arXiv
preprint arXiv:1703.07754. (2017).
[57] J. Li, G. Ye, R. Zhao, J. Droppo, and Y. Gong. Acoustic-to-word model without
OOV. In Proc. ASRU, (2017).
[58] K. Audhkhasi, B. Kingsbury, B. Ramabhadran, G. Saon, and M. Picheny. Build-
ing competitive direct acoustics-to-word models for English conversational speech
recognition. In Proc. ICASSP, (2018).
[59] Y. Gaur, J. Li, Z. Meng, and Y. Gong. Acoustic-to-phrase models for speech recog-
nition. In Proc. Interspeech, (2019).
[60] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words
with subword units. In Proceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725, Berlin,
Germany, (2016).
[61] H. Xu, S. Ding, and S. Watanabe. Improving end-to-end speech recognition with
pronunciation-assisted sub-word modeling. In Proc. ICASSP, pp. 7110–7114, (2019).
[62] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares,
H. Schwenk, and Y. Bengio, On using monolingual corpora in neural machine trans-
lation, arXiv preprint arXiv:1503.03535. (2015).
[63] A. Sriram, H. Jun, S. Satheesh, and A. Coates. Cold fusion: Training seq2seq models
together with language models. In Proc. Interspeech, (2018).
[64] S. Toshniwal, A. Kannan, C.-C. Chiu, Y. Wu, T. N. Sainath, and K. Livescu. A
comparison of techniques for language model integration in encoder-decoder speech
recognition. In Proc. SLT, pp. 369–375, (2018).
[65] A. Tjandra, S. Sakti, and S. Nakamura. Listening while speaking: Speech chain by
deep learning. In Proc. ASRU, pp. 301–308, (2017).
[66] J. Guo, T. N. Sainath, and R. J. Weiss. A spelling correction model for end-to-end
speech recognition. In Proc. ICASSP, pp. 5651–5655, (2019).
[67] S. Zhang, M. Lei, and Z. Yan. Investigation of transformer based spelling correction
model for CTC-based end-to-end Mandarin speech recognition. In Proc. Interspeech,
(2016).
[68] G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao. Deep context:
end-to-end contextual speech recognition. In Proc. SLT, pp. 418–425, (2018).
[69] A. Bruguier, R. Prabhavalkar, G. Pundak, and T. N. Sainath. Phoebe:
Pronunciation-aware contextualization for end-to-end speech recognition. In Proc.
ICASSP, pp. 6171–6175, (2019).
[70] D. Zhao, T. N. Sainath, D. Rybach, D. Bhatia, B. Li, and R. Pang. Shallow-fusion
March 12, 2020 11:35 ws-rv961x669 HBPRCV-6th Edn.–11573 ”Speech Regn” page 178
178 J. Li and D. Yu
[92] S. Sun, B. Zhang, L. Xie, and Y. Zhang, An unsupervised deep domain adaptation
approach for robust speech recognition, Neurocomputing. (2017).
[93] Z. Meng, Z. Chen, V. Mazalov, J. Li, and Y. Gong. Unsupervised adaptation with
domain separation networks for robust speech recognition. In Proc. ASRU, (2017).
[94] P. Denisov, N. T. Vu, and M. F. Font. Unsupervised domain adaptation by ad-
versarial learning for robust speech recognition. In Speech Communication; 13th
ITG-Symposium, pp. 1–5. VDE, (2018).
[95] Z. Meng, J. Li, and Y. Gong. Adversarial speaker adaptation. In Proc. ICASSP, pp.
5721–5725, (2019).
[96] Y. Shinohara. Adversarial multi-task learning of deep neural networks for robust
speech recognition. In Proc. Interspeech, pp. 2369–2372, (2016).
[97] D. Serdyuk, K. Audhkhasi, P. Brakel, B. Ramabhadran, S. Thomas, and
Y. Bengio, Invariant representations for noisy speech recognition, arXiv preprint
arXiv:1612.01928. (2016).
[98] Z. Meng, J. Li, Y. Gong, and B.-H. F. Juang. Adversarial teacher-student learning
for unsupervised domain adaptation. In Proc. ICASSP, (2018).
[99] Z. Meng, J. Li, and Y. Gong. Attentive adversarial learning for domain-invariant
training. In Proc. ICASSP, pp. 6740–6744. IEEE, (2019).
[100] G. Saon, G. Kurata, T. Sercu, et al. English conversational telephone speech recog-
nition by humans and machines. In Proc. Interspeech, (2017).
[101] Z. Meng, J. Li, Y. Gong, and B.-H. F. Juang. Speaker-invariant training via adver-
sarial learning. In Proc. ICASSP, (2018).
[102] L. Tóth and G. Gosztolya. Reducing the inter-speaker variance of CNN acoustic
models using unsupervised adversarial multi-task training. In International Confer-
ence on Speech and Computer, pp. 481–490. Springer, (2019).
[103] L. Wu, H. Chen, L. Wang, P. Zhang, and Y. Yan, Speaker-invariant feature-mapping
for distant speech recognition via adversarial teacher-student learning, Interspeech.
1, 1, (2019).
[104] J. Yi, J. Tao, Z. Wen, and Y. Bai. Adversarial multilingual training for low-resource
speech recognition. In Proc. ICASSP, (2018).
[105] O. Adams, M. Wiesner, S. Watanabe, and D. Yarowsky, Massively multilingual
adversarial speech recognition, HAACL-HLT. (2019).
[106] K. Hu, H. Sak, and H. Liao, Adversarial training for multilingual acoustic modeling,
arXiv preprint arXiv:1906.07093. (2019).
[107] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie. Domain adversarial
training for accented speech recognition. In Proc. ICASSP, (2018).
[108] Y. Ephraim and D. Malah, Speech enhancement using a minimum mean-square
error log-spectral amplitude estimator, IEEE Transactions on Acoustics, Speech,
and Signal Processing. 33(2), 443–445, (1985).
[109] D. Wang and G. Brown, Computational Auditory Scene Analysis: Principles, Algo-
rithms, and Applications. (Wiley-IEEE Press, 2006).
[110] C. Févotte, E. Vincent, and A. Ozerov. Single-channel audio source separation with
NMF: Divergences, constraints and algorithms. In Audio Source Separation, pp.
1–24. Springer, (2018). doi: 10.1007/978-3-319-73031-8 1.
[111] J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe. Deep clustering: Discrimi-
native embeddings for segmentation and separation. In the Proceedings of ICASSP,
pp. 31–35, (2016).
[112] Y. Isik, J. Roux, Z. Z. Chen, and et al. Single-channel multi-speaker separation using
deep clustering. In Interspeech, pp. 545–549, (2016).
[113] Z. Chen, Y. Luo, and N. Mesgarani. Deep attractor network for single-microphone
March 12, 2020 11:35 ws-rv961x669 HBPRCV-6th Edn.–11573 ”Speech Regn” page 180
180 J. Li and D. Yu
APPLICATIONS
184 Introduction
A BRIEF INTRODUCTION
I recall that as early as sixties, Prof. K.S. Fu has emphasized on the applications of
pattern recognition techniques. He was involved in no less than two dozen pattern
recognition applications [1, 2]. Early effort was focused on speech recognition,
character recognition, medical diagnosis and remote sensing. In our daily life we
have now benefited so much with the enormous progress in the area of speech and
character/document processing and recognition. Throughout this handbook series
we have published important work on both speech and character recognition. It is
difficult to provide extensive list of books in this area with over 60 years of
development. References 3 and 4 are among the many important books in the area.
The progress in computer vision as well as pattern recognition has enormous
impact on rapid advances in person identifications, especially on face recognition
(see e.g. [5]), fingerprint recognition and biometric authentication [6] with great
accuracy thanks to the advances in neural networks techniques. Though medical
diagnosis using pattern recognition has a long history, reliable automated systems
have been few. With enormous progress in medical imaging hardware, much
success has yet to be seen with the use of computer vision and pattern recognition
(see e.g. [7]). The success in remote sensing for both hyperspectral/multispectral
and synthetic aperture radar data however is much more evident with some
important books (see e.g. [8, 9, 10, 11]). There are many hundreds of other
applications of pattern recognition and computer vision with different amounts of
success.
In a recent report by Dr. Bob Fisher of the University of Edinburgh,
(rbf@inf.ed.ac.uk), 300 areas of computer vision applications are listed. The
number can be on the low side considering many on-going efforts to integrate
computer vision as part of larger automation systems. On the issue of software
development, there has been limited success with comprehensive pattern
recognition or computer vision software packages. Specialized speech analysis
and recognition software has been better developed. There are also many popular
neural network software systems available such as those offered by MathLab. It
is useful to mention a recent deep learning software by R. Cresson [12].
Examples of some difficult applications from my experience are in tele-
seismic signal recognition (see e.g. [13, 14]), signal and image recognition of
underwater objects (see e.g. [15, 16]) and in automated sorting of fishes (see e.g.
[17]). For such applications, a correct recognition rate of 90% is considered very
good. The use of multi-sensors and knowledge-based method can provide
Introduction 185
References
1. K.S. Fu, editor, “Applications of Pattern Recognition”, CRC Press 1982.
2. K.S. Fu, editor, “Syntactic Pattern Recognitions, Applications”, Springer-Verlag 1982.
3. F. Jelinek, “Statistical Methods for Speech Recognition (language, speech, and
communications), MIT Press 1998, now in 4th printing available through Amazon.com.
4. M. Choriet, N. Kharma, C.L. Liu, C.Y. Suen, “Character Recognition Systems: a guide for
students and practitioners”, Wiley 2007.
5. S. Li and A.K. Jain, editors, “Handbook of Face Recognition”, Springer 2011.
6. S. Y. Kung, M.W. Mak and S.H. Lin, ‘Biometric Authentication”, Prentice-Hall; 2006.
7. C.H. Chen, editor, “Computer Vision in Medical Imaging”, World Scientific Publishing 2014.
8. L. Alparone, B. Aiazzi, S. Baronti and A. Grazelli, “Remote Sensing Image Fusion”, CRC Press,
2015.
9. C.H. Chen, editor, “Signal and Image Processing for Remote Sensing”, 2nd edition, CRC Press
2012.
10. Q. Zhang and R. Skjetne, “Sea Ice Image Processing with MATLAB”, CRC Press 2018.
186 Introduction
11. C.H. Chen, editor, “Signal and Image Processing for Remote Sensing”, CRC Press 2006 (first
edition), 2012 (second edition).
12. R. Cresson, “Deep Learning on Remote Sensing Images with Open Source Software”, CRC
Press, June 2020.
13. C.H. Chen, “Seismic signal recognition”, Geoexploration, vol. 6, no. 1, pp. 133–146, 1978.
14. H.H. Liu and K.S. Fu, “A syntactic approach to seismic pattern recognition”, IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 4, pp. 136–140, 1982.
15. C.H. Chen, “Recognition of underwater transient patterns”, Pattern Recognition, vol. 18, no. 9,
pp. 485-490, 1985.
16. C.H. Chen, “Neural networks for active sonar classification”. IEEE OCEANS 1990.
17. K. Stokesbury, Lecture on status of the marine fishery research program at UMass Dartmouth,
May 15, 2019.
18. A. Wilson, “Deep learning brings a new dimension to machine vision”, Laser Focus World,
May 2019, pp. 43–47.
March 13, 2020 9:24 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-rhaensch page 187
CHAPTER 2.1
Ronny Hänsch∗
Department SAR Technology,
German Aerospace Center (DLR), 82234 Weßling, Germany
Remote Sensing plays an essential role in Earth Observation and thus in un-
derstanding of the complex relationship between bio/geo-physical processes and
human welfare. The growing quality and quantity of remotely sensed data make
manual interpretation infeasible and require accurate and efficient methods to
automatically analyse the acquired data. This chapter discusses two Machine
Learning approaches that address these tasks on the example application of se-
mantic segmentation of polarimetric Synthetic Aperture Radar images. It shows
the importance of proper classifier design and training as well as the benefits of
automatically learned features.
1. Introduction
82234 Weßling, Germany, and was during the writing of this chapter with the Computer Vi-
sion & Remote Sensing group, Technische Universität Berlin, 10587 Berlin, Germany. e-mail:
ronny.haensch@dlr.de. webpage: www.rhaensch.de.
187
March 13, 2020 9:24 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-rhaensch page 188
context, however, it usually means acquiring data about the Earth (or other plane-
tary objects) by either air- or space-borne sensors often including (semi-)automatic
processing and analysis.
The corresponding sensors can be divided into two groups depending on whether
the sensor emits radiation on its own (i.e. being an active sensor) or relies on
external radiation sources such as the sun (i.e. being a passive sensor). Examples
of active sensors are Synthetic Aperture Radar (SAR, emitting microwaves) and
Light Detection and Ranging (LiDAR, using laser light), while examples of passive
sensors are CCD cameras, infrared sensors, and imaging spectrometers.
These sensors provide information about terrestrial, marine, and atmospheric
variables, such as changes in land cover, sea surfaces, and temperatures. Modern
Remote Sensing data products (combined with data from other sources) are used
with decades of scientific development and operational experience to address tasks
such as natural hazard monitoring, global climate change, and urban planning.
To this aim, a large number of air- and space-borne sensors deployed by research
institutes and industry of many countries provide multi-source (LiDAR, SAR, op-
tical, etc.), multi-temporal, and multi-resolution Remote Sensing data of increasing
quantity and quality (e.g. with higher spatial and spectral resolution).
However, acquiring, processing, and interpreting these data comes with several
challenges that are very different from the challenges of close-range Computer Vision
and which often hinder a successful direct transfer and application of correspond-
ing tools and methods. Apart from special applications such as medical Computer
Vision and autonomous driving, close-range Computer Vision is dominated by op-
tical digital cameras. This technology is sufficiently mature and industrialised to be
affordable for most individual persons and easily manageable even by laymen. In
contrast, Remote Sensing consists of a large variety of sensors with very different
properties. While commercial solutions for some sensors exist, sensor development
and launch are often still active fields of research. The data acquisition can only
seldom be performed by individuals as significant monetary, infrastructural, and
knowledge resources are needed to deploy and manage for example a satellite or
perform air-borne measurement campaigns. Also the data processing often requires
expert knowledge as it involves for example atmospheric correction, sensor cali-
bration, and geo-referencing. This is why Remote Sensing data is often owned by
specific research institutes, space agencies, or companies and not directly available
to the public. Nowadays, more and more data is freely available (e.g. via ESA’s
Copernicus programme [1]) or can be obtained via scientific proposals. However,
national and international law as well as paywalls still hinder free access to many
Remote Sensing data products. On the other hand, the interpretation of Remote
Sensing data often requires domain specific expertise as well. While most people
nowadays are able to understand overhead optical imagery, the visual interpretation
of a SAR image poses difficulties even for trained experts. This is only worsened if
not a semantic interpretation but a bio/geo-physical understanding is aimed for, e.g.
March 13, 2020 9:24 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-rhaensch page 189
estimating soil moisture, forest height, ice thickness, vegetation health, or biomass.
The difficulties regarding data access and data interpretability pose a significant
obstacle when applying Machine Learning for the automatic analysis of Remote
Sensing data. Most Machine Learning methods that aim to estimate a mapping
from the input data to the target variable require training data, i.e. samples for
which input and desired output are known. Close-range Computer Vision often
deals with every-day objects that can be labelled by laymen via e.g. crowd-sourcing.
The same is only seldom possible for Remote Sensing data as its interpretation
requires expert knowledge and some target variables can only be determined by
in-situ measurements.
Nevertheless, there has been a large success in the development and application
of Machine Learning methods in Remote Sensing. A complete overview about cor-
responding methods and applications would fill a whole book series and is therefore
not possible within a single book chapter. Instead, this chapter focuses on data of
a single sensor type, i.e. SAR, and on a single Machine Learning aspect, i.e. the
learning of optimal features for semantic segmentation, i.e. the estimation of a class
label for each pixel within the images. Most of the methods introduced in the next
sections, however, can be easily transferred to other sensors (e.g. hyperspectral
cameras) or other tasks (e.g. regression).
Synthetic Aperture Radar (SAR) is an active air- or space-borne sensor that emits
microwaves and records the backscattered echo. It is independent of daylight, only
insignificantly weather-dependent and able to penetrate clouds, dust and, to a cer-
tain extent, vegetation depending on the wavelength used. Polarimetric SAR (Pol-
SAR) transmits and receives in different polarizations and thus records multichannel
images. The change in orientation and degree of polarization depends on various
surface properties, including moisture, roughness, and object geometry. Conse-
quently, the recorded data contains valuable information about physical processes
as well as semantic object classes on the illuminated ground.
Since modern sensors record increasing volumes of these data which makes man-
ual interpretation infeasible, there is a large need for methods that automatically
analyse PolSAR images. One typical task is the creation of semantic maps, i.e. the
assignment of a semantic label to each pixel in the images. This task is accom-
plished by supervised Machine Learning methods that change the internal param-
eters of a generic model so that the system provides (on average) the correct label
when a training sample is given, i.e. a sample for which the true class is known.
One way to approach this problem is to model the relationship between data and
semantic label with probabilistic distributions (or mixtures thereof) (e.g. [2; 3;
4]). On the other hand, there are discriminative approaches which are usually eas-
ier to train and more robust as those generative models. These methods extract
task-specific image features and apply classifiers such as Support Vector Machines
March 13, 2020 9:24 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-rhaensch page 190
(SVMs, e.g. in [5]), Multi-Layer Perceptrons (MLPs, e.g. in [6]) or Random Forests
(RFs, e.g. in [7]). The feature extraction step often involves manual designing
and selecting operators that are specific for the given classification task and thus
requires expert knowledge.
Modern approaches avoid the extraction of predefined features by adapting the
involved classifier to work directly on the complex-valued PolSAR data, e.g. by
using complex-valued MLPs [8] or SVMs with kernels defined over the complex-
domain [9]. Other methods use quasi-exhaustive feature sets that at least poten-
tially contain all the information necessary to solve any given classification problem.
The high dimensionality of the corresponding feature space is problematic for many
modern classifiers which is why it is often reduced by techniques such as principal
component analysis [10], independent component analysis [11], or linear discrimi-
nant analysis [12]. Another solution to this problem is to apply classifiers that are
not prone to this curse of dimensionality. One example are Random Forests (RFs)
due to their inbuilt feature selection as for example in [13] which computes hundreds
of features from a given PolSAR image as input for a RF.
These kind of methods are less biased towards specific classification tasks. How-
ever, the large amount of features consumes a huge amount of memory and com-
putation time. The following sections present a RF variant that can directly be
applied to PolSAR data without the need to pre-compute any features and thus
drastically decrease the required amount of memory and processing time.
allowing a certain randomness during tree creation. Many of these trees will agree
on the correct label for most samples, while the remaining trees give wrong but
inconsistent answers. Consequently, a simple majority vote of all trees leads to the
correct answer. An in-depth discussion of RFs is beyond the scope of this chapter
but can be found e.g. in [17]. The following sections provide a brief introduction
to RFs but focus on how to enable them to learn from PolSAR images directly.
Operator, region size and position are randomly selected. The results of applying
the operator at each region are compared to each other by Equations 2-4, where C̃
is a reference covariance matrix randomly selected from the whole image:
1-point projection: d(CR1 , C̃) <θ (2)
2-point projection: d(CR1 , CR2 ) <θ (3)
4-point projection: d(CR1 , CR2 ) − d(CR3 , CR4 ) < θ (4)
These projections (illustrated in Figure 1) are able to analyse the local spectral
and textural content. They make use of a proper distance measure d(A, B) defined
over the corresponding data space, i.e. Hermitian matrices in the case of PolSAR
images, as for example the log-Euclidean distance d(A, B) = ||log(A) − log(B)||F
(where || · ||F is the Frobenius norm).
Every internal node creates multiple test candidates and selects the best test
according to a quality criterion which is usually based on the drop of impurity ΔI:
ΔI = I(P (y|Dn )) − PL I(P (y|DnL )) − PR I(P (y|DnR )) (5)
C
I(P (y)) = 1 − P (yi )2 (6)
i=1
where nL , nR are the left and right child nodes of node n with the respective data
subsets DnL , DnR (with DnL ∪ DnR = Dn and DnL ∩ DnR = ∅) and the corre-
sponding prior probabilities PL/R = |DnL/R |/|Dn |. The Gini impurity (Equation 6)
of the corresponding local class posteriors P (y|Dn ) of node n is a typical choice to
measure the node impurity and is estimated based on the local subset Dn of the
training set.
After the RF is created and trained, it can be used for prediction during which a
query sample is propagated through all trees. It will reach exactly one leaf nt (x) in
every tree t. The estimate stored in these leafs, i.e. the class posterior P (y|nt (x)),
is averaged to obtain the final class posterior P (y|x):
1
T
P (y|x) = P (y|nt (x)) (7)
T t=1
Fig. 1.: Different spatial projections within a node test function [19].
Germany) is shown in Figure 2c (the used reference data is shown in Figure 2b)
[19]. Table 1 shows the corresponding confusion matrix.
If compared to the results obtained by extracting a large set of real-valued
features as input to the RF [13], a very similar accuracy is achieved, i.e. a slight
drop from 89.4% to 87.5%.
(a) Image data acquired by (b) Reference data. (c) Result (Log-Euclidean).
ESAR sensor (DLR).
This allows to load only as many samples as the memory size allows while the
learning process still exploits all available data. Only the current batch of samples
and the local split statistics have to be kept in memory. The threshold τ should
be selected to be sufficiently large to perform an accurate estimation of the split
statistics. However, RFs do not rely on optimal node splits and not only tolerate but
even require a considerable amount of uncertainty, which allows to keep τ relatively
small. Furthermore, a too large τ would lead to well optimised splits but slow down
the the tree growth.
This batch processing (with a RF with T = 50 trees of maximum height H = 50,
10 × 1 tiles and a batch size of B = 10, 000) was applied to the fully-polarimetric
example image of TerraSAR-X (X-band) provided by DLR shown in Figure 3a (with
the corresponding reference data shown in Figure 3b). It was acquired over Plat-
tling, Germany, a large rural area with multiple small settlements, roads, agricul-
tural fields, forests, and water and contains 10, 310×11, 698 pixels which corresponds
to roughly 2.1GB of memory.
(a) PolSAR image, TerraSAR- (b) Reference data: Urban (c) Classification result (same
X, DLR. (red), Road (magenta), Forest color code as for reference
(green), Field (yellow), Water data)
(blue)
Figure 3c shows the estimated label map which has a balanced accuracy (i.e.
average class detection rate) of 75.1% (please see [25] for more details).
Fig. 4.: The here discussed stacking framework uses at level 0 a RF that is trained on the
image data and the reference data only. Subsequent RFs use the estimated class posterior
as an additional feature. This enables refining the class decision and leads to more accurate
semantic maps [29].
model) within the second stage. The Tier-2 model uses a more sophisticated fusion
rule by learning when to ignore which of the Tier-1 models and how to combine
their answers. This means that even errors of the Tier-1 models can be exploited
since consistent errors might provide descriptive information about the true class.
The idea presented in the following slightly differs from the original formulation
of stacking in two major points [29]: 1) Only a single RF is trained as Tier-1 model,
i.e. the RF variant discussed above which can directly be applied to PolSAR data.
The estimated pixel-wise class posterior of this RF contains already a high level of
semantic information and is then used by a second RF as Tier-2 model together
with the original image data. To this aim, the original RF framework is extended
by including internal node tests that can be applied to class posteriors. 2) This
procedure is repeated multiple times: A RF at the i + 1-th level uses the original
image data and the posterior estimate of the RF at level i as input. This allows to
improve the class posterior estimate by learning which decisions are consistent with
the reference labels and how to correct errors.
Figure 4 illustrates the basic principle of stacking as applied here. The first level
(i.e. level 0) consists of a RF that is trained as described in Section 3.1.1. It only
uses the image data, i.e. the polarimetric sample covariance matrices, as well as the
reference data. This RF is then used to predict the class posterior of each sample
of the training data which completes the first level.
A RF at level l (with 0 < l ≤ L) uses image and reference data, but also the
class posterior estimated at level l − 1. This allows to refine the class estimates and
to correct errors made by the previous RF. One possible example are pixels showing
double bounce backscattering which frequently happens within urban areas due to
the geometric structure of buildings. It also happens frequently within a forest
where it is caused by tree stems. A RF in the early stages will interpret double
March 13, 2020 9:24 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-rhaensch page 197
bounce as an indication for urban areas and thus mislabel a forest pixel showing
this type of backscattering. A RF at an higher stage will learn that isolated double
bounce pixels labelled as urban area but surrounded by forest actually belong to
the forest class.
RFs that are supposed to analyse local class posteriors require node tests that
are designed for patches of probability distributions. In such a sample patch x each
pixel contains a probability distribution P (c) ∈ [0, 1]|C| defining to which degree
this pixel belongs to a class c ∈ C. As described above in Section 3.1.1 every node
test randomly samples several regions within a patch and selects one of the pixels
based on an operator. Possible operators for probability regions are for example
selecting the centre value or the region element with minimal/maximal entropy.
These probability distributions are then compared by a proper distance measure d
as for example the histogram intersection dHI (P, Q) = c∈C min(P (c), Q(c)) [29].
While the node tests in Section 3.1.1 analyse local spectral and textural proper-
ties within the image space, the node tests discussed here analyse the local structure
of the label space. This integrates the final classification decision of the previous RF
with its certainty as well as their spatial distributions. This enables the framework
to analyse spectral, spatial, as well as semantic information.
The following experiments are carried out on the Oberpfaffenhofen data set of
the previous section.
The RF at the 0-level has only access to the image (and reference) data and
achieved a balanced accuracy of 86.8% (the corresponding semantic map is shown
in Figure 5c. Despite the already quite high accuracy, there are several remaining
problems as for example fluctuating labels (e.g. within the central forest area) and
areas that are consistently misclassified. Figure 6 shows one of the problematic
areas in greater detail. The RF associates image edges with either city or road and
thus assigns wrong class labels to boundaries between fields, forest, and shrubland
(see first image in the first row of Figure 6).
The entropy and margin of the estimated class posterior (shown in the second
and third row of Figure 6) measure the degree of certainty of the classifier in its
decision. They range from completely uncertain (margin equals zero, entropy equals
one, both shown in blue) to completely certain (margin equals one, entropy equals
zero, both shown in dark red). While most parts of the forest and field class show
a high degree of certainty, in particular the wrongly labelled regions show a high
degree of uncertainty. The remaining rows of Figure 6 show the class posterior. The
columns of Figure 6 illustrate the learning through the individual stacking levels and
show that the largest changes occur within the first few levels. Every RF corrects
some of the remaining mistakes and gains certainty in decisions that are already
correct by using the semantic information provided by its predecessors as well as
the original image data. RFs at higher levels learn that edges within the image do
only correspond to city or road if other criteria (e.g. certain context properties) are
fulfilled. The large field area at the top of the image which is largely confused as
March 13, 2020 9:24 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-rhaensch page 198
(a) Image Data (E-SAR, DLR, L- (b) Reference Data: City (red),
Band) Road (blue), Forest (dark green),
Shrubland (light green), Field
(yellow), unlabelled pixels in white
Fig. 5.: Input data and results of the Oberpfaffenhofen data set [29].
field at the 0-level is now correctly labelled as field. Not all errors are corrected, e.g.
the large field area at the bottom of the image stays falsely classified as shrubland.
Nevertheless, the overall accuracy of the estimate of the last RF and thus the final
output improved significantly as shown in Figure 5d.
Figure 7 shows the development of margin (Figure 7a), entropy (Figure 7b), and
classification accuracy (Figure 7c) over different stacking levels. The classification
accuracy does monotonically increase over all stacking levels starting at 86.8% for
level 0 and ending at 90.7% at level 9. There is a significant change in accuracy at
the first levels which saturates quickly after roughly four levels. Interestingly, the
results differ for different classes: All classes (besides the street class which loses 1%
accuracy) benefit from stacking but not to the same extent. While, for example,
the accuracy of the field class appears to saturate already after level 2, the forest
class continuous to slightly improve even at the last iteration.
March 13, 2020 9:24 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-rhaensch page 199
Fig. 6.: Detail of the Oberpfaffenhofen data set. The columns illustrate level 0, 1, 2, 5,
and 9 of stacking. From top to bottom: Label map (same color code as in Figure 5b);
Entropy; Margin; Class posterior of City, Street, Forest, Shrubland, Field [29].
While the accuracy quickly saturates, the certainty of the RFs continues to
increase (shown in Figures 7a and 7b). While the changes are large at the early
stacking levels, higher levels only achieve marginal improvements.
March 13, 2020 9:24 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-rhaensch page 200
0.9 0.3
0.7 0.1
0.6 0.8
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
(a) Margin (b) Entropy (c) Accuracy
Fig. 7.: Results over the different stacking levels on the Oberpfaffenhofen data set [29].
Deep Learning methods that extract learned features are said to outperform tradi-
tional Machine Learning approaches with hand-crafted features. They have changed
paradigms in many areas, shifting research efforts from feature development to de-
signing deep architectures and creating databases. The latter is especially important
because deep learning works very well if (and often only if) large amounts of data
are available.
Recently, there has been a trend to provide satellite data for research for free
(e.g., [1]), which opens up new possibilities for methods requiring much data. In-
deed, deep learning methods are increasingly being used in Remote Sensing applica-
tions [30]. Traditional methods based on hand-crafted features, however, sometimes
still outperform deep learning approaches. An example is a recent classification chal-
lenge [31] where the four winning approaches (e.g. [32]) used ensemble techniques.
The problem of deep learning is usually not the lack of data, but the lack of labelled
data. More specifically, the lack of data of a particular sensor in conjunction with
the particular target variable to be learned.
As discussed at the beginning of this chapter, RS sensors cover a lot of very dif-
ferent modalities, which indicates that very large data sets are required for different
combinations of sensor type and target variable. More and larger labelled data sets
will surely emerge over time. However, the sheer amount of possible combinations
is clearly prohibitive.
In the absence of large amounts of labelled data, unsupervised or semi-supervised
methods may be beneficial: Instead of solving the primary problem directly, another
task is being learned for which more data is available. This proxy task is intended
to force the model to partially learn the primary task as well.
This paradigm can be successfully applied to the classification of PolSAR images
by using the proxy task of transcoding PolSAR images into optical multispectral
images for which large amounts of data is freely available. On the one hand, such
a transcoding provides a more intuitive visualization of SAR data. On the other
hand, the corresponding network must learn to recognize semantic entities in order
March 13, 2020 9:24 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-rhaensch page 201
Fig. 8.: Transcoding examples using either adversarial loss (left) or L2 regression loss
(right). With the adversarial loss, structures not present in the SAR data are hallucinated,
leading to the synthesis of texture [33].
loss). Structures are transcoded into their respective average colors if they can
be differentiated in the PolSAR data, but are lost if there is no direct translation
causing many different types of land use to be mapped to similar colors.
If the network is supposed to differentiate between classes, it needs to repro-
duce not only the average colors but also the corresponding, class-specific textures.
This can be achieved by training the network as generator of a conditioned Gen-
erative Adversarial Network (conditioned GAN) similar to [34] which does not aim
to reproduce the exact optical image from the SAR image but tries to produce an
plausible optical image. In this context, plausible means that it is not possible to
distinguish the transcoded optical image from a real image given the SAR data.
A second convolutional network, the discriminator, computes the this adversarial
loss, i.e. whether the optical output appears real given the SAR data. The training
of generator and discriminator are interleaved (more details on the exact training
procedure can be found in [33]).
The described framework is applied to two scenes around Wroclaw and Poznan,
Poland, where the first scene is used for training and the second for evaluation.
Figure 9 shows crops from the training set from the optical image, the PolSAR
image, and the transcoding result. While the transcoding is not always accurate
in terms of colors (in particular for fields), the result does match the semantic
meaning of a region, i.e. the optical texture of a class is synthesized if this class
is truly present in the image. This shows that the transcoding network learned
useful features to detect and differentiate between different semantic classes. These
features can then be used in a subsequent classification network to achieve high
performance with only very few labelled training samples.
In the following, the prefix “FS” denotes methods trained f rom scratch for
classification, i.e. a simple U-shape ConvNet (FS U-net), a ConvNet of similar ar-
chitecture as the generator (FS generator), and a RF as described in Section 3.1.1
(FS RF). The prefix “PT” refers to methods that are pre-trained on the transcoding
task, i.e. only fine tuning the last three convolution layers of the transcoding net-
work (PT last layers) or retraining a smaller upsampling branch of the transcoding
network (PT upsampling).
Similar to [35], samples of each class are spatially clustered into 16 clusters to
reduce the amount of training data in a more natural way then simple random
sampling. This allows to choose 1, 2, 4, 8, or all 16 clusters which corresponds
roughly to 1/16, 1/8, 1/4, 1/2, or all the training samples.
Qualitative and quantitative results on the test data set, i.e. a completely sepa-
rate image pair that was neither used for training nor for manual tuning, are shown
in Figures 10 and 11. All methods perform reasonably well if a sufficient amount
of training samples are available. The pre-trained networks using features from the
transcoding task outperform the methods trained from scratch. In particular, when
the amount of training data is small the methods trained from scratch show a large
performance decrease, while the pre-trained methods remain surprisingly stable at
March 13, 2020 9:24 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-rhaensch page 203
Fig. 9.: Left: PolSAR image from Sentinel-1. Middle: Transcoded optical image using
only the SAR image as input. Right: Actual optical image from Sentinel-2. Not all
bands/channels are shown for the sake of brevity [33].
about 75% accuracy. Interestingly, the RF (as a shallow feature learner) is on par
with the deep ConvNets if they are not pre-trained and even slightly superior if
only few training samples are available.
March 13, 2020 9:24 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-rhaensch page 204
SAR Optical [FS] generator [PT] last layers [FS] generator [PT] last layers
(classier input) (for reference) 1/16 1/16 Full Full
Fig. 10.: Qualitative classification results for several parts of the test data. Only “FS
generator” and “PT last layers” are shown. The two right columns show the effect of
different amounts of training data (1/16th and full amount) [33].
Fig. 11.: Classification accuracies achieved with different amounts of training data [33].
March 13, 2020 9:24 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-rhaensch page 205
4. Conclusion
This chapter discussed the crucial role Remote Sensing plays for Earth Observation
and consequently for human welfare. City planning, forest monitoring, and nat-
ural hazard management are only a few possible applications which benefit from
high quality Remote Sensing data and a corresponding automatic analysis. Auto-
matic procedures to interpret (remotely sensed) images are often based on Machine
Learning which in particular in the last years showed tremendous success achieving
unprecedented accuracy and robustness.
One of the most challenging and interesting Remote Sensing data sources is
Synthetic Aperture Radar (SAR) which served as an example to discuss modern
Machine Learning approaches in this chapter. In particular, Random Forests (RFs)
were presented as one of the few shallow learning methods that is able to learn a
direct mapping from the data to the desired output variable. It was shown, how
the general concept of RFs can be extended to process large amounts of data as
well as how to integrate it into even more elaborated frameworks. As a second
contemporary Machine Learning example Generative Adversarial Networks (GAN)
were discussed in the context of feature learning based on the proxy task of image
to image transcoding. The resulting optical-like SAR image representation might
have an intrinsic value, e.g. for visualization purposes, but more importantly can
the learned features be used to ease a subsequent classification task.
The presented results show that automatic feature learning leads to superior
performance and can relax the requirement of ConvNets on a large training set.
It was also shown that in particular for small training set sizes a shallow learner
such as RFs can still compete with Deep Learning approaches if trained and applied
properly.
The future of Machine Learning in Remote Sensing will make even stronger
use of larger data sets of freely available multi-modal, multi-temporal data. It will
connect new learning strategies as for example the integration of physical models
with new target variables leading to many high-level data products that will further
strengthen our understanding of geological, biological, and physical processes in our
environment.
References
vision through Mellin transform and Meijer functions. In 2016 24th European Signal
Processing Conference (EUSIPCO), pp. 518–522, Budapest, Hungary, (2016).
5. P. Mantero, G. Moser, and S. B. Serpico, Partially supervised classification of remote
sensing images through svm-based probability density estimation, IEEE Transactions
on Geoscience and Remote Sensing. 43(3), 559–570, (2005).
6. L. Bruzzone, M. Marconcini, U. Wegmuller, and A. Wiesmann, An advanced system
for the automatic classification of multitemporal sar images, IEEE Transactions on
Geoscience and Remote Sensing. 42(6), 1321–1334, (2004).
7. R. Hänsch and O. Hellwich. Random forests for building detection in polarimetric sar
data. In 2010 IEEE International Geoscience and Remote Sensing Symposium, pp.
460–463, (2010).
8. R. Hänsch, Complex-valued multi-layer perceptrons - an application to polarimetric
sar data, Photogrammetric Engineering & Remote Sensing. 9, 1081–1088, (2010).
9. G. Moser and S. B. Serpico. Kernel-based classification in complex-valued feature
spaces for polarimetric sar data. In 2014 IEEE Geoscience and Remote Sensing Sym-
posium, pp. 1257–1260, (2014).
10. G. Licciardi, R. G. Avezzano, F. D. Frate, G. Schiavon, and J. Chanussot, A novel ap-
proach to polarimetric SAR data processing based on nonlinear PCA, Pattern Recog-
nition. 47(5), 1953 – 1967, (2014).
11. M. Tao, F. Zhou, Y. Liu, and Z. Zhang, Tensorial independent component analysis-
based feature extraction for polarimetric SAR data classification, IEEE Transactions
on Geoscience and Remote Sensing. 53(5), 2481–2495, (2015).
12. C. He, T. Zhuo, D. Ou, M. Liu, and M. Liao, Nonlinear compressed sensing-based
LDA topic model for polarimetric SAR image classification, IEEE Journal of Selected
Topics in Applied Earth Observations and Remote Sensing. 7(3), 972–982, (2014).
13. R. Hänsch. Generic object categorization in PolSAR images - and beyond. PhD thesis,
TU Berlin, (2014).
14. Y. Zhou, H. Wang, F. Xu, and Y. Q. Jin, Polarimetric SAR image classification using
deep convolutional neural networks, IEEE Geoscience and Remote Sensing Letters.
13(12), 1935–1939, (2016).
15. R. Hänsch and O. Hellwich. Complex-valued convolutional neural networks for object
detection in polsar data. In 8th European Conference on Synthetic Aperture Radar,
pp. 1–4, Aachen, Germany, (2010).
16. Z. Zhang, H. Wang, F. Xu, and Y. Q. Jin, Complex-valued convolutional neural net-
work and its application in polarimetric SAR image classification, IEEE Transactions
on Geoscience and Remote Sensing. PP(99), 1–12, (2017).
17. A. Criminisi and J. Shotton, Decision Forests for Computer Vision and Medical Image
Analysis. (Springer Publishing Company, Incorporated, 2013).
18. B. Fröhlich, E. Rodner, and J. Denzler. Semantic segmentation with millions of fea-
tures: Integrating multiple cues in a combined Random Forest approach. In 11th Asian
Conference on Computer Vision, pp. 218–231, Daejeon, Korea, (2012).
19. R. Hänsch and O. Hellwich, Skipping the real world: Classification of polsar images
without explicit feature extraction, ISPRS Journal of Photogrammetry and Remote
Sensing. 140, 122–132, (2017). ISSN 0924-2716.
20. T. K. Ho, The random subspace method for constructing decision forests, IEEE Trans-
actions on Pattern Analysis and Machine Intelligence. 20(8), 832–844, (1998).
21. L. Breiman, Random forests, Machine Learning. 45(1), 5–32, (2001).
22. L. Breiman, Bagging predictors, Machine Learning. 24(2), 123–140, (1996).
23. R. Hänsch and O. Hellwich. Evaluation of tree creation methods within random forests
for classification of polsar images. In 2015 IEEE International Geoscience and Remote
March 13, 2020 9:24 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-rhaensch page 207
2
University of Iceland, Faculty of Electrical and Computer Engineering,
107 Reykjavik, Iceland
benedikt@hi.is
1. Introduction
209
210 F. Kizel and J.A. Benediktsson
the type of output (EMs or/and fractions), b) the type of EMs used/extracted, i.e.,
from image or library, and c) does the proposed model includes a term for sparse
regularization. Following these definitions, a general overview of the existing
spatially adaptive unmixing approaches yields the following five main groups of
methods:
x Group E. Like methods in the previous group, methods in this group are
intended to estimate both the EMs and their corresponding fractions
simultaneously. However, the sparsity of the fractions is also considered
and represented in the objective function [56]–[59].
For more information about the taxonomy of the existing spatially adaptive
methods, please see an overview in [39].
x The entire set of EMs is used to unmix all the pixels in the image without
any explicit decision regarding the probable presence of only a subset of
the EMs in a pixel. Non-existing EMs in a specific pixel are considered in
the optimization process and result in erroneous and overestimated
positive fraction values.
Different from other spatially adaptive methods, the previous two drawbacks are
addressed in the GBSAU method for further enhancement of the obtained fraction
accuracy. Instead of the commonly used (local) spatial regularization, the entire
spatial information of the image is incorporated within a supervised unmixing
process, and the spatial distribution of each EM fraction is represented as a 2D
analytical surface. In the rest of this chapter, we present the basic concepts of the
GBSAU method and show how we can use these concepts for the reconstruction
of fraction surfaces for images with low SNR and a high percentage of corrupted
pixels in the image.
214 F. Kizel and J.A. Benediktsson
The process in GBSAU is based on the realistic assumption that the fractions of an
EM are spatially distributed around a finite number of cores (Fig. 2), and their
values are assumed to decrease (or remain at most constant in some cases) as the
pixel's distance from the closest core increases. In particular, the spatial decay of
fraction values from a spectral core outward is assumed to be Gaussian. Using
these assumptions, the overall process in GBSAU combines two main steps as
follows:
The first step starts with generating a spectral similarity surface for each EM by
calculating the spectral angle mapper (SAM) [60][61] between the EM spectra and
the spectral signature at each pixel in the image. Given the spectral signature at an
arbitrary location in the image, i.e. m(x , y ) , the corresponding value of the spectral
similarity surface of the ith EM, in the x , y pixel is given by
Once we created the spectral similarity surfaces for all the EM, a multi-layer
surface is created by staking all the generated similarity surfaces along the z axes.
Then, the tow processes for the extraction of single-layer and multi-layer regional
maxima are applied. Single-layer regional maxima have a maximal value relative
to their spatial 2D surrounding neighborhood within the spectral similarity surface
of a given EM. Whereas, multi-layer regional maxima have a maximal value
relative to their 3D surrounding neighborhood, which includes the corresponding
2D neighborhoods in all other layers in the multi-layer surface. In addition to many
real spectral cores, the detection of points that do not represent a real spectral core
Hyperspectral and Spatially Adaptive Unmixing 215
is also probable. The elimination of these unreal cores is considered during the
process in the next step. Please see the work in [39] for full details regarding these
processes.
In the second step, an optimization process is applied to reconstruct the
fraction surface of each EM by fitting an analytical surface represented as a sum
of anisotropic 2D Gaussians. Accordingly, given that hi Gaussians represent the
fractions’ surface of the ith EM, the fraction value fi at a given location x , y can
be written as:
hi
u
G (x , y ) a 0 a1e 2
(4)
where
2 2
x a ¬ ¬
y a
u ,
Tx ® Ty ®
x a (x x 0 ) sin R (y y 0 ) cos R
and
y a (x x 0 ) cos R (y y 0 ) sin R.
The beneficial use of the sum of spatial Gaussians for the approximation of a
spatial surface has been shown in [62]–[67]. The objective in the second step is to
adjust the parameters of the Gaussians while reconstructing the fraction surfaces
of all the EMs. Each Gaussian has seven parameters a 0 , a 1 , T x , Ty , R, x 0 , y 0 , and
all the Gaussians are adjusted simultaneously. The process is initialized by locating
narrow 2D Gaussian at each spectral core that extracted in the previous step. Then,
216 F. Kizel and J.A. Benediktsson
where c and r are the number of columns and rows of the image, respectively,
and G(x , y ) [60] represents the local spectral similarity between the source spectral
signature at the point (x , y ) and the reconstructed signature (by the endmember set
E and the vector of estimated fractions ˆf (x , y ) ). Then, G is defined as follows:
m(x , y )T Efˆ(x , y )
G(x , y ) (6)
m(x , y ) ¸ Efˆ(x , y )
ˆk
P 1 ˆk
P H Pˆ k (8)
s8
where k is the iteration index, H is the step size, and Pˆ w is the objective
ˆ
sP
function gradient at P̂ . Given that the number of EMs is L and that the number
of Gaussians that represent the fractions’ surface of the i th EM is hi , the overall
gradient is given by
Hyperspectral and Spatially Adaptive Unmixing 217
T
Pˆ ¡ Pˆ1 , Pˆ 2 , !, ˆ h1 , Pˆ1 , Pˆ 2 , !, ˆ h2 , !, Pˆ 1 , Pˆ 2 , !, ˆ hL ¯° , (9)
¢ 1 1 P1 2 2 P2 L L PL ±
where Pˆ j is the vector of derivatives of the objective function 8 with respect to
i
A full description of the derivatives is given in [39]. All the derivative of the
objective function 8 with respect the Gaussian parameter can be derived
analytically. Thus, the overall optimization process is relatively computationally
light and allows for adjusting many Gaussians for each EM. Moreover, during the
optimization process, Gaussians that are located at a point that does not represent
a real fraction's core will vanish, and their magnitude will converge to zero,
whereas other Gaussians will change their parameters and take approximately the
shape of the real fraction distribution around the core. This property significantly
reduces the sensitivity of the GBSAU method to false detection of unreal cores
during the first step.
where C is the set of spectral cores extracted during the first step of GBSAU and
P̂ is the vector of estimated parameters of all the Gaussians in the problem. The
problem in (11) can be modified to estimate the parameters of the Gaussians based
on data with corrupted pixels as follows
ˆ arg max 8 E, H ,C , P
P \ ^, (12)
c c c c
Pc
based on data without corrupted pixels P̂ . And let cp nocp / nop denote the
percentage of corrupted pixels in the image Hc , where nocp and nop are the
number of corrupted pixels and overall pixels in the image, respectively. We
assume that the error e p is relatively robust to the factor cp , i.e., we can estimate
the parameters of the Gaussians with sufficient accuracy using data with many
corrupted pixels.
Hyperspectral and Spatially Adaptive Unmixing 219
To test the performance of the modified GBSAU for the reconstruction of fraction
surfaces, we applied a comparative evaluation using a patch of 50 q 50 pixels
from a real AisaDUAL image Fig. 1. Six EMs of asphalt, vegetation, red roof,
concrete, and two types of soil were selected from the image and used for the
evaluation. To create data with a synthetic ground-truth of the fractions, we created
a semi-synthetic spectral image with an affinity to the real image as described in
[39]. The semi-synthetic image contains mixed pixels with a variety of
combinations of the used EMs. To simulate real conditions in each experimental
scenario, we added Gaussian noise to each pixel in the image. To test the
performance of the examined methods under the presence of corrupted pixels, we
create scenes with different percentages of corrupted pixels over the image. We
created nine scenarios with different combinations of signal to noise ratio (SNR)
and percentage of corrupted pixels, as presented in Table 1. An RGB composite of
the generated image in each scenario is presented in Fig. 3. The performance of
the modified GBSAU method in each case was compared to the performance of
the two ordinary (non-spatial) methods SUnSAL [26] and VPGDU [60] and the
spatially adaptive method SUnSAL-TV [52]. Except for the GBSAU, the other
three examined methods are not adaptable to be applied to data with corrupted
pixels. Thus, an interpolated image in each scenario was used as an input for these
methods. A linear interpolation was applied only to retrieve the data in the
corrupted pixels while pixels with original data were not modified. Whereas, the
modified GBSAU was always applied to the original image with corrupted pixels
without the use of any interpolated data.
Table 1. Experimental scenarios with different combinations of SNR and percentage of corrupted
pixels.
1 30 0
2 30 40
3 30 80
4 10 0
5 10 40
6 10 80
7 5 0
8 5 40
9 5 80
220 F. Kizel and J.A. Benediktsson
Fig. 1. RGB composite and reflectance spectra of the six EMs, selected from the AisaDUAL image.
Fig. 2. Scatter of spectral core points for EM of Asphalt; (a)—(b) single-layer and multi-layer
regional maxima points, respectively, using an image without corrupted pixels; (c)—(d) single-layer
and multi-layer regional maxima points, respectively, using an image with 40% corrupted pixels.
Hyperspectral and Spatially Adaptive Unmixing 221
Fig. 3. RGB composite of the generated semi-synthetic spectral images (a)–(i) according to the
presented parameters in scenarios 1–9, respectively.
1 L
MAE MAEi (13)
L i 1
where
1 c r ˆ
MAEi f (x, y) fi (x, y),
r ¸ c x 1 y 1 i
fi (x , y ) and fˆi (x , y ) are, respectively, the synthetic-true fractions and the estimated
fractions of the i th EM at (x , y ) and c and r are the numbers of columns and rows
in the image, respectively. The results are summarized in Table 2.
Table 2. Quantitative MAE measures for evaluating the accuracy of estimated fractions, in each
experimenting scenario, using the GBSAU, SUnSAL-TV, SUnSAL, and VPGDU methods relative
to the true-synthetic fractions. The results of the GBSAU are based only on the original images with
corrupted pixels, while the other methods were applied to interpolated images.
MAE
0% corrupted pixels
SNR SUnSAL VPGDU SUnSAL-TV GBSAU
30db 0.0178 0.0200 0.0132 0.0143
10db 0.0846 0.1200 0.0842 0.0628
5db 0.1101 0.1417 0.1065 0.0816
The results clearly show the advantage of the GBSAU over the other methods,
especially as the SNR and percentage of corrupted pixels increase. In general, the
spatially adaptive methods perform better that the ordinary ones in the case of
Hyperspectral and Spatially Adaptive Unmixing 223
SNR=30db. The SUnSAL-TV method loses its advantage over the non-spatial
SUnSAL method in some of the cases with corrupted pixels and low SNR; this
probably indicates the negative influence of the interpolated data on the spatial
regularization. Otherwise, the advantage of the modified GBSAU over all the other
examined methods is significantly clear. Recall that its results are obtained without
the need for any interpolation. The modified GBSAU provides a beneficial tool for
the retrieval of valuable information from spectral data with a high level of noise
and percentage of corrupted pixels. To illustration the summarized results in Table
2, Fig. 4 presents the surface of the MAE values for the obtained fractions by
GBSAU and SUnSAL-TV. First, the MAE value in each scenario, i.e., for a
particular combination of SNR and percentage of corrupted pixels, is assigned to
a corresponding pixel to create a 3 q 3 surface. Then, for better visualization, the
surface is resized into a 15 q15 surface using a bicubic interpolation.
Fig. 4. MAE surfaces, (a) and (b), for the obtained results by GBSAU and SUnSAL-TV, respectively.
The illustration in Fig. 4 sheds light on the obtained MAE values and emphasizes
the advantage of GBSAU over SUnSAL with regards to the accuracy of the
estimated fractions. The increase of MAE values, as the SNR decreases or the
percentage of corrupted pixels increases, is evident in both methods. However, the
trend of this increase is more moderate for the results of GBSAU. For further visual
evaluation of the results, we present the obtained fractions’ surface for the red roof
EM in each scenario. Fig. 5 and Fig. 6 present the obtained surfaces by SUnSAL-
TV and GBSAU, respectively.
224 F. Kizel and J.A. Benediktsson
Fig. 5. Fraction surfaces of the red roof EM, (a)-(i) as obtained by SUnSAL-TV for the images in
scenarios 1-9, respectively. The results for cases with corrupted pixels are obtained using an
interpolated image.
Fig. 6. Fraction surfaces of the red roof EM, (a)-(i) as obtained by GBSAU for the images in scenarios
1-9, respectively. The results for cases with corrupted pixels are obtained using only non-corrupted
pixels.
Hyperspectral and Spatially Adaptive Unmixing 225
It is noticeable that both methods can retrieve reliable information even from very noisy
data. However, the advantage of the GBSAU over SUnSAL-TV in all the scenarios is clear.
The ability of the GBSAU to reconstruct the fraction under conditions of low SNR and a
high percentage of corrupted pixels is noteworthy. In addition to the spatial distribution,
the GBSAU also preserves the sparsity of the fractions; we can observe this from the
presented surfaces. While SUnSAL-TV overestimates fractions with zero value, a zero
value is obtained for these pixels by GBSAU in most of the cases.
4. Conclusions
We presented a new strategy for retrieving the fractional abundances from very
noisy hyperspectral images with a high percentage of corrupted pixels. The new
strategy is based on a modification of the GBSAU method. An experimental
evaluation of the proposed strategy with respect to state-of-art spatially adaptive
and non-spatial unmixing methods was conducted. The outcomes in the evaluation
section emphasize the advantage of using the unmixing process for the extraction
of information from hyperspectral data. All the methods success to retrieve an
amount of the information regarding the spatial distribution of the abundance
fractions with a certain degree of accuracy. This accuracy decreases as the SNR or
percentage of corrupted pixels increases. The GBSAU method outperforms all the
other methods with regards to the accuracy of the obtained results. This higher is
mainly due to the use of spatial information from the entire image and not only
from local regions. The problem of spatially adaptive unmixing is similar to the
fitting of a particular function using grid data. While the solution in SUnSAL-TV
(and all the other methods) is aimed to achieve this objective through a piecewise
regularization, the solution on GBSAU fits a single continuous and smooth
function for each EM across the entire image. Thus, the influence of local
anomalies on the obtained surfaces by GBSAU is much lower. For similar reasons,
all the other unmixing methods cannot be applied to data without continuity. Thus,
an interpolation method must be applied as a preprocess to the spectral unmixing.
Whereas the framework in GBSAU provides a novel solution for unmixing images
with both low SNR and a non-continuity due to the presence of corrupted pixels.
References
[1] N. Keshava and J. F. Mustard, “Spectral unmixing,” IEEE Signal Process. Mag., vol. 19,
no. 1, pp. 44–57, 2002.
226 F. Kizel and J.A. Benediktsson
[2] A. Plaza, Q. Du, J. M. Bioucas-Dias, X. Jia, and F. A. Kruse, “Foreword to the special issue
on spectral unmixing of remotely sensed data,” IEEE Trans. Geosci. Remote Sens., vol. 49,
no. 11 PART 1, pp. 4103–4105, 2011.
[3] M. Brown, H. G. Lewis, and S. R. Gunn, “Linear spectral mixture models and support vector
machines for remote sensing,” IEEE Trans. Geosci. Remote Sens., vol. 38, no. 5, pp. 2346–
2360, 2000.
[4] J. Bioucas-Dias et al., “Hyperspectral unmixing overview: Geometrical, statistical, and
sparse regression-based approaches,” IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing, vol. 5, no. 2. pp. 354–379, 2012.
[5] B. Hapke, “Bidirectional reflectance spectroscopy,” Icarus, vol. 195, no. 2. pp. 918–926, 2008.
[6] J. M. P. Nascimento and J. M. Bioucas-Dias, “Nonlinear mixture model for hyperspectral
unmixing,” in Proceedings of SPIE conference on Image and Signal Processing for Remote
Sensing XV, 2009, vol. 7477, pp. 74770I-1-74770I–8.
[7] Y. Altmann, N. Dobigeon, J. Y. Tourneret, and S. McLaughlin, “Nonlinear unmixing of
hyperspectral images using radial basis functions and orthogonal least squares,” Geoscience
and Remote Sensing Symposium (IGARSS), 2011 IEEE International. pp. 1151–1154, 2011.
[8] R. Heylen, M. Parente, and P. Gader, “A review of nonlinear hyperspectral unmixing
methods,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Sensing, vol. 7, no. 6. pp. 1844–1868, 2014.
[9] C. I. Chang, “Constrained subpixel target detection for remotely sensed imagery,” IEEE
Trans. Geosci. Remote Sens., vol. 38, no. 3, pp. 1144–1159, 2000.
[10] N. Keshava, “A Survey of Spectral Unmixing Algorithms,” Lincoln Lab. J., vol. 14, no. 1,
pp. 55–78, 2003.
[11] A. Bateson and B. Curtiss, “A method for manual endmember selection and spectral
unmixing,” Remote Sens. Environ., vol. 55, no. 3, pp. 229–243, 1996.
[12] M. Parente and A. Plaza, “Survey of geometric and statistical unmixing algorithms for
hyperspectral images,” in 2nd Workshop on Hyperspectral Image and Signal Processing:
Evolution in Remote Sensing, WHISPERS 2010 - Workshop Program, 2010.
[13] J. W. Boardman, F. a. Kruse, and R. O. Green, “Mapping target signatures via partial
unmixing of AVIRIS data,” Summ. JPL Airborne Earth Sci. Work., pp. 3–6, 1995.
[14] C. Gonzalez, D. Mozos, J. Resano, and A. Plaza, “FPGA implementation of the N-FINDR
algorithm for remotely sensed hyperspectral image analysis,” IEEE Trans. Geosci. Remote
Sens., vol. 50, no. 2, pp. 374–388, 2012.
[15] C. I. Chang, C. C. Wu, C. S. Lo, and M. L. Chang, “Real-Time Simplex Growing Algorithms
for Hyperspectral Endmember Extraction,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 48, no. 4. pp. 1834–1850, 2010.
[16] X. Geng, Z. Xiao, L. Ji, Y. Zhao, and F. Wang, “A Gaussian elimination based fast
endmember extraction algorithm for hyperspectral imagery,” ISPRS J. Photogramm.
Remote Sens., vol. 79, pp. 211–218, May 2013.
[17] M. E. Winter, “N-FINDR: an algorithm for fast autonomous spectral end-member
determination in hyperspectral data,” SPIE’s Int. Symp. Opt. Sci. Eng. Instrum., vol. 3753,
no. July, pp. 266–275, 1999.
[18] C. I. Chang, C. C. Wu, W. M. Liu, and Y. C. Ouyang, “A new growing method for simplex-
based endmember extraction algorithm,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 10,
pp. 2804–2819, 2006.
Hyperspectral and Spatially Adaptive Unmixing 227
[19] M. D. Craig, “Minimum-volume transforms for remotely sensed data,” IEEE Trans. Geosci.
Remote Sens., vol. 32, no. 3, pp. 542–552, 1994.
[20] J. M. P. Nascimento and J. M. Bioucas-Dias, “Hyperspectral Unmixing Based on Mixtures
of Dirichlet Components,” IEEE Transactions on Geoscience and Remote Sensing, 2011.
[21] E. M. T. Hendrix, I. Garcia, J. Plaza, G. Martin, and A. Plaza, “A New Minimum-Volume
Enclosing Algorithm for Endmember Identification and Abundance Estimation in
Hyperspectral Data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no.
7. pp. 2744–2757, 2012.
[22] J. M. P. Nascimento and J. M. B. Dias, “Vertex component analysis: A fast algorithm to
unmix hyperspectral data,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 4, pp. 898–910,
2005.
[23] A. Plaza, P. Matrinez, R. Perez, and J. Plaza, “A quantitative and comparative analysis of
endmember extraction algorithms from hyperspectral data,” IEEE Trans. Geosci. Remote
Sens., vol. 42, no. 3, pp. 650–663, 2004.
[24] S. Sánchez, G. Martín, and A. Plaza, “Parallel implementation of the N-FINDR endmember
extraction algorithm on commodity graphics processing units,” in International Geoscience
and Remote Sensing Symposium (IGARSS), 2010, pp. 955–958.
[25] Z. Shi, W. Tang, Z. Duren, and Z. Jiang, “Subspace matching pursuit for sparse unmixing
of hyperspectral data,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 6, pp. 3256–3274,
2014.
[26] J. M. Bioucas-Dias and M. A. T. Figueiredo, “Alternating direction algorithms for
constrained sparse regression: Application to hyperspectral unmixing,” in 2nd Workshop on
Hyperspectral Image and Signal Processing: Evolution in Remote Sensing, WHISPERS
2010 - Workshop Program, 2010.
[27] M. D. Iordache, J. M. Bioucas-Dias, and A. Plaza, “Collaborative Sparse Regression for
Hyperspectral Unmixing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52,
no. 1. pp. 341–354, 2014.
[28] M. D. Iordache, J. M. Bioucas-Dias, and A. Plaza, “Sparse Unmixing of Hyperspectral
Data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 6. pp. 2014–
2039, 2011.
[29] W. Ouerghemmi, C. Gomez, S. Naceur, and P. Lagacherie, “Applying blind source
separation on hyperspectral data for clay content estimation over partially vegetated
surfaces,” Geoderma, vol. 163, no. 3–4, pp. 227–237, 2011.
[30] I. Meganem, Y. Deville, S. Hosseini, P. Déliot, and X. Briottet, “Linear-quadratic blind
source separation using NMF to unmix urban hyperspectral images,” IEEE Trans. Signal
Process., vol. 62, no. 7, pp. 1822–1833, 2014.
[31] A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing: Learning
Algorithms and Applications. 2003.
[32] Y. Zhong, X. Wang, L. Zhao, R. Feng, L. Zhang, and Y. Xu, “Blind spectral unmixing based
on sparse component analysis for hyperspectral remote sensing imagery,” ISPRS J.
Photogramm. Remote Sens., vol. 119, pp. 49–63, Sep. 2016.
[33] C.-H. Lin, C.-Y. Chi, Y.-H. Wang, and T.-H. Chan, “A Fast Hyperplane-Based Minimum-
Volume Enclosing Simplex Algorithm for Blind Hyperspectral Unmixing,” IEEE Trans.
Signal Process., vol. 64, no. 8, pp. 1946–1961, Apr. 2016.
228 F. Kizel and J.A. Benediktsson
[34] S. Zhang, A. Agathos, and J. Li, “Robust Minimum Volume Simplex Analysis for
Hyperspectral Unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 11, pp. 6431–
6439, Nov. 2017.
[35] Y. Zhong, X. Wang, L. Zhao, R. Feng, L. Zhang, and Y. Xu, “Blind spectral unmixing based
on sparse component analysis for hyperspectral remote sensing imagery,” ISPRS J.
Photogramm. Remote Sens., vol. 119, pp. 49–63, Sep. 2016.
[36] S. Jia and Y. Qian, “Constrained nonnegative matrix factorization for hyperspectral
unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 1, pp. 161–173, 2009.
[37] A. Buades, B. Coll, and J.-M. Morel, “A Non-Local Algorithm for Image Denoising,” in
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), vol. 2, pp. 60–65.
[38] Y. Zhong, R. Feng, and L. Zhang, “Non-local sparse unmixing for hyperspectral remote
sensing imagery,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 7, no. 6, pp. 1889–
1909, 2014.
[39] F. Kizel and M. Shoshany, “Spatially adaptive hyperspectral unmixing through endmembers
analytical localization based on sums of anisotropic 2D Gaussians,” ISPRS J. Photogramm.
Remote Sens., vol. 141, pp. 185–207, Jul. 2018.
[40] A. Plaza, P. Martenez, R. Perez, and J. Plaza, “Spatial/spectral endmember extraction by
multidimensional morphological operations,” IEEE Trans. Geosci. Remote Sens., vol. 40,
no. 9, pp. 2025–2041, 2002.
[41] D. M. Rogge, B. Rivard, J. Zhang, A. Sanchez, J. Harris, and J. Feng, “Integration of spatial-
spectral information for the improved extraction of endmembers,” Remote Sens. Environ.,
vol. 110, no. 3, pp. 287–303, 2007.
[42] M. Zortea and A. Plaza, “Spatial preprocessing for endmember extraction,” IEEE Trans.
Geosci. Remote Sens., vol. 47, no. 8, pp. 2679–2693, 2009.
[43] G. Marten and A. Plaza, “Spatial-spectral preprocessing prior to endmember identification
and unmixing of remotely sensed hyperspectral data,” IEEE J. Sel. Top. Appl. Earth Obs.
Remote Sens., vol. 5, no. 2, pp. 380–395, 2012.
[44] G. Marten and A. Plaza, “Region-based spatial preprocessing for endmember extraction and
spectral unmixing,” IEEE Geosci. Remote Sens. Lett., vol. 8, no. 4, pp. 745–749, 2011.
[45] B. Somers, M. Zortea, A. Plaza, and G. P. Asner, “Automated extraction of image-based
endmember bundles for improved spectral unmixing,” IEEE J. Sel. Top. Appl. Earth Obs.
Remote Sens., vol. 5, no. 2, pp. 396–408, 2012.
[46] M. Shoshany and T. Svoray, “Multidate adaptive unmixing and its application to analysis
of ecosystem transitions along a climatic gradient,” Remote Sens. Environ., 2002.
[47] A. Zare, “Spatial-spectral unmixing using fuzzy local information,” in International
Geoscience and Remote Sensing Symposium (IGARSS), 2011, pp. 1139–1142.
[48] X. Song, X. Jiang, and X. Rui, “Spectral unmixing using linear unmixing under spatial
autocorrelation constraints,” in International Geoscience and Remote Sensing Symposium
(IGARSS), 2010, pp. 975–978.
[49] O. Eches, N. Dobigeon, and J. Y. Tourneret, “Enhancing hyperspectral image unmixing
with spatial correlations,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 11 PART 1, pp.
4239–4247, 2011.
Hyperspectral and Spatially Adaptive Unmixing 229
CHAPTER 2.3
Qin Zhang
Department of Marine Technology, Norwegian University of Science and
Technology, 7052, Trondheim, Norway
chin.qz.chang@gmail.com
Sea ice statistics and ice properties are important for the analysis of ice–structure
interaction. The use of cameras as sensors on mobile sensor platforms will aid
the development of sea ice observation to, for instance, support the estimation
of ice forces that are critical to marine operations in Arctic waters. The time-
and geo-referenced sea ice images captured from cameras will provide valuable
information to observe the type and state of sea ice and the corresponding physical
phenomena taking place. However, there has been a lack of methods to effectively
extract engineering-scale parameters from sea ice images, leaving scientists and
engineers to do their analysis manually. This chapter introduces novel sea ice
image processing algorithms to automatically extract useful ice information, such
as ice concentration, ice types and ice floe size distribution, which are important
in various fields of ice engineering.
1. Introduction
Various types of remotely sensed data and imaging technology have been developed
for sea ice observation. Image data from various sources, such as visible cameras,
infrared cameras, radar, and satellites, are rich on information of the environment,
from which many of the sea ice parameters can be extracted. Identifying ice pa-
rameters for a wide scale region from satellite data has been widely studied.1–6
Recently, the ice cover data on a global scale has become available on a daily basis
due to the development of microwave satellite sensors, making it possible to moni-
tor the global variability of sea ice extent over the time-scale from days to seasons.
However, the satellite observing systems are unable to monitor the local variability
of sea ice parameters (e.g., the sea ice in contact with a marine/offshore structure
or coastal infrastructure), and it is still an issue in engineering scale (e.g., to predict
sea ice behavior and loads by numerical models) due to lack of sub-grid scale infor-
mation of the ice parameters.7 This motivated attention to boundary detection of
individual floes and estimating the floe size distributions.8,9
231
March 12, 2020 15:13 ws-rv961x669 HBPRCV-6th Edn.–11573 ch12˙Zhang-Qin page 232
Focusing on a relatively small scale, camera imagery becomes one of the most
information-rich remote sensing tools and has been used on mobile sensor platforms
(e.g., aircrafts, shipboard, or on unmanned vehicles) to characterize ice conditions
for engineering purposes.10–12 Cameras as field observation sensors have the poten-
tial of continuous measurements with high precision that allows capturing a wide
range of the sea ice field, from a few meters to hundreds of meters. The visible image
data obtained from cameras have high resolution, which is particularly important
for providing detailed localized information of sea ice to collect observational data
on an operational basis.13 The information of the object and environment provided
by such visible images are close to human visual perception on them both in tonal
structure and resolution, meaning that determination of sea ice characteristics via
visible sea ice images is similar to manual visual observation. Thus cameras can be
used as supplementary means to necessary and important information of the actual
ice conditions for validation of theories and estimation of parameters in combination
with other instruments for sea ice remote sensing.
Despite the advantages of using visible cameras for sea ice observation, an im-
portant requirement is clear weather and sight when capturing sea ice image data.
Moreover, one of the major problems of surveying sea ice via cameras has been the
difficulty in image processing for numerical extraction of sea ice information, which
is vital for estimating the sea ice properties and understanding the behavior of sea
ice, especially on a relatively small scale. Since sea ice condition is a complicated
multi-domain process, it is not easy to analyze sea ice images quantitatively. The
lack of an effective method for image processing has hampered the understanding
of the dynamic properties of sea ice on small scale. This chapter introduces novel
sea ice image processing algorithms to automatically extract useful ice informa-
tion, such as ice concentration, ice types and ice floe size distribution, which are
important in various fields of ice engineering.
Ice Concentration (IC) from a digital visual image is, for simplicity, defined as the
area of sea surface covered by visible ice observable in the 2D visual image taken
vertically from above, as a fraction of the whole sea surface domain area.14 It can
be calculated as the ratio of the number of pixels of visible ice to the total number
of pixels within the image domain, where the domain area is an effective area within
the image excluding land or other non-relevant areas. This means, ice concentration
is given by a binary decision of each pixel to determine whether it belongs to the
class “ice” or to the class “water”, and it is clear that the distinction of the ice
pixels from water pixels is thus crucial to calculating the ice concentration value.
The pixels in the same region normally have similar intensity. Based on that
ice is whiter than water, ice pixels usually have higher intensity values than those
belonging to water in a uniformly illuminated ice image. The thresholding method,
which extracts the objects from the background based on the pixels’ grayscale val-
March 12, 2020 15:13 ws-rv961x669 HBPRCV-6th Edn.–11573 ch12˙Zhang-Qin page 233
Image Processing for Sea Ice Parameter Identification from Visual Images 233
ues and converts the grayscale image to a binary one, is a natural choice to separate
an ice image into an “ice region” and a “water region”. The automatic selection
of an appropriate grayscale threshold value is crucial when using the thresholding
method for determing ice pixels. The Otsu thresholding, which is an exhaustive
algorithm of searching for the global optimal threshold, is one of the most com-
mon automatic threshold segmentation methods.15 This method assumes that the
grayscale histogram of an image is bimodal and the illumination of the image is uni-
form. It then divides the histogram into two classes (i.e., the pixels are identified as
either foreground or background) and finds the threshold value that minimizes the
within-class variance. The Otsu’s method is a bi-level image thresholding technique
and can be further extended to multi-level thresholding for image segmentation.
Multi-level thresholding using Otsu’s method is computationally simple when di-
viding the image into two or three classes. But as the number of classes increases,
the minimization procedure becomes more complex, which makes the multi-level
Otsu thresholding method time consuming.
K-means clustering, which minimizes the within-cluster sum of distance to par-
tition a set of data into groups, is another ice pixel detection method.16 This
algorithm iterates two steps: assignment and update. In the assignment step, each
point of the data set is assigned to its nearest centroid; and in the update step, the
position of the centroid is adjusted to match the sample means of the data points
that they are responsible for. The iteration stops when the positions of centroids no
longer change. The k-means clustering is actually to minimize a within-class vari-
ance too, but it does not require to compute any variance. Therefore, this algorithm
is computationally fast and is a good way for a quick review of data, especially if
the objects are classified into many clusters.17
Using Otsu thresholding or k-means clustering method for determining ice pixels,
the ice image is divided into two or more classes. The class with the lowest average
intensity value is considered to be water, while the other classes are considered to
be ice.14,18 The ice pixel detection results using the Otsu thresholding method is
similar to the results using the k-means method when the intensity values of all the
ice pixels are significantly higher than water pixels.19 However, the bi-level Otsu
thresholding method can only find “light ice” pixels. The “dark ice” (e.g., brash
ice, slush, and the ice that are submerged in water), whose pixel intensity values
are between the threshold and water, may be lost. According to the definition of
ice concentration, both “light ice” and “dark ice” visible in the ice images should be
included when calculating the ice concentration. Since using the multi-level Otsu
thresholding method to detect ice pixels requires more computational time, the k-
means clustering method, on the other hand, has a better detection by dividing the
image into three or more clusters, as seen in Fig. 1.
March 12, 2020 15:13 ws-rv961x669 HBPRCV-6th Edn.–11573 ch12˙Zhang-Qin page 234
Image Processing for Sea Ice Parameter Identification from Visual Images 235
initial contour, which is a starting set of snake points for the evolution and should
be placed close to the true boundary. Otherwise, the snake will likely converge to
an incorrect result. To overcome these limitations, the gradient vector flow, which
is derived from the image by minimizing a certain energy functional in a variational
framework, was introduced into the traditional snake algorithm.20 The GVF field
is computed as a spatial diffusion of the gradient vectors of an edge map to expand
the capture range of external force fields from boundary regions to homogeneous
regions and to enable the external forces to point into deep concavities of object
boundaries. The GVF snake is thus computationally faster and less restricted by
the initial contour.
The GVF snake algorithm operates on the grayscale image in which the real
boundary information, particularly weak boundaries, has been preserved. This al-
gorithm is able to detect the weak edges between ice floes, and to ensure that the
detected boundary is closed. As an example, shown in Fig. 2(b), given an initial
contour (red curve), the snake finds the floe boundary (green curve) after a few
iterations (yellow curves). The GVF snake algorithm relaxes the requirements of
the initial contour. However, due to that the snake deforms itself to conform with
the nearest salient contour, a proper initial contour for an object is still necessary.
Especially when identifying the mass of ice floes in an ice image, many initial con-
tours are required for performing the GVF snake algorithm to detect all individual
ice floe boundaries, and these initial contours should have proper locations, sizes
and shapes.
Fig. 2 gives an example showing the floe boundary detection results affected
by initializing the contour at different locations. In Fig. 2(a), the initial contour
is located at the water, close to the ice boundaries. The snake rapidly detects the
boundaries however, not the ice but the boundaries of the water region. When
initializing the contour at the center of an ice floe, as shown in Fig. 2(b), the
snake accurately finds the boundary after a few iterations even though the initial
contour is some distance away from the floe boundary. A weak connection will also
be detected if the initial contour is located on it, as shown in Fig. 2(c). However,
when the initial contour is located near the floe boundary inside the floe, as shown
in Fig. 2(d), the snake may only find a part of the floe boundary near the initial
contour (it should be noted that the curve is always closed regardless of how it
deforms, even in the cases of Fig. 2(c) and Fig. 2(d), which appear to be non-
closed curves. This behavior occurs because the area bounded by the closed curve
tends toward zero). This example indicates that, the snake will find a boundary
regardless of where the initial contour is located, the result where the initial contour
is located inside the floe and close to the floe center is the most effective.
In addition to locations, the sizes of initial contour will also affect the results
of ice floe boundary detection. The initial contour in the GVF snake algorithm
does not need to be as close to the true boundary as for in the traditional snake
algorithm. However, if the initial contour, located at the floe center and inside of
March 12, 2020 15:13 ws-rv961x669 HBPRCV-6th Edn.–11573 ch12˙Zhang-Qin page 236
(a) Initial contour 1 (b) Initial contour 2 lo- (c) Initial contour 3 lo- (d) Initial contour 4
located at the water, cated at the center of cated at a weak connec- located near the floe
and the water region an ice floe, and the tion, and the weak con- boundary inside the
boundary is found. whole floe boundary is nection is found. floe, and only a part of
found. floe boundary is found.
Fig. 2. Initial contours located at different positions and their corresponding curve evolutions.
The red curves are the initial contours, the yellow curves are iterative runs of the GVF snake
algorithm, and the green curves are the final detected boundaries.
the floe, is too small, it will be slightly “far away” from the floe boundary and more
iterations will be needed for the snake to find the boundary. The snake may also
converge to an incorrect result if the initial contour is further distanced from the
floe boundary, especially when the grayscale of floe is uneven. Fig. 3 serves as an
example. Fig. 3(a) contains some light reflection in the middle of a model ice floe
(in an ice tank) where the pixels that belong to the reflection are lighter than the
other pixels of the floe. And Fig. 3(d) contains speckle inside of a sea ice floe where
the pixels of the speckle are darker. These phenomenons will affect the boundary
detection when the initial contour (the red curves in Fig. 3(b) and Fig. 3(e)) is too
small and not close to the actual boundary. The snake uses many steps (the yellow
curves in Fig. 3(b) and Fig. 3(e)) and find a part of floe boundary (the green curve
in Fig. 3(b)), or does not find the complete boundary being blocked by the speckle
(the green curve in Fig. 3(e)). If we enlarge the initial contour, as shown in Fig.
3(c) and Fig. 3(f), the initial contour allows for a faster determination of the entire
floe boundary. Therefore, the initial contour should still be set as close as possible
to the actual floe boundary.
As a conclusion of the above, to increase the efficiency of the ice floe boundary
method based on the GVF snake algorithm, the initial contours should be adapted
to floe size, located inside the floe and centered as close to floe center.18 In image
analysis, the ice floes can be separated from water when converting the ice image
into a binary one by using a thresholding method or k-means clustering method.
These methods make it easy to locate the initial contours inside the ice floes. Thus,
the binarized ice image and its distance transform is use to automatically initialize
contours for evolving the GVF snake efficiently in ice floe boundary detection. The
step-by-step of this automatic contour initialization algorithm is described below:
Step 1: Convert the ice image into a binary image after separating ice regions from
water regions, in which case the pixels with value ‘1’ indicate ice, and the pixels
with value ‘0’ indicates water; see Fig. 4(a) and Fig. 5(b).
Step 2: Perform the distance transform to the binarized ice image. Find the regional
March 12, 2020 15:13 ws-rv961x669 HBPRCV-6th Edn.–11573 ch12˙Zhang-Qin page 237
Image Processing for Sea Ice Parameter Identification from Visual Images 237
(a) Model ice floe image (b) A small contour ini- (c) A large contour ini-
with light reflection. tialized at the model ice tialized at the model ice
floe center, giving con- floe center, giving con-
vergence of the snake to vergence of the snake to
the incomplete bound- the correct boundary.
ary.
(d) Sea ice floe image (e) A small contour ini- (f) A large contour ini-
with speckle. tialized at the sea ice floe tialized at the sea ice
center, giving erroneous floe center, giving con-
evolutions of the snake. vergence of the snake to
the correct boundary.
Fig. 3. Initial circles with different radii and their curve evolutions. The red curves are the initial
contours, the yellow curves are iterative runs of the GVF snake algorithm, and the green curves
are the final detected boundaries.
maxima shown as the green numerals in Fig. 4(b) and green ‘+’ in Fig. 5(d).
Step 3: Merge those regional maxima within a short distance (within as a threshold
Tseed ) of each other. Find the “seeds” that are centers of regional maxima and
merged regions, shown as red ‘+’ in Fig. 4(b) and Fig. 5(e).
Step 4: Initialize the contours to be located at the seeds with the circular shape.
The radius of the circle is then chosen according to the pixel value at the seed in
the distance map; see the blue circles in Fig. 4(b) and Fig. 5(e).
At Step 2 in this algorithm, a regional maximum in the distance map of the
binarized ice image ideally corresponds to the center of an ice floe, but more than
one regional maximum are detected in many cases. Thus, the regional maxima that
have a short distance to each other will be merged (e.g., by a dilation operator) into
a big one at Step 3. The circular shape is chosen as the shape of the initial contour
at Step 4, because this shape deforms to the floe boundary more uniformly than
other shapes, being unaware of the floe’s irregular shape and orientation. Moreover,
March 12, 2020 15:13 ws-rv961x669 HBPRCV-6th Edn.–11573 ch12˙Zhang-Qin page 238
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0
0 1 1 1 1 1 1 0 0 1 2 2 2 2 1 0
0 1 1 1 1 1 1 0 0 1 2 3 3 2 1 0
0 0 1 1 1 1 1 0 0 0 1 2 3 2 1 0
0 0 1 1 1 1 0 0 0 0 1 1 2 1 0 0
0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
the use of seed’s pixel value in the distance map as the basis for selecting the radius
of the circle ensures that the initial contour (circle) is contained strictly inside the
floe and adapted to the floe size. Therefore, this contour initialization algorithm
accomplishes the requirements of the initial contour for the GVF snake without
manual interaction.
After initializing the contours, the GVF snake algorithm is run on each contour
to identify the floe boundary. Superimposing all the boundaries over the binarized
ice image, i.e., setting all the identified boundary pixels value ‘0’ (note that the
boundary pixels can be specifically labeled for special handling in subsequent use),
results in the separation of the connected ice floes. This GVF snake-based ice floe
segmentation procedure carried out on a sea ice floe image is shown in Fig. 5.
It should be noted that, a regional maximum whose distance is larger than the
given threshold Tseed will not be merged into one seed at Step 3 in the introduced
contour initialization algorithm. This means that some floes may have more than
one seed. However, two or more seeds for one ice floe will not affect its boundary
detection, but it may increase the computational time.
Image Processing for Sea Ice Parameter Identification from Visual Images 239
(a) Sea ice floe image. (b) Binarized image of (c) Distance map of Fig.
Fig. 5(a). The ice floes 5(b).
are connected.
(d) Binary ice floe im- (e) Binary ice floe image (f) Segmentation result.
age with regional maxima with seeds (red ‘+’) and The connected ice floes are
(green ‘+’). initial contours (blue cir- separated.
cles).
(a) Ice floe image (b) Segmentation re- (c) Shape enhance-
with speckle. sult of Fig. 6(a). ment result of 6(b).
It should be noted that the arrangement of ice pieces in order of increasing size is
required for the morphological cleaning. Otherwise, the smaller ice piece contained
in a larger ice floe may not be removed.
Sea ice typically has a wide variability of ice floe, together with content of brash
ice and slush and, possibly, a snow cover. Usually, part of the ice pixels have low
intensity values close to water pixels, as seen in a MIZ image in Fig. 7(a), and
they may not be identified by the bi-level Otsu thresholding method. The k-means
clustering method with three or more clusters is then applied to determine more ice
pixels. By comparing the difference between bi-level Otsu thresholding detection
result and k-means clustering detection result, we obtain “dark ice” pixels, as shown
in Fig. 7(d). We will see later that creating the individual “light ice” and “dark
ice” image layers is advantageous for the computation of the initial contours for the
GVF snake algorithm.
(c) Ice detection using k-means method with 3 (d) “Dark ice”, the difference between Fig. 7(b)
clusters. and Fig. 7(c).
Image Processing for Sea Ice Parameter Identification from Visual Images 241
Note that, the results of ice pixel detection by using the same method but with
different multilevel may be similar to each other,9 see for example in Fig. 1(d) and
Fig. 1(e), which show ice pixel detection results by using the k-means clustering
with two and three clusters, respectively. The difference between these two resulting
images is too minor to be able to determine the “dark ice” pixels. Therefore, it is
necessary to use two different methods for the extractions of the sea ice pixels in
order to widen the gap between the results. Also note that, the k-means clustering
is computationally faster than the multi-level Otsu thresholding method. Thus, the
bi-level Otsu thresholding method is used to detect the “light ice” pixels, and the
k-means clustering method with three or more clusters is used to determine “dark
ice” pixels.
The bi-level Otsu method detects less ice pixels, however, the under-detected ice
pixels by the bi-level Otsu method results in more “holes” in the binarized image
as seen the “light ice” image layer in Fig. 7(b), and this is essential for the further
initialization of the contours for the GVF snake algorithm, in order to separate
the connected ice floes in the area where the sea ice are crowded. Those under-
detected ice pixels can next be compensated by the detected “dark ice” pixels when
using the additional k-means clustering method with three or more clusters. On
the contrary, if we use the k-means clustering method to detect “light ice” or more
ice pixels, there might be few “holes” among a massive amount of ice floes that are
connected to each other, as seen in Fig. 1(d) and Fig. 7(c), it then becomes difficult
to initialize the contours for the GVF snake algorithm. Therefore, both “light ice”
and “dark ice” image layers are needed for a more accurate result, especially for
individual ice floe identification.
To start the GVF snake algorithm, the “light ice” and “dark ice” layers are used
individually to calculate the initialization of the contours. Then the GVF snake
algorithm is run to individually derive “light ice” segmentation as seen by the white
ice pieces in Fig. 8, and the “dark ice” segmentation as seen by the gray ice pieces
in Fig. 8. Collecting all the ice pieces in the “light ice” and “dark ice” segmented
image layers results in the final segmented image, as exemplified in Fig. 8.
It should be noted that the “light ice” and the “dark ice” should be labeled
differently in the final segmented image. Otherwise, it may be impossible to separate
some “light ice” and “dark ice” pieces if they are connected.
4.1.3. Marginal Ice Shape Enhancement and Final Image Processing Result
In many cases, the grayscale of an ice floe is uneven, as seen in Fig. 9(a). The
lighter part of the floe is considered as “light ice” (the white pixels in Fig. 9(b)
and Fig. 9(c)), while the darker part is considered as “dark ice” (the gray pixels in
Fig. 9(b) and Fig. 9(c)). This means the ice floe, as shown in Fig. 9(b), cannot be
March 12, 2020 15:13 ws-rv961x669 HBPRCV-6th Edn.–11573 ch12˙Zhang-Qin page 242
Fig. 8. Sea ice segmentation image. The white ice is the segmentation result for the “light ice”
in Fig. 7(b), and the gray ice is the segmentation result for the “dark ice” in Fig. 7(d).
completely identified when it has both “light ice” pixels and “dark ice” pixels. If we
perform the ice shape enhancement to the “light ice” segmentation and “dark ice”
segmentation independently, there will be overlap between the resulting individual
light ice piece identification and individual dark ice piece identification. This means
that some ice pixels may be identified as belonging to different ice floes, with the
risk that large ice floes are still incomplete. Therefore, all the detected ice pieces
from both segmented “light ice” and “dark ice” layers should be labeled as one
input to the step of ice shape enhancement to ensure the completeness of the ice
floe and smaller ice pieces contained in a lager ice floe are removed.
(a) Ice floe image with (b) Segmentation result (c) Shape enhancement
uneven grayscale. of Fig. 9(a). result of 9(b).
Fig. 9. Sea ice shape enhancement. The white pixels are the “light ice” pixels, and the gray
pixels are the “dark ice” pixels.
The ice shape enhancement results in the identification of individual ice pieces.
Furthermore, to distinguish brash ice from ice floes, we define a brash ice threshold
parameter (pixel number, area, or characteristic length) that can be tuned for each
application. The ice pieces with sizes larger than the threshold are considered to
be ice floes, while smaller pieces are considered to be brash ice. The remaining ice
pixels, e.g., single ice pixels or the ice pieces that are too small to be treated as
March 12, 2020 15:13 ws-rv961x669 HBPRCV-6th Edn.–11573 ch12˙Zhang-Qin page 243
Image Processing for Sea Ice Parameter Identification from Visual Images 243
brash ice, are labeled as slush. This results in four layers of a sea ice image (using
Fig. 7(a) as an example): ice floe (Fig. 10(a)), brash ice (Fig. 10(b)), slush (Fig.
10(c)), and water (Fig. 10(d)) (noted that, the incorrect identification in the foggy
bottom-left corner of the figure can be improved by, e.g., processing this blurred
region locally.18 However, we keep this error here as a special case and will discuss
it in later sections). Moreover, the residue ice, which is the detected edge pixels
between the connected ice pieces, is in this example considered as slush (since there
is often edge layer of slush ice between two ice floes) and included in Fig. 10(c).
However, the residue ice, as shown in Fig. 10(e), can also simply be identified as
“residue ice” and defined specifically by the user and the application of the data.
Based on the four layers, a total of 2888 ice floes and 3452 brash ice pieces are
identified from Fig. 7(a). The coverage percentages are 58.00% ice floe, 4.85%
brash ice, 21.21% slush, and 15.94% water. The total ice concentration is 84.06%,
and the histogram of ice floe size distribution grouped by mean clipper diameter
(MCD) is shown in Fig. 11.
≥
[m]
(a) Layer showing the “ice floes” (marked with white dots at floe centers), the color bar
shows the MCD of ice floes.
(b) Layer showing the “brash ice”. (c) Layer showing the “slush”.
(d) Layer showing the “water”. (e) Residue ice (edge pixels).
Image Processing for Sea Ice Parameter Identification from Visual Images 245
≥
[m]
Fig. 12. A close-up view of floe ice, brash ice and their corresponding numerical representation.
overlap are labeled with red color in Fig. 13(a). Afterwards, for each calculation
iteration, the collision detection algorithm identifies existing overlaps; and the col-
lision responses are calculated and applied to eliminate the overlaps. Fig. 13(b)
shows one snapshot of the ice field domain, within which, overlaps are gradually re-
solved. Notably, for saving computation resources, not all the ice floes are involved
in the calculation of each iteration. For ice floes without overlap and are far away
from the overlapped ice floe clusters, they are in “sleeping mode” in the adopted
algorithm (see Fig. 13(b)).
Fig. 13(b) shows that ice floes in the ice field’s bottom-left corner have more
overlaps. Nevertheless, applying the non-smooth DEM calculation procedures, all
the overlaps are eventually resolved in Fig. 13(c) and the finally generated ice field
is shown in Fig. 13(d). After resolving the overlaps, the exact location of each ice
floe in Fig. 13(d) and Fig. 13(a) are not the same, but with only minor differences.
On the other hand, each ice floe’s shape and size, and the overall ice mass are
conserved.
March 12, 2020 15:13 ws-rv961x669 HBPRCV-6th Edn.–11573 ch12˙Zhang-Qin page 246
Initial phase, ice floes: without overlap with overlap Calculation phase, ice floes: sleeping; active with overlap
active without overlap
(a) Initial phase of the ice floe field with overlap. (b) Calculation phase of the ice floe field with
overlap.
(c) All overlaps are resolved. (d) Finally generated ice floe field.
Similarly, brash ice can be imported into the same non-smooth DEM based
simulator and be treated as discrete bodies. From a non-smooth DEM calculation’s
point of view, the simplification of each brash ice as a circular disk with equivalent
area makes the collision detection and consequent collision response calculation
much more efficient comparing to arbitrary polygons. Given the amount of brash
ice and its relatively small mass, this simplification is reasonable and have been
adopted in previous studies.31–33
For the current demonstration, the identified brash ice and its numerical repre-
sentation are additionally imported to the ice field in Fig. 13(a). This is illustrated
in Fig. 14(a), which shows relatively more overlaps. An enlarged view within the
field center is also presented, where the circular disk-shaped bodies are the brash
ice representations.
For the current ice field’s composition, i.e., 58.00% ice floe, 4.85% brash ice,
the calculation time to resolve all overlaps for the cases with and without brash ice
poses no significant difference. In both cases, the bottleneck for calculation time is
on the overlap resolution in the bottom-left corner’s large ice floes. However, it is
expected that as the amount of brash ice increases, the calculation time would also
increase, which eventually becomes the decisive bottleneck for the calculation.
March 12, 2020 15:13 ws-rv961x669 HBPRCV-6th Edn.–11573 ch12˙Zhang-Qin page 247
Image Processing for Sea Ice Parameter Identification from Visual Images 247
Fig. 14. Ice field generation with both floe ice and brash ice.
To calculate ice concentration, both the Otsu thresholding and the k-means clus-
tering methods separate classes of ice pixels from water pixels by dividing an image
into two or more classes in a mandatory manner. This implicitly assumes there must
be some water and some ice in the image, and will fail in the boundary conditions
when ice concentration is 0% or 100%, which have to be dealt with as particular
cases. How to choose the number of classes automatically for any image data is
critical and there is no explicit mathematical criterion that can be evaluated to
find it. Both methods are also not adaptive to varying light conditions, varying
shading, melt ponds and surface water on the ice, etc. Instead, they must be tuned
to become as robust as possible to such variations for a given place and ambient
conditions. Moreover, both methods do not include detailed ice physics except the
grayscale value of the image. Therefore, the learning-based object detection could
March 12, 2020 15:13 ws-rv961x669 HBPRCV-6th Edn.–11573 ch12˙Zhang-Qin page 248
To determine ice floe statistics and properties, the GVF snake algorithm is adopted
to identify individual floes due to its superior detection capability of weak bound-
aries. The GVF snake uses a diffusion of the gradient vectors of an edge map as the
source of its external forces, resulting in a smooth attraction field that is defined
in the whole image and spreads the influence of boundary concavities. However,
due to the inherent competition of the diffusion process, the capture range of the
strong edges may dominate the external force field. The external forces near the
weak edges, which are close to the stronger ones, will be too weak to pull the
snake toward the desired weak boundaries. As a result, the snake is likely to pass
over the weak edge and terminate at the corresponding strong edge. Hence, the
under-detection of the blurred ice edges by the GVF snake results in the incorrect
identification in the the foggy bottom-left corner of Fig. 7(a). A solution to this
would be to process this region as a particular case.18
Furthermore, the GVF snake-based method separates the connected ice floes
and identifies individual floes one by one in the image, and may take from minutes
to hours to complete depending on image size and the amount of ice floe existed
in the image. It typically requires more computational time for larger images with
more ice floes, and will challenge the real-time applications. Thus, an adaptive,
faster, and parallelized algorithm for identifying individual ice floes needs to be
developed.
With the identified sea ice field parameters involving geometries and locations of
ice floes and brash ice, a non-smooth DEM based method is adopted to assign basic
physics to each ice floes and brash ice. The digitalized ice field usually involves
overlaps among different bodies mainly because of the geometrical simplifications
made for ice floe and brash ice. The primary intention is hence to revolve these
overlaps. As a demonstration, the non-smooth DEM successfully resolved all these
overlaps among ice floes and brash ice. Notably, given the simplified circular disk-
shape numerical representation of brash ice, reaching the final ice field without
overlap demands minimal additional computational time. However, as the amount
of brash ice keeps increasing, further simplification might be desired, e.g., modeling
brash ice as a viscous flow which is governed by conservation laws as a material
collection.
March 12, 2020 15:13 ws-rv961x669 HBPRCV-6th Edn.–11573 ch12˙Zhang-Qin page 249
Image Processing for Sea Ice Parameter Identification from Visual Images 249
References
CHAPTER 2.4
Deep learning implementations using convolutional neural nets (CNNs) have re-
cently demonstrated promise in many areas of medical imaging. This chapter
presents two aspects of CNN use with magnetic resonance (MRI) brain images.
First, we describe production-level output of brain segmentation from whole head
images, a crucial processing task that is resource-intensive under standard CPU
methods and human quality control. With an extremely large archive of MRIs for
training and testing, our segmentation performs robustly across multiple imag-
ing cohorts, with greatly increased throughput. Second, we present robust brain
structure edge labeling which enables studies of greater statistical power than
that of Canny edge detection or hand crafted probabilistic algorithms.
1. Introduction
Deep learning refers to a variety of techniques for extracting features from data,
usually by neural networks involving a hierarchy of many layers.1,2 This chapter
reports on our use of deep learning via convolutional neural nets (CNNs)3 to au-
tomate and improve the robustness and statistical power of two necessary tasks
for processing structural brain magnetic resonance images (MRIs). CNNs have re-
cently been used in a variety of applications for medical image processing.4–6 CNN
applications have the potential to equal or exceed expert medical image evaluation
and to greatly speed up computationally intensive aspects of image processing.
In this chapter we describe two projects using CNNs for rapid and robust identi-
fication of brain locations within structural MRIs. First, segmentation of the brain
from the whole head (i.e. skull-stripping) is an indispensable task in any standard
pipeline of MRI processing. However, it can be resource intensive, thus creating a
processing bottleneck in the analysis of large data sets. For example, using a state-
of-the-art technique of atlas-matching,7 in which at least 10 carefully segmented
atlas images are nonlinearly matched to a target, skull-stripping a single whole
head MRI requires around 27 CPU hours and must be followed by human qual-
ity control (QC) taking typically an hour or more. In answer to this, we present
251
March 12, 2020 15:22 ws-rv961x669 HBPRCV-6th Edn.–11573 main page 252
a method that greatly reduces processing time while robustly performing over a
variety of imaging cohorts. Next, brain structural edge labeling is vital for many
analyses. Here we will describe the use of CNN-derived edge labels to enhance lon-
gitudinal registration of same-subject scans, an important task in the burgeoning
field of longitudinal analysis. This approach leads to improved robustness of edge
labeling and increased statistical power for computing longitudinal atrophy rates.
Both of the tasks that we focus on in this chapter fall under the heading of
segmentation, or the labeling of 3D voxel locations in a structural brain MRI.2 The
gold standard for this process has typically been manual human labeling from draw-
ing on images slice-by-slice, but this is slow, subject to human error and effectively
limits the practical size of useful data sets of labeled images. The promise of deep
learning derives from its potential to rival or exceed human expert recognition of
brain structures, while performing automatically or with reduced need for human
quality control, in much shorter times than are needed for manual labeling. Chal-
lenges to the success of deep learning in medical image processing stem from the
relative dearth of high-quality ground truth labeling (i.e. the very problem that
deep learning aims to address) which is necessary for training to achieve robust
performance given varied image quality and scanner characteristics.2
2. Methods
This section outlines our deep learning hardware and software environment, followed
by specific methods for each of the two tasks we addressed.
2.1.2. Software
The TensorFlow platform (https://www.tensorflow.org/) was used to implement
and train the neural network, as well as to calculate the similarity and volume
metrics. The relevant steps for the analyses presented below were performed by in-
house code developed as part of our image processing suite. These were: multi-atlas
brain mask extraction for ground truth training, testing and validation (Sec. 2.2.2
and Sec. 2.2.3), structural edge labeling using our implementation of a 3D Canny
labeling algorithm for edge ground truth labels (Sec. 2.3.2), and longitudinal same-
subject sequential scan registration to compute patterns of atrophy (Sec. 2.3.4),
followed by statistical analysis of the results (Sec. 2.3.6 and Sec. 2.3.7). All addi-
tional processing was done in Python using built-in as well as the following external
March 12, 2020 15:22 ws-rv961x669 HBPRCV-6th Edn.–11573 main page 253
modules: NiBabel, NumPy, SciPy, scikit-image, Pydicom, pytoml, and tqdm. Re-
sult analysis was performed using Pandas and plotted using Bokeh via HoloViews
in JupyterLab.
(a) Whole head MRI. (b) Brain location mask. (c) Extracted brain.
1 2 3×3×3 32
2 2 3×3×3 128
3 3 3×3×3 256
4 3 3×3×3 512
5 3 3×3×3 1024
For training and testing, we used 11,663 structural T1-weighted MRI brain scans
selected from our archive of almost 26,000 scan sessions, representing data from
multiple national imaging studies. The composition of this set is detailed in Table 2.
The diversity of imaging cohorts by subject demographics and MRI acquisitions
allowed us to train for robust segmentation in the face of image variability due to
multiple factors.
Each structural MRI in our laboratory has an accompanying brain mask created
by an automated multi-atlas segmentation procedure7 followed by human quality
control. Our brain segmentation protocol includes the entire intracranial cavity
(ICC) defined out to the pia mater. This deviates from many standard brain seg-
mentations, which stop at the brain boundary. By segmenting the larger space, we
obtain a more robust and invariant measure of head size over time.
We started with previously generated ICC masks via this method as our ground
truth (GT). Our full training set included about 90% as masks from atlas-based
segmentation, supplemented by about 10% CNN-generated masks from previous
iterations of training. A handful of volumes were excluded from the set due to large
slice thickness, excessive noise (e.g. ghosting), or severe pathologies such as large
tumor or stroke.
March 12, 2020 15:22 ws-rv961x669 HBPRCV-6th Edn.–11573 main page 255
Network training was deeply supervised, with loss function penalties calculated
at each stage as well as the final fused prediction. Training example pairs were
sampled one at a time in round-robin fashion by cohort. Individual cohort sets
were continually cycled to maintain influence until a fixed number of training steps
had been completed. Training was completed in about 32 hours using the following
hyperparameters:
• Loss function: Summed cross-entropy of each stage and the fused prediction
compared to ground truth.13
• Optimization: Exponential moving average of a stochastic gradient descent
with Nesterov momentum.14
• Learning rate: 10−2
• Momentum: 0.9
• Moving average decay: 0.999
• Batch size: 1
• Steps: 25,000
conv 3x3x3, ReLU conv 3x3x3, ReLU conv 1x1x1 conv 1x1x1
... ...
32 128 128
up-conv 8x8x8
Stage 2
... conv 3x3x3, ReLU ... conv 3x3x3, ReLU ... conv 3x3x3, ReLU ... conv 1x1x1
Stage 3
... conv 3x3x3, ReLU ... conv 3x3x3, ReLU ... conv 3x3x3, ReLU ... conv 1x1x1
Stage 4
Stage 5
Fig. 2. Neural network diagram tracking 3D volume through the 5-stage encoder and fuse decoder.
3 × 3 × 3 convolution followed by a ReLU nonlinearity, 2 × 2 × 2 Max pooling layer, 1 × 1 × 1
convolution reduction, Upsampling convolutional transpose.
are calculated non-destructively, and applied on the fly right before neural network
processing.
Prediction maps for voxel brain membership from the CNN are binarized using
a threshold of p > 0.34 to form the brain segmentation mask.
We evaluated the performance and quality of the CNN brain mask predictions using
three comparison measures: model generalization, model consistency and resource
efficiency. Model generalization refers to the ability of the trained neural net to
match ground truth masks of the test samples across a variety of imaging cohorts.
This is important because imaging cohorts vary by characteristics of scanner and
participants, and we want to achieve consistently good matches regardless of cohort.
We used the Dice similarity coefficient (DSC)11,12,15 for match quality between CNN
and ground truth masks. The DSC is defined as follows:
2A ∩ B
DSC = ,
(1)
A + B
where A is the set of predicted voxels in the CNN mask and B is the set of voxels
in the GT mask.
March 12, 2020 15:22 ws-rv961x669 HBPRCV-6th Edn.–11573 main page 257
Model consistency is the ability to generate brain masks for longitudinal same-
subject repeated scans that are close in volume. This is crucial because of our
protocol segmenting the ICC volume, which unlike brain volume is constant over
time, meaning that estimated ICC volumes should ideally be unchanging over re-
peated scans. We used the maximum volumetric differences over all scans of a
subject to assess this consistency. Resource efficiency encompasses the two aspects
of computational and human resource time. Our current atlas-based brain mask
computations require about 27 CPU hours of computation followed by an average of
45-75 minutes of human QC. We compare the corresponding times for computation
and human QC of the CNN masks.
Our second analysis examined the ability of CNN brain structural edge predictions
to enhance the sensitivity of brain atrophy rate computations. In previous arti-
cles, we showed that supplementing structural MRI images with estimates of edge
presence increased our sensitivity and localization of atrophy maps, leading to aug-
mented statistical power for detecting differences in atrophy rates between impaired
and normal cohorts.16,17 In the current chapter, we investigate whether CNN edge
predictions further enhanced these characteristics.
The CNN architecture for edge recognition was modified from that for brain segmen-
tation described above (see Table 1 for reference to the segmentation architecture).
We reduced the number of stages from 5 to 3, with the intention of loosening the
context restrictions imposed by later stages in the mask segmentation architecture:
we wanted edge patterns learned in one part of the brain to be recognizable in
other regions. We doubled the number of filters in both layers of the first stage, to
increase the network capacity for recognizing variations in small detail, given that
structural edges have finer and more varied characteristics than whole brain masks.
Ground truth training data consisted of 10,910 edge-labeled structural MRI images,
previously skull-stripped, from the ADNI, ADC, Framingham and VCID cohorts.
The Alzheimer’s Disease Neuroimaging Initiative (ADNI) (adni.loni.usc.edu)
was launched in 2003 as a public-private partnership. The primary goal of ADNI
has been to test whether serial MRI, positron emission tomography, other biologi-
cal markers, and neuropsychological assessment can be combiined to measuree the
progression of mild cognitive impairment and early AD. The principal investigator
of ADNI is Michael Weiner, MD, VA Medical Center and University of California,
San Francisco. For current information, see www.adni-info.org.
March 12, 2020 15:22 ws-rv961x669 HBPRCV-6th Edn.–11573 main page 258
For testing, we used a further set of 1,070 subjects from the ADNI cohort which
had two serial longitudinal scans with at least one year interscan separation, since
the goal was not only to evaluate qualities of edge prediction, but their effectiveness
when used to augment longitudinal registration. Edge labeling for ground truth was
performed by an in-house implementation of 3D Canny edge detection,18 aimed at
delineating brain tissue boundaries between white matter (WM) and gray matter
(GM), and GM with cerebrospinal fluid (CSF).
For ground truth, the Canny edge labels were scaled to the interval [0,1] and then
thresholded at 0.1 to produce a binary edge mask. The only modification to the
training protocol for brain masks (see Sec. 2.2.3) was to increase the number of
training steps to 30,000. Testing was performed both by visual inspection of the
CNN-predicted edge maps and by evaluating their ability to enhance longitudinal
image registration (see Sec. 2.3.5 and Sec. 2.3.7 below).
The study of brain change over time requires precise computations of local volume
change rates (tissue atrophy or CSF cavity expansions). These are accomplished
using non-linear registration between two structural MRI scans of the same subject
separated by a time interval. Local volume changes are computed by the loga-
rithm of the determinant of the jacobian 3 × 3 matrix of the deformation field
partial derivatives (log-jacobians). The jacobian determinant yields volume change
as multiplicative factor at each image voxel, and the log-transform converts it to
a distribution centered at 0, with negative values indicating contraction (atrophy)
and positive values expansions. For small magnitudes of the determinant, the log-
jacobian approximates the local percentage volume change. Deformation fields are
computed via Navier-Stokes equations driven by a force generated from image mis-
match; solutions are velocities which are integrated to yield spatial deformations
needed to register the images. A full explanation is provided in our previous arti-
cles.16,17
There we demonstrated that the accuracy and resulting statistical power of
the log-jacobian atrophy maps could be improved by incorporating estimates of
tissue or structural boundary likelihood at each point into the force field, since the
computed deformations rely heavily on mismatches of edges that have moved.16,17
In brief, the driving force F in the Navier-Stokes equation at each voxel is a weighted
sum of components derived from the gradient F1 of the image intensity mismatch
metric and the gradient F2 of a modulating penalty function. The weights use the
probability P (edge) of a structural boundary at a voxel to 1) allow strong effects of
the mismatch gradient while minimizing the penalty at highly likely edge locations,
and 2) dampen the driving force, which might be incorrectly high due to image
March 12, 2020 15:22 ws-rv961x669 HBPRCV-6th Edn.–11573 main page 259
noise, while allowing full strength of the penalty gradient in areas more likely to be
inside homogeneous tissue:
F = P F1 + λ(1 − P )F2 , (2)
where λ is a penalty weighting factor.
(MCI) or Alzheimer’s disease (AD) groups, over statistically defined regions of inter-
est (statROIs),22 defined as brain regions that best characterize atrophy difference
rates between two groups. We computed statROIs for the pairs AD vs. CN and
AD vs. MCI using non-parametric cluster size permutation testing21 with 1000 it-
erations, to find significant (p < 0.05, corrected) clusters of voxels whose atrophy
difference T-values were at least 5 for the pair of groups. The statistical power es-
timate was based on the minimum sample size n80 needed to detect a 25% increase
in atrophy in an impaired cohort beyond that of CN with 80% probability:22,23
2
2σimpaired (z0.975 + z0.80 )2
n80 = , (3)
(0.25(μimpaired − μCN ))2
where μ is the mean statROI atrophy and σ the standard deviation for a given
group (impaired or CN), and for b = 0.975 or 0.80, zb is the threshold defined by
P (Z < zb ) in the standard normal distribution.
3. Results
This section presents results of our current applications of CNN learning to the areas
of brain segmentation and brain structural edge labeling for improving longitudinal
image registration.
Fig. 3. Whisker plot distributions of the DSC between CNN masks and ground truth grouped by
cohort. The upper heavily dashed line at DSC = 0.984 is our mean estimated level of human inter-
rater performance vs consensus. The lower (lightly dashed) line at 0.977 is the best previously
reported mean of prediction mask vs. ground truth using LPBA40 and OASIS data.12
over time, the mask volume predictions of sequential scans in one subject should
be the same. Their variability is a measure of the consistency of the segmentation
method.
Fig. 4. Repeat scan ICC volume ranges (max-min per subject) of 117 subjects across 259 scans,
for ground truth masks finished by human QC (left) and CNN masks (right).
March 12, 2020 15:22 ws-rv961x669 HBPRCV-6th Edn.–11573 main page 262
Preprocessing - 2
Mask generation 900-1600 0.15
Human QC 45-75 10
Totals 945-1675 12
(a) Averaged CNN edge label predictions. (b) Average CNN edge labels overlaid by sig-
Thresholded map (p > 0.30) on template nificant differences with Canny labels. Warm
brain. Red indicates p-values near 0.30; yel- colors: CNN > Canny. Cool: Canny > CNN.
low, near 0.85.
Grad-Enhanced method in white matter regions (blue-purple in Fig. 6(c)) are less
biologically plausible and may be a consequence of that method’s weaker localiza-
tion to gray matter structures.
(a) Average atrophy rates (b) Average atrophy com- (c) Significant atrophy differ-
computed using CNN edge puted using Grad-Enhanced ences for CNN vs.
predictions. Blue indicates edge estimates. Same color Grad-Enhanced edges. Cod-
severe (3-4% biannual loss) scales and coding as in (a). ing: CNN atrophy > Grad-
and green modest atrophy Enhanced (warm) and Grad-
(around 1%). Enhanced > CNN (cool)
We computed the n80 minimum sample size measure (Sec. 2.3.7, Eq. (3)) using
statROIs capturing the regions of significant atrophy difference between AD vs.
CN, and AD vs. MCI cohorts. We did not compare CN and MCI cohorts because
we found little difference between these via either the CNN or Grad-Enhanced
methods.
Although the statROIs for each group difference covered broadly the same re-
gions in both methods, the CNN-derived statROIs showed both more precision to
small structures and more consistent coverage of regions whose atrophy is known to
be associated with cognitive decline. Resulting minimum sample sizes are displayed
in Table 4. For each comparison, the CNN method of atrophy calculation shows a
smaller sample size.
4. Discussion
We have outlined the methods and presented results for two applications of CNN-
based deep learning to structural recognition tasks in 3D MRI brain images. With
minor modifications, our CNN architectures were successfully adapted to segment-
ing the brain from the whole head, and to labeling structural boundary edges.
The products of each application demonstrated a robustness and consistency that
exceeded previous efforts, making them useful both for fast and consistent image
processing (in brain segmentation) and statistically powerful computations of lon-
gitudinal atrophy rates. These illustrate the flexibility and adaptability of deep
learning in disparate areas of brain imaging.
Brain segmentation is a necessary and routine step of any image processing pipeline.
Many solutions have been developed over a 20-year period (Ref. 9 provides a good
review). Prior to the application of deep learning to brain segmentation, most ap-
proaches required numerical parameter inputs in order to accommodate the charac-
teristics of particular scans. For example, Brain Extraction Tool (BET)8 has user
inputs for “fractional intensity threshold” that affects the size of the output seg-
mented mask and “threshold gradient” that influences the relative size of estimates
at the top and bottom of the brain. While default values can be used, they do not
work on every scan and systematic errors can result. Although computation times
for many algorithmic approaches are fast (for BET, about 2 seconds; see Table III of
Ref. 12 for an overview of CNN and non-CNN segmentation times), accuracies are
variable and the search continues for an algorithm that is robust across scanners.
An alternative to which we have turned in our laboratory is multi-atlas matching,7
which gives consistently acceptable results when paired with human QC. But it is
resource-intensive, thereby posing a limitation for processing large amounts of data.
In response, we aimed to develop a deep learning implementation with increases
of throughput over our current methods while maintaining robust generalizations
across cohorts and longitudinal consistency. The results of this chapter suggest
that these goals are feasible. Other recent efforts have demonstrated the utility of
CNN approaches to brain segmentation.11,12 However, these used relatively small
training sets which were of limited variety and segmentation quality. One of the
limitations for any deep learning application in medical imaging has been a dearth
of high quality ground truth. In this regard, our laboratory archives afford a rare
instance of large numbers (see Table 2) of high-quality GT generated from a com-
bination of multi-atlas matching and human QC, using our established protocol of
segmenting the ICC. This has given us a close-to-optimal setting for deep learn-
ing. Drawn from multiple imaging cohorts throughout the United States, our data
have provided sufficient numbers and variability to train our CNN segmentation
architecture across different scanners and subject populations. We documented its
March 12, 2020 15:22 ws-rv961x669 HBPRCV-6th Edn.–11573 main page 266
capacity for generalization in Fig. 3. We further showed that the CNN segmenta-
tions improved upon the longitudinal consistency produced by our multi-atlas and
human QC method (Fig. 4). In sum, our CNN learning combines computation
times that are comparable to those of fast non-CNN approaches with improvements
over multi-atlas matching for the measures of consistency and human QC (Table 3).
This will allow us to process very large data sets that we anticipate coming online
in our laboratory over the next three years.
In future work, we aim to address the first limitation using iterations of training
in which our previous atlas-matching masks are replaced by CNN-generated masks
that show more consistency due to less human variability in the QC phase. To
be clear, the CNN masks still require human QC, but this is far less than for the
atlas-matching masks (Table 3), thus reducing the risks of QC analysts introducing
variability at the brain edges. Other work in the near future will include adapting
our volumetric techniques to segmentation of important brain sub-structures such
as the hippocampus24 (for which we now also use multi-atlas segmentation) and
subcortical nuclei25 whose boundaries tend to be indistinct in structural MRI.
Our application of deep learning to structural edge labeling was aimed at improv-
ing localization, biological accuracy and statistical power of voxel-based registration.
Nonlinear registration is a powerful technique for studying local brain variability.
In the cross-sectional setting, where many images are registered to a template, it
is necessarily imperfect since brains have individual cortical signatures that cannot
be completely matched. This is not the case in longitudinal registration of two
scans for the same subject; however, random noise in the form of image artifacts
typically still causes false indications of change over time. Common measures to
address this problem have involved penalty functions and levels of smoothing,26,27
but these risk degrading the level of local detail that otherwise might be possible.
Our previous work in this area16,17 aimed to support localization by incorporating
auxiliary estimates of edge likelihoods, in the form of hand-crafted algorithms us-
ing one or a few pre-selected features, to reinforce change gradients in likely edge
locations but inhibit indications of “change” in areas not associated with edges.
The pattern recognition capabilities of deep learning, going beyond the use of a
handful of features, suggested the possibility to do better. Results in this chapter
have indicated that CNN-generated edge predictions can in fact extend the power of
our previous methods. Registrations generated with CNN edge predictions showed
increased detail regarding structural boundaries (compare Fig. 6(a) and Fig. 6(b)),
with regions of computed atrophy rates that were both larger in some areas and
smaller in others compared to our previous algorithm, in biologically plausible ways
(Fig. 6(c)). These led to improved statistical power (Table 4) in the form of reduced
sample sizes needed to detect change.
Applications of deep-learning methods to brain structural edge recognition are
of course not limited to the one presented in this chapter. Tasks such as tissue
classification28 and the segmentation of brain structures,25 including subcortical
nuclei whose boundaries are difficult to localize with certainty, can benefit from the
edge prediction capability we demonstrated here. This will be the subject of future
work.
March 12, 2020 15:22 ws-rv961x669 HBPRCV-6th Edn.–11573 main page 268
5. Conclusion
We have demonstrated deep learning CNN applications in two areas of brain struc-
tural image processing. One application focused on improving production and ro-
bustness in brain segmentation, a routine, essential image processing task that
suffers from variable reliability among available non-CNN approaches, and a dearth
of training GT data for previous deep learning efforts. The other aimed at improv-
ing edge prediction, leading to greater biological accuracy and statistical power for
analysis of large data sets. Our results suggest that we have attained these goals.
This demonstrates the flexibility and broad applicability of CNN learning in medical
image processing and analysis.
References
1. Dinggang Shen, G. Wu, and H.-I. Suk, Deep Learning in Medical Image Analysis,
Annual Review of Biomedical Engineering. 19(1), 221–248, (2017). ISSN 1523-9829.
doi: 10.1146/annurev-bioeng-071516-044442. URL http://www.annualreviews.org/
doi/10.1146/annurev-bioeng-071516-044442.
2. Z. Akkus, A. Galimzianova, A. Hoogi, D. L. Rubin, and B. J. Erickson, Deep Learning
for Brain MRI Segmentation: State of the Art and Future Directions, Journal of Digi-
tal Imaging. 30(4), 449–459, (2017). ISSN 1618727X. doi: 10.1007/s10278-017-9983-4.
3. Y. Bengio, Learning Deep Architectures for AI. vol. 2, 2009. ISBN 2200000006. doi:
10.1561/2200000006.
4. Z. Zhang, F. Xing, H. Su, X. Shi, and L. Yang, Recent Advances in the Applications of
Convolutional Neural Networks to Medical Image Contour Detection, arXiv preprint
arXiv:1708.07281. (2017). URL http://arxiv.org/abs/1708.07281.
5. G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A.
W. M. van der Laak, B. van Ginneken, and C. I. Sánchez, A Survey on Deep Learning
in Medical Image Analysis, Medical Image Analysis. 42, 60–88, (2017). ISSN 1361-
8423. doi: 10.1016/j.media.2017.07.005. URL http://arxiv.org/abs/1702.05747{\%
}0Ahttp://dx.doi.org/10.1016/j.media.2017.07.005.
6. A. S. Lundervold and A. Lundervold, An overview of deep learning in medical imag-
ing focusing on MRI, Zeitschrift fur Medizinische Physik. 29(2), 102–127, (2019).
ISSN 18764436. doi: 10.1016/j.zemedi.2018.11.002. URL https://doi.org/10.1016/
j.zemedi.2018.11.002.
7. P. Aljabar, R. Heckemann, Hammers, J. Hajnal, and D. Rueckert, Multi-atlas based
segmentation of brain images: atlas selection and its effect on accuracy., NeuroImage.
46(3), 726–38 (jul, 2009). ISSN 1095-9572. doi: 10.1016/j.neuroimage.2009.02.018.
URL http://www.ncbi.nlm.nih.gov/pubmed/19245840.
8. S. M. Smith, Fast robust automated brain extraction, Human Brain Mapping.
17(3), 143–155, (2002). ISSN 1097-0193. doi: 10.1002/hbm.10062. URL https:
//onlinelibrary.wiley.com/doi/abs/10.1002/hbm.10062.
9. S. F. Eskildsen, P. Coupé, V. Fonov, J. V. Manjón, K. K. Leung, N. Guizard, S. N.
Wassef, L. R. stergaard, and D. L. Collins, BEaST: Brain extraction based on nonlo-
cal segmentation technique, NeuroImage. 59(3), 2362–2373 (Feb., 2012). ISSN 1053-
8119. doi: 10.1016/j.neuroimage.2011.09.012. URL http://www.sciencedirect.com/
science/article/pii/S1053811911010573.
March 12, 2020 15:22 ws-rv961x669 HBPRCV-6th Edn.–11573 main page 269
10. J. E. Iglesias, C. Liu, P. M. Thompson, and Z. Tu, Robust Brain Extraction Across
Datasets and Comparison With Publicly Available Methods, IEEE Transactions on
Medical Imaging. 30(9), 1617–1634 (Sept., 2011). ISSN 0278-0062. doi: 10.1109/TMI.
2011.2138152.
11. J. Kleesiek, G. Urban, A. Hubert, D. Schwarz, K. Maier-Hein, M. Bendszus, and
A. Biller, Deep MRI brain extraction: A 3D convolutional neural network for
skull stripping, NeuroImage. 129, 460–469, (2016). ISSN 10959572. doi: 10.1016/
j.neuroimage.2016.01.024. URL http://dx.doi.org/10.1016/j.neuroimage.2016.
01.024.
12. S. S. M. Salehi, D. Erdogmus, and A. Gholipour, Auto-context Convolutional Neural
Network for Geometry-Independent Brain Extraction in Magnetic Resonance Imaging,
IEEE Transactions on Medical Imaging. 36(11), 2319–2330, (2017). ISSN 0278-0062.
doi: 10.1109/TMI.2017.2721362. URL http://arxiv.org/abs/1703.02083.
13. J. Merkow, D. Kriegman, A. Marsden, and Z. Tu. Dense Volume-to-Volume Vascular
Boundary Detection. In eds. S. Ourselin, L. Joskowicz, M. R. Sabuncu, G. Unal, and
W. Wells, MICCAI 2016, vol. 9902, pp. 1–8. Springer, (2016).
14. I. Sutskever, J. Martens, G. Dahl, and G. Hinton, On the importance of initialization
and momentum in deep learning, 30th International Conference on Machine Learning,
ICML 2013. (PART 3), 2176–2184, (2013).
15. L. R. Dice, Measures of the Amount of Ecologic Association Between Species, Ecology.
26(3), 297–302, (1945).
16. E. Fletcher, A. Knaack, B. Singh, E. Lloyd, E. Wu, O. Carmichael, and C. De-
Carli, Combining boundary-based methods with tensor-based morphometry in the
measurement of longitudinal brain change., IEEE transactions on medical imag-
ing. 32(2), 223–36 (feb, 2013). ISSN 1558-254X. doi: 10.1109/TMI.2012.2220153.
URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3775845{\&
}tool=pmcentrez{\&}rendertype=abstract.
17. E. Fletcher, Using Prior Information To Enhance Sensitivity of Longitudinal Brain
Change Computation, In ed. C. H. Chen, Frontiers of Medical Imaging, chapter 4,
pp. 63–81. World Scientific, (2014). doi: 10.1142/9789814611107 0004. URL http:
//www.worldscientific.com/doi/abs/10.1142/9789814611107{\_}0004.
18. J. Canny, A computational approach to edge detection., IEEE transactions on pattern
analysis and machine intelligence. 8(6), 679–698, (1986). ISSN 0162-8828. doi: 10.
1109/TPAMI.1986.4767851.
19. P. Kochunov, J. L. Lancaster, P. Thompson, R. Woods, J. Mazziotta, J. Hardies,
and P. Fox, Regional Spatial Normalization: Toward and Optimal Target, Journal of
Computer Assisted Tomography. 25(5), 805–816, (2001).
20. D. Rueckert, P. Aljabar, R. A. Heckemann, J. V. Hajnal, A. Hammers, R. Larsen,
M. Nielsen, and J. Sporring. Diffeomorphic registration using b-splines. In MICCAI
2006, vol. LNCS 4191, pp. 702–709. Springer-Verlag, (2006).
21. T. Nichols and A. P. Holmes, Nonparametric permutation tests for functional neu-
roimaging: a primer with examples, Human Brain Mapping. 15(1), 1–25, (2001).
22. X. Hua, B. Gutman, C. P. Boyle, P. Rajagopalan, A. D. Leow, I. Yanovsky,
A. R. Kumar, A. W. Toga, C. R. Jack Jr, N. Schuff, G. E. Alexander,
K. Chen, E. M. Reiman, M. W. Weiner, P. M. Thompson, Initiative, the
Alzheimer’s Disease Neuroimaging, and C. R. Jack, Accurate measurement of brain
changes in longitudinal MRI scans using tensor-based morphometry, NeuroImage.
57(1), 5–14 (jul, 2011). ISSN 1095-9572. doi: 10.1016/j.neuroimage.2011.01.079.
URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3394184{\&
}tool=pmcentrez{\&}rendertype=abstract.
March 12, 2020 15:22 ws-rv961x669 HBPRCV-6th Edn.–11573 main page 270
CHAPTER 2.5
1. Introduction
Intravascular Ultrasound (or IVUS) allows us to see the coronary artery from the
inside out. This unique picture, generated in real time, yields information that is
not possible with routine imaging methods or even non-invasive multi slice CT
scans. A growing number of cardiologists think that new information yielded by
* Corresponding author.
271
272 A. G. Gangidi and C. H. Chen
IVUS can make a significant difference in how patient is treated and can provide
more accurate information that will reduce complications and incidence of heart
diseases.
Intravascular ultrasound (IVUS) is a catheter-based technique that provides
high-resolution images allowing precise tomographic assessment of lumen area.
IVUS uses high- frequency sound waves called Ultrasound that can provide a
moving picture of your heart. These pictures come from inside the heart rather than
through the chest wall.
In a typical IVUS image (Fig. 1a), the lumen is typically a dark echo-free area
adjacent to the imaging catheter and the coronary artery vessel wall mainly appears
in three layers: Intima, Media, and Adventitia. Fig. 1b defines the layers. As the
two inner layers are of principal concern in clinical research, segmentation of
IVUS images is necessary to isolate the intima-media and lumen which provides
important information about the degree of vessel obstruction as well as the shape
and size of plaques. Such segmentation can be performed manually by a human
expert but it is very time consuming and costly also. Computer based analysis and
in fact a fully automatic image segmentation is much needed.
Fig. 1a Fig. 1b
There are several factors (artifacts) that significantly reduce the accuracy of
segmentation and ultimately cause difficulty in interpretation:
1. The ever present speckle noises in the ultrasonic images and particularly on
human tissues.
2. Guide wire with reverberation
3. Reflection from sheath surrounding transducer
4. Barely identifiable lumen intima boundary
Automatic Segmentation of Intravascular Ultrasound Images 273
There have been a large amount of efforts made including the use of
automated contours models, machine learning and other methods for the IVUS
segmentation in recent years (see e.g. [1-12]). Though deep learning may provide
better performance with a very large data set, it is not used in the work as the
available data set size, in our view, is not large enough.
In this chapter we will present an automated segmentation methods making
use of both temporal and spatial information and the discrete waveframe
decomposition to extract the texture information and for initialization of contours.
The radial basis function is used to constructing the final contours in a few Iterative
steps. The proposed method is tested in a data base provided by the Brigham and
Women Hospital in Boston with very encouraging results.
Table 1
by Brigham Women Hospital for our academic research is in envelop file format.
They are converted into PC-Matlab format with 256x256 pixels in polar format.
The data available is listed Table 1. There are 15 pull-out sequences from 9
patients. There are a total of 2293 gated image frames which have been manually
segmented and are useful for training and validation purposes. A total of 57098
image frames provides us a large data set for algorithm testing. Although many
studies on IVUS image segmentation have been conducted with different but
limited amount of success, none has employed such a large data base. We believe
IVUS image segmentation is a problem in pattern recognition and computer vision.
Considering the successes of many pattern recognition and computer vision
problems in the last 55 years and their great impact on modern society, we are
confident that an effective automated segmentation process can be developed as
proven in this paper.
Step1: Pre-Processing:
The IVUS images represented in polar coordinated by I include not only tissue
and blood regions, but also the outer boundary of the catheter itself. The latter
defines a dead zone of radius equal to that of the catheter where no useful
information is contained. Knowing the diameter D of the catheter, these catheter-
induced artifacts are easily removed by setting.
I (r,T ) 0 for r D / 2 e
D Diameter of chatheter (1)
e Constant term
In the above equation, I(m,n)(t+n) is the (t+n)th frame, Ig (m, n)(t) is the gradient image
obtained corresponding to frame number “t”.
276 A. G. Gangidi and C. H. Chen
Fig. 3: Image showing the difference between DWT and DWFT; Due to absence of subsampling in
DWFT all the resulting images are of same size.
Fig. 4: Image showing the different textures that can be obtained for DWF.
278 A. G. Gangidi and C. H. Chen
Fig. 5: DWF Application on IVUS Images in a stage by stage to serve as a basis for exture
segmentation.
Automatic Segmentation of Intravascular Ultrasound Images 279
255 '
I int ( r,T ) '
I int ( r,T )
max ( r ,T ) {I int ( r,T )}
'
I int ( r ,T ) ¦
k {7,8,10,11}
I k ( r,T ) I ( r ,T )
| |
Texture Intensity (4)
cint,t { pint,t [ U ,T ]}
I int,t ( U ,T ) ! T and I int,t ( r,T ) T r U
defining a lumen contour function Cint,t (T ) U
Only significant edges are saved towards a set of contour initialization points, Later
these are used to get Actual approximation of Contour by applying radial basis
functions. The choice of the images Ik in above formulae, employed in this
initialization process was done based on visual evaluation of all K generated
images and is in line with the aforementioned observations regarding the Intensity
and texture properties of the lumen and wall areas, in combination with the
characteristics of the filter bank used for the generation of images Ik T is the
threshold defined for initialization is found to be best at T= 42 for Image in range
[0 255].
for the wall inside of the lumen area. For media-adventitia we take the sum of the
most coarse texture image obtained for the Discrete wavelet frames and the
Intensity. In this sum image we look for the maxima over the lumen border with
respect to a threshold resulting all the pixels at given theta.
This process can be represented by the following operator.
cext { pext [ P ,T ]}
I LL ( P ,T ) max{I LL ( P ,T ) ( r,T )}, (5)
r !U
Selecting, according to the above equations (5), the pixels to which the
intensity of the low-pass filtered image is maximized serves the purpose of
identifying the most dominant low frequency detail in the image, in case low-pass
filtering has failed to suppress all other higher-frequency information. The selected
pixels correspond to those on the boundary between the adventitia and the media
regions.
Define
r max{C (T )} 1
T
(6)
r min{C (T )} 1
T
Automatic Segmentation of Intravascular Ultrasound Images 281
f (T , C (T )) 0
f (T , r z C (T )) r C (T ) (7)
_ C (T ) here denotes either Cint (T ) or Cext (T )
Following the definition of f, the FastRBF library (FarField, 2012) [13] is used to
generate the smooth contour approximation in the following three steps.
Step A: Duplicate points where f has been defined (i.e. points in the 2D space
which are located within a specific minimum distance from other input points are
removed) the remaining points serve as the centers of the RBF.
Step B: The fitting of an RBF to this data is performed using the spline smoothing
technique, chosen for not requiring the prior estimation of the noise measure
related to each input data point, as opposed to other fitting options such error bar
fitting.
Step C: The fitted RBF is evaluated in order to find the points which correspond
to zero value; the latter define the contour approximation C’.
T
Fig. 6: Initialized Contour before Applying Radial Basis Function.
282 A. G. Gangidi and C. H. Chen
Fig. 8: Lumen and Media Adventitia Contour before and after applying Radial Basis Function.
Fig. 9
The objective of this research is to segment the Lumen and EEM. The
benchmarking of these results is done against the Manual and EEM results. In
order to observe the comparatively for each frame: Manual Lumen and EEM are
plotted against predicted Lumen and EEM on the IVUS image as follows:
284 A. G. Gangidi and C. H. Chen
5. Concluding Remarks
A novel algorithm has been proposed to auto-detect the Lumen and EEM and it is
observed that this algorithm reliably performs contour prediction with clinically
appreciated limits of average prediction error under 0.13 mm. The proposed
approach does not require manual initialization of the contours, which is a common
requirement of several other prior approaches to IVUS image segmentation.
The experiments conducted with the combination of temporal analysis,
contour initialization and contour refinement methods proposed in this work
demonstrated the usefulness of the employed texture features for IVUS image
analysis as well as the contribution of the approximation technique based on Radial
Basis Functions to the overall analysis outcome. The comparative evaluation of
the different alternate approaches revealed that use of the temporal texture based
initialization and the 2D RBF-based approximation results in a reliable and quick
IVUS segmentation, comparable to the manual segmentation and other alternate
segmentation algorithms.
Our automated segmentation algorithm has several clinical applications. It
could facilitate plaque morphometric analysis i.e. planimetric, volumetric and wall
thickness calculations, contributing to rapid, and potentially on-site, decision-
making. Similarly, our method could be utilized for the evaluation of plaque
progression or regression in serial studies investigating the effect of drugs in
atherosclerosis.
Automatic Segmentation of Intravascular Ultrasound Images 285
References
12. S. Balocco, M.A. Zalaaga, G. Zahad, S.L. Lee and S. Demirci, editors, “Computing and
Visualization for Intravascular Computer-Assisted Stenting”, Elsevier 2017.
13. R. Krasny and L. Wang, “Fast evaluation of multiquadric RBF sums by a artesian
treecode”, SIAM J. Scientific Computing, vol.33, no. 5, pp. 2341-2355, 2011.
March 12, 2020 15:42 ws-rv975x65 HBPRCV-6th Edn.–11573 main page 287
CHAPTER 2.6
This chapter gives an overview of the state of the art and recent methods in the
area of historical document analysis. Historical documents differ from the ordi-
nary documents due to the presence of different artifacts. Issues such as poor
conditions of the documents, texture, noise and degradation, large variability of
page layout, page skew, random alignment, variety of fonts, presence of embel-
lishments, variations in spacing between characters, words, lines, paragraphs and
margins, overlapping object boundaries, superimposition of information layers,
etc bring complexity issues in analyzing them. Most methods currently rely on
deep learning based methods, including Convolutional Neural Networks and Long
Short-Term Memory Networks. In addition to the overview of the state of the art,
this chapter describes a recently introduced idea for the detection of graphical
elements in historical documents and an ongoing effort towards the creation of
large database.
287
March 12, 2020 15:42 ws-rv975x65 HBPRCV-6th Edn.–11573 main page 288
(a) British Library, Harley (b) British Library, Add MS (c) Inst. du Clergé Patriar-
MS 2970, f.6v 5153A, f.1r cal de Bzommar, BzAr 39,
38
(d) Monastery of Mor (e) New York Public Li- (f) FamilySearch, sample
Gabriel, MGMT 298, 5r brary, Spencer Collection. from ICDAR HDRC-20191
Fig. 1. Samples of historical documents in five different languages. (a) Greek: Readings for
Easter and the Bright Week. (b) Latin: lectionary of the 11th century, (the ‘Odalricus Peccator
Gospel Lectionary’). (c) Armenian: New Testament Lectionary of the 18th or 19th century.
(d) Syriac: Gospel Lectionary written in 1833. (e) Syriac: Lectionary for Holy Week, in Coptic
and Arabic. Egypt, 1948. (f) Chinese: Family record from the 19th century.
speech processing and achieved remarkable results. These models have also been
successfully applied in handwriting recognition,13 where the error was reduced from
35 % to 18 %.
The deep learning paradigm is being explored for high-level image analysis like
semantic segmentation, object detection and localization,20 document image cap-
tioning and summerization, word spotting,20 and visual question answering.21
on first aligning transcripts of texts of the royal French chancery and the royal Dutch
archives with the glyphs and then training an LSTM to analyze non transcribed
texts of the same periods with Latin scripts. Online annotation tools like SALSAH30
and Transcribe Bentham31 exists, however, they have only limited automatic DIA
support. An online tool that includes computerized tools for script analysis and
OCR, is the Monk32† . However, the Monk tools are not publicly available and the
methods cannot easily integrated into other workflows or tools. A publicly accessible
Web interface is provided in the course of the Genizah project33 for the analysis of
fragments. Users are able to semi-automatically investigate all the available data,
however, they cannot directly add additional information such as transcriptions.
(a) Query
Fig. 2. Example of a query (a) with the expert annotated results in order (b, c, d, e) where (b) is
the most relevant. Our system retrieves these results among the first six ranks in the order (e, d, b,
c). Note that the system is not affected by the different reproduction techniques (a: radiography,
b/d: rubbing, c/e: tracing).
tion more difficult. Therefore, rubbings and radiography reproductions were also
included in this data set.
In the watermark research, there exist very complex classification systems for the
motifs depicted by the watermarks. The classification system plays a major role for
both the retrieval and the label-assignment of a watermark. The user must be able
to determine uniquely the correct class for a given watermark. This is not always
easy, especially with rare watermarks. Intuitively, the labelling scheme directly
controls the rarity of a class, i.e., the more precision achievable with the classifi-
cation system, the fewer the samples for each class. This section uses the WZIS
classification system as it is a widespread standard in the considered domain.39
It is partially based on the Bernstein classification system35 and it is built in
a mono-hierarchical∗ structure.41 It contains 12 super-classes (Anthropomorphic
figures, Fauna, Fabulous Creature, Flora, Mountains/Luminaries, Artefacts, Sym-
bols/Insignia, Geometrical figures, Coat of Arms, Marks, Letters/Digits, Undefined
marks) with around 5 to 20 sub-classes each.39 The super-classes are purposely ab-
stract and are only useful as entry point for classifying an instance of a watermark.
∗ Inpractice, this means that regardless of the level of depth for class specification, there can be
only one unique parent class.
March 12, 2020 15:42 ws-rv975x65 HBPRCV-6th Edn.–11573 main page 293
For example, for the watermark represented in Figure 2(c), the following hierarchy
applies:
Fauna
Bull ’ s head
detached , with s i g n above
with e y e s
...
The actual definition is complete only at the final level. This kind of terminology
is not trivial to be dealt with. Moreover, the user needs special knowledge not
only about the single terms but also about their usage in the different scenarios.
Especially the order of the descriptions can become a problem if different users (or
libraries) prefer different orders. To overcome this problem, an automatic motif
comparison would be beneficial.
2.3. Experiments
The goal of this section is to develop a system that can help researchers in the
humanities perform content-based image retrieval in the context of historical wa-
termarks. Therefore, this section performs experiments in three different ways,
classification, tagging, and similarity matching. All experiments use the DeepDIVA
experimental framework.42
As Architecture, this section builds on a recent extension CNN, which includes
residual connections in order to avoid the vanishing gradient problem.43 In partic-
ular, using dense connections (DenseNet) which connect each layer to every other
layer in a feed-forward fashion.44 These two architectural paradigms – namely
skip-connections for ResNet and dense-connections for DenseNet – are the state-of-
the-art for computer vision tasks. In this section the experiments were rub using
two different networks, one for each paradigm. Specifically, the vanilla implemen-
tation of ResNet-18 and DenseNet-121 as they are freely available in the PyTorch
vision package† .
The first approach is to treat the problem as a object classification task. The rest
of hierarchical structure are discarded and only these 12 super-classes as labels for
the watermarks are selected. The network is trained to minimize the cross-entropy
loss function shown below:
exy
L(x, y) = −log |x| (1)
xi
i=0 e
† https://github.com/pytorch/vision/
March 12, 2020 15:42 ws-rv975x65 HBPRCV-6th Edn.–11573 main page 295
where x is a vector of size 12 representing the output of the network and and
y = {0..11} a scalar, representing the label for a given watermark. Instead of using
randomly-initialized versions of these models, variants that have been trained for
the classification task on the ImageNet dataset from the ImageNet Large Scale
Visual Recognition Challenge45 (ILSVRC) were used. Afzal et al.46 and Singh et
al.47 have previously demonstrated that ImageNet pre-training can be beneficial
for several other domains including document analysis tasks. All input images are
pre-processed by removing the whitespace around the image, and then resized to a
standard input size of 224 x 224. The dataset is then split into subsets of size 63 626
for training, 7 082 for validation and 34 825 for testing. The performance of the
systems were evaluated using the accuracy metric which is standard for single-label
classification tasks. From Table 1, one can see that the DenseNet outperforms the
ResNet by a small but significant margin on all subsets of the dataset.
The second approach is to treat the problem as a multi-label tagging task. In this
case each image can be assigned one or more corresponding labels. In this section,
this approach is motivated by hierarchical structure of the labels, which, although
very noisy, could provide additional entropy thus increasing the performance of the
network. To avoid the ordering problem (see Sec. 2.1), each level of the hierarchy
is treated as a ”tag” thus addressing the problem of having one label appearing
at different levels in the hierarchy. The network is trained to minimize the binary
cross-entropy loss function shown below:
n
L(x, y ) = − yi · log(σ(xi )) + (1 − yi ) · log(1 − σ(xi ) (2)
i=0
where n is the number of different labels being used, x is a vector of size n repre-
senting the output of the network and y is a multi-hot vector of size n, i.e., it is a
vector of values that are 1 when the corresponding tag is present in the image and 0
when it is not. The setup of training is similar to the setup of the classification task,
just with multiple lables. The performance of the systems is evaluated using the
Jaccard Index48 which is also referred to as the Mean Intersection over Union (IoU).
This metric is used to compare the similarity of sets which makes it suitable for
multi-label classification problems. In Table 2, both networks achieve a high degree
of performance on the tagging task, with DenseNet performing significantly better
on all subsets of the dataset. This result is quite significant as the IoU accounts
for imbalance in the labels. That is, for a dataset with n classes, even if a class
is significantly over-represented in the dataset, it contributes only 1/n towards the
final score. Note that n equals 622 in this case.
It is worth to mentioned that the problem is treated as image similarity task, i.e.,
given an image produce an embedding in a high dimensional space where similar
images are embedded closer and dissimilar images are embedded far apart. This
is intuitively the closest formulation to the end goal of image retrieval. There
are different approaches to formulate this task in the literature. The triplet loss
approach49,50 is chosen which has been shown to outperform two-channel networks51
March 12, 2020 15:42 ws-rv975x65 HBPRCV-6th Edn.–11573 main page 296
ResNet18 DenseNet121
Baseline 0.885 0.928
Classification Pre-trained 0.929 0.951
Tagging Pre-trained 0.923 0.952
Fig. 3. The first step is to create modern electronic printed documents from a Latex specification
document. The second step involves using a deep neural network to learn a mapping function to
transform the modern printed document to a historical handwritten document.
and advanced application of the Siamese approach such as MatchNet52 as well. The
triplet loss operates on a tuple of three watermarks {a, p, n} where a is the anchor
(reference watermark), p is the positive sample (a watermarks from the same class)
and n is the negative sample (a watermarks from another class). The neural network
is then trained to minimize the loss function defined as:
where δ+ and δ− are the Euclidean distance between anchor-positive and anchor-
negative pairs in the feature space and μ is the margin used. Pre-trained models
were used, for the Classification task and the Multi-Label Tagging task. The results
are reported in terms of mean Average Precision (mAP), which is a well established
metric in the context of similarity matching. Table 3 clearly demonstrates that the
classification and tagging pre-trained networks outperform the ImageNet baseline
networks by a significant margin on all subsets of the dataset. Similarly to the
other two tasks, one can see here that the DenseNet outperforms the ResNet by a
significant margin.
Unfortunately, in the field of Historical DIA, labelled image datasets are a scarce
resource. This lack of labelled training data makes it challenging to take advan-
tage of several deep learning breakthroughs. This section shows how it is possible
to go one step further.‡ The approach takes advantage of recent advancements in
the design of Generative Adversarial Networks (GANs) and Neural Style Transfer
Algorithms (NST). It learns a mapping function that goes from the Source Domain
S (modern printed electronic document) to the Target Domain T (historical
handwritten document). The primary goal is to generate synthetic historical
‡ This section is an updated version of our original work published in.53
March 12, 2020 15:42 ws-rv975x65 HBPRCV-6th Edn.–11573 main page 297
handwritten document images, which looks like other historical handwritten doc-
uments from T . The secondary goal is to demonstrate a new promising and more
straightforward approach to create a large amount of complex synthetic handwrit-
ten historical documents based on different ground-truthed electronic documents.
The proposed cycleGAN framework reaches the goals and performs the generation
task in an integrated manner.
3.1. Task
As shown in Figure 3, we tackle the problem of historical document image synthesis
in two steps. The first step of generating source domain documents is achieved with
a Latex framework as described in previous work.53 In the second step, we train
a neural network to learn a mapping function between the source domain (modern
document) and the target domain (historical handwritten document). The second
step can be further divided into three tasks, a reconstruction task and classification
task which are used to pre-train the networks and the final generation task.
• Random Crop: A random crop of size 256 × 256 pixels is taken from the
central portion (to avoid blank border regions) of the document. In this
scenario, the fine detail of the pages are preserved, and crops have almost
the same number of lines and words in both the historical and modern
domains.
The historical dataset used in this work does not contain paired images between
the source and target domains. Thus, this work uses a variant of GANs, called
the cycleGAN,54 that uses the cycle consistency loss. The cycleGAN architecture
performs a transformation of the images from the source to the target domains
and vice-versa. The cycle consistency loss with the bi-directional mapping function
coupled with the L1 distance loss increases the learning stability of the adversarial
framework in an unpaired image setting.55 When training the cycleGAN from
scratch, it runs for 50 epochs with a batch size of 1. The learning rate is 0.0002
with a linear decay starting from epoch 25. It is useful to use a history buffer that
stores the 50 most recently generated images. This history buffer is used to update
the discriminator and reduce model oscillations during training. When training the
cycleGAN in the pre-trained scenario, the generator and discriminator are initialized
with weights obtained from the reconstruction and classification tasks respectively.
The weights of the encoder part of the generators are initialized with the weights
of the encoder component of an auto-encoder that is trained for reconstruction on
the datasets.
For the Neural Style Transfer the VGG-19 Convolutional Neural Network im-
plementation56 is used, where the goal is to minimize the content loss and the style
functions conjointly. This work uses the VGG-19 based NST model in two differ-
ent settings – using ImageNet weights and using weights from a pre-trained model.
When using the model with the ImageNet weights, only the last layers of the net-
work are reinitialized, and then the NST procedure is applied to the images. For
pre-training, first the VGG-19 is trained on the dataset for a classification task. The
network is trained for 25 epochs with a batch size of 4, learning rate of 0.001 and
momentum of 0.9. The weights of this model are then used for the NST procedure.
3.4. Results
(a) Target Domain (b) Source Domain (c) cycleGAN (d) NST Synthetic
Samples Samples Synthetic Samples Samples
Fig. 4. Examples of images generated by the cycleGAN and the NST after training on the Com-
plete Document images. The first and second columns contains samples from the Target Domain
and Source Domain respectively. The third column contains samples generated by the cycleGAN
trained from scratch. The samples generated by the NST model (pre-trained on the PDD dataset)
can be seen in the fourth column. Every sample contains a zoomed-in view to see the quality of
the generated pages.
the complete document images. The synthetic images generated by the cycleGAN
appear significantly better than those generated with NST. Regarding the semantic
content (font shape, readability of words and letters, marginal annotations), we
can notice many similarities between the target domain samples and the synthetic
samples. The overall style content of the target domain (background colour, texture,
paper degradation, initials style) is well expressed. However, in a structural content
point of view (column-mode, number and presence of initials, textual artifacts),
the initials are not well detected and expressed. The two column-mode is not
at all expressed. When considering the synthetic documents produced with the
NST, the structural content is better preserved. However, the style is mixed and
standardized over the entire synthetic document, leading to the presence of a lot of
coloured artifacts.
ing the global and local generation procedures to produce documents that have the
correct global structure as well as fine-grained details.
These techniques, combined with the techniques mentioned in Sec. 1.3 can be
used to generate a massive database of historical structured documents, including
epic texts, religious texts, demographic reports, and economic reports. Existing
databases can serve as an input for the GAN-based doucment generation framework.
In our future work, we plan to create a database containing several millions of
document images together with GT for logical layout analysis, text extraction, and
OCR. With this growing dataset we plan to host a series of novel challenging public
competitions, where world-wide researchers can participate with their methods.¶
References
Trans. on Pattern Analysis and Machine Intelligence. 31(5), 855–868, (2009). doi:
10.1109/TPAMI.2008.137.
14. S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation.
9(8), 1735–1780, (1997).
15. P. Voigtlaender, P. Doetsch, and H. Ney. Handwriting recognition with large multi-
dimensional long short-term memory recurrent neural networks. In 2016 15th Inter-
national Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 228–233.
IEEE, (2016).
16. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, ImageNet Large Scale Visual
Recognition Challenge, International Journal of Computer Vision (IJCV). 115(3),
211–252, (2015). doi: 10.1007/s11263-015-0816-y.
17. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep
convolutional neural networks. In eds. F. Pereira, C. J. C. Burges, L. Bottou, and
K. Q. Weinberger, Advances in Neural Information Processing Systems 25, pp. 1097–
1105. Curran Associates, Inc., (2012).
18. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-
houcke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the
IEEE CVPR, pp. 1–9, (2015).
19. A. Graves, N. Jaitly, and A.-r. Mohamed. Hybrid speech recognition with deep bidi-
rectional lstm. In 2013 IEEE workshop on automatic speech recognition and under-
standing, pp. 273–278. IEEE, (2013).
20. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and
C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on
computer vision, pp. 740–755. Springer, (2014).
21. P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh. Yin and yang: Bal-
ancing and answering binary visual questions. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 5014–5022, (2016).
22. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. (MIT Press, 2016).
23. A. Rozantsev, V. Lepetit, and P. Fua, On rendering synthetic images for training an
object detector, Computer Vision and Image Understanding. 137, 24 – 37, (2015).
ISSN 1077-3142. doi: https://doi.org/10.1016/j.cviu.2014.12.006.
24. N. Journet, M. Visani, B. Mansencal, K. Van-Cuong, and A. Billy, Doccreator: A new
software for creating synthetic ground-truthed document images, Journal of imaging.
3(4), 62, (2017).
25. D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura,
J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. Icdar 2015 competition
on robust reading. In 2015 13th International Conference on Document Analysis and
Recognition (ICDAR), pp. 1156–1160. IEEE, (2015).
26. T. He, Z. Tian, W. Huang, C. Shen, Y. Qiao, and C. Sun. An end-to-end textspotter
with explicit alignment and attention. In Proceedings of the IEEE CVPR, pp. 5020–
5029, (2018).
27. T. M. Breuel. High performance text recognition using a hybrid convolutional-lstm
implementation. In 14th Int. Conf. on Document Analysis and Recognition (ICDAR),
vol. 01, pp. 11–16, (2017).
28. J.-C. Burie, J. Chazalon, M. Coustaty, S. Eskenazi, M. M. Luqman, M. Mehri,
N. Nayef, J.-M. Ogier, S. Prum, and M. Rusiñol. Icdar2015 competition on smartphone
document capture and ocr (smartdoc). In 2015 13th Int. Conf. Document Analysis and
Recognition (ICDAR), pp. 1161–1165. IEEE, (2015).
March 12, 2020 15:42 ws-rv975x65 HBPRCV-6th Edn.–11573 main page 302
29. P. Krishnan and C. Jawahar, Generating synthetic data for text recognition, arXiv
preprint arXiv:1608.04224. (2016).
30. T. Schweizer and L. Rosenthaler. Salsah eine virtuelle forschungsumgebung fr die
geisteswissenschaften. In EVA, pp. 147–153, (2011).
31. T. Causer and V. Wallace, Building a volunteer community: results and findings from
transcribe bentham, Digital Humanities Quarterly. 6, (2012).
32. S. He, P. Sammara, J. Burgers, and L. Schomaker. Towards style-based dating of
historical documents. In Frontiers in Handwriting Recognition (ICFHR), 2014 14th
int. conf. on, pp. 265–270 (Sept, 2014). doi: 10.1109/ICFHR.2014.52.
33. L. Wolf, R. Littman, N. Mayer, T. German, N. Dershowitz, R. Shweka, and
Y. Choueka, Identifying join candidates in the cairo genizah, int. Journal of Com-
puter Vision. 94(1), 118–135, (2011).
34. P. Rückert, Ochsenkopf und Meerjungfrau. Wasserzeichen des Mittelalters. (Haupt-
staatsarchiv, Stuttgart, 2006).
35. E. Wenger. Metasuche in wasserzeichendatenbanken (bernstein-projekt): Heraus-
forderungen für die zusammenführung heterogener wasserzeichen-metadaten. In eds.
W. Eckhardt, J. Neumann, T. Schwinger, and A. Staub, Wasserzeichen - Schreiber -
Provenienzen : neue Methoden der Erforschung und Erschlieung von Kulturgut im dig-
italen Zeitalter: zwischen wissenschaftlicher Spezialdisziplin und Catalog enrichment,
pp. 289–297. Vittorio Klostermann, Frankfurt am Main, (2016).
36. E. Wenger and M. Ferrando Cusi, How to make and organize a watermark database
and how to make it accessible from the bernstein portal: a practical example: Ivc+r,
Paper history. 17, 16–21, (2013).
37. S. Limbeck, Digitalisierung von Wasserzeichen als Querschnittsaufgabe. Überlegungen
zu einer gemeinsamen Wasserzeichendatenbank der Handschriftenzentren, Das Mitte-
lalter Perspektiven mediävistischer Forschung. 14(2), 146–155, (2009).
38. N. F. Palmer. Verbalizing watermarks : the question of a multilingual database. In eds.
P. Rückert and G. Maier, Piccard-Online. Digitale Präsentationen von Wasserzeichen
und ihre Nutzung, pp. 73–90. Kohlhammer, Stuttgart, (2007).
39. E. Frauenknecht. Papiermühlen in Württemberg. Forschungsansätze am Beispiel
der Papiermühlen in Urach und Söflingen. In eds. C. Meyer, S. Schultz, and
B. Schneidmüller, Papier im mittelalterlichen Europa. Herstellung und Gebrauch, pp.
93–114. De Gruyter, Berlin, Boston, (2015).
40. V. Pondenkandath, M. Alberti, R. Ingold, and M. Liwicki. Identifying Cross-Depicted
Historical Motifs. In 2018 16th International Conference on Frontiers in Handwriting
Recognition (ICFHR), Niagara Falls, USA (aug, 2018).
41. E. Frauenknecht. Von wappen und ochsenköpfen: zum umgang mit groen motiv-
gruppen im wasserzeichen-informationssystem (wzis). In eds. W. Eckhardt, J. Neu-
mann, T. Schwinger, and A. Staub, Wasserzeichen - Schreiber - Provenienzen : neue
Methoden der Erforschung und Erschlieung von Kulturgut im digitalen Zeitalter: zwis-
chen wissenschaftlicher Spezialdisziplin und Catalog enrichment, pp. 271–287. Vittorio
Klostermann, Frankfurt am Main, (2016).
42. M. Alberti, V. Pondenkandath, M. Würsch, R. Ingold, and M. Liwicki. DeepDIVA:
A Highly-Functional Python Framework for Reproducible Experiments. In 2018 16th
International Conference on Frontiers in Handwriting Recognition (ICFHR), Niagara
Falls, USA (aug, 2018).
43. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
770–778, (2016).
March 12, 2020 15:42 ws-rv975x65 HBPRCV-6th Edn.–11573 main page 303
44. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected
convolutional networks. In CVPR, pp. 4700–4708, (2017).
45. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, ImageNet Large Scale Visual
Recognition Challenge, International Journal of Computer Vision (IJCV). 115(3),
211–252, (2015). doi: 10.1007/s11263-015-0816-y.
46. M. Z. Afzal, S. Capobianco, M. I. Malik, S. Marinai, T. M. Breuel, A. Dengel, and
M. Liwicki. DeepDocClassifier : Document Classification with Deep Convolutional
Neural Network. In 13th International Conference on Document Analysis and Recog-
nition, pp. 1111–1115. IEEE, (2015). ISBN 9781479918058.
47. M. S. Singh, V. Pondenkandath, B. Zhou, P. Lukowicz, and M. Liwickit. Transforming
sensor data to the image domain for deep learningan application to footstep detection.
In Neural Networks (IJCNN), 2017 International Joint Conference on, pp. 2665–2672.
IEEE, (2017).
48. M. Levandowsky and D. Winter, Distance between sets, Nature. 234(5323), 34, (1971).
49. E. Hoffer and N. Ailon. Deep metric learning using triplet network. In Lecture Notes
in Computer Science (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics), vol. 9370, pp. 84–92, (2015). ISBN 9783319242606.
doi: 10.1007/978-3-319-24261-3 7.
50. V. Balntas, Learning local feature descriptors with triplets and shallow convolutional
neural networks, Bmvc. 33(1), 119.1–119.11, (2016). doi: 10.5244/C.30.119.
51. S. Zagoruyko and N. Komodakis, Learning to compare image patches via convolutional
neural networks, Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition. 07-12-June(i), 4353–4361, (2015). ISSN 10636919.
doi: 10.1109/CVPR.2015.7299064.
52. X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. MatchNet: Unifying feature
and metric learning for patch-based matching. In Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, vol. 07-12-June, pp.
3279–3286, (2015). ISBN 9781467369640. doi: 10.1109/CVPR.2015.7298948.
53. V. Pondenkandath, M. Alberti, M. Diatta, R. Ingold, and M. Liwicki. Historical docu-
ment synthesis with generative adversarial networks. In 2019 International Conference
on Document Analysis and Recognition Workshops (ICDARW), vol. 5, pp. 146–151.
IEEE, (2019).
54. J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired Image-to-Image Transla-
tion using Cycle-Consistent Adversarial Networks. In Computer Vision (ICCV), 2017
IEEE International Conference on, (2017).
55. H. Huang, P. S. Yu, and C. Wang, An Introduction to Image Synthesis with Generative
Adversarial Nets, arXiv preprint arXiv:1803.04469. (2018).
56. L. A. Gatys, A. S. Ecker, and M. Bethge. Image Style Transfer Using Convolutional
Neural Networks. In The IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR) (June, 2016).
This page intentionally left blank
March 12, 2020 16:8 ws-rv961x669 HBPRCV-6th Edn.–11573 riesen˙signature page 305
CHAPTER 2.7
1. Introduction
The first use of handwritten signatures can be traced back to the fourth century,
when signatures were used to protect the Talmud (i.e. the central text in Rabbin
Judaism) from possible changes. Since then and to this day, handwritten signatures
have been used as biometric authentication and verification measure in a wide range
of business and legal transactions worldwide.
With the widespread use of signatures, the interest and necessity to verify the
authenticity of signatures has grown. Signature verification is often synonymous
with the process of comparing a questioned signature with a set of reference signa-
tures in order to distinguish between genuine and forged signatures.1 Traditionally,
this task is performed by human experts within the framework of graphology, i.e. the
study of handwriting. However, signature verification turns out to be a demanding
task as the decision only has to be made on the basis of a few original samples.
This motivated the research and development of automatic signature verification
305
March 12, 2020 16:8 ws-rv961x669 HBPRCV-6th Edn.–11573 riesen˙signature page 306
Offline signature verification systems are typically based on the following three
processing steps.
Fig. 1. Image preprocessing illustrated on the first signature image of user 3941 from the
GPDSsynthetic dataset.26
Definition 1 (Graph). Let LV and LE be finite or infinite label sets for nodes
and edges, respectively. A graph g is a four-tuple g = (V, E, μ, ν), where V is the
finite set of nodes, E ⊆ V × V is the set of edges, μ : V → LV is the node labeling
function, and ν : E → LE is the edge labeling function.
In keypoint graphs, the nodes represent keypoints on the handwriting and the
node labels are the coordinates of these points, i.e. LV = IR × IR. The edges are
unlabeled and undirected, i.e. LE = ∅ and (u, v) ∈ E ⇐⇒ (v, u) ∈ E, and connect
two nodes if their corresponding points are directly connected on the handwriting.
The nodes and edges are extracted from the skeleton image of the signature. The
keypoints are selected iteratively. First, junction-points and end-points are added
to the set of keypoints. Secondly, the left outer most pixel of circular structures is
added to the keypoints if they contain no keypoints, yet. Then, additional points
are added by sampling the skeleton. This is done by tracing along the skeleton
while starting at already selected keypoints. Once the traveled distance without
meeting a keypoint is larger or equal to a user-defined threshold D, a new keypoint
is added. In order to formalize the relationship between nodes, we use undirected
edges to connect neighboring keypoints on the skeleton.
The node labels are finally normalized to make the graph representation invari-
ant to translation by subtracting the average node label of this particular graph
from each node label in the graph. Thus, the nodes of a graph are always centered
around the origin (0, 0) in a two-dimensional plane.
An example of a keypoint graph is shown in Fig. 2. In this chapter, a graph is
termed gR if it is based on a signature image R.
In our signature verification system, the decision of whether an unseen signature
is a genuine signature of the claimed user is based on a set R of known genuine
signatures graphs from that user, termed references. An unseen signature T (rep-
resented by gT ) is compared with all reference signature graphs gR ∈ R, and a
signature verification score is calculated. Formally, we match the corresponding
graphs by a certain graph matching procedure and compute several graph dissimi-
March 12, 2020 16:8 ws-rv961x669 HBPRCV-6th Edn.–11573 riesen˙signature page 309
Fig. 2. Example keypoint graph generated from the first signature of user 3941 from the
GPDSsynthetic dataset26
This score is used to normalize the dissimilarity scores of each user. Formally, we
ˆ R , gT , R) as the reference normalized score:
define d(g
ˆ R , gT , R) = d(gR , gT ) ,
d(g (2)
δ(R)
ˆ R , gT , R) = mingR ∈R d(gR , gT )
d(R, gT ) = min d(g (3)
gR ∈R δ(R)
Obviously, the dissimilarity computation d(·, ·) between two graphs builds the
fundamental building block in the complete verification procedure. In the following
section, the process of calculating a graph dissimilarity is described in more detail.
In particular, we review two well-known graph matching algorithms – the bipartite
approximation of graph edit distance20 as well as the Hausdorff edit distance.19
where μR (u) = (xu , yu ) and μT (v) = (xv , yv ) are the node labels of nodes u ∈ VR
and v ∈ VT , respectively.
For both deletions and insertions of nodes, we use a constant cost cnode . For-
mally,
c(eR → eT ) = 0, (6)
March 12, 2020 16:8 ws-rv961x669 HBPRCV-6th Edn.–11573 riesen˙signature page 311
where eR ∈ ER and eT ∈ ET , while the cost of both edge deletion and insertion is
set to a constant value cedge .
The edit distance of two graphs can now be defined by the minimum cost edit
path between two graphs.
where Υ(g1 , g2 ) denotes the set of all edit paths transforming g1 into g2 and c
denotes the cost function measuring the strength c(ei ) of edit operation ei .
The minimal cost edit path found in Υ(g1 , g2 ) corresponding to dGED is termed
λmin from now on.
Optimal algorithms for computing the edit distance of graphs are typically based
on combinatorial search procedures that explore the space of all possible mappings of
the nodes and edges of g1 to the nodes and edges of g2 . Such an exploration is often
conducted by means of A* based search techniques32 using some heuristics.33,34
A major drawback of A* based search techniques for graph edit distance computa-
tion is their computational complexity. In fact, the problem of minimizing the graph
edit distance can be reformulated as an instance of a Quadratic Assignment Problem
(QAP ).35 QAPs belong to the most difficult combinatorial optimization problems
for which only exponential run time algorithms are known to date. The graph edit
distance approximation framework introduced in Ref. 36 reduces the QAP of graph
edit distance computation to an instance of a Linear Sum Assignment Problem
(LSAP ). For solving LSAPs, a large number of quite efficient algorithms exist.37
LSAPs are concerned with the problem of finding the best bijective assign-
(1) (1)
ment between the independent entities of two sets S1 = {s1 , . . . , sn } and
(2) (2)
S2 = {s1 , . . . , sn } of equal size. In order to assess the quality of an assign-
ment of two entities, a cost cij is commonly defined that measures the suitability
(1) (2)
of assigning the i-th element si ∈ S1 to the j-th element sj ∈ S2 (resulting in
n × n cost values cij (i, j = 1, . . . , n)).
March 12, 2020 16:8 ws-rv961x669 HBPRCV-6th Edn.–11573 riesen˙signature page 312
n
min ciϕi
(ϕ1 ,...,ϕn )∈Sn
i=1
Definition 4 (Cost matrix C). Based on the node sets V1 = {u1 , . . . , un } and
V2 = {v1 , . . . , vm } of g1 and g2 , respectively, a (n + m) × (n + m) cost matrix C is
established as follows.
⎡ ⎤
c11 c12 · · · c1m c1ε ∞ · · · ∞
⎢ . . ⎥
⎢ c21 c22 · · · c2m ∞ c2ε . . .
. ⎥
⎢ . . . ⎥
⎢ .. .. . . ... ... . . . . . . ⎥
⎢ ∞ ⎥
⎢ cn1 cn2 · · · cnm ∞ · · · ∞ cnε ⎥
C= ⎢ cε1 ∞ · · · ∞ 0 0 · · · 0 ⎥ (9)
⎢ ⎥
⎢ . . ⎥
⎢ ∞ cε2 . . . .. 0 0 . . . . ⎥
⎢ .
⎥
⎣ .. . . . . . .
. .. ... ⎦
. . . ∞ . 0
∞ · · · ∞ cεm 0 · · · 0 0
Entry cij thereby denotes the cost of a node substitution (ui → vj ), ciε denotes
the cost of a node deletion (ui → ε), and cεj denotes the cost of a node insertion
(ε → vj ).
Obviously, the left upper corner of the cost matrix C = (cij ) represents the costs
of all possible node substitutions, the diagonals of the right upper and left bottom
corners the costs of all possible node deletions and node insertions, respectively. As
every node can be deleted or inserted at most once, any non-diagonal element of
the right-upper and left-lower part is set to ∞. Substitutions of the form (ε → ε)
should not cause any cost (thus the bottom right corner of C is set to zero).
Given the cost matrix C = (cij ), the LSAP optimization consists in finding a
permutation (ϕ1 , . . . , ϕn+m ) of the integers (1, 2, . . . , (n + m)) that minimizes the
(n+m)
overall assignment cost i=1 ciϕi . In order to solve the LSAP on our specific
cost matrix the Hungarian algorithm39 with cubic time complexity is applied.
March 12, 2020 16:8 ws-rv961x669 HBPRCV-6th Edn.–11573 riesen˙signature page 313
of the nodes of g1 to the nodes of g2 . Note that assignment ψ includes node assign-
ments of the form (ui → vj ), (ui → ε), (ε → vj ), and (ε → ε) (the latter can be
dismissed, of course).
In fact, so far, the cost matrix C = (cij ) considers the nodes of both graphs only,
and thus mapping ψ does not take any structural constraints into account. In order
to integrate knowledge about the graph structure, to each entry cij , i.e. to each cost
of a node edit operation (ui → vj ), the minimum sum of edge edit operation costs,
implied by the corresponding node operation, is added. This particular encoding of
the minimum matching cost arising from the local edge structure enables the LSAP
to consider information about the local, yet not global, edge structure of a graph.
The LSAP optimization finds an assignment ψ in which every node of g1 is either
assigned to a unique node of g2 or deleted. Likewise, every node of g2 is either
assigned to a unique node of g1 or inserted. Note, moreover, that edit operations
on edges are always defined by the edit operations on their adjacent nodes. That
is, whether an edge (u, v) is substituted, deleted, or inserted, depends on the edit
operations actually performed on both adjacent nodes u and v.
Hence, given the node assignment ψ, the edge edit operations can be completely
(and globally consistently) inferred from ψ resulting in an admissible edit path be-
tween the graphs under consideration. The corresponding cost of this edit path can
be interpreted as an approximate graph edit distance. We denote this suboptimal
edit distance with dBP (g1 , g2 ) (or dBP for short)a .
The solution ψ found by the Hungarian algorithm may not be optimal with
respect to graph edit distance and the corresponding edit path cost derived may be
higher than the cost of the optimal edit path λmin . Hence, the distance measure dBP
provides an upper bound for the exact graph edit distance, that is dGED (g1 , g2 ) ≤
dBP (g1 , g2 ).
matchings into account conjointly. In this extension, each node of the first graph
is compared with each node of the second graph similar to comparing subsets of a
metric space using the Hausdorff distance.41 Accordingly, the proposed Hausdorff
edit distance (HED) can be calculated in quadratic time, that is O(nm).
We start our formalization with the definition of the Hausdorff distance of two
subsets A, B of a metric space
H(A, B) = max(sup inf d(a, b), sup inf d(a, b)) , (10)
a∈A b∈B b∈B a∈A
with respect to the metric d(a, b). In case of finite sets the Hausdorff distance is
H(A, B) = max(max min d(a, b), max min d(a, b)) , (11)
a∈A b∈B b∈B a∈A
i.e. the maximum among all nearest neighbor distances between A and B. As it is
prone to outliers, the maximum operator can be replaced with the sum
H (A, B) = min d(a, b) + min d(a, b) , (12)
b∈B a∈A
a∈A b∈B
taking all distances into account. Finally, the Hausdorff edit distance (HED) dHED
between two graphs g1 = (V1 , E1 , μ1 , ν1 ) and g2 = (V2 , E2 , μ2 , ν2 ) is
dHED (g1 , g2 ) = min N (u, v) + min N (u, v) , (13)
v∈V2 ∪{ } u∈V1 ∪{ }
u∈V1 v∈V2
⎧
⎪
⎪
cp
for node deletion (u → )
⎨cu + p∈P 2
c
N (u, v) = c v + q∈Q 2q for node insertion ( → v) , (14)
⎪
⎪
⎩ cuv + dHED2(P,Q) for node substitution (u → v)
2
where P and Q are the edges adjacent to u and v, respectively, and cij the cost
for deletion, insertion, and substitution. Only half of the substitution costs are
considered because HED does not enforce bidirectional assignments. Only half of
the edge costs are considered because they connect two nodes.
The edge matching cost dHED (P, Q) is defined differently for HED when com-
pared with the bipartite framework. Instead of solving an assignment problem for
the adjacent edges of two nodes, Hausdorff matching is performed on the edges in
the same way as for the nodes (see Ref. 19 for details).
In summary, HED considers the best case for matching each node and edge in-
dividually and hence underestimates the true edit distance. That is, dHED (g1 , g2 ) ≤
dGED (g1 , g2 ). In order to constrain the underestimation, a lower bound can be used
for dHED (g1 , g2 ), which asserts a minimum amount of deletion and insertion costs
if the two matched graphs differ in size (see Ref. 19 for more details). Moreover, in
Ref. 42 an improvement of this approximation has been proposed which involves a
larger context of individual nodes.
In the following, we use d as a placeholder for both graph edit distance approxima-
tions, i.e. dBP or dHED .
Previous publications have shown that it is crucial to apply a normalization
when using a dissimilarity measure for signature verification. We normalize our
graph edit distance with what we refer to as maximal graph edit distance, i.e. the
cost of completely deleting the first graph and then inserting the complete second
graph.
Formally, given two graphs gR = (VR , ER , μR , νR ) and gT = (VT , ET , μT , νT )
and a cost function c, we define dmax as
dmax (gR , gT ) = c(u → ε) + c(e → ε)
u∈VR e∈ER
(15)
+ c(ε → v) + c(ε → e)
v∈VT e∈ET
March 12, 2020 16:8 ws-rv961x669 HBPRCV-6th Edn.–11573 riesen˙signature page 316
When using the cost function defined in Section 2, this equation can be simplified
to
We now define the normalized graph edit distance of two signature images R
and T as
d(gR , gT )
dnorm (gR , gT ) = , (17)
dmax (gR , gT )
where gR and gT are the keypoint graphs for the signature images R and T respec-
tively, and d(gR , gT ) is either dBP or dHED .
4. Experimental Evaluation
• GPDS-last100: Containing the last 100 users of the dataset (users 3901 to
4000).
• GPDS-75: Containing the first 75 users of the dataset (users 1 to 75).
The GPDS-last100 dataset is used as the training set for both structural meth-
ods. That is, we tune the parameters of our graph-based methods on this subset
exclusively.
UTSig is a Persian signature dataset.44 It consists of 115 users with 27 genuine
signatures, 3 opposite-hand signaturesb , and 42 forgeries for each user. The users
have been instructed to sign within six differently sized bounding boxes to simulate
different conditions. The resulting signatures have been scanned with 600 dpi.
MCYT-75 is a offline signature dataset within the MCYT baseline corpus.9,45
It consists of 75 users with 15 genuine signatures and 15 forgeries for each user. The
users signed in a 127mm × 97mm box and each signature has been scanned at 600
dpi.
b The opposite-hand signatures are treated as forgeries as suggested by the authors of the dataset.
March 12, 2020 16:8 ws-rv961x669 HBPRCV-6th Edn.–11573 riesen˙signature page 317
Table 1. Number of users, genuine and forged signatures, as well as dpi during
scanning for all datasets.
used for used for
Name Users Genuine Forgeries dpi
tuning testing
GPDS-last10026 100 24 30 600 x
GPDS-7526 75 24 30 600 x
MCYT-759 75 15 15 600 x
UTSig44 115 27 45 600 x
CEDAR46 55 24 24 300 x
• Skilled forgeries (SF): The target’s genuine signature is known to the forger,
and usually, the forger has time to practice it. This often leads to forgeries that
have high resemblance with their genuine counterparts.
• Random forgeries (RF): Genuine signatures of other users are used in a brute
force attack on the verification system. Another reasoning is that the forgers use
their own signatures since they have no knowledge about the target’s signature.
In our experiments, we are using one genuine signature from each other user as
random forgeries.
On all datasets but the UTSig data, we use the first 10 genuine signatures
for each user as reference (on UTSig, the first 12 signatures are employed). The
remaining genuine signatures are used as positive samples for the evaluation.
We evaluate the performance of our graph-based verification systems using the
equal error rate (EER). The EER is the error rate when the false rejection rate
(FRR) is equal to the false acceptance rate (FAR). The FRR refers to the percentage
of genuine signatures that are rejected by the system, and the FAR refers to the
percentage of forgeries accepted by the system. In order to determine FRR and
FAR directly, we have to decide on a (global) decision threshold (applied for all
users).
erage, and maximum number of nodes in the graphs for a given D (we show the
results for D ∈ {25, 50, 100}). The number of nodes is increasing as expected when
lowering threshold D.
In Table 3, we report the average runtimec on GPDS-last100 using both dBP
and dHED . The expected speed-up of dHED when compared with dBP is particularly
significant when using graph representations with more nodes.
The next question we want to answer is whether the distance measure dBP or
dHED performs better for the task of signature verification. In Table 4 we report the
EER on all datasets in both scenarios (RF and SF) achieved with dBP and dHED .
Regarding the random forgery scenario, we observe that the bipartite approximation
performs better than the Hausdorff approximation on all datasets. However, in
the scenario with skilled forgeries, dHED performs on two datasets slightly better
than the bipartite counterpart (UTSig and MCYT-75). On GPDS-75 the results
achieved with dBP are substantially better than the results obtained with dHED ,
while on CEDAR both dissimilarities result in the same EER score.
In summary, we can conclude that the bipartite distance dBP leads in general to a
(slightly) better EER in this particular experiment. On the other hand, we observe
substantially better runtimes when using dHED as basic dissimilarity measure rather
than dBP .
c Runtime is with respect to a Java implementation and AMD Opteron 2354 nodes with 2.2 GHz
CPU.
March 12, 2020 16:8 ws-rv961x669 HBPRCV-6th Edn.–11573 riesen˙signature page 319
Table 4. Equal error rates on four data sets in a random and skilled forgery
scenario using dBP and dHED .
RF SF
Dataset dBP dHED dBP dHED
GPDS-7526 3.80 3.89 6.67 9.33
UTSig44 4.90 4.00 18.96 17.33
MCYT-759 2.67 3.87 13.24 12.71
CEDAR46 5.05 5.93 17.50 17.50
The present chapter reviews a recent line of research concerned with graph-based
signature verification. We review the core processes actually needed in a signature
verification framework, viz. preprocessing of signature images and graph extraction
as well as the graph dissimilarity computation by means of two approximation
algorithms (bipartite graph edit distance and Hausdorff edit distance).
In a non-exhaustive experimental evaluation, we compare the two baseline meth-
ods with each other on four benchmark datasets. While the Hausdorff approach
leads to substantially lower matching times (due to the lower complexity of the
algorithm), we observe slightly better performance with respect to the verification
accuracy (measured via equal error rates).
The system presented in this chapter actually builds the core for various subse-
quent verification systems. In Refs. 21 and 23, for instance, this system is extended
to an ensemble method that allows combining metric learning by means of a deep
CNN47 with the triplet loss function48 with the fast graph edit distance approxi-
mations described in this chapter.
Combining the present structural approach and statistical models has signif-
icantly improved the signature verification performance on the MCYT-75 and
GPDS-75 benchmark datasets.21 The structural model based on approximate graph
edit distance achieved better results on skilled forgeries, while the statistical model
based on metric learning with deep triplet networks achieved better results against
a brute-force attack with random forgeries. The proposed system was able to com-
bine these complementary strengths and has proven to generalize well to unseen
users, which have not been used for model training and parameter optimization.
In Refs. 22 and 24, the basic framework presented in this chapter is combined
with a tree-based inkball model. Inkball models are another recent structural ap-
proach for handwriting analysis proposed by Howe in Ref. 49. Originally, this
approach has been introduced as a technique for segmentation-free word spotting
that requires few training data. In addition to keyword spotting, inkball models
have been used for handwriting recognition as a complex feature in conjunction with
HMM.50 Inkball models are visually similar to keypoint graphs since they are using
very similar points on the handwriting as nodes. However, inkballs are connected to
a rooted tree that is directly matched with a skeleton image using an efficient algo-
rithm. The complementary aspects of the two dissimilarity measures are exploited
March 12, 2020 16:8 ws-rv961x669 HBPRCV-6th Edn.–11573 riesen˙signature page 320
to achieve better verification results using a linear combination of the two dissim-
ilarity scores. The systems are evaluated individually as well as combined, and it
can be empirically proven that graph-based signature verification is able to reach
and, in some cases, surpass the current state of the art in signature verification,
motivating further research on structural approaches to signature verification.
References
1. D. Impedovo and G. Pirlo, Automatic signature verification: The state of the art,
IEEE Trans. on Systems, Man and Cybernetics Part C: Applications and Reviews. 38
(5), 609–635, (2008).
2. L. G. Hafemann, R. Sabourin, and L. S. Oliveira. Offline handwritten signature ver-
ification - literature review. In Proc of Int. Conf. on Image Processing Theory, Tools
and Applications (IPTA), pp. 1–8 (Nov, 2017).
3. M. Diaz, M. A. Ferrer, D. Impedovo, M. I. Malik, G. Pirlo, and R. Plamondon, A
perspective analysis of handwritten signature technology, ACM Comput. Surv. 51
(6), 117:1–117:39 (Jan., 2019). ISSN 0360-0300. doi: 10.1145/3274658. URL http:
//doi.acm.org/10.1145/3274658.
4. P. S. Deng, H.-Y. M. Liao, C. W. Ho, and H.-R. Tyan, Wavelet-Based Off-Line Hand-
written Signature Verification, Computer Vision and Image Understanding. 76(3),
173–190, (1999).
5. A. Gilperez, F. Alonso-Fernandez, S. Pecharroman, J. Fierrez, and J. Ortega-Garcia.
Off-line signature verification using contour features. In International Conference on
Frontiers in Handwriting Recognition. Concordia University, (2008).
6. M. A. Ferrer, J. Alonso, and C. Travieso, Offline geometric parameters for auto-
matic signature verification using fixed-point arithmetic, IEEE Transactions on Pat-
tern Analysis and Machine Intelligence. 27(6), 993–997, (2005).
7. A. Piyush Shanker and A. Rajagopalan, Off-line signature verification using DTW,
Pattern Recognition Letters. 28(12), 1407–1414 (9, 2007).
8. F. Alonso-Fernandez, M. Fairhurst, J. Fierrez, and J. Ortega-Garcia. Automatic Mea-
sures for Predicting Performance in Off-Line Signature. In IEEE International Con-
ference on Image Processing, pp. I–369–I–372. IEEE, (2007).
9. J. Fierrez-Aguilar, N. Alonso-Hermira, G. Moreno-Marquez, and J. Ortega-Garcia. An
off-line signature verification system based on fusion of local and global information.
In Biometric Authentication, pp. 295–306. Springer, (2004).
10. M. B. Yilmaz, B. Yanikoglu, C. Tirkaz, and A. Kholmatov. Offline signature verifi-
cation using classifier combination of HOG and LBP features. In International Joint
Conference on Biometrics, pp. 1–7. IEEE, (2011).
11. M. A. Ferrer, J. F. Vargas, A. Morales, and A. Ordonez, Robustness of Offline Signa-
ture Verification Based on Gray Level Features, IEEE Transactions on Information
Forensics and Security. 7(3), 966–977 (jun, 2012). ISSN 1556-6013.
12. S. Dey, A. Dutta, J. I. Toledo, S. K. Ghosh, J. Llados, and U. Pal. SigNet: Convolu-
tional Siamese Network for Writer Independent Offline Signature Verification. (2017).
13. L. G. Hafemann, R. Sabourin, and L. S. Oliveira, Learning features for offline hand-
written signature verification using deep convolutional neural networks, Pattern Recog-
nition. 70, 163–176, (2017).
14. A. Soleimani, B. N. Araabi, and K. Fouladi, Deep multitask metric learning for offline
signature verification, Pattern Recognition Letters. 80, 84–90, (2016).
March 12, 2020 16:8 ws-rv961x669 HBPRCV-6th Edn.–11573 riesen˙signature page 321
15. M. Stauffer, P. Maergner, A. Fischer, and K. Riesen, Polar Graph Embedding for
Handwriting Applications, Pattern Analysis and Applications. Submitted, (2019).
16. R. Sabourin, R. Plamondon, and L. Beaumier, Structural interpretation of handwrit-
ten signature images, Int. Journal of Pattern Recognition and Artificial Intelligence.
8(3), 709–748, (1994).
17. A. Bansal, B. Gupta, G. Khandelwal, and S. Chakraverty, Offline signature verification
using critical region matching, Int. Journal of Signal Processing, Image Processing and
Pattern. 2(1), 57–70, (2009).
18. T. Fotak, M. Baca, and P. Koruga, Handwritten signature identification using basic
concepts of graph theory, WSEAS Transactions on Signal Processing. 7(4), 145–157,
(2011).
19. A. Fischer, C. Y. Suen, V. Frinken, K. Riesen, and H. Bunke, Approximation of graph
edit distance based on Hausdorff matching, Pattern Recognition. 48(2), 331–343 (2,
2015).
20. K. Riesen and H. Bunke, Approximate graph edit distance computation by means of
bipartite graph matching, Image and Vision Computing. 27(7), 950–959 (6, 2009).
21. P. Maergner, V. Pondenkandath, M. Alberti, M. Liwicki, K. Riesen, R. Ingold, and
A. Fischer. Offline Signature Verification by Combining Graph Edit Distance and
Triplet Networks. In International Workshop on Structural, Syntactic, and Statistical
Pattern Recognition, pp. 470–480. Springer, (2018).
22. P. Maergner, N. Howe, K. Riesen, R. Ingold, and A. Fischer. Offline Signature Ver-
ification Via Structural Methods: Graph Edit Distance and Inkball Models. In In-
ternational Conference on Frontiers in Handwriting Recognition, pp. 163–168. IEEE,
(2018).
23. P. Maergner, V. Pondenkandath, M. Alberti, M. Liwicki, K. Riesen, R. Ingold, and
A. Fischer. Combining graph edit distance and triplet networks for offline signature
verification. In Pattern Recognition Letters 125, pp. 527–533. (2019).
24. P. Maergner, N. Howe, K. Riesen, R. Ingold, and A. Fischer. Graph-Based Offline
Signature Verification. In arXiv:1906.10401, (2019).
25. T. Y. Zhang and C. Y. Suen, A fast parallel algorithm for thinning digital patterns,
Communications of the ACM. 27(3), 236–239, (1984).
26. M. A. Ferrer, M. Diaz-Cabrera, and A. Morales, Static Signature Synthesis: A Neu-
romotor Inspired Approach for Biometrics, IEEE Transactions on Pattern Analysis
and Machine Intelligence. 37(3), 667–680 (mar, 2015). ISSN 0162-8828.
27. A. Fischer, M. Diaz, R. Plamondon, and M. A. Ferrer. Robust score normalization
for DTW-based on-line signature verification. In Proc. of International Conference on
Document Analysis and Recognition (ICDAR), pp. 241–245. IEEE (8, 2015).
28. D. Conte, P. Foggia, C. Sansone, and M. Vento, Thirty years of graph matching in
pattern recognition, Int. Journal of Pattern Recognition and Artificial Intelligence. 18
(3), 265–298, (2004).
29. P. Foggia, G. Percannella, and M. Vento, Graph Matching and Learning in Pattern
Recognition in the last 10 Years, International Journal of Pattern Recognition and
Artificial Intelligence. 28(01), 1450001, (2014).
30. H. Bunke and G. Allermann, Inexact graph matching for structural pattern recogni-
tion, Pattern Recognition Letters. 1(4), 245–253 (5, 1983).
31. K. Riesen, Structural Pattern Recognition with Graph Edit Distance. Advances in
Computer Vision and Pattern Recognition, (Springer International Publishing, 2015).
32. P. Hart, N. Nilsson, and B. Raphael, A formal basis for the heuristic determination
of minimum cost paths, IEEE Transactions of Systems, Science, and Cybernetics. 4
(2), 100–107, (1968).
March 12, 2020 16:8 ws-rv961x669 HBPRCV-6th Edn.–11573 riesen˙signature page 322
33. L. Gregory and J. Kittler. Using graph search techniques for contextual colour re-
trieval. In eds. T. Caelli, A. Amin, R. Duin, M. Kamel, and D. de Ridder, Proc. of the
Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern
Recognition, LNCS 2396, pp. 186–194, (2002).
34. S. Berretti, A. Del Bimbo, and E. Vicario, Efficient matching and indexing of graph
models in content-based retrieval, IEEE Trans. on Pattern Analysis and Machine
Intelligence. 23(10), 1089–1105, (2001).
35. X. Cortés, F. Serratosa, and A. Solé. Active graph matching based on pairwise prob-
abilities between nodes. In eds. G. Gimel’farb, E. Hancock, A. Imiya, A. Kuijper,
M. Kudo, O. S., T. Windeatt, and K. Yamad, Proc. 14th Int. Workshop on Structural
and Syntactic Pattern Recognition, LNCS 7626, pp. 98–106, (2012).
36. K. Riesen and H. Bunke, Approximate graph edit distance computation by means of
bipartite graph matching, Image and Vision Computing. 27(4), 950–959, (2009).
37. R. Burkard, M. Dell’Amico, and S. Martello, Assignment Problems. (Society for In-
dustrial and Applied Mathematics, Philadelphia, PA, USA, 2009). ISBN 0898716632,
9780898716634.
38. K. Riesen, Structural Pattern Recognition with Graph Edit Distance. (Springer, 2016).
39. J. Munkres, Algorithms for the Assignment and Transportation Problems, Journal of
the Society for Industrial and Applied Mathematics. 5(1), 32–38, (1957).
40. A. Fischer, C. Suen, V. Frinken, K. Riesen, and H. Bunke. A fast matching algorithm
for graph-based handwriting recognition. In eds. W. Kropatsch, N. Artner, Y. Hax-
himusa, and X. Jiang, Proc. 8th Int. Workshop on Graph Based Representations in
Pattern Recognition, LNCS 7877, pp. 194–203, (2013).
41. D. P. Huttenlocher, G. A. Klanderman, G. A. Kl, and W. J. Rucklidge, Comparing
images using the Hausdorff distance, IEEE Trans. PAMI. 15, 850–863, (1993).
42. A. Fischer, S. Uchida, V. Frinken, K. Riesen, and H. Bunke. Improving hausdorff
edit distance using structural node context. In eds. C. Liu, B. Luo, W. Kropatsch,
and J. Cheng, Proc. 10th Int. Workshop on Graph Based Representations in Pattern
Recognition, LNCS 9069, pp. 148–157, (2015).
43. M. A. Ferrer. GPDSsyntheticSignature database website, (2016). URL http://www.
gpds.ulpgc.es/downloadnew/download.htm. accessed on Jan 28, 2019.
44. A. Soleimani, K. Fouladi, and B. N. Araabi, Utsig: A persian offline signature dataset,
IET Biometrics. 6(1), 1–8, (2016).
45. J. Ortega-Garcia, J. Fierrez-Aguilar, D. Simon, J. Gonzalez, M. Faundez-Zanuy, V. Es-
pinosa, A. Satue, I. Hernaez, J.-J. Igarza, C. Vivaracho, D. Escudero, and Q.-I. Moro,
MCYT baseline corpus: a bimodal biometric database, IEEE Proceedings-Vision, Im-
age and Signal Processing. 150(6), 395–401, (2003).
46. M. K. Kalera, S. Srihari, and A. Xu, Offline signature verification and identification
using distance statistics, International Journal of Pattern Recognition and Artificial
Intelligence. 18(07), 1339–1360, (2004).
47. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proc of Conf. on Computer Vision and Pattern Recognition, pp. 770–778, (2016).
48. E. Hoffer and N. Ailon. Deep metric learning using triplet network. In International
Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Springer, (2015).
49. N. Howe. Part-structured inkball models for one-shot handwritten word spotting. In
Proc. of International Conference on Document Analysis and Recognition (ICDAR),
(2013).
50. N. Howe, A. Fischer, and B. Wicht. Inkball models as features for handwriting recog-
nition. In Proc. of International Conference on Frontiers in Handwriting Recognition
(ICFHR), (2016).
CHAPTER 2.8
Cellular neural network (CNN) is adopted for seismic pattern recognition. We design
CNN to behave as associative memory according to the stored patterns, and finish the
training process of the network. Then we use this associative memory to recognize
seismic testing patterns. In the experiments, the analyzed seismic patterns are bright spot
pattern, right and left pinch-out patterns that have the structure of gas and oil sand zones.
From the recognition results, the noisy seismic patterns can be recovered. In the
comparison of experimental results, the CNN has better recovery capacity than Hopfield
model. Also we have the experiments on seismic images. Through window moving, the
bright spot pattern and horizon pattern can be detected. The results of seismic pattern
recognition using CNN are good. It can help the analysis and interpretation of seismic
data.
1. Introduction
In 1988, Chua and Yang proposed the theory and applications of cellular neural
network (CNN) [1]–[3]. After that, there were studies on the discrete time CNN
(DT-CNN) [4]–[6]. And some papers discussed the stability analysis and
attractivity analysis [7]–[10]. CNN had ever been used in many applications,
such as to the detection of geological lineaments on radarsat images and the
seismic horizon picking [11], [12].
Here the DT-CNN is used as the associative memory [5], [6]. Each memory
pattern corresponds to a unique globally asymptotically stable equilibrium point
of the network. We use the motion equation of a cellular neural network to
behave as an associative memory, and then use the associative memory to
recognize patterns.
The seismic pattern recognition system using CNN is shown in Fig. 1. The
process of seismic pattern recognition is composed of two parts. In the training
part, the training seismic patterns can be used to construct the auto-associative
323
324 K. Y. Huang and W. H. Hsieh
memory using DT-CNN. In the recognition part, the input testing seismic pattern
can be recognized by the auto-associative memory.
Training烉
Training Cellular
Auto-associative
seismic Neural
Memory
patterns Network
Recognition烉
Recovered
Testing
Auto-associative seismic
seismic
Memory pattern
pattern
The primary element of CNN is a cell and shown in Fig. 2. Each cell has an input,
threshold, and output. For CNN, cells are arranged in a two-dimensional array
usually, as shown in Fig. 3. Every cell is only influenced directly by its
neighboring cells in a CNN. It is not influenced directly by all other cells. In
CNN, the input of one cell comes from the input and output of other cells which
are only near neighboring cells.
j
Cell Cij
uij
Input Output
State
xij yij 3x3 sphere
Threshold Iij Cell Cij
of influence
Fig. 2. The element of a cellular neural Fig. 3. A 6×10 cell array. Radius r = 1, the range
network, cell Cij . of cell Cij and its neighboring cells.
Cellular Neural Network for Seismic Pattern Recognition 325
N ij (r ) represents the set of neighboring cells Ckl of cell Cij , r is the radius of
N ij (r ) , r is a positive integer. N ij (r ) express a (2r +1) × (2r +1) cell array. For
being simple and convenient, we omit r and express N ij (r ) with N ij . For
example, r = 1, the range of cell Cij and its neighboring cells is size of 3 × 3, as
shown in Fig. 3 [1]. The cell set contained by the grey square in Fig. 3 is N ij (1) .
N ij (1) is a 3 × 3 cell array.
In order to interpret the propagation of inputs and outputs of the neighboring
cells, the cell array of CNN is shown in Fig. 4. The left and right networks are the
same as the middle network. They are separated because of easy interpretation.
For CNN, the state value of the next time of one cell is influenced by inputs and
outputs of cells near this cell. The inputs and outputs of cells near this cell all will
feedback to this cell; they are regarded as the inputs of this cell. The cell with its
neighboring connection relation is moving and is called a template.
Multiply Multiply
Standard CNN: ( A, B, I )
template A template B
A B
Each cell had its basic circuit structure [1]. From continuous case of motion
equation, Harrer and Nossek derived to the discrete time CNN (DT-CNN) [4].
Grassi also designed DT-CNN for associative memories [5], [6].
Here we use Grassi’s method in the analysis. Consider a DT-CNN with a two-
dimensional M × N cell array [6]. For each cell (i, j), the equation of motion in a
DT-CNN is as follows.
1 for x ij (t 1) t 0
y ij (t 1) f ( x ij (t 1)) ® ( 2)
¯ 1 for x ij (t 1) 0
1d i d M; 1d j d N
where A(i, j; k, l) is the output weighting from cell (k, l) to cell (i, j); B(i, j; k, l) is
the input weighting from cell (k, l) to cell (i, j); I ij is the external input to cell (i,
j); ykl(t) is the output of the cell (k, l) at time t; ukl(t) is the input of neighboring
cell (k, l); Ckl is the cell (k, l); and Sij is the set of neighboring cells of cell (i, j).
The f (.) is the activation function of hard-limiter.
The (1) and (2) can be written as a vector form [6].
x(t+1) = A y(t) + B u + e (3)
y(t) = f(x(t)) (4)
In (3) and (4), the meanings of symbols are as follows.
x [ x1 x 2 " x n ]T nu1 , a vector has state value of every cell;
y [ y1 y 2 " y n ]T nu1 , a vector has output of every cell;
u [u1 u 2 " u n ]T nu1 , a vector has input of every cell;
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Fig. 5. The connection relation of cells for designing associative memories with r = 1 and n = 16.
If there are n cells, then the feedback coefficients of the cells can be expressed
with the following matrix A.
a11 represents the self-feedback coefficient of 1st cell, a12 represents the
feedback coefficient of the next cell (the 2nd cell) of 1st cell in clockwise order,
a 21 represents the feedback coefficient of the next cell (the 1st cell) of 2nd cell
in counterclockwise order. Since the feedback template is a one-dimensional
space-invariant template, so a11 = a 22 = … = a nn = D 1 represent the self-feedback
coefficient of each cell, a12 = a23 = … = a ( n 1) n = an1 = D 2 represent the feedback
coefficient of the next cell in clockwise order, a1n = a 21 = a32 = … = a n ( n 1) = D n
328 K. Y. Huang and W. H. Hsieh
represent the feedback coefficient of the next cell in counterclockwise order, a13
= a 24 = … = a ( n 2 ) n = a ( n 1)1 = a n 2 = D 3 represent the feedback coefficient of the
next two cell in clockwise order, and so on. So matrix A can be expressed as
ªD 1 D 2 " D n º
«D D " D »
A= «
n 1 n 1 »
«# % # »
« »
¬D 2 D 3 " D 1 ¼
Therefore A is a circulant matrix. The following one-dimensional space-invariant
template is considered:
The Eq. (6) is the first row of matrix A. We arrange the last element of the
first row of matrix A in the first position of the second row of matrix A, it is
regarded as the first element of the second row of matrix A, other elements of the
first row of matrix A cycle right shift one position, they form the second to the
last element of the second row of matrix A. Similarly, we take the previous row
to cycle right shift once, the new sequence is the next row of matrix A, then we
can define matrix A.
ª a(0) a(1) " a (r ) 0 " 0 a( r ) " a(1)º
«a(1) a(0) " a(r ) 0 " 0 a( r ) " a(2) »
« »
« # " # »
A= « » (7)
« # " # »
« # " # »
« »
«¬a (1) a (2) "" """"""" a (-1) a (0) »¼
It only needs to design the first row of matrix A when we design matrix A as
a circulant matrix. Each next row is the previous row which is cycle right shifted
Cellular Neural Network for Seismic Pattern Recognition 329
once. The number of 0s of Eq. (6) is decided by radius r and n. A nun , namely
each row of matrix A has n elements, and there are n rows in A. If n = 9, then
there are nine elements in each row. When r = 1, the one dimensional template is
sorted according to the Eq. (6), as the following Eq. (8) shows.
[ a( 0 ) a( 1 ) 0 0 0 0 0 0 a( -1 ) ] (8)
In the Eq. (8), there are six 0s in the middle of the template, add other three
elements, there are nine elements in a row.
2.4. Stability
F (2 S q / n) 1 , q = 0, 1, 2, …, n-1 (9)
The stability criterion (9) can be easily satisfied by choosing small values for
the elements of the one-dimensional space-invariant template. In particular, the
larger the network dimension n is, the smaller the values of the elements will be
by (10). On the other hand, the feedback values cannot be zero, since the stability
properties considered herein require that (3) be a dynamical system. These can
help the designer in setting the values of the feedback parameters. Namely, the
lower bound is zero, whereas the upper bound is related to the network
dimension.
i
2, …, m, for each u i , there is only one equilibrium point x satisfying motion
equation (3):
x1 Ay 1 Bu1 e
° 2
°x Ay 2 Bu 2 e (11)
®
° #
°x m Ay m Bu m e
¯
di [ d 1i d 2i " d ni ]T nu1 , i = 1, …, m
R [U T h] mu( n 1)
ªu11 u21 " un1 1 º
« 2 2 »
«u1 u2 " un2 1 »
« # % #»
« »
«¬u1m u2m m
" un 1 »¼
From (13), BU + J = X– A y
332 K. Y. Huang and W. H. Hsieh
« 21 b22 " b2 n »» «
«u 2 u 2 " u 2 » 為 « I 2 I 2 " I 2 »»
«# % # » « # % » «# % # »
« » « » « »
¬bn1 bn 2 " bnn ¼ «¬u 1n u n2 " u nm »¼ ¬ I n In " In ¼
~T ~
w j R j ( X Tj A Ty , j ) , j = 1, 2, …, n (16)
~
R j is got from R according to the connection relation of the input of the jth cell
and inputs of other cells. We express the connection relation of the inputs of cells
by matrix S, so R~ can be got by taking out partly vectors of R according to the
j
Cellular Neural Network for Seismic Pattern Recognition 333
§ n ·
hj ¨ ¦ S ji ¸ 1 . Matrix S is the matrix represents the connection relation of
©i1 ¹
cells’ inputs. S nun , if the ith cell’s input and jth cell’s input have connection
relation, sij=1. On the other hand, if the ith cell’s input and jth cell’s input have
no connection relation, sij =0.
1, if the ith cell's input and jth cell's input have connection relation
°
sij ®
°0, if the ith cell's input and jth cell's input have no connection relation
¯
For example, a 4 × 4 cell array and radius r = 1 in Fig. 5, then S is in the
following:
ª1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0º
«1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0»»
«
«0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0»
« »
«0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0»
«1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0»
« »
«1 1 1 0 1 1 1 0 1 1 1 0 0 0 0 0»
«0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 0»
« »
«0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0»
S «0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0»
« »
«0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0»
« »
«0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1»
«0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1»
« »
«0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0»
«0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0»
« »
«0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1»
«0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 »¼
¬
R may not be unique, so wTj may not be unique. B may not be unique. Then B
may not accord with the interconnecting structure of network inputs. So we must
use a matrix S to represent the interconnecting structure of network inputs and
use the above derivation to calculate matrix B. We summarize the steps of using
CNN to design associative memories in the following.
Method:
(1) Set up matrix U from training patterns u i .
U [u 1 u 2 " u m ]
(2) Establish Y = U.
(3) Set up S.
1, if the ith cell's input and jth cell's input have connection relation
°
sij ®
°0, if the ith cell's input and jth cell's input have no connection relation
¯
(4) Design matrix A as the circulant matrix which satisfies globally
asymptotically stable condition.
(5) Set the value of Į (Į > 1), and calculate X = ĮY.
(6) Calculate A y AY .
(7) for ( j =1 to n ) do:
Calculate X j by X.
Calculate A y, j by A y .
Calculate R, R [ U T h] .
~ from matrix S and matrix R.
Establish matrix R j
~ of ~ .
Calculate pseudoinverse matrix R j Rj
~ ,w
T ~ T ~ T T
Calculate w j R (X A ) .
j j j y, j
~T .
Recover w j from w j
End
Method:
(1) Set up initial output vector y, its element values are all in [-1, 1] interval.
(2) Input testing pattern u and A, B, e, and y into the equation of motion to get
x(t+1).
x(t + 1) = A y(t) + B u + e
(3) Input x(t + 1) into activation function, get new output y(t + 1).
The activation function is:
x ! 1, then y 1
°
® 1 d x d 1, then y x
° x 1, then y 1
¯
(4) Compare new output y(t + 1) with y(t). Check whether they are the same.
If they are the same, then stop, otherwise input new output y(t + 1) into
equation of motion again. Repeat Step (2) to Step (4) until output y is not
changed.
End
4. Experiments
We have two kinds of experiments. The first one is on the simulated seismic
pattern recognition. The analyzed seismic patterns are bright spot pattern, right
and left pinch-out patterns that have the structure of gas and oil sand zones [17].
The second one is on the simulated seismic images. The analyzed seismic
patterns are bright spot pattern and horizon pattern. We use window moving to
detect the patterns.
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
011111111111111111111111111111111111111111111111111111111110
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000111111111111000000000000000000000000
000000000000000000001111000000000000111100000000000000000000
000000000000000001110000000000000000000011100000000000000000
000000000000000110000000000000000000000000011000000000000000
000000000000111000000000000011100000000000000111000000000000
000000000011000000000011111100011111100000000000110000000000
000000011100000000001100000000000000001110000000001110000000
000111100000000111110000000000000000010001110000000001111100
001000000000001000001111111111111111100000001110000000000011
000000000001110000000000000000000000000000000001100000000000
000000001110001111111000000000000000000111111110011100000000
000000110000000000000111111111111111111000000000000011100000
001111000000000000000000000000000000000000000000000000011100
110000000000000000000000000000000000000000000000000000000011
000000000000000000000000000000000000000000000000000000000000
Fig. 9. Training seismic patterns: (a) bright spot, (b) right pinch-out, (c) left pinch-out.
Fig. 10. Noisy testing seismic patterns: (a) bright spot, (b) right pinch-out, (c) left pinch-out.
For r = 2, the recovered patterns are shown in Fig. 11(a), (b), and (c). They
are not correct output patterns. Fig. 11(d), (e), and (f) are the energy vs iteration.
So we set neighborhood radius to r = 3 and test again. For r = 3, the recovered
patterns are shown in Fig. 12(a), (b), and (c). Fig. 12(b) is not correct output
pattern. Fig. 12(d), (e), and (f) are the energy vs iteration. So we set
neighborhood radius to r = 4 and test again. For r = 4, the recovered pattern is
shown in Fig. 13(a). Fig. 13(b) is the energy vs iteration. It is the correct output
pattern.
Fig. 11. For r = 2, (a) output of Fig. 10(a), (b) output of Fig. 10(b), (c) output of Fig. 10(c),
(d) energy curve of Fig. 10(a), (e) energy curve of Fig. 10(b), (f) energy curve of Fig. 10(c).
338 K. Y. Huang and W. H. Hsieh
Fig. 12. For r = 3, (a) output of Fig. 10(a), (b) output of Fig. 10(b), (c) output of Fig. 10(c),
(d) energy curve of Fig. 10(a), (e) energy curve of Fig. 10(b), (f) energy curve of Fig. 10(c).
(a) (b)
Fig. 13. For r=4, (a) output of Fig. 10(b), (b) energy curve of Fig. 10(b).
Next, we apply DT-CNN associative memory without matrix S to Fig. 10(a), (b),
and (c). We set Į = 3 and neighborhood radius r = 1. The output recovered
patterns are the same as Fig. 9(a), (b), and (c) respectively.
Fig. 14. Results of Hopfield associative memory: (a) output of Fig. 10(a), (b) output of Fig. 10(b),
(c) output of Fig. 10(c).
(a) (b)
Fig. 15. Two training seismic patterns: (a) bright spot pattern, (b) horizon pattern.
We have three testing seismic images shown in Fig. 16(a), 17(a), and 18(a).
The size is 64x64 and larger than the size of training pattern 16x50. We use a
window to extract the testing pattern from seismic image. The size of this
window is equal to the size of training pattern. This window is shifted from left
to right and top to bottom on the testing seismic image. If the output pattern of
the network is equal to one of training patterns, we record the coordinate of
upper-left corner of the window. After the window is shifted to the last position
340 K. Y. Huang and W. H. Hsieh
on the testing seismic image and all testing patterns are recognized, we calculate
the coordinate of the center of all recorded coordinates which are the same kind
of training pattern. And then we use the center coordinate to recover the detected
training pattern.
We set neighborhood radius r = 1 to process the Fig. 16(a) and Fig. 17(a), and
set r = 3 to process the Fig. 18(a). For the first image in Fig. 16(a), the horizon is
short. The detected pattern in Fig. 16(c) is only the bright spot. For the second
image in Fig. 17(a), the horizon is long. The detected patterns in Fig. 17(c) are
the horizon and bright spot. For the third image in Fig. 18(a), the horizon and
bright spot pattern have discontinuities, but both kinds of patterns can also be
detected in Fig. 18(c).
5. Conclusions
Acknowledgements
This work was supported in part by the National Science Council, Taiwan, under
NSC92-2213-E-009-095 and NSC93-2213-E-009-067.
References
1. L. O. Chua and Lin Yang, “Cellular Neural Networks: Theory,” IEEE Trans. on CAS, vol. 35
no. 10, pp.1257-1272, 1988.
2. L. O. Chua and Lin Yang, “Cellular Neural Networks: Applications,” IEEE Trans. on CAS,
vol.35, no.10, pp. 1273-1290, 1988.
3. Leon O. Chua, CNN: A paradigm for complexity, World Scientific, 1998.
4. H. Harrer and J. A. Nossek, “Discrete-time Cellular Neural Networks,” International Journal
of Circuit Theory and Applications, vol. 20, pp. 453-468, 1992.
5. G. Grassi, “A new approach to design cellular neural networks for associative memories,”
IEEE Trans. Circuits Syst. I, vol. 44, pp. 835–838, Sept. 1997.
6. G. Grassi, “On discrete-time cellular neural networks for associative memories,” IEEE Trans.
Circuits Syst. I, vol. 48, pp. 107–111, Jan. 2001.
7. Liang Hu, Huijun Gao, and Wei Xing Zheng, “Novel stability of cellular neural networks with
interval time-varying delay,” Neural Networks, vol. 21, no. 10, pp. 1458-1463, Dec. 2008.
342 K. Y. Huang and W. H. Hsieh
8. Lili Wang and Tianping Chen, “Complete stability of cellular neural networks with
unbounded time-varying delays,” Neural Networks, vol. 36, pp. 11-17, Dec. 2012.
9. Wu-Hua Chen, and Wei Xing Zheng, “A new method for complete stability analysis of
cellular neural networks with time delay,” IEEE Trans. on Neural Networks, vol. 21, no. 7, pp.
1126-1139, Jul. 2010.
10. Zhenyuan Guo, Jun Wang, and Zheng Yan, “Attractivity analysis of memristor-based cellular
neural networks with time-varying delays,” IEEE Trans. on Neural Networks, vol. 25, no. 4,
pp. 704-717, Apr. 2014.
11. R. Lepage, R. G. Rouhana, B. St-Onge, R. Noumeir, and R. Desjardins, “Cellular neural
network for automated detection of geological lineaments on radarsat images,” IEEE Trans. on
Geoscience and Remote Sensing, vol. 38, no. 3, pp. 1224-1233, May 2000.
12. Kou -Yuan Huang, Chin-Hua Chang, Wen-Shiang Hsieh, Shan-Chih Hsieh, Luke K.Wang,
and Fan-Ren Tsai, “Cellular neural network for seismic horizon picking,” The 9th IEEE
International Workshop on Cellular Neural Networks and Their Applications, CNNA 2005,
May 28~30, Hsinchu, Taiwan, 2005, pp. 219-222.
13. R. Perfetti, “Frequency domain stability criteria for cellular neural networks,” Int. J. Circuit
Theory Appl., vol. 25, no. 1, pp. 55–68, 1997.
14. K. Y. Huang, K. S. Fu, S. W. Cheng, and T. H. Sheen, "Image processing of seismogram: (A)
Hough transformation for the detection of seismic patterns (B) Thinning processing in the
seismogram," Pattern Recognition, vol.18, no.6, pp. 429-440, 1985.
15. J. J. Hopfield and D. W. Tank, “Neural” computation of decisions in optimization problems,”
Biolog. Cybern., 52, pp. 141-152, 1985.
16. J. J. Hopfield and D. W. Tank, “Computing with neural circuits: A model,” Science, 233, pp.
625-633, 1986.
17. M. B. Dobrin and C. H. Savit, Introduction to Geophysical Prospecting, New York: McGraw-
Hill Book Co., 1988.
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 343
CHAPTER 2.9
Face sketches are able to capture the spatial topology of a face while lacking some facial
attributes such as race, skin, or hair color. Existing sketch-photo recognition and synthesis
approaches have mostly ignored the importance of facial attributes. This chapter introduces
two deep learning frameworks to train a Deep Coupled Convolutional Neural Network
(DCCNN) for facial attribute guided sketch-to-photo matching and synthesis. Specifically,
for sketch-to-photo matching, an attribute-centered loss is proposed which learns several
distinct centers, in a shared embedding space, for photos and sketches with different com-
binations of attributes. Similarly, a conditional CycleGAN framework is introduced which
forces facial attributes, such as skin and hair color, on the synthesized photo and does not
need a set of aligned face-sketch pairs during its training.
1. Introduction
Automatic face sketch-to-photo identification has always been an important topic in com-
puter vision and machine learning due to its vital applications in law enforcement.1,2 In
criminal and intelligence investigations, in many cases, the facial photograph of a suspect
is not available, and a forensic hand-drawn or computer generated composite sketch follow-
ing the description provided by the testimony of an eyewitness is the only clue to identify
possible suspects. Based on the existance or absence of the suspect’s photo in the law
enforcement database, an automatic matching algorithm or a sketch-to-photo synthesis is
needed.
Automatic Face Verification: An automatic matching algorithm is necessary for a
quick and accurate search of the law enforcement face databases or surveillance cameras
using a forensic sketch. The forensic or composite sketches, however, encode only limited
information of the suspects’ appearance such as the spatial topology of their faces while
the majority of the soft biometric traits, such as skin, race, or hair color, are left out.
Traditionaly, sketch recognition algorithms were of two categories, namely generative
and discriminative approaches. Generative algorithms map one of the modalities into the
other and perform the matching in the second modality.3,4 On the contrary, discriminative
approaches learn to extract useful and discriminative common features to perform the veri-
fication, such as the Weber’s local descriptor (WLD),5 and scale-invariant feature transform
343
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 344
(SIFT).6 Nonetheless, these features are not always optimal for a cross-modal recognition
task.7 More recently, deep learning-based approaches have emerged as a general solution
to the problem of cross-domain face recognition. It is enabled by their ability in learn-
ing a common latent embedding between the two modalities.8,9 Despite all the success,
employing deep learning techniques for the sketch-to-photo recognition problem is still
challenging compared to the other single modality domains as it requires a large number
of data samples to avoid over-fitting on the training data or stopping at local minima. Fur-
thermore, the majority of the sketch-photo datasets include a few pairs of corresponding
sketches and photos.
Existing state-of-the-art methods primarily focus on making the semantic representa-
tion of the two domains into a single shared subspace, whilst the lack of soft-biometric
information in the sketch modality is completely ignored. Despite the impressive results of
recent sketch-photo recognition algorithms, conditioning the matching process on the soft
biometric traits has not been adequately investigated. Manipulating facial attributes in pho-
tos has been an active research topic for years.10 The application of soft biometric traits in
person reidentification has also been studied in the literature.11,12 A direct suspect identifi-
cation framework based solely on descriptive facial attributes is introduced in.13 However,
they have completely neglected the sketch images. In recent work, Mittal et al.14 employed
facial attributes (e.g. ethnicity, gender, and skin color) to reorder the list of ranked iden-
tities. They have fused multiple sketches of a single identity to boost the performance of
their algorithm.
In this chapter, we introduce a facial attribute-guided cross-modal face verification
scheme conditioned on relevant facial attributes. To this end, a new loss function, namely
attribute-centered loss, is proposed to help the network in capturing the similarity of identi-
ties that have the same facial attributes combination. This loss function is defined based on
assigning a distinct centroid (center point), in the embedding space, to each combination of
facial attributes. Then, a deep neural network can be trained using a pair of sketch-attribute.
The proposed loss function encourages the DCNN to map a photo and its corresponding
sketch-attribute pair into a shared latent sub-space in which they have similar representa-
tions. Simultaneously, the proposed loss forces the distance of all the photos and sketch-
attribute pairs to their corresponding centers to be less than a pre-specified margin. This
helps the network to filter out the subjects of similar facial structures to the query but a
limited number of common facial attributes. Finally, the learned centers are trained to keep
a distance related to their number of contradictory attributes. The justification behind the
latter is that it is more likely that a victim misclassifies a few facial attributes of the suspect
than most of them.
Sketch-to-Photo Synthesis: In law enforcement, the photo of the person of interest
is not always available in the police database. Here, an automatic face sketch-to-photo
synthesis comes handy enabling them to produce suspects’ photos from the drawn forensic
sketches. The majority of the current research works in the literature of sketch-based photo
synthesis have tackled the problem using pairs of sketches and photos that are captured un-
der highly controlled conditions, i.e., neutral expression and frontal pose. Different tech-
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 345
tributes. To this end, we developed a conditional version of the CycleGAN which we refer
to as the cCycleGAN and trained it by an extra discriminator to force the desired facial
attributes on the synthesized images.
where m denotes the number of samples in a mini-batch, xi ∈ IRd denotes the ith sample
feature embedding, belonging to the class yi . The cyi ∈ IRd denotes the yi th class center
of the embedded features, and d is the feature dimension. To train a deep neural network, a
joint supervision of the proposed center loss and cross-entropy loss is adopted:
L = Ls + λLc , (2)
where Ls is the softmax loss (cross-entropy). The center loss, as defined in Eq. 1, is defi-
cient in that it only penalizes the compactness of intra-class variations without considering
the inter-class separation. Therefore, to address this issue, a contrastive-center loss has
been proposed in35 as
1
m
xi − cyi 22
Lct−c = k , (3)
2 i=1 ( j=1,j=y xi − cj 22 ) + δ
i
where δ is a constant preventing a zero denominator, and k is the number of classes. This
loss function not only penalizes the intra-class variations but also maximizes the distance
between each sample and all the centers belonging to the other classes.
center to each identity as in32 and.35 However, here we assign centers to different combi-
nations of facial attributes. In other words, the number of centers is equal to the number of
possible facial attribute combinations. To define our attribute-centered loss, it is important
to briefly describe the overall structure of the recognition network.
Fig. 1.: Coupled deep neural network structure. Photo-DCNN (upper network) and sketch-attribute-
DCNN (lower network) map the photos and sketch-attribute pairs into a common latent subspace.
In the problem of facial-attribute guided sketch-photo recognition, one can consider dif-
ferent combinations of facial attributes as distinct classes. With this intuition in mind, the
first task of the network is to learn a set of discriminative features for inter-class (between
different combinations of facial attributes) separability. However, the second goal of our
network differs from the other two previous works32,35 which were looking for a compact
representation of intra-class variations. On the contrary, here, intra-class variations rep-
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 348
resent faces with different geometrical properties, or more specifically, different identities.
Consequently, the coupled DCNN should be trained to keep the separability of the identities
as well. To this end, we define the attribute-centered loss function as
Lac = Lattr + Lid + Lcen , (4)
where Lattr is a loss to minimize the intra-class distances of photos or sketch-attribute
pairs which share similar combination of facial attributes, Lid denotes the identity loss for
intra-class separability, and Lcen forces the centers to keep distance from each other in
the embedding subspace for better inter-class discrimination. The attribute loss Lattr is
formulated as
1
m
Lattr = max( pi − cyi 22 − c , 0) + max( sgi − cyi 22 − c , 0) (5)
2 i=1
i − cyi 2 − c , 0),
+ max( sim 2
where c is a margin promoting convergence, pi is the feature embedded of the input photo
by the photo-DCNN with attributes combination represented by yi . Also, sgi and sim i (see
Figure 1) are the feature embeddings of two sketches with the same combination of at-
tributes as pi but with the same (genuine pair) or different (impostor pair) identities, re-
spectively. On the contrary to the center loss (1), the attribute loss does not try to push
the samples all the way to the center, but keeps them around the center by a margin with a
radius of c (see Figure 2). This gives the flexibility to the network to learn a discriminative
feature space inside the margin for intra-class separability. This intra-class discriminative
representation is learned by the network through the identity loss Lid which is defined as
1
m
Lid = pi − sgi 22 + max( d − pi − sim
i 2 , 0),
2
(6)
2 i=1
which is a contrastive loss33 with a margin of d to push the photos and sketches of a
same identity toward each other, within their center’s margin c , and takes the photos and
sketches of different identities apart. Obviously, the contrastive margin, d , should be less
than twice the attribute margin c , i.e. d < 2 × c (see Figure 2). However, from a
theoretical point of view, the minimization of identity loss, Lid , and attribute loss, Lattr ,
has a trivial solution if all the centers converge to a single point in the embedding space.
This solution can be prevented by pushing the centers to keep a minimum distance. For
this reason, we define another loss term formulated as
1
nc nc
Lcen = max( jk − cj − ck 22 , 0), (7)
2 j=1
k=1,k=j
where nc is the total number of centers, cj and ck denote the jth and kth centers, and jk is
the associated distance margin between cj and ck . In other words, this loss term enforces
a minimum distance jk , between each pair of centers, which is related to the number
of contradictory attributes between two centers cj and ck . Now, two centers which only
differ in few attributes are closer to each other than those with more number of dissimilar
attributes. The intuition behind the similarity-related margin is that the eyewitnesses may
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 349
mis-judge one or two attributes, but it is less likely to mix up more than that. Therefore,
during the test, it is very probable that the top rank suspects have a few contradictory
attributes when compared with the attributes provided by the victims. Figure 2 visualizes
the overall concept of the attribute-centered loss.
Fig. 2.: Visualization of the shared latent space learn by the utilization of the attribute-centered loss.
Centers with less contradictory attributes are closer to each other in this space.
input sketch and its corresponding attributes. This fusion process is an inevitable task for
the network to learn the mapping from each sketch-attribute pair to its center vicinity. As
shown in Figure 1, in this scheme the sketch and n binary attributes, ai=1,...,n , are passed
to the network as a (n + 1)-channel input. Each attribute-dedicated channel is constructed
by repeating the value that is assigned to that attribute. This fusion algorithm uses the in-
formation provided by the attributes to compensate the information that cannot be extracted
from the sketch (such as hair color) or it is lost while drawing the sketch.
We deployed a deep coupled CNN to learn the attribute-guided shared representation be-
tween the forensic sketch and the photo modalities by employing the proposed attribute-
centered loss. The overall structure of the coupled network is illustrated in Figure 1.
The structures of both photo and sketch DCNNs are the same and are adopted from the
VGG16.36 However, for the sake of parameter reduction, we replaced the last three con-
volutional layers of VGG16, with two convolutional layers of depth 256 and one convo-
lutional layer of depth 64. We also replaced the last max pooling with a global average
pooling, which results in a feature vector of size 64. We also added batch-normalization
to all the layers of VGG16. The photo-DCNN takes an RGB photo as its input and the
sketch-attribute-DCNN gets a multi-channel input. The first input channel is a gray-scale
sketch and there is a specific channel for each binary attribute filled with 0 or 1 based on
the presence or absence of that attribute in the person of interest.
We make use of hand-drawn sketch and digital image pairs from CUHK Face Sketch
Dataset (CUFS)37 (containing 311 pairs), IIIT-D Sketch dataset38 (containing 238 viewed
pairs, 140 semi-forensic pairs, and 190 forensic pairs), unviewed Memory Gap Database
(MGDB)3 (containing 100 pairs), as well as composite sketch and digital image pairs from
PRIP Viewed Software-Generated Composite database (PRIP-VSGC)39 and extended-
PRIP Database (e-PRIP)14 for our experiments. We also utilized the CelebFaces Attributes
Dataset (CelebA),40 which is a large-scale face attributes dataset with more than 200K
celebrity images with 40 attribute annotations, to pre-train the network. To this end, we
generated a synthetic sketch by applying xDOG41 filter to every image in the celebA
dataset. We selected 12 facial attributes, namely black hair, brown hair, blond hair, gray
hair, bald, male, Asian, Indian, White, Black, eyeglasses, sunglasses, out of the available
40 attribute annotations in this dataset. We categorized the selected attributes into four at-
tribute categories of hair (5 states), race (4 states), glasses (2 states), and gender (2 states).
For each category, except the gender category, we also considered an extra state for any case
in which the provided attribute does not exist for that category. Employing this attribute
setup, we ended up with 180 centers (different combinations of the attributes). Since none
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 351
of the aforementioned sketch datasets includes facial attributes, we manually labeled all of
the datasets.
We pre-trained our deep coupled neural network using synthetic sketch-photo pairs from
the CelebA dataset. We followed the same approach as32 to update the centers based on
mini-batches. The network pre-training process terminated when the attribute-centered
loss stopped decreasing. The final weights are employed to initialize the network in all the
training scenarios.
Since deep neural networks with a huge number of trainable parameters are prone to
overfitting on a relatively small training dataset, we employed multiple augmentation tech-
niques (see Figure 3):
• Deformation: Since sketches are not geometrically matched with their photos, we
employed Thin Plate Spline Transformation (TPS)42 to help the network learning
more robust features and prevent overfitting on small training sets, simultaneously.
To this end, we deformed images, i.e. sketches and photos, by randomly translat-
ing 25 preselected points. Each point is translated with random magnitude and
direction. The same approach has been successfully applied for fingerprint distor-
tion rectification.43
• Scale and crop: Sketches and photos are upscaled to a random size, while do
not keep the original width-height ratio. Then, a 250×200 crop is sampled from
the center of each image. This results in a ratio deformation which is a common
mismatch between sketches and their ground truth photos.
• Flipping: Images are randomly flipped horizontally.
2.4. Evaluation
The proposed algorithm works with a probe image, preferred attributes and a gallery of
mugshots to perform identification. In this section, we compare our algorithm with multiple
attribute-guided techniques as well as those that do not utilize any extra information.
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 352
Table 1.: Experiment Setup. The last three columns show the number of identities in each of train,
test gallary, and test probe.
Setup Test Train Train # Gallery # Prob #
P1 e-PRIP e-PRIP 48 75 75
P2 e-PRIP e-PRIP 48 1500 75
IIIT-D Semi-forensic CUFS, IIIT-D Viewed 135
P3 1968 1500
MGDB Unviewed CUFSF, e-PRIP 100
For the set of sketches generated by the Indian (Faces) and Asian (IdentiKit) users14 has
the rank 10 accuracy of %58.4 and %53.1, respectively. They utilized an algorithm called
attribute feedback to consider facial attributes on their identification process. However,
SGR-DA46 reported a better performance of %70 on the IdentiKit dataset without utiliza-
tion of any facial attributes. In comparison, our proposed attribute-centered loss resulted in
%73.2 and %72.6 accuracies, on Faces and IdentiKit, respectively. For the sake of evalu-
ation, we also trained the same coupled deep neural network with the sole supervision of
contrastive loss. This attribute-unaware network has %65.3 and %64.2 accuracies, on Faces
and IdentiKit, respectively, which demonstrates the effectiveness of attributes contribution
as part of our proposed algorithm.
Figure 4 visualize the effect of attribute-centered loss on top five ranks on P1 experi-
ment’s test results. The first row is the results of our attribute-unaware network, while the
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 353
Table 2.: Rank-10 identification accuracy (%) on the e-PRIP composite sketch database.
Algorithm Faces (In) IdentiKit (As)
Mittal et al.47 53.3 ± 1.4 45.3 ± 1.5
Mittal et al.48 60.2 ± 2.9 52.0 ± 2.4
Mittal et al.14 58.4 ± 1.1 53.1 ± 1.0
SGR-DA46 - 70
Ours without attributes 68.6 ± 1.6 67.4 ± 1.9
Ours with attributes 73.2 ± 1.1 72.6 ± 0.9
second row shows the top ranks for the same sketch probe using our proposed network
trained by the attribute-centered loss. Considering the attributes removes many of the false
matches from the ranked list and the correct subject moves to a higher rank.
To evaluate the robustness of our algorithm in the presence of a relatively large gallery
of mugshots, the same experiments are repeated but on an extended gallery of 1500 sub-
jects. Figure 5a shows the performance of our algorithm as well as the state of the art
algorithm on Indian user (Faces) dataset. The proposed algorithm outperforms14 by al-
most %11 rank 50 when exploiting facial attributes. Since the results for IdentiKit was not
reported on,14 we compared our algorithm with SGR-DA46 (see Figure 5b). Even tough
SGR-DA outperformed our attribute-unaware network in the P1 experiment but its result
was not as robust as our proposed attribute-aware deep coupled neural network.
Finally, Figure 6 demonstrate the results of the proposed algorithm on P3 experiment.
The network is trained on 1968 sketch-photo pairs and then tested on two completely un-
seen datasets, i.e. IIIT-D Semi-forensic and MGDB Unviewed. The gallery of this experi-
ment was also extended to 1500 mugshots.
Fig. 4.: The effect of considering facial attributes in sketch-photo matching. The first line shows the
results for a network trained with attribute-centered loss, and the second line depicts the result of a
network trained using contrastive loss.
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 354
(a) (b)
Fig. 5.: CMC curves of the proposed and existing algorithms for the extended gallery experiment:
(a) results on the Indian data subset compared to Mittal et al.14 and (b) results on the Identi-Kit data
subset compared to SGR-DA.46
Fig. 6.: CMC curves of the proposed algorithm for experiment P3. The results confirm the robustness
of the network to different sketch styles.
GANs22 are a group of generative models which learn to map a random noise z to output
image y: G(z) : z −→ y. They can be extended to a conditional GAN (cGAN) if the gen-
erator model, G, (and usually the discriminator) is conditioned on some extra information,
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 355
x, such as an image or class labels. In other words, cGAN learns a mapping from an input
x and a random noise z to the output image y: G(x, z) : {x, z} −→ y. The generator
model is trained to generate an image which is not distinguishable from “real” samples by
a discriminator network, D. The discriminator is trained adversarially to discriminate be-
tween the “fake” generated images by the generator and the real samples from the training
dataset. Both the generator and the discriminator are trained simultaneously following a
two-player min-max game.
The objective function of cGAN is defined as:
lGAN (G, D) =Ex,y∼pdata [log D(x, y)] + Ex,z∼pz [log(1 − D(x, G(x, z)))], (9)
where G attempts to minimize it and D tries to maximize it. Previous works in the literature
have found it beneficial to add an extra L2 or L1 distance term to the objective function
which forces the network to generate images which are near the ground truth. Isola et al.23
found L1 to be a better candidate as it encourages less blurring in the generated output. In
summary, the generator model is trained as follows:
G∗ = arg min max lGAN (G, D) + λlL1 (G), (10)
G D
3.2. CycleGAN
The main goal of CycleGAN27 is to train two generative models, Gx and Gy . These two
models learn the mapping functions between two domains x and y. The model, as illus-
trated in Figure 7, includes two generators; the first one maps x to y: Gy (x) : x −→ y and
the other does the inverse mapping y to x: Gx (y) : y −→ x. There are two adversarial dis-
criminators Dx and Dy , one for each generator. More precisely, Dx distinguishes between
“real” x samples and its generated “fake” samples Gx (y), and similarly, Dy discriminates
between “real” y and the “fake” Gy (x). Therefore, there is a distinct adversarial loss in
CycleGAN for each of the two (Gx , Dx ) and (Gy , Dy ) pairs. Notice that the adversarial
losses are defined as in Eq. 9.
For a high capacity network to be trained using only the adversarial loss, there is a
possibility of mapping the same set of inputs to a random permutation of images in the
target domain. In other words, the adversarial loss is not enough to guarantee that the
trained network generates the desired output. This is the reason behind having an extra
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 356
L1 distance term in the objective function of cGAN as shown in Eq. 10. As shown in
Figure 7, in the case of CycleGAN, there are no paired images between the source and
target domains, which is the main feature of CycleGAN over cGAN. Consequently, the L1
distance loss cannot be applied to this problem. To tackle this issue, a cycle consistency
loss was proposed in27 which forced the learned mapping functions to be cycle-consistent.
Particularly, the following conditions should be satisfied
x −→ Gy (x) −→ Gx (Gy (x)) ≈ x, y −→ Gx (y) −→ Gy (Gx (y)) ≈ y. (12)
To this end, a cycle consistency loss is defined as
lcyc (Gx , Gy ) = Ex∼pdata [ x−Gx (Gy (x)) 1 ] + Ey∼pdata [ y−Gy (Gx (y)) 1 ] . (13)
Taken together, the full objective function is
l( Gx , Gy , Dx , Dy ) = lGAN (Gx , Dx ) + lGAN (Gy , Dy ) + λlcyc (Gx , Gy ), (14)
where λ is a weighting factor to control the importance of the objectives and the whole
model is trained as follows
G∗x , G∗y = arg min max l(Gx , Gy , Dx , Dy ). (15)
Gx ,Gy Dx ,Dy
From now on, we use x for our source domain which is the sketch domain and y for the
target domain or the photo domain.
3.2.1. Architecture
The two generators, Gx and Gy , adopt the same architecture27 consisting of six convolu-
tional layers and nine residual blocks49 (see27 for details). The output of the discriminator
is of size 30x30. Each output pixel corresponds to a patch of the input image and tries to
classify if the patch is real or fake. More details are reported in.27
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 357
3.4. Architecture
Our proposed cCycleGAN adopts the same architecture as in CycleGAN. However, to
condition the generator and the discriminator to the facial attributes, we slightly modified
the architecture. The generator which transforms photos into sketches, Gx (y), and its
corresponding discriminator, Dx , are left unchanged as there is no attribute to force in
sketch generation phase. However, in the sketch-photo generator, Gy (x), we insert the
desired attributes before the fifth residual block of the bottleneck (Figure 8). To this end,
each attribute is repeated 4096 (64*64) times and then resized to a matrix of size 64×64.
Then all of these attribute feature maps and the output feature maps of the fourth residual
block are concatenated in depth and passed to the next block, as shown in Figure 9. The
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 358
Fig. 8.: cCycleGAN architecture, including Sketch-Photo cycle (top) and Photo-Sketch cycle (bot-
tom).
We follow the same training procedure as in Section 3.1.1 for the photo-sketch generator.
However, for the sketch-photo generator, we need a different training mechanism to force
the desired facial attributes to be present in the generated photo. Therefore, we define a
new type of negative sample for the attribute discriminator, Da , which is defined as a real
photo from the target domain but with a wrong set of attributes, ā. The training mech-
anism forces the sketch-photo generator to produce faces with the desired attributes. At
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 359
each training step, this generator synthesizes a photo with the same attributes, a, as the real
photo. Both the corresponding sketch-photo discriminator, Dy , and attribute discriminator,
Da , are supposed to detect the synthesized photo as a fake sample. The attribute discrimi-
nator, Da , is also trained with two other pairs: a real photo with correct attributes as a real
sample, and a real photo with wrong set of attributes as a fake sample. Simultaneously, the
sketch-photo generator attempts to fool both of the discriminators.
Fig. 10.: Sketch-based photo synthesis of hand-drawn test sketches (FERET dataset). Our network
adapts the synthesis results to satisfy different skin colors (white, brown, black).
channel, a, for the skin color. The sketch images are normalized to stand in [−1, 1] range.
Similarly, the skin color attribute gets -1, 0, and 1 for the black, brown and white skin
colors, respectively. Figure 10 shows the results of the cCycleGAN after 200 epochs on
the test data. The three skin color classes are not represented equally in the dataset which
obviously balanced the results towards the lighter skins.
Preliminary results reveal that the CycleGAN training can get unstable when there is a
significant difference, such as differences in scale and face poses, in the source and target
datasets. The easy task of the discriminator in differentiating between the synthesized and
real photos in these cases could account for this instability. Consequently, we generated a
synthetic sketch dataset as a replacement to the FERET dataset. Among the 40 attributes
provided in the CelebA dataset, we have selected the six most relevant ones in terms of
the visual impacts on the sketch-photo synthesis, including black hair, blond hair, brown
hair, gray hair, male, and pale skin. Therefore, the input to the sketch-photo generator has
seven channels including a gray-scale sketch image, x, and six attribute channels, a. The
attributes in CelebA dataset are binary, we have chosen -1 for a missing attribute and 1 for
an attribute which is supposed to be present in the synthesized photo. Figure 11 shows the
results of the cCycleGAN after 50 epochs on the test data. The trained network can follow
the desired attributes and force them on the synthesized photo.
For the sake of evaluation, we utilized a VGG16-based face verifier pre-trained on the
CMU Multi-PIE dataset. To evaluate the proposed algorithm, we first selected the identi-
ties which had more than one photos in the testing set. Then, for each identity, one photo is
randomly added to the test gallery, and a synthetic sketch corresponding to another photo
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 361
Fig. 11.: Attribute guided Sketch-based photo synthesis of synthetic test sketches from CelebA
dataset. Our network can adapt the synthesis results to satisfy the desired attributes.
Table 3.: Verification performance of the proposed cCycle-GAN vs. the cycle-GAN
Method cycle-GAN cCycle-GAN
Accuracy (%) %61.34 ± 1.05 %65.53 ± 0.93
of the same identity is added to the test prob. Finally, every prob synthetic sketch is given to
our attribute-guided sketch-photo synthesizer and the resulting synthesized photos are used
for face verification against the entire test gallery. This evaluation process was repeated
10 times. Table 3 depicts the face verification accuracies of the proposed attribute-guided
approach and the results of the original cycle-GAN on celebA dataset. The results of our
proposed network significantly improved on the original cycle-GAN with no attribute in-
formation.
4. Discussion
In this chapter, two distinct frameworks are introduced to enable employing facial attributes
in cross-modal face verification and synthesis. The experimental results show the superi-
ority of the proposed attribute-guided frameworks compared to the state-of-the-art tech-
niques. To incorporate facial attribute in cross modal face verification, we introduced an
attribute-centered loss to train a coupled deep neural network learning a shared embedding
space between the two modalities in which both geometrical and facial attribute infor-
mation cooperate on similarity score calculation. To this end, a distinct center point is
constructed for every combination of the facial attributes, which are used in the sketch-
attribute-DCNN, by leveraging the facial attributes of the suspect provided by the victims,
and the photo-DCNN learned to map their inputs close to their corresponding attribute cen-
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 362
ters. To incorporate facial attributes for an unpaired face sketch-photo synthesis problem,
an additional auxiliary attribute discriminator was prpposed with an appropriate loss to
force the desired facial attributes on the output of the generator. The pair of real face photo
from the training data with a set of false attributes defined a new fake input to the attribute
discriminator in addition to the pair of generator’s output and a set of random attributes.
References
1. X. Wang and X. Tang, Face photo-sketch synthesis and recognition, IEEE Transactions on Pat-
tern Analysis and Machine Intelligence. 31(11), 1955–1967, (2009).
2. Q. Liu, X. Tang, H. Jin, H. Lu, and S. Ma. A nonlinear approach for face sketch synthesis and
recognition. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer
Society Conference on, vol. 1, pp. 1005–1010. IEEE, (2005).
3. S. Ouyang, T. M. Hospedales, Y.-Z. Song, and X. Li. Forgetmenot: memory-aware forensic
facial sketch matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 5571–5579, (2016).
4. Y. Wang, L. Zhang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, and D. Samaras, Face relighting from a
single image under arbitrary unknown lighting conditions, IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence. 31(11), 1968–1984, (2009).
5. H. S. Bhatt, S. Bharadwaj, R. Singh, and M. Vatsa, Memetically optimized mcwld for matching
sketches with digital face images, IEEE Transactions on Information Forensics and Security. 7
(5), 1522–1535, (2012).
6. B. Klare and A. K. Jain, Sketch-to-photo matching: a feature-based approach, Proc. Society of
Photo-Optical Instrumentation Engineers Conf. Series. 7667, (2010).
7. W. Zhang, X. Wang, and X. Tang. Coupled information-theoretic encoding for face photo-sketch
recognition. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on,
pp. 513–520. IEEE, (2011).
8. C. Galea and R. A. Farrugia, Forensic face photo-sketch recognition using a deep learning-based
architecture, IEEE Signal Processing Letters. 24(11), 1586–1590, (2017).
9. S. Nagpal, M. Singh, R. Singh, M. Vatsa, A. Noore, and A. Majumdar, Face sketch matching via
coupled deep transform learning, arXiv preprint arXiv:1710.02914. (2017).
10. Y. Zhong, J. Sullivan, and H. Li. Face attribute prediction using off-the-shelf CNN features. In
Biometrics (ICB), 2016 International Conference on, pp. 1–7. IEEE, (2016).
11. A. Dantcheva, P. Elia, and A. Ross, What else does your biometric data reveal? a survey on soft
biometrics, IEEE Transactions on Information Forensics and Security. 11(3), 441–467, (2016).
12. H. Kazemi, M. Iranmanesh, A. Dabouei, and N. M. Nasrabadi. Facial attributes guided deep
sketch-to-photo synthesis. In Applications of Computer Vision (WACV), 2018 IEEE Workshop
on. IEEE, (2018).
13. B. F. Klare, S. Klum, J. C. Klontz, E. Taborsky, T. Akgul, and A. K. Jain. Suspect identifica-
tion based on descriptive facial attributes. In Biometrics (IJCB), 2014 IEEE International Joint
Conference on, pp. 1–8. IEEE, (2014).
14. P. Mittal, A. Jain, G. Goswami, M. Vatsa, and R. Singh, Composite sketch recognition using
saliency and attribute feedback, Information Fusion. 33, 86–99, (2017).
15. W. Liu, X. Tang, and J. Liu. Bayesian tensor inference for sketch-based facial photo hallucina-
tion. pp. 2141–2146, (2007).
16. X. Gao, N. Wang, D. Tao, and X. Li, Face sketch–photo synthesis and retrieval using sparse
representation, IEEE Transactions on circuits and systems for video technology. 22(8), 1213–
1226, (2012).
17. J. Zhang, N. Wang, X. Gao, D. Tao, and X. Li. Face sketch-photo synthesis based on support
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 363
vector regression. In Image Processing (ICIP), 2011 18th IEEE International Conference on,
pp. 1125–1128. IEEE, (2011).
18. N. Wang, D. Tao, X. Gao, X. Li, and J. Li, Transductive face sketch-photo synthesis, IEEE
transactions on neural networks and learning systems. 24(9), 1364–1376, (2013).
19. B. Xiao, X. Gao, D. Tao, and X. Li, A new approach for face recognition by sketches in photos,
Signal Processing. 89(8), 1576–1588, (2009).
20. Y. Güçlütürk, U. Güçlü, R. van Lier, and M. A. van Gerven. Convolutional sketch inversion. In
European Conference on Computer Vision, pp. 810–824. Springer, (2016).
21. L. Zhang, L. Lin, X. Wu, S. Ding, and L. Zhang. End-to-end photo-sketch generation via fully
convolutional representation learning. In Proceedings of the 5th ACM on International Confer-
ence on Multimedia Retrieval, pp. 627–634. ACM, (2015).
22. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and
Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems,
pp. 2672–2680, (2014).
23. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, Image-to-image translation with conditional adver-
sarial networks, arXiv preprint arXiv:1611.07004. (2016).
24. P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays, Scribbler: Controlling deep image synthesis with
sketch and color, arXiv preprint arXiv:1612.00835. (2016).
25. J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the
natural image manifold. In European Conference on Computer Vision, pp. 597–613. Springer,
(2016).
26. D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky. Texture networks: Feed-forward
synthesis of textures and stylized images. In ICML, pp. 1349–1357, (2016).
27. J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, Unpaired image-to-image translation using cycle-
consistent adversarial networks, arXiv preprint arXiv:1703.10593. (2017).
28. J. Zhu, S. Liao, D. Yi, Z. Lei, and S. Z. Li. Multi-label CNN based pedestrian attribute learning
for soft biometrics. In Biometrics (ICB), 2015 International Conference on, pp. 535–540. IEEE,
(2015).
29. Q. Guo, C. Zhu, Z. Xia, Z. Wang, and Y. Liu, Attribute-controlled face photo synthesis from
simple line drawing, arXiv preprint arXiv:1702.02805. (2017).
30. X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from
visual attributes. In European Conference on Computer Vision, pp. 776–791. Springer, (2016).
31. G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Álvarez, Invertible conditional GANs for
image editing, arXiv preprint arXiv:1611.06355. (2016).
32. Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face
recognition. In European Conference on Computer Vision, pp. 499–515. Springer, (2016).
33. R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant map-
ping. In Computer vision and pattern recognition, 2006 IEEE computer society conference on,
vol. 2, pp. 1735–1742. IEEE, (2006).
34. F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recogni-
tion and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 815–823, (2015).
35. C. Qi and F. Su, Contrastive-center loss for deep neural networks, arXiv preprint
arXiv:1707.07391. (2017).
36. K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recog-
nition, arXiv preprint arXiv:1409.1556. (2014).
37. X. Tang and X. Wang. Face sketch synthesis and recognition. In Computer Vision, 2003. Pro-
ceedings. Ninth IEEE International Conference on, pp. 687–694. IEEE, (2003).
38. H. S. Bhatt, S. Bharadwaj, R. Singh, and M. Vatsa. Memetic approach for matching sketches
with digital face images. Technical report, (2012).
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 364
39. H. Han, B. F. Klare, K. Bonnen, and A. K. Jain, Matching composite sketches to face photos:
A component-based approach, IEEE Transactions on Information Forensics and Security. 8(1),
191–204, (2013).
40. Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings
of International Conference on Computer Vision (ICCV), (2015).
41. H. WinnemöLler, J. E. Kyprianidis, and S. C. Olsen, Xdog: an extended difference-of-gaussians
compendium including advanced image stylization, Computers & Graphics. 36(6), 740–753,
(2012).
42. F. L. Bookstein, Principal warps: Thin-plate splines and the decomposition of deformations,
IEEE Transactions on pattern analysis and machine intelligence. 11(6), 567–585, (1989).
43. A. Dabouei, H. Kazemi, M. Iranmanesh, and N. M. Nasrabadi. Fingerprint distortion rectifica-
tion using deep convolutional neural networks. In Biometrics (ICB), 2018 International Confer-
ence on. IEEE, (2018).
44. Biometrics and identification innovation center, wvu multi-modal dataset. Available at http:
//biic.wvu.edu/,.
45. A. P. Founds, N. Orlans, W. Genevieve, and C. I. Watson, Nist special databse 32-multiple
encounter dataset ii (meds-ii), NIST Interagency/Internal Report (NISTIR)-7807. (2011).
46. C. Peng, X. Gao, N. Wang, and J. Li, Sparse graphical representation based discriminant analysis
for heterogeneous face recognition, arXiv preprint arXiv:1607.00137. (2016).
47. P. Mittal, A. Jain, G. Goswami, R. Singh, and M. Vatsa. Recognizing composite sketches with
digital face images via ssd dictionary. In Biometrics (IJCB), 2014 IEEE International Joint Con-
ference on, pp. 1–6. IEEE, (2014).
48. P. Mittal, M. Vatsa, and R. Singh. Composite sketch recognition via deep network-a transfer
learning approach. In Biometrics (ICB), 2015 International Conference on, pp. 251–256. IEEE,
(2015).
49. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, (2016).
50. P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, The feret evaluation methodology for face-
recognition algorithms, IEEE Transactions on pattern analysis and machine intelligence. 22(10),
1090–1104, (2000).
51. Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings
of the IEEE International Conference on Computer Vision, pp. 3730–3738, (2015).
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 365
CHAPTER 2.10
Connected and Autonomous Vehicles (CAVs) are typically equipped with multiple
advanced on-board sensors generating a massive amount of data. Utilizing and
processing such data to improve the performance of CAVs is a current research
area. Machine learning techniques are effective ways of exploiting such data in
many applications with many demonstrated success stories. In this chapter, first,
we provide an overview of recent advances in applying machine learning in the
emerging area of CAVs including particular applications and highlight several
open issues in the area. Second, as a case study and a particular application, we
present a novel deep learning approach to control the steering angle for coopera-
tive self-driving cars capable of integrating both local and remote information. In
that application, we tackle the problem of utilizing multiple sets of images shared
between two autonomous vehicles to improve the accuracy of controlling the steer-
ing angle by considering the temporal dependencies between the image frames.
This problem has not been studied in the literature widely. We present and
study a new deep architecture to predict the steering angle automatically. Our
deep architecture is an end-to-end network that utilizes Convolutional-Neural-
Networks (CNN), Long-Short-Term-Memory (LSTM) and fully connected (FC)
layers; it processes both present and future images (shared by a vehicle ahead via
Vehicle-to-Vehicle (V2V) communication) as input to control the steering angle.
In our simulations, we demonstrate that using a combination of perception and
communication systems can improve robustness and safety of CAVs. Our model
demonstrates the lowest error when compared to the other existing approaches
in the literature.
1. Introduction
It is estimated that by the end of next decade, most vehicles will be equipped with
powerful sensing capabilities and on-board units (OBUs) enabling multiple commu-
nication types including in-vehicle communications, vehicle-to-vehicle (V2V) com-
munications and vehicle-to-infrastructure (V2I) communications. As vehicles be-
365
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 366
V1 (Own camera)
t
t-3
t+1
t-2 t+2
t-1 t+3
t
Vehicle 1 (V1)
Fig. 1. The overview of our proposed vehicle-assisted end-to-end system. Vehicle 2 (V2) sends
his information to Vehicle 1 (V1) over V2V communication. V1 combines that information along
with its own information to control the steering angle. The prediction is made through our
CNN+LSTM+FC network (see Fig. 2 for the details of our network).
come more aware of their environments and as they evolve towards full autonomy,
the concept of connected and autonomous vehicles (CAVs) becomes more crucial.
Recently, CAVs has gained substantial momentum to bring a new level of connec-
tivity to vehicles. Along with novel on-board computing and sensing technologies,
CAVs serve as a key enabler for Intelligent Transport Systems (ITS) and smart
cities.
CAVs are increasingly equipped with a wide variety of sensors, such as engine
control units, radar, light detection and ranging (LiDAR), and cameras to help a
vehicle perceive the surrounding environment and monitor its own operation status
in real-time. By utilizing high-performance computing and storage facilities, CAVs
can keep generating, collecting, sharing and storing large volumes of data. Such data
can be exploited to improve CAVs robustness and safety. Artificial intelligence (AI)
is an effective approach to exploit, analyze and use such data. However, how to
mine such data currently remains a challenge in many ways for further research
direction.
Among the current existing issues, robust control of the steering angle is one of
the most difficult and important problems for autonomous vehicles [1–3]. Recent
computer vision-based approaches to control the steering angle in autonomous cars
mostly focus on improving the driving accuracy with the local data collected from
the sensors on the same vehicle and as such, they consider each car as an isolated
unit gathering and processing information locally. However, as the availability and
the utilization of V2V communication increases, real-time data sharing becomes
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 367
more feasible among vehicles [4–6]. As such, new algorithms and approaches are
needed that can utilize the potential of cooperative environments to improve the
accuracy for controlling the steering angle automatically [7].
One objective of this chapter is to bring more attention to this emerging field
since the research on applying AI in CAVs is still a growing area. We identify and
discuss major challenges and applications of AI in perception/sensing, in communi-
cations and in user experience for CAVs. In particular, we discuss in greater detail
and present a deep learning-based approach that differs from other approaches. It
utilizes two sets of images (data): coming from the on-board sensors and coming
from another vehicle ahead over V2V communication to control the steering angle
in self-driving vehicles automatically (see Fig. 1). Our proposed deep architecture
contains a convolutional neural network (CNN) followed by a Long-Short-Term-
Memory (LSTM) and a fully connected (FC) network. Unlike the older approach
that manually decomposes the autonomous driving problem into different compo-
nents as in [8, 9] the end-to-end model can directly steer the vehicle from the cam-
era data and has been proven to operate effectively in previous works [1, 10]. We
compare our proposed deep architecture to multiple existing algorithms in the liter-
ature on Udacity dataset. Our experimental results demonstrate that our proposed
CNN-LSTM-based model yields state-of-the-art results. Our main contributions
are: (1) we provide a survey of AI applications in the emerging area of CAVs and
highlight several open issues for further research; (2) we propose an end-to-end
vehicle-assisted steering angle control system for cooperative vehicles using a large
sequence of images; (3) we introduce a new deep architecture that yields the state-
of-the-art results on the Udacity dataset; (4) we demonstrate that integrating the
data obtained from other vehicles via V2V communication system improves the
accuracy of predicting the steering angle for CAVs.
As the automotive industry transforms, data remains at the heart of CAV’s evolu-
tion [11, 12]. To take advantage of the data, efficient methods are needed to interpret
and mine massive amount of data and to improve robustness of self-driving cars.
Most of the relevant attention is given on the AI-based techniques as many recent
deep learning-based techniques demonstrated promising performance in a wide va-
riety of applications in vision, speech recognition and natural language areas by
significantly improving the state-of-the-art performance [13, 14].
Similar to many other areas, deep learning-based techniques are also providing
promising and improved results in the area of CAVs [14, 15]. For instance, the
problem of navigating a self-driving car with the acquired sensory data has been
studied in the literature with and without using end-to-end approaches [16]. The
earlier works such as the ones from [17] and [18] use multiple components for rec-
ognizing objects of safe-driving concerns including lanes, vehicles, traffic signs, and
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 368
pedestrians. The recognition results are then combined to give a reliable world rep-
resentation, which are used with an AI system to make decisions and control the
car. More recent approaches focus on using deep learning-based techniques. For ex-
ample, Ng et al. [19] utilized a CNN for vehicle and lane detection. Pomerleau [20]
used a NN to automatically train a vehicle to drive by observing the input from a
camera. Dehghan et al. [21] presents a vehicle color recognition (MMCR) system,
that relies on a deep CNN.
To automatically control the steering angle, recent works focus on using neural
networks-based end-to-end approaches [22]. The Autonomous Land Vehicle in a
Neural Network (ALVINN) system was one of the earlier systems utilizing multilayer
perceptron [23] in 1989. Recently, CNNs were commonly used as in the DAVE-2
Project [1]. In [3], the authors proposed an end-to-end trainable C-LSTM network
that uses a LSTM network at the end of the CNN network. A similar approach was
taken by the authors in [24], where the authors designed a 3D CNN model with
residual connections and LSTM layers. Other researchers also implemented different
variants of convolutional architectures for end-to-end models as in [25–27]. Another
widely used approach for controlling vehicle steering angle in autonomous systems is
via sensor fusion where combining image data with other sensor data such as LiDAR,
RADAR, GPS to improve the accuracy in autonomous operations [28, 29]. For
instance, in [26], the authors designed a fusion network using both image features
and LiDAR features based on VGGNet.
There has been significant progress by using AI on several cooperative and con-
nected vehicle related issues including network congestion, intersection collision
warning, wrong-way driving warning, remote diagnostic of vehicles, etc. For in-
stance, a centrally controlled approach to manage network congestion control at
intersections has been presented by [30] with the help of a specific unsupervised
learning algorithm, k-means clustering. The approach basically addresses the con-
gestion problem when vehicles stop at a red light in an intersection, where the road
side infrastructures observe the wireless channels to measure and control chan-
nel congestion. CoDrive [31] proposes an AI cooperative system for an open-car
ecosystem, where cars collaborate to improve the positioning of each other. Co-
Drive results in precise reconstruction of a traffic scene, preserving both its shape
and size. The work in [32] uses the received signal strength of the packets received
by the Road Side Units (RSUs) and sent by the vehicles on the roads to predict the
position of vehicles. To predict the position of vehicles they adopted a cooperative
machine-learning methodology, they compare three widely recognized techniques:
K Nearest Neighbors (KNN), Support Vector Machine (SVM) and Random Forest.
In CVS systems, drivers’ behaviors and online and real-time decision making on
the road directly affect the performance of the system. However, the behaviors of
human drivers are highly unpredictable compared to pre-programmed driving assis-
tance systems, which makes it hard for CAVs to make a prediction based on human
behavior and that creates another important issue in the area. Many AI applications
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 369
have been proposed to tackle that issue including [5, 6, 33]. For example, in [33],
a two-stage data-driven approach has been proposed: (I) classify driving patterns
of on-road surrounding vehicles using the Gaussian mixture models (GMM); and
(II) predict vehicles’ short-term lateral motions based on real-world vehicle mobility
data. Sekizawa et al. [34] developed a stochastic switched auto-regressive exogenous
model to predict the collision avoidance behavior of drivers using simulated driving
data in a virtual reality system. Chen et al., in 2018, designed a visibility-based col-
lision warning system to use the NN to reach four models to predict vehicle rear-end
collision under a low visibility environment [35]. With historical traffic data, Jiang
and Fei, in 2016, employed neural network models to predict average traffic speeds
of road segments and a forward-backward algorithm on Hidden Markov models to
predict speeds of an individual vehicle [36].
For the prediction of drivers’ maneuver. Yao et al., in 2013, developed a para-
metric lane change trajectory prediction approach based on real human lane change
data. This method generated a similar parametric trajectory according to the k-
Nearest real lane change instances [37]. In [38] is proposed an online learning-based
approach to predict lane change intention, which incorporated SVM and Bayesian
filtering. Liebner et al. developed a prediction approach for lateral motion at urban
intersections with and without the presence of preceding vehicles [39]. The study
focused on the parameter of the longitudinal velocity and the appearance of pre-
ceding vehicles. In [40] is proposed a multilayer perceptron approach to predict the
probability of lane changes by surrounding vehicles and trajectories based on the
history of the vehicles’ position and their current positions. Woo et al., in 2017,
constructed a lane change prediction method for surrounding vehicles. The method
employed SVM to classify driver intention classes based on a feature vector and
used the potential field method to predict trajectory [41].
With the purpose of improving the driver experience in CAVs, there have been
proposed works on predictive maintenance, automotive insurance (to speed up the
process of filing claims when accidents occur), car manufacturing improved by AI,
driver behavior monitoring, identification, recognition and alert [42]. Other works
focus on eye gaze, eye openness, and distracted driving detection, alerting the driver
to keep their eyes on the road [43]. Some advanced AI facial recognition algorithms
are used to allow access to the vehicle and detect which driver is operating the
vehicle, the system can automatically adjust the seat, mirrors, and temperature to
suit the individual. For example, [44] presents a deep face detection vehicle system
for driver identification that can be used in access control policies. These systems
have been devised to provide customers greater user experience and to ensure safety
on the roads.
Other works focus on traffic flow prediction, traffic congestion alleviation, fuel
consumption reduction, and various location-based services. For instance, in [45], a
probabilistic graphical model; Poisson regression trees (PRT), has been used for two
correlated tasks: the LTE communication connectivity prediction and the vehicular
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 370
3. Relevant Issues
The notion of a “self-driving vehicle” has been around for quite a long time now,
yet a fully automated vehicle not being available for sale created some confu-
sion [47–49]. To put the concept into a measurable degree, the United States’
Department of Transportation’s (USDOT) National Highway Traffic Safety Ad-
ministration (NHTSA) defines 6 levels of automation [47]. They released this clas-
sification for smooth standardization and to measure the safety ratings for AVs.
The levels span from 0 to 5, where level 0 referring to no automation at all where
the human driver does all the control and maneuver. In level 1 of automation, an
Advanced Driver Assistance System (ADAS) helps the human driver with either
control (i.e. accelerating, braking) or maneuver (steering) at certain circumstances,
albeit not simultaneously both. Adaptive Cruise Control (ACC) falls below this
level of automation as it can vary the power to maintain the user-set speed but the
automated control is limited to maintaining the speed, not the lateral movement. At
the next level of automation (Level 2, Partial Automation), the ADAS is capable of
controlling and maneuvering simultaneously, but under certain circumstances. So
the human driver still has to monitor the vehicle’s surrounding and perform the
rest of the controls when required. At level 3 (Conditional Automation), the ADAS
does not require the human driver to monitor the environment all the time. At
certain circumstances, the ADAS is fully capable of performing all the parts of the
driving task. The range of safe-automation scenarios is larger at this level than in
level 2. However, the human driver should still be ready to regain control when the
system asks for it at such circumstance. At all other scenarios, the control is up to
human-maneuver. Level 4 of automation is called “High Automation” as the ADAS
at level 4 can take control of the vehicle at most scenarios and a human driver is not
essentially required to take control from the system. But in critical weather where
the sensor information might be noisier (e.g. in rain or snow), the system may
disable the automation for safety concerns requiring the human driver to perform
all of the driving tasks [47–49]. Currently, many private sector car companies and
investors are testing and analyzing their vehicles at level 4 standard but putting a
safety driver behind the wheel, which necessarily brings down the safety testing at
level 2 and 3. All the automakers and investors are currently putting their research
and development efforts to eventually reach level 5, which refers to full automation
where the system is capable of performing all of the driving tasks without requiring
any human takeover at any circumstance [47, 48].
The applications presented in section 2 shown a promised future for data-driven
deep learning algorithms in CAVs. However, the jump from level 2 to level 3, 4
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 371
and 5 is substantial from the AI perspective and naively applying existing deep
learning methods is currently insufficient for full automation due to the complex
and unpredictable nature of CAVs [48, 49]. For example, in level 3 (Conditional
Driving Automation) the vehicles need to have intelligent environmental detection
capabilities, able to make informed decisions for themselves, such as accelerating
past a slow-moving vehicle; in level 4 (High Driving Automation) the vehicles have
full automation in specific controlled areas and can even intervene if things go
wrong or there is a system failure and finally in level 5 (Full Driving Automation)
the vehicles do not require human attention at all [42, 47]. Therefore, how to adapt
existing solutions to better handle such requirements remains a challenging task.
In this section, we identify some research topics for further investigation and in
particular we propose one new approach to the control of the steering angle for
CAVs.
There are various open research problems in the area [11, 42, 48, 49]. For
instance, further work can be done on detection of driver’s physical movements
and posture as in eye gaze, eye openness, and head position to detect and alert
a distracted driver with lower latency [11, 12, 50]. An upper body detection can
detect the driver’s posture and in case of a crash, airbags can be deployed in a
manner that will reduce injury based on how the driver is sitting [51]. Similarly,
detecting the driver’s emotion can also help with the decision making [52].
Connected vehicles can use an Autonomous Driving Cloud (ADC) platform,
that will allow data to have need-based availability [11, 14, 15]. The ADC can use
AI algorithms to make meaningful decisions. It can act as the control policy or the
brain of the autonomous vehicle. This intelligent agent can also be connected to a
database which acts as a memory where past driving experiences are stored [53].
This data along with the real-time input coming in through the autonomous vehi-
cle about the immediate surroundings will help the intelligent agent make accurate
driving decisions. In the vehicular network side, AI can exploit multiple sources of
data generated and stored in the network (e.g. vehicle information, driver behavior
patterns, etc.) to learn the dynamics in the environment [4–6] and then extract ap-
propriate features to use for the benefit of many tasks for communications purposes,
such as signal detection, resource management, and routing [11, 15]. However, it is a
non-trivial task to extract semantic information from the huge amount of accessible
data, which might have been contaminated by noise or redundancy, and thus infor-
mation extraction need to be performed [11, 53]. In addition, in vehicular networks,
data are naturally generated and stored across different units in the network [15, 54]
(e.g., RVs, RSUs, etc). This brings challenges to the applicability of most existing
machine learning algorithms that have been developed under the assumption that
data are centrally controlled and easily accessible [11, 15]. As a result, distributed
learning methods are desired in CAVs that act on partially observed data and have
the ability to exploit information obtained from other entities in the network [7, 55].
Furthermore, additional overheads incurred by the coordination and sharing of in-
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 372
formation among various units in vehicular networks for distributed learning shall
be properly accounted for to make the system work effectively [11].
In particular, an open area for further research is how to integrate the informa-
tion from local sensors (perception) and remote information (cooperative) [7]. We
present a novel application in this chapter that steps in this domain. For instance,
for controlling the steering angle all the above-listed work focus on utilizing data
obtained from the on-board sensors and they do not consider the assisted data that
comes from another car. In the following section, we demonstrate that using addi-
tional data that comes from the ahead-vehicle helps us obtain better accuracy in
controlling the steering angle. In our approach, we utilize the information that is
available to a vehicle ahead of our car to control the steering angle.
We consider the control of the steering angle as a regression problem where the
input is a stack of images and the output is the steering angle. Our approach can
also process each image individually. Considering multiple frames in a sequence
can benefit us in situations where the present image alone is affected by noise or
contains less useful information such as when the current image is burnt largely by
direct sunlight. In such situations, the correlation between the current frame and
the past frames can be useful to decide the next steering value. We use LSTM to
utilize multiple images as a sequence. LSTM has a recursive structure acting as a
memory, through which the network can keep some past information to predict the
output based on the dependency of the consecutive frames [56, 57].
Our proposed idea in this chapter relies on the fact that the condition of the
road ahead has already been seen by another vehicle recently and we can utilize
that information to control the steering angle of our car as discussed above. Fig.
1 illustrates our approach. In the figure, Vehicle 1 receives a set of images from
Vehicle 2 over V2V communication and keeps the data at the on-board buffer. It
combines the received data with the data obtained from the on-board camera and
processes those two sets of images on-board to control the steering angle via an
end-to-end deep architecture. This method enables the vehicle to look ahead of its
current position at any given time.
Our deep architecture is presented in Fig. 2. The network takes the set of
images coming from both vehicles as input and it predicts the steering angle as
the regression output. The details of our deep architecture are given in Table 1.
Since we construct this problem as a regression problem with a single unit at the
end, we use the Mean Squared Error (MSE) loss function in our network during the
training.
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 373
Input
Output
conv2D
conv2D
conv2D
conv2D
conv2D
Image Set (self)
FC 1
FC 10
FC 50
Image Set (remote) 64 64 64 FC 100
Fig. 2. CNN + LSTM + FC Image sharing model. Our model uses 5 convolutional layers,
followed by 3 LSTM layers, followed by 4 FC layers. See Table 1 for further details of our proposed
architecture.
5. Experiment Setup
In this section we will elaborate further on the dataset as well as data pre-processing
and evaluation metrics. We conclude the section with details of our implementation.
5.1. Dataset
In order to compare our results to existing work in the literature, we used the self-
driving car dataset by Udacity. The dataset has a wide variation of 100K images
from simultaneous Center, Left and Right camera on a vehicle, collected in sunny
and overcast weather, 33K images belong to center camera. The dataset contains
the data of 5 different trips with a total drive time of 1694 seconds. Test vehicle
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 374
104
3
2.5
2
Histogram
1.5
0.5
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Angle (Radians)
Fig. 3. The angle distribution within the entire Udacity dataset (angle in radians vs. total number
of frames), just angles between -1 and 1 radians are shown.
has 3 cameras mounted as in [1] collecting images at a rate of near 20Hz. Steering
wheel angle, acceleration, brake, GPS data was also recorded. The distribution of
the steering wheel angles over the entire dataset is shown in Fig. 3. As shown
in Fig. 3, the dataset distribution includes a wide range of steering angles. The
image size is 480*640*3 pixels and total dataset is of 3.63 GB. Since there is no
dataset available with V2V communication images currently, here we simulate the
environment by creating a virtual vehicle that is moving ahead of the autonomous
vehicle and sharing camera images by using the Udacity dataset.
Udacity dataset has been used widely in the recent relevant literature [24, 58]
and we also use Udacity dataset in this chapter to compare our results to the ex-
isting techniques in literature. Along with the steering angle, the dataset contains
spatial (latitude, longitude, altitude) and dynamic (angle, torque, speed) informa-
tion labelled with each image. The data format for each image is: index, timestamp,
width, height, frame id, filename, angle, torque, speed, latitude, longitude, altitude.
For our purpose, we are only using the sequence of center-camera images.
The images in the dataset are recorded at the rate around 20 frame per second.
Therefore, usually there is a large overlap between consecutive frames. To avoid
overfitting, we used image augmentation to get more variance in our image dataset.
Our image augmentation technic randomly adds brightness and contrast to change
pixel values. We also tested image cropping to exclude possible redundant infor-
mation that are not relevant in our application. However, in our test the models
perform better without cropping.
For the sequential model implementation, we preprocessed the data in a different
way. Since we do want to keep the visual sequential relevance in the series of frames
while avoiding overfitting, we shuffle the dataset while keeping track of the sequential
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 375
information. We then train our model with 80% images on the same sequence from
the subsets and validate on the rest 20%.
-
8x Resnet 5x 5x
5x
Conv 2D
Conv 3D Pre- Conv 2D Conv 2D
trained
2 x LSTM
5 x FC 5 x FC 4 x FC 5 x FC 5 x FC
Fig. 4. An overview of the used baseline models in this chapter. The details of each model can
be found in their respective source paper.
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 376
Table 2 lists the comparison of the RMSE values for multiple end-to-end models
after training them on the Udacity dataset. In addition to the five baseline models
listed in Section IV-E, we also include two models of ours: Model F and Model G.
Model F is our proposed approach with setting x = 8 for each vehicle. Model G
sets x = 10 time-steps for each vehicle instead of 8 in our model. Since the RMSE
values on Udacity dataset were not reported for Model D and Model E in [58], we
re-implemented those models to compute the RMSE values on Udacity Dataset and
reported the results from our implementation in Table 2.
Table 3 lists the MAE values computed for our implementations of the models
A, D, E, F, and G. Models A, B, C, D, and E do not report their individual MAE
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 377
0.04
Validation Set
Training Set
0.03
MSE Loss
0.02
0.01
0
0 5 10 15
Number of Ephochs
Fig. 5. Training and Validation steps for our best model with x = 8.
0.2
0.16
0.17
0.15
0.12
0.13
RMSE
0.1 0.095
0.073
0.073 0.051
0.071 0.05
0.05 0.043 0.042 0.044 0.06
0.046 0.044 0.044 Validation Set
0.034 Training Set
0
0 2 4 6 8 10 12 14 16 18 20
Number of Images (x)
Fig. 6. RMSE vs.x value. We trained our algorithm at various x values and computed the
respective RMSE value. As shown in the figure, the minimum value is obtained at x = 8.
Model: A B C D E Fa Gb
[1] [24] [24] [58] [58] Ours Ours
Training 0.099 0.113 0.077 0.061 0.177 0.034 0.044
Validation 0.098 0.112 0.077 0.083 0.149 0.042 0.044
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 378
to 1, 2, 4, 6, 8, 10, 12, 14, 20. We computed the RMSE value for both the training
and validation data respectively at each x value. The results were plotted in Fig. 6.
As shown in the figure, we obtained the lowest RMSE value for both training and
validation data at the value when x = 8, where RMSE = 0.042 for the validation
data. The figure also shows that choosing the appropriate x value is important to
receive the best performance from the model. As Fig. 6 shows, the number of the
used images in the input affects the performance. Next, we study how changing the
Δt value affects the performance of our end-to-end system in terms of RMSE value
during the testing, once the algorithm is trained at a fixed Δt.
Changing Δt corresponds to varying the distance between the two vehicles. For
that purpose, we first set Δt = 30 frames (i.e., 1.5 seconds gap between the vehicles)
and trained the algorithm accordingly (where x = 10). Once our model was trained
and learned the relation between the given input image stacks and the corresponding
output value at Δt = 30, we studied the robustness of the trained system as the
distance between two vehicles change during the testing. Fig. 7 demonstrates the
results on how the RMSE value changes as we change the distance between the
vehicles during the testing. For that, we run the trained model over the entire
validation data where the input obtained from the validation data formed at Δt
values varying between 0 and 95 with increments of 5 frames, and we computed the
RMSE value at each of those Δt values.
0.07
0.065
0.064
0.065 0.062
0.061
0.059
0.06 0.057
0.055
RMSE
0.053
0.055 0.051
0.049 0.049
0.048
0.05 0.0470.046 0.046 0.047
0.046
0.044 0.044
0.045
0.04
0 10 20 30 40 50 60 70 80 90
Number of frames ahead ( )
Fig. 7. RMSE value vs. size of the number of frames ahead (Δt) over the validation data. The
model is trained at Δt = 30 and x = 10. Between the Δt values: 13 and 37 (the red area)
the change in RMSE value remains small and the algorithm almost yields the same min value at
Δt = 20 which is different than the training value.
A D E Fa Gb
Model: [1] [58] [58] Ours Ours
Training 0.067 0.038 0.046 0.022 0.031
Validation 0.062 0.041 0.039 0.033 0.036
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 379
2
Steering Angle (Radians)
-1
-2
0 0.5 1 1.5 2 2.5 3 3.5
Number of Images 104
Fig. 8. Steering angle (in radians) vs. the index of each image frame in the data sequence is
shown for the Udacity Dataset. This data forms the ground-truth for our experiments. The upper
and lower red lines highlight the maximum and minimum angle values respectively in the figure.
7. Concluding Remarks
Model A Model D
1 1
0.5 0.5
Angle (Radians)
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
Model E Model F
4
1
2 0.5
Angle (Radians)
0
0
-0.5
-2
-1
-4
-1.5
0 1 2 3 0 1 2 3
Number of Images x 10 4 Number of Images x 10 4
Fig. 9. Individual error values (shown in radian) made at each time frame is plotted for four
models namely: Model A, Model D, Model E and Model F. The dataset is the Udacity Dataset.
Ground-truth is shown in Fig. 8. The upper and lower red lines highlight the maximum and
minimum errors made by each algorithm. The error for each frame (the y axis) for Model A, D
and F are plotted in the range of [-1.5, +1,2] and the error for Model E is plotted in the range of
[-4.3, +4.3].
machine learning in CAVs and highlight several open issues for further research. We
present a new approach by sharing images between cooperative self-driving vehicles
to improve the control accuracy of steering angle. Our end-to-end approach uses a
deep model using CNN, LSTM and FC layers and it combines the on-board data
with the data (images) received from another vehicle as input. Our proposed model
using shared images yields the lowest RMSE value when compared to the other
existing models in the literature.
Unlike previous works that only use and focus on local information obtained from
a single vehicle, we proposed a system where the vehicles communicate with each
other and share data. In our experiments, we demonstrate that our proposed end-to-
end model with data sharing in cooperative environments yields better performance
than the previous approaches that rely on only the data obtained and used on the
same vehicle. Our end-to-end model was able to learn and predict accurate steering
angles without manual decomposition into road or lane marking detection.
One potentially strong argument against using image sharing might be that
using the geo-spatial information along with the steering angle from the future
vehicle and employing the same angle value at that position. Here we argue that:
(I) using GPS makes the prediction dependent on the location data which, like
many other sensor types, provides faulty location values due to various reasons and
that can force algorithms to use wrong image sequence as input.
More work and analysis are needed to improve the robustness of our proposed
model. While this chapter relies on simulated data (where the data sharing be-
tween the vehicles is simulated from the Udacity Dataset), we are in the process of
collecting real data collected from multiple cars communicating over V2V and will
perform more detailed analysis on that new and real data.
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 381
Acknowledgement
This work was done as a part of CAP5415 Computer Vision class in Fall 2018
at UCF. We gratefully acknowledge the support of NVIDIA Corporation with the
donation of the GPU used for this research.
References
[47] nhtsa. nhtsa automated vehicles for safety, (2019). URL https://www.nhtsa.gov/
technology-innovation/automated-vehicles-safety.
[48] J. M. Anderson, K. Nidhi, K. D. Stanley, P. Sorensen, C. Samaras, and O. A. Oluwa-
tola, Autonomous vehicle technology: A guide for policymakers. (Rand Corporation,
2014).
[49] W. J. Kohler and A. Colbert-Taylor, Current law and potential legal issues pertaining
to automated, autonomous and connected vehicles, Santa Clara Computer & High
Tech. LJ. 31, 99, (2014).
[50] Y. Liang, M. L. Reyes, and J. D. Lee, Real-time detection of driver cognitive distrac-
tion using support vector machines, IEEE transactions on intelligent transportation
systems. 8(2), 340–350, (2007).
[51] Y. Abouelnaga, H. M. Eraqi, and M. N. Moustafa, Real-time distracted driver posture
classification, arXiv preprint arXiv:1706.09498. (2017).
[52] M. Grimm, K. Kroschel, H. Harris, C. Nass, B. Schuller, G. Rigoll, and T. Moosmayr.
On the necessity and feasibility of detecting a drivers emotional state while driving.
In International Conference on Affective Computing and Intelligent Interaction, pp.
126–138. Springer, (2007).
[53] M. Gerla, E.-K. Lee, G. Pau, and U. Lee. Internet of vehicles: From intelligent grid
to autonomous cars and vehicular clouds. In 2014 IEEE world forum on internet of
things (WF-IoT), pp. 241–246. IEEE, (2014).
[54] B. Toghi, M. Saifuddin, H. N. Mahjoub, M. O. Mughal, Y. P. Fallah, J. Rao, and
S. Das. Multiple access in cellular v2x: Performance analysis in highly congested
vehicular networks. In 2018 IEEE Vehicular Networking Conference (VNC), pp. 1–8
(Dec, 2018). doi: 10.1109/VNC.2018.8628416.
[55] K. Passino, M. Polycarpou, D. Jacques, M. Pachter, Y. Liu, Y. Yang, M. Flint, and
M. Baum. Cooperative control for autonomous air vehicles. In Cooperative control
and optimization, pp. 233–271. Springer, (2002).
[56] F. A. Gers, J. Schmidhuber, and F. Cummins, Learning to forget: Continual pre-
diction with LSTM, Neural Computation. (2000). ISSN 08997667. doi: 10.1162/
089976600300015015.
[57] K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J. Schmidhuber,
LSTM: A Search Space Odyssey, IEEE Transactions on Neural Networks and Learn-
ing Systems. (2017). ISSN 21622388. doi: 10.1109/TNNLS.2016.2582924.
[58] D. Choudhary and G. Bansal. Convolutional Architectures for Self-Driving Cars.
Technical report, (2017).
[59] B. Toghi, M. Saifuddin, Y. P. Fallah, and M. O. Mughal, Analysis of Distributed
Congestion Control in Cellular Vehicle-to-everything Networks, arXiv e-prints. art.
arXiv:1904.00071 (Mar, 2019).
[60] B. Toghi, M. Mughal, M. Saifuddin, and Y. P. Fallah, Spatio-temporal dynam-
ics of cellular v2x communication in dense vehicular networks, arXiv preprint
arXiv:1906.08634. (2019).
[61] G. Shah, R. Valiente, N. Gupta, S. Gani, B. Toghi, Y. P. Fallah, and S. D. Gupta,
Real-time hardware-in-the-loop emulation framework for dsrc-based connected vehicle
applications, arXiv preprint arXiv:1905.09267. (2019).
[62] M. Islam, M. Chowdhury, H. Li, and H. Hu, Vision-based Navigation of Au-
tonomous Vehicle in Roadway Environments with Unexpected Hazards, arXiv
preprint arXiv:1810.03967. (2018).
[63] S. Wu, S. Zhong, and Y. Liu, ResNet, CVPR. (2015). ISSN 15737721. doi: 10.1002/
9780470551592.ch2.
INDEX
385
386 Index