2020 C H Chen - Handbook of Pattern Recognition and Computer Vision (6th Edition)

HANDBOOK OF
PATTERN RECOGNITION
AND COMPUTER VISION
6th Edition
115783_9789811211065_tp.indd 1 26/7/19 1:30 PM

This page intentionally left blank
HANDBOOK OF
PATTERN RECOGNITION
AND COMPUTER VISION
6th Edition
editor
C H Chen
University of Massachusetts Dartmouth, USA
World Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI • TOKYO
115783_9789811211065_tp.indd 2 26/7/19 1:30 PM

Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the British Library.
HANDBOOK OF PATTERN RECOGNITION AND COMPUTER VISION

Sixth Edition
Copyright © 2020 by World Scientific Publishing Co. Pte. Ltd.
All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or
mechanical, including photocopying, recording or any information storage and retrieval system now known or to
be invented, without written permission from the publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center,
Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from
the publisher.
ISBN 978-981-121-106-5 (hardcover)

ISBN 978-981-121-107-2 (ebook for institutions)
ISBN 978-981-121-108-9 (ebook for individuals)
For any available supplementary material, please visit

https://www.worldscientific.com/worldscibooks/10.1142/11573#t=suppl
Printed in Singapore
Steven - 11573 - Handbook of Pattern Recognition.indd 1 26-03-20 12:22:22 PM

The book is dedicated to the memory of the following pioneers
of pattern Recognition and computer vision
Prof. K.S. Fu,

Dr. Pierre A. Devijver,
Prof. Azriel Rosenfeld,
Prof. Thomas M. Cover,
Dr. C.K. Chow,
Prof. Roger Mohr, and
Prof. Jack Sklansky
v
PREFACE TO THE 6TH EDITION
Motivated by the re-emergence of artificial intelligence, big data, and machine

learning in the last six years that have impacted many areas of pattern recognition
and computer vision, this new edition of Handbook is intended to cover both new
developments involving deep learning and more traditional approaches. The book
is divided into two parts: part 1 on theory and part 2 on applications.
Statistical pattern recognition is of fundamental importance to the
development of pattern recognition. The book starts with Chapter 1.1, Optimal
Statistical Classification, by Profs. Dougherty and Dalton, that reviews the
optimal Bayes classifier in a broader context than an optimal classifier with
unknown feature-label distribution designed from sample data, an optimal
classifier that possess minimal expected error relative to the posterior, etc.
Though optimality includes a degree of subjectivity, it always incorporates the
aim and knowledge of the designer. The chapter also deals with the topic of
optimal Bayesian transfer learning, where the training data are augmented with
data from a different source. From my observation of the last half century, I
must say that it is amazing that the Bayesian theory of inferences has such a long
lasting value. Chapter 1.2 by Drs. Shi and Gong on Deep Discrimitive Feature
Learning Methods for Object Recognition, presents the entropy-orthogonality
loss and Min-Max loss to improve the within-class compactness and between-
class separability of the convolutional neural network classifier for better object
recognition. Chapter 1.3 by Prof. Bouwmans et al. on Deep Learning Based
Background Subtraction: A Systematic Survey, provides a full review of recent
advances on the use of deep neural networks applied to background subtraction
for detection of moving objects in video taken by a static camera. The readers
may be interested to read a related chapter on Statistical Background Modeling
for Foreground Detection: A Survey, also by Prof. Bouwmans, et al. in the 4th
edition of the handbook series. Chapter 1.4 by Prof. Ozer, on Similarity Domains
Network for Modeling Shapes and Extracting Skeleton Without Large Datasets,
introduces a novel shape modeling algorithm, Similarity Domain Network
(SDN), based on Radial Basis Networks which are a particular type of neural
networks that utilize radial basis function as an activation function in the hidden
layer. The algorithm effectively computes similarity domains for shape modeling
and skeleton extraction using only one image sample as data.
As a tribute to Prof. C.C. Li who recently retired from University of
Pittsburgh after over 50 years of dedicated research and teaching in pattern
recognition and computer vision, his chapter in the 5th edition of the handbook
vii
viii Preface
series is revised as Chapter 1.5 entitled, On Curvelet-Based Texture Features for

Pattern Classification. The chapter provides a concise introduction to the
curvelet transform which is still a relatively new method for sparse representation
of images with rich edge structure. The curvelet-based texture features are very
useful for the analysis of medical MRI organ tissue images, classification of
critical Gleason grading of prostate cancer histological images and other medical
as well as non-medical images. Chapter 1.6 by Dr. Wang is entitled, An Overview
of Efficient Deep Learning on Embedded Systems. It is now evident that the
superior accuracy of deep learning neural networks comes from the cost of high
computational complexity. Implementing deep learning on embedded systems
with limited hardware resources is a critical and difficult problem. The chapter
reviews some of the methods that can be used to improve energy efficiency
without sacrificing accuracy within cost-effective hardware. The quantization,
pruning, and network structure optimization issues are also considered. As
pattern recognition needs to deal with complex data such as data from different
sources, as for autonomous vehicles for example, or from different feature
extractors, learning from these types of data is called multi-view learning and
each modality/set of features is called a view. Chapter 1.7, Random Forest for
Dissimilarity-Based Multi-View Learning, by Dr. Bernard, et al. employs random
forest (RF) classifiers for measuring dissimilarities. RF embed a (dis)similarity
measure that takes the class membership into account in such a way that
instances from the same class are similar. A Dynamic View Selection method is
proposed to better combine the view-specific dissimilarity representations.
Chapter 1.8, A Review of Image Colorisation, by Dr. Rosin, et al. brings us
to a different theoretical but practical problem of adding color to a given
grayscale image. Three classes of colourization, including colourization by deep
learning, are reviewed in the chapter. Chapter 1.9 on speech recognition is
presented by Drs. Li and Yu, Recent Progresses on Deep Learning for Speech
Recognition. The authors noted that recent advances in automatic speech
recognition (ASP) have been mostly due to the advent of using deep learning
algorithms to build hybrid ASR systems with deep acoustic models like feed-
forward deep neural networks, convolution neural networks, and recurrent neural
networks. The summary of progresses is presented in the two areas where
significant efforts have been taken for ASR, namely, E2E (end to end) modeling
and robust modeling.
Part 2 begins with Chapter 2.1, Machine Learning in Remote Sensing by Dr.
Ronny Haenesch, providing an overview of remote sensing problems and
sensors. It then focuses on two machine learning approaches, one based on
random forest theory and the other on convolutional neural networks with
examples based on synthetic aperture radar image data. While much progress has
Preface ix
been made on information processing for hyperspectral images in remote

sensing, spectral unmixing problem presents a challenge. Chapter 2.2 by Kizel
and Benediktsson is on Hyperspectral and Spatially Adaptive Unmixing for
Analytical Reconstruction of Fraction Surfaces from Data with Corrupted Pixels.
Analysis of the spectral mixture is important for a reliable interpretation of
spectral image data. The information provided by spectral images allows for
distinguishing between different land cover types. However, due to the typical
low spatial resolution in remotely sensed data, many pixels in the image
represent a mixture of several materials within the area of the pixels, Therefore,
subpixel information is needed in different applications, which is extracted by
estimating fractional abundance that corresponds to pure signatures, known as
endmembers. The unmixing problem has been typically solved by using spectral
information only. In this chapter, a new methodology is presented based on a
modification of the spectral unmixing method called Gaussican-based spatially
adaptive unmixing (GBSAU). The problem of spatially adaptive unmixing is
similar to the fitting of a certain function using grid data. An advantage of the
GBSAU framework is to provide a novel solution for unmixing images with both
low SNR and a non-continuity due to the presence of corrupted pixels. Remote
sensing readers may also be interested in the excellent chapter in the second
edition of the handbook series, Statistical and Neural Network Pattern
Recognition Methods for Remote Sensing Application also by Prof. Benediktsson.
Chapter 2.3, Image Processing for Sea Ice Parameter Identification from Visual
Images, by Dr. Zhang introduces novel sea ice image processing algorithms to
automatically extract useful ice information such as ice concentration, ice types,
and ice flow size distribution, which are important in various fields of ice
engineering. It is noted that gradient vector flow snake algorithm is particularly
useful in ice boundary-based segmentation. More details on the chapter is
available in the author’s recent book, Sea Ice Image Processing with Matlab
(CRC Press 2018).
The next chapter (2.4) by Drs. Evan Fletcheris and Alexandeer Knaack is,
Applications of Deep Learning to Brain Segmentation and Labeling of MRI
Brain Structures. The authors successfully demonstrated deep learning
convolution neural networks (CNNs) applications in two areas of brain structural
image processing. One application focused on improving production and
robustness in brain segmentation. The other aimed at improving edge
recognition, leading to greater biological accuracy and statistical power for
computing longitudinal atrophy rates. The authors have also carefully presented a
detailed experimental set-up for the complex brain medical image processing
using deep learning and a large archive of MRIs for training and testing. While
there have been much increased interest in brain research and brain image
x Preface
processing, the readers may be interested with other recent work by Dr. Fletcher
reported in a chapter on Using Prior Information to Enhance Sensitivity of
Longitudinal Brain Change Computation, in Frontiers of Medical Imaging
(World Scientific Publishing, 2015). Chapter 2.5, Automatic Segmentation of
IVUS Images Based on Temporal Texture Analysis, is devoted to more traditional
approach to intravascular ultrasonic image analysis using both textural and
spatial (or multi-image) information for the analysis and delineation of lumen
and external elastic membrane boundaries. The use of multiple images in a
sequence, processed by discrete waveframe transform, clearly provides better
segmentation results over many of those reported in the literature. We take this
traditional approach as the available data set for the study is limited.
Chapter 2.6 by F. Liwicki and Prof. M. Liwicki on Deep Learning for
Historical Documents provides an overview of the state of the art and recent
methods in the area of historical document analysis, especially those using deep
learning and long-Short-Term Memory Networks. Historical documents differ
from the ordinary documents due to the presence of different artifacts. Their idea
of detection of graphical elements in historical documents and their ongoing
efforts towards the creation of large databases are also presented. Actually graphs
allow us to simultaneously model the local features and the global structure of a
handwritten signature in a natural and comprehensive way. Chapter 2.7 by Drs.
Maergner, Riesen et al. thoroughly review two standard graph matching
algorithms that can be readily integrated into an end-to-end signature framework.
The system presented in the chapter was able to combine the complementary
strengths of structural approach and the statistical models to improve the
signature verification performance. The reader may also be interested with the
chapter in the 5th edition of the Handbook series also by Prof. Riesen on Graph
Edit Distance-Novel Approximation Algorithms.
Chapter 2.8 by Prof. Huang and Dr. Hsieh is on Cellular Neural Network for
Seismicc Pattern Recognition. The discrete-time cellular neural network (DT-
CNN) is used as associate memory, and then the associate memory is used to
recognize seismic patterns. The seismic patterns are bright spot pattern, right and
left pinch-out patterns that have the structure of gas and oil sand zones. In
comparison with the use of Hopefield associative memory, the DT-CNN has
better recovery capacity. The results of seismic image interpretation using DT-
CNN are also good. An automatic matching algorithm is necessary for a quick
and accurate search of the law enforcement face databases or surveillance
cameras using a forensic sketch. In Chapter 2.9, Incorporating Facial Attributes
in Cross-Modal Face Verification and Synthesis, by H. Kazemi et al., two deep
learning frameworks are introduced to train a Deep Coupled Convolution Neural
Network for facial attribute guided sketch-to-photo matching and synthesis. The
Preface xi
experimental results show the superiority of the proposed attribute-guided

frameworks compared to the state-of-the-art techniques.
Finally in Chapter 2.10, Connected and Autonomous Vehicles in the Deep
Learning Era: A Case Study On Computer-Guided Steering, by Drs. Valiente,
Ozer, et al., the challenging problem of machine learning in self-driving vehicles
is examined in general and a specific case study is presented. The authors
consider the control of the steering angle as a regression problem where the input
is a stack of images and the output is the steering angle of the vehicle.
Considering multiple frames in a sequence can benefit us to deal with noise and
occasional corrupted images such as those caused by sunlight. The new deep
architecture that is used to predict the steering angle automatically consists of
convolutional neural network, Long-Short-Term-Memory (LSTM) and fully
connected layers. It processes both present and future images (shared by a
vehicle ahead via vehicle-to-vehicle communication) as input to control the
steering angle.
With a handbook of this size, or even ten times the size, it is clearly difficult
to capture the full development of the field of pattern recognition and computer
vision. Unlike a journal special issue, the book covers key progresses in theory
and application of pattern recognition and computer vision. I hope the readers
will examine all six volumes of the Handbook series that reflect the advances of
nearly three decades in the field to gain a better understanding of this highly
dynamic field. With the support of Information Research Foundation, a free
access to the Handbook series vols. 1–4 is now available free. The free access
was successfully set up in early July 2018. For your convenience, the URL links
are as follows:
Vol. 1: https://www.worldscientific.com/worldscibooks/10.1142/1802#t=toc
I like to take this opportunity to thank all chapter authors throughout the
years for their important contributions to the Handbook series. My very special
thanks go to all chapter authors of the current volume.
C.H. Chen
February 3, 2020
CONTENTS
Dedication v
Preface vii
PART 1: THEORY, TECHNOLOGY AND SYSTEMS 1

A Brief Introduction to Part 1 (by C.H. Chen) 2
Chapter 1.1 Optimal Statistical Classification 7
Edward R. Dougherty, Jr. and Lori Dalton
Chapter 1.2 Deep Discriminative Feature Learning Method for 31
Object Recognition
Weiwei Shi and Yihong Gong
Chapter 1.3 Deep Learning Based Background Subtraction: 51
A Systematic Survey
Jhony H. Giraldo, Huu Ton Le, and Thierry Bouwmans
Chapter 1.4 Similarity Domains Network for Modeling Shapes 75
and Extracting Skeletons without Large Datasets
Sedat Ozer
Chapter 1.5 On Curvelet-Based Texture Features for Pattern 87
Classification (Reprinted from Chapter 1.7
of 5th HBPRCV)
Ching-Chung Li and Wen-Chyi Lin
Chapter 1.6 An Overview of Efficient Deep Learning on 107
Embedded Systems
Xianju Wang
Chapter 1.7 Random Forest for Dissimilarity-Based Multi-View 119
Learning
Simon Bernard, Hongliu Cao, Robert Sabourin and
Laurent Heutte
Chapter 1.8 A Review of Image Colourisation 139
Bo Li, Yu-Kun Lai, and Paul L. Rosin
Chapter 1.9 Recent Progress of Deep learning for Speech Recognition 159
Jinyu Li and Dong Yu
xiii
xiv Contents
PART 2: APPLICATIONS 183

A Brief Introduction to Part 2 (by C.H. Chen) 184
Chapter 2.1 Machine Learning in Remote Sensing 187
Ronny Hänsch
Chapter 2.2 Hyperspectral and Spatially Adaptive Unmixing for 209
Analytical Reconstruction of Fraction Surfaces from
Data with Corrupt Pixels
Fadi Kizel and Jon Atli Benediktsson
Chapter 2.3 Image Processing for Sea Ice Parameter 231
Identification from Visual Images
Qin Zhang
Chapter 2.4 Applications of Deep Learning to Brain Segmentation 251
and Labeling of MRI Brain Structures
Evan Fletcher and Alexander Knaack
Chapter 2.5 Automatic Segmentation of IVUS Images Based on 271
Temporal Texture Analysis
A. Gangidi and C.H. Chen
Chapter 2.6 Deep Learning for Historical Document Analysis 287
Foteini Simistira Liwicki and Marcus Liwicki
Chapter 2.7 Signature Verification via Graph-Based Methods 305
Paul Maergner, Kaspar Riesen, Rolf Ingold, Andreas Fischer
Chapter 2.8 Cellular Neural Network for Seismic Pattern Recognition 323
Kou-Yuan Huang and Wen-Hsuan Hsieh
Chapter 2.9 Incorporating Facial Attributes in Cross-modal Face 343
Verification and Synthesis.
Hadi Kazemi, Seyed Mehdi Iranmanesh and
Nasser M. Nasrabadi
Chapter 2.10 Connected and Autonomous Vehicles in the Deep Learning 365
Era: A Case Study on Computer-Guided Steering
Rodolfo Valientea, Mahdi Zamana, Yaser P. Fallaha
and Sedat Ozer
Index 385
PART 1
THEORY, TECHNOLOGY AND SYSTEMS

2 Introduction
A BRIEF INTRODUCTION
From my best recollection, the effort toward making machines as intelligent as

human being was shifted in late 50’s to a more realistic goal of automating the
human recognition process, which is followed soon by computer processing of
pictures. Statistical pattern classification emerged as a major approach to pattern
recognition and even now some sixty years later is still an active research area with
focus more toward classification tree methods. Feature extraction has been
considered as a key problem in pattern recognition and is yet not a well solved
problem. It is still an important problem despite the use of neural networks and
deep learning for classification. In statistical pattern recognition, both parametric
and nonparametric methods were extensively investigated in the 60’s and 70’s.
The nearest neighbor decision rule for classification had well over a thousand
publications. Other theoretical pattern recognition approaches have been
developed since the late sixties, including syntactical (grammatical) pattern
recognition and structural pattern recognition.
For waveforms, effective features extracted from both spectral, temporal and
statistical domains have been limited. For images, texture features, and local edge
detectors have been quite effective. Still good features are much needed and can
make good use of human ingenuity. Machine learning has been an essential part
of pattern recognition and computer vision. In the 60’s and 70’s machine learning
in pattern recognition was focused on improving parameter estimates of a
distribution or nonparametric estimate of probability densities with supervised and
unsupervised learning samples. The re-introduction of artificial neural networks in
mid-80’s has had tremendous impact on machine learning in pattern recognition.
The feature extraction problem has received less attention recently as neural
networks can work with large feature dimension. Obviously the major advances
on computing using personal computers has improved greatly the automated
recognition capability with both neural networks and the more traditional non-
neural networks approaches. It is noted that there has not been conclusive evidence
that best neural networks can perform better than the Bayes decision rules for real
data. However accurate class statistics may not be established from limited real
data. The most cited textbooks in pattern recognition in my view are Fukunaga[1],
Duda et al. [2] and Devijver et al. [3]. The most cited textbook for neural networks
is by Haykin [4]. Contextual information is important for pattern recognition and
there was extensive research in the use of context in pattern recognition and
computer vision (see e.g. [5,6]). Feature evaluation and error estimation were
Introduction 3
another hot topics in the 70’s (see e.g. [1,7]). For many years, researchers have
considered the so called “Hughes phenomenon” (see e.g. [8]), which states that for
finite training sample size, there is a peak mean recognition accuracy. Large
feature dimension however may imply better separability among pattern classes.
Support vector machine is one way to increase the number of features for better
classification.
Syntactic pattern recognition is a very different approach that has different
feature extraction and decision making processes. It consists of string grammar-
based methods, tree grammar-based methods and graph grammar-based methods.
The most important book is by Fu [9]. More recent books include those by Bunke,
et al. [10], and Flasinski [11] which has over 1,000 entries in the Bibliography.
Structural pattern recognition (see e.g. [12]) can be more related to signal/image
segmentation and can be closely linked to syntactic pattern recognition.
In more recent years, much research effort in pattern recognition was in sparse
representation (see e.g. [13]) and in tree classifications like the use of random
forests, as well as various forms of machine learning involving neural networks.
In connection with sparse representation, compressive sensing (not data
compression) has been very useful in some complex image and signal recognition
problems (see e.g. [14]).
The development of computer vision was largely evolved from digital image
processing with early frontier work by Rosenfeld [15] and many of his subsequent
publications. A popular textbook on digital image processing is by Gonzalez and
Woods [16]. Digital image processing by itself can only be considered as low level
to mid-level computer vision. While image segmentation and edge extraction can
be loosely considered as middle level computer vision, the high level computer
vision which is supposed to be like human vision has not been well defined.
Among many textbooks in computer vision is the work of Haralick et al. listed in
[17,18]. There has been much advances in computer vision especially in the last
20 years (see. e.g. [19]).
Machine learning has been the fundamental process to both pattern
recognition and computer vision. In pattern recognition, many supervised, semi-
supervised and unsupervised learning approaches have been explored. The neural
network approaches are particularly suitable for machine learning for pattern
recognition. Multilayer perceptron using back-propagation training algorithm,
kernel methods for support vector machines, self-organizing maps and
dynamically driven recurrent networks represent much of what neural networks
have contributed to machine learning [4].
The recently popular dynamic learning neural networks started with a
complex extension of multilayered neural network by LeCun et al. [20] and
4 Introduction
expanded into versions in convolutional neural networks (see e.g. [21]). Deep
learning implies a lot of learning with many parameters (weights) on a large data
set. As expected, some performance improvement over traditional neural network
methods can be achieved. As an emphasis of this Handbook edition, we have
included several chapters dealing with deep learning. Clearly, deep learning, as a
renewed effort in neural networks since the mid-ninties, is among the important
steps toward matured artificial intelligence. However, we take a balanced view in
this book by placing as much importance of the past work on pattern recognition
and computer vision as the new approach like deep learning. We believe that any
work that is built on solid mathematical and/or physics foundation will have long
lasting value. Examples are the Bayes decision rule, nearest-neighbor decision
rule, snake-based image segmentation models, etc.
Though theoretical work on pattern recognition and computer vision has
moved on a fairly slow or steady pace, the software and hardware development has
progressed so much faster, thanks to the ever-increasing computer power. MatLab
alone, for example, has served so well in software needs, thus diminishing the need
for dedicated software systems. Rapid development in powerful sensors and
scanners has made possible many real-time or near real-time use of pattern
recognition and computer vision. Throughout this Handbook series, we have
included several chapters on hardware development. Perhaps continued and
increased commercial and non-commercial needs have driven the rapid progress
in the hardware as well as software development.
References
1. K. Fukanaga, “Introduction to Statistical Pattern Recognition”, second edition, Academic Press
1990.
2. R. Duda, P. Hart, and D. G. Stork, “Pattern Classification’, second edition, Wiley 1995.
3. P.A. Devijver and J. Kittler, “Pattern Recognition: A Statistical Approach”, Prentice 1982.
4. S. Haykin, “Neural Networks and Learning Machines”, third edition, 2008.
5. K.S. Fu and T.S. Yu, “Statistical Pattern Classification using Contextual Information”, Research
Studies Press, a Division of Wiley, 1976.
6. G. Toussaint, “The use of context in pattern recognition”, Pattern Recognition, Vol. 10, pp. 189-
204, 1978.
7. C.H. Chen, “On information and distance measures, error bounds, and feature selection”,
Information Sciences Journal, Vol. 10, 1976.
8. D. Landgrebe, “Signal Theory Methods in Multispectral Remote Sensing”, Wiley 2003.
9. K.S. Fu, “Syntactic Pattern Recognition and Applications, Prentice-Hall 1982.
10. H.O. Bunke, A. Sanfeliu, editors, “Syntactic and Structural Pattern Recognition-theory and
Applications”, World Scientific Publishing, 1992.
11. M. Flasinski, “Syntactic Pattern Recognition”, World Scientific Publishing, March 2019.
12. T. Pavlidiis. “Structural Pattern Recognition”, Springer, 1977.
Introduction 5
13. Y. Chen, T.D. Tran and N.M. Nasrabdi, “Sparse representation for target detection and
classification in hyperspectral imagery”, Chapter 19 of “Signal and Image Processing for
Remote Sensing”, second edition, edited by C.H. Chen, CRC Press 2012.
14. M.L. Mekhalfi, F. Melgani, et al., “Land use classification with sparse models”, Chapter 14 of
“Compressive sensing of Earth Observations”, edited by C.H. Chen, CRC Press 2017.
15. A. Rosenfeld, “Picture Processing by Computer’, Academic Press 1969.
16. R.C. Gonzelez and R.E. Woods, “Digital Image Processing”, 4th edition, Prentice-Hall 2018.
17. R.M. Haralick and L. G. Shapiro, “Computer and Robot Vision, Vol. 1, Addison-Wesley
Longman 2002.
18. R.M. Haralick and L.G. Shapiro, “Computer and robot Vision”. Vol. 2, Addison-Wesley
Longman 2002.
19. C.H. Chen, editor, “Emerging Topics in Computer Vision”, World Scientific Publishing 2012.
20. Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning”, Nature, Vol. 521, no. 7553, pp. 436-444,
2015
21. L Goodfellow, Y. Bengio and A. Courville, “Deep Learning”, Cambridge, MA. MIT Press 2016.
March 24, 2020 12:27 ws-book961x669 HBPRCV-6th Edn.–11573 dougherty˙PR˙Hbook page 7
CHAPTER 1.1
OPTIMAL STATISTICAL CLASSIFICATION

Edward R. Dougherty1 and Lori Dalton2
1
Department of Electrical and Computer Engineering, Texas A&M University,
2
Department of Electrical and Computer Engineering, Ohio State University
1
Email: edward@ece.tamu.edu
Typical classification rules input sample data and output a classifier. This is
different from the engineering paradigm in which an optimal operator is derived
based on a model and cost function. If the model is uncertain, one can incor-
porate prior knowledge of the model with data to produce an optimal Bayesian
operator. In classification, the model is a feature-label distribution and, if this
is known, then a Bayes classifier provides optimal classification relative to the
classification error. This chapter reviews optimal Bayesian classification, in
which there is an uncertainty class of feature-label distributions governed by a
prior distribution, a posterior distribution is derived by conditioning the prior
on the sample, and the optimal Bayesian classifier possesses minimal expected
error relative to the posterior. The chapter covers binary and multi-class clas-
sification, prior construction from scientific knowledge, and optimal Bayesian
transfer learning, where the training data are augmented with data from a
different source.
1 Introduction
The basic structure of engineering is to operate on a system to achieve some objec-

tive. Engineers design operators to control, perturb, filter, compress, and classify
systems. In the classical paradigm initiated by the Wiener-Kolmogorov theory for
linearly filtering signals, the signals are modeled as random functions, the opera-
tors are modeled as integral operators, and the accuracy of a filter is measured by
the mean-square error between the true signal and the filtered observation signal.
The basic paradigm consists of four parts: (1) a scientific (mathematical) model
describing the physical system, (2) a class of operators to choose from, (3) a cost
function measuring how well the objective is being achieved, and (4) optimization
to find an operator possessing minimum cost. Data are important in the process
because system parameters must be estimated. What may appear to be a big data
set might actually be very small relative to system complexity. And even if a system
is not complex, there may be limited access to data. Should there be insufficient
data for accurate parameter estimation, the system model will be uncertain.
Suppose the scientific model is uncertain, and the true model belongs to an
7
8 E.R. Dougherty and L. Dalton
uncertainty class Θ of models determined by a parameter vector θ composed of the

unknown parameters. As in the classical setting there is a cost function C and a class
Ψ of operators on the model whose performances are measured by the cost function.
For each operator ψ ∈ Ψ there is a cost Cθ (ψ) of applying ψ on model θ ∈ Θ. An
intrinsically Bayesian robust (IBR) operator minimizes the expected value of the
cost with respect to a prior probability distribution π(θ) over Θ.1,2 An IBR operator
is robust in the sense that on average it performs well over the whole uncertainty
class. The prior distribution reflects our existing knowledge. If, in addition to a
prior distribution coming from existing knowledge, there is a data sample S, the
prior distribution conditioned on the sample yields a posterior distribution π ∗ (θ)
= π(θ|S). An IBR operator for the posterior distribution is called an optimal
Bayesian operator. For the general theory applied to other operator classes, such
as filters and clusterers, see Ref. 3.
The Wiener-Kolmogorov theory for linear filters was introduced in the 1930s,
Kalman-Bucy recursive filtering in the 1960s, and optimal control and classification
in the 1950s. In all areas, it was recognized that often the scientific model would
not be known. Whereas this led to the development of adaptive linear/Kalman
filters and adaptive controllers, classification became dominated by rules that did
not estimate the feature-label distribution. Control theorists delved into Bayesian
robust control for Markov decision processes in the 1960s,4,5 but computation was
prohibitive and adaptive methods prevailed. Minimax optimal linear filtering was
approached in the 1970s.6,7 Suboptimal design of filters and classifiers in the context
of a prior distribution occurred in the early 2000s.8,9 IBR design for nonlinear/linear
filtering,2 Kalman filtering,10 and classification11,12 has been achieved quite recently.
This chapter focuses on optimal Bayesian classification.
2 Optimal Bayesian Classifier
Binary classification involves a feature vector X = (X1 , X2 , ..., Xd ) ∈ d composed

of random variables (features), a binary random variable Y , and a classifier ψ :
d → {0, 1} to serve as a predictor of Y , meaning Y is predicted by ψ(X). The
features X1 , X2 , ..., Xd can be discrete or real-valued. The values, 0 or 1, of Y are
treated as class labels. Classification is characterized by the probability distribution
f (x, y) of the feature-label pair (X, Y ), which is called the feature-label distribution.
The error ε[ψ] of ψ is the probability of erroneous classification: ε[ψ] = P (ψ(X) =
Y ). An optimal classifier ψbay , called a Bayes classifier, is one having minimal
error among the collection all classifiers on d . The error εbay of a Bayes classifier
is called the Bayes error. The Bayes classifier and its error can be found from the
feature-label distribution.
In practice, the feature-label distribution is unknown and classifiers are designed
from sample data. A classification rule takes sample data as input and outputs a
classifier. A random sample refers to a sample whose points are independent and
identically distributed according to the feature-label distribution. The stochastic
Optimal Statistical Classification 9
process that generates the random sample constitutes the sampling distribution.
A classifier is optimal relative to a feature-label distribution and a collection C
of classifiers if it is in C and its error is minimal among all classifiers in C:
ψopt = arg min ε[ψ]. (1)
ψ∈C
Suppose the feature-label distribution is unkown, but we know that it is charac-
terized by an uncertainty class Θ of parameter vectors corresponding to feature-label
distributions fθ (x, y) for θ ∈ Θ. Now suppose we have scientific knowledge regard-
ing the features and labels, and this allows us to construct a prior distribution π(θ)
governing the likekihood that θ ∈ Θ parameterizes the true feature-label distribu-
tion, where we assume the prior is uniform if we have no knowledge except that
the true feature-label distribution lies in the uncertainty class. Then the optimal
classifier, known as an intrinsically Bayesian robust classifier (IBRC) is defined by
Θ
ψIBR = arg min Eπ [εθ [ψ]], (2)
ψ∈C
where εθ [ψ] is the error of ψ relative to fθ (x, y) and Eπ is expectation relative to
π.11,12 The IBRC is optimal on average over the uncertainty class, but it will not
be optimal for any particular feature-label distribution unless it happens to be a
Bayes classifier for that distribution.
Going further, suppose we have a random sample Sn = {(X1 , Y1 ), . . . , (Xn , Yn )}
of vector-label pairs drawn from the actual feature-label distribution. The posterior
distribution is defined by π ∗ (θ) = π(θ|Sn ) and the optimal classifier, known as an
Θ
optimal Bayesian classifier (OBC), denoted ψOBC , is defined by Eq. 2 with π ∗ in
11
place of π. An OBC is an IBRC relative to the posterior, and an IBRC is an OBC
with a null sample. Because we are generally interested in design using samples, we
focus on the OBC. For both the IBRC and the OBC, we omit the Θ in the notation
if the uncertainty class is clear from the context. Given our prior knowledge and
the data, the OBC is the best classifier to use.
A sample-dependent minimum-mean-square-error (MMSE) estimator ε̂(Sn ) of
εθ [ψ] minimzes Eπ,Sn [|εθ [ψ] − ξ(Sn )|2 ] over all Borel measurable functions ξ(Sn ),
where Eπ,Sn denotes expectation with respect to the prior distribution and the sam-
pling distribution. According to classical estimation theory, ε̂(Sn ) is the conditional
expectation given Sn . Thus,
ε̂(Sn ) = Eπ [εθ [ψ]|Sn ] = Eπ∗ [εθ [ψ]]. (3)
In this light, Eπ∗ [εθ [ψ]] is called the Bayesian MMSE error estimator (BEE) and
is denoted by ε̂Θ [ψ; Sn ].13,14 The OBC can be reformulated as
Θ
ψOBC (Sn ) = arg min ε̂Θ [ψ; Sn ]. (4)
ψ∈C
Besides minimizing Eπ,Sn [|εθ [ψ] − ξ(Sn )|2 ], the BEE is also an unbiased estimator
of εθ [ψ] over the distribution of θ and Sn :
ESn [ε̂Θ [ψ; Sn ]] = ESn [ Eπ [εθ [ψ]|Sn ] ] = Eπ,Sn [εθ [ψ]] . (5)
2.1 OBC Design

Two issues must be addressed in OBC design: representation of the BEE and min-
imization. In binary classification, θ is a random vector composed of three parts:
the parameters of the class-0 and class-1 conditional distributions, θ0 and θ1 , re-
spectively, and the class-0 prior probability c = c0 (with c1 = 1 − c for class 1).
Let Θy denote the parameter space for θy , y = 0, 1, and write the class-conditional
distribution as fθy (x|y). The marginal prior densities are π(θy ), y = 0, 1, and π(c).
To facilitate analytic representations, we assume that c, θ0 and θ1 are all indepen-
dent prior to observing the data. This assumption allows us to separate the prior
density π(θ) and ultimately to separate the BEE into components representing the
error contributed by each class.
Given the independence of c, θ0 and θ1 prior to sampling, they remain indepen-
dent given the data: π ∗ (θ) = π ∗ (c)π ∗ (θ0 )π ∗ (θ1 ), where π ∗ (θ0 ), π ∗ (θ1 ), and π ∗ (c)
are the marginal posterior densities for θ0 , θ1 , and c, respectively.13
Focusing on c, and letting n0 be the number of class-0 points, since n0 ∼
Binomial(n, c) given c,
π ∗ (c) = π(c|n0 ) ∝ π(c)f (n0 |c) ∝ π(c)cn0 (1 − c)n1 . (6)
If π(c) is beta(α, β) distributed, then π ∗ (c) is still a beta distribution,

n +β−1
cn0 +α−1 (1 − c) 1
π ∗ (c) = , (7)
B(n0 + α, n1 + β)
n0 +α
where B is the beta function, and Eπ∗ [c] = n+α+β . If c is known, then Eπ∗ [c] = c .
The posteriors for the parameters are found via Bayes’ rule,

π ∗ (θy ) = f (θy |Sny ) ∝ π(θy )f (Sny |θy ) = π(θy ) fθy (xi |y) , (8)
i:yi =y
where ny is the number of y-labeled points (xi , yi ) in the sample, Sny is the subset
of sample points from class y, and the constant of proportionality can be found by
normalizing the integral of π ∗ (θy ) to 1. The term f (Sny |θy ) is called the likelihood
function.
Although we call π(θy ), y = 0, 1, the “prior probabilities,” they are not required
to be valid density functions. A prior is called “improper” if the integral of π(θy )
is infinite. When improper priors are used, Bayes’ rule does not apply. Hence,
assuming the posterior is integrable, we take Eq. 8 as the definition of the posterior
distribution, normalizing it so that its integral is equal to 1.
Owing to the posterior independence between c, θ0 and θ1 , and the fact that
εyθ [ψ], the error on class y, is a function of θy only, the BEE can be expressed as
ε̂Θ [ψ; Sn ] = Eπ∗ [cε0θ [ψ] + (1 − c)ε1θ [ψ]]

= Eπ∗ [c]Eπ∗ [ε0θ [ψ]] + (1 − Eπ∗ [c])Eπ∗ [ε1θ [ψ]] , (9)
where

Eπ∗ [εyθ [ψ]] = εyθy [ψ]π ∗ (θy )dθy (10)
Θy
is the posterior expectation for the error contributed by class y. Letting ε̂yΘ [ψ; Sn ] =
Eπ∗ [εyn [ψ]], Eq. 9 takes the form
ε̂Θ [ψ; Sn ] = Eπ∗ [c]ε̂0Θ [ψ; Sn ] + (1 − Eπ∗ [c])ε̂1Θ [ψ; Sn ] . (11)
We evaluate the BEE via effective class-conditional densities, which for y = 0, 1,

are defined by11

fΘ (x|y) = fθy (x|y) π ∗ (θy ) dθy . (12)
Θy
The following theorem provides the key representation for the BEE.
Theorem 1 [11]. If ψ (x) = 0 if x ∈ R0 and ψ (x) = 1 if x ∈ R1 , where R0 and
R1 are measurable sets partitioning d , then, given random sample Sn , the BEE is
given by

ε̂Θ [ψ; Sn ] = Eπ∗ [c] fΘ (x|0) dx + (1 − Eπ∗ [c]) fΘ (x|1) dx
R1 R0

= (Eπ∗ [c]fΘ (x|0) Ix∈R1 + (1 − Eπ∗ [c])fΘ (x|1) Ix∈R0 ) dx, (13)
d
where I denotes the indicator function, 1 or 0, depending on whether the condition

is true or false. Moreover, for y = 0, 1,

ε̂yΘ [ψ; Sn ] = Eπ∗ [εyθ [ψ; Sn ]] = fΘ (x|y) Ix∈R1−y dx. (14)
d
In the unconstrained case in which the OBC is over all possible classifiers, The-
orem 1 leads to pointwise expression of the OBC by simply minimizing Eq. 13.
Theorem 2 [11]. The optimal Bayesian classifier over the set of all classifiers is
given by

Θ
0 if Eπ∗ [c]fΘ (x|0) ≥ (1 − Eπ∗ [c])fΘ (x|1) ,
ψOBC (x) = (15)
1 otherwise.
The representation in the theorem is the representation for the Bayes classifier
for the feature-label distribution defined by class-conditional densities fΘ (x|0) and
fΘ (x|1), and class-0 prior probability Eπ∗ [c]; that is, the OBC is the Bayes classifier
for the effective class-conditional densities. We restrict our attention to the OBC
over all possible classifiers.
3 OBC for the Discrete Model
If the range of X is finite, then there is no loss in generality in assuming a single

feature X taking values in {1, . . . , b}. This discrete classification problem is de-
fined by the class-0 prior probability c0 and the class-conditional probability mass
functions pi = P (X = i|Y = 0), qi = P (X = i|Y = 1), for i = 1, . . . , b. Since
b−1 b−1
pb = 1 − i=1 pi , and qb = 1 − i=1 qi , the classification problem is determined
by a (2b − 1)-dimensional vector (c0 , p1 , . . . , pb−1 , q1 , . . . , qb−1 ) ∈ 2b−1 . We con-
sider an arbitrary number of bins with beta class priors and define the parameters
for each class to contain all but one bin probability: θ0 = [p1 , p2 , . . . , pb−1 ] and
θ1 = [q1 , q2 , . . . , qb−1 ]. Each parameter space is defined as the set of all valid bin
probabilities. For example [p1 , p2 , . . . , pb−1 ] ∈ Θ0 if and only if 0 ≤ pi ≤ 1 for
b−1
i = 1, . . . , b − 1 and i=1 pi ≤ 1. We use the Dirichlet priors

b
α0 −1

b
α1 −1
π(θ0 ) ∝ pi i and π(θ1 ) ∝ qi i , (16)
i=1 i=1
where αiy > 0. These are conjugate priors, meaning that the posteriors take the
same form. Increasing a specific αiy has the effect of biasing the corresponding bin
with αiy samples from the corresponding class before observing the data.
The posterior distributions are again Dirichlet and are given by
b
Γ ny + i=1 αiy b
Uiy +αy −1
π ∗ (θy ) = b y y
p i
i
(17)
k=1 Γ (Uk + αk ) i=1
for y = 0 and a similar expression with p replaced by q for y = 1, where Uiy is the
number of observations in bin i for class y.13 The effective class-conditional densities
are given by13
Ujy + αjy
fΘ (j|y) = b . (18)
ny + i=1 αiy
From Eq. 13,
b
Uj0 + αj0 Uj1 + αj1
ε̂Θ [ψ; Sn ] = Eπ∗ [c] b Iψ(j)=1 + (1 − Eπ∗ [c]) b Iψ(j)=0 .
j=1 n0 + i=1 αi0 n1 + i=1 αi1
(19)
In particular,
b
Ujy + αjy
ε̂yΘ [ψ; Sn ] = b Iψ(j)=1−y . (20)
j=1 ny + i=1 αiy
From Eq. 15 using the effective class-conditional densities in Eq. 18,11
⎧
0 0
⎪
⎨ 1 if E ∗ [c] Uj + αj Uj1 + αj1
Θ π b < (1 − Eπ∗ [c]) b ,
ψOBC (j) = n + α 0 n + α 1 (21)
⎪ 0 i 1 i
⎩ 0 otherwise. i=1 i=1
From Eq. 13, the expected error of the OBC is

b
Uj0 + αj0 Uj1 + αj1
εOBC = min Eπ∗ [c] b , (1 − Eπ∗ [c]) b . (22)
j=1 n0 + i=1 αi0 n1 + i=1 αi1
The OBC minimizes the BEE by minimizing each term in the sum of Eq. 19 by
assigning ψ(j) the class with the smaller constant scaling the indicator function.
The OBC is optimal on average across the posterior distribution, but its behavior
for any specific feature-label distribution is not guaranteed. Generally speaking, if
the prior is concentrated in the vicinity of the true feature-label distribution, then
results are good. But there is risk. If one uses a tight prior that is concentrated away
from the true feature-label distribution, results can be very bad. Correct knowledge
helps; incorrect knowledge hurts. Thus, prior construction is very important, and
we will return to that issue in a subsequent section.
Following an example in Ref. 3, suppose the true distribution is discrete with
c = 0.5,
p1 = p2 = p3 = p4 = 3/16,
p5 = p6 = p7 = p8 = 1/16,
q1 = q2 = q3 = q4 = 1/16,
q5 = q6 = q7 = q8 = 3/16.
Consider five Dirichlet priors π1 , π2 , ..., π5 with c = 0.5,
α1j,0 = α2j,0 = α3j,0 = α4j,0 = aj,0 ,

α5j,0 = α6j,0 = α7j,0 = α8j,0 = bj,0 ,
α1j,1 = α2j,1 = α3j,1 = α4j,1 = aj,1 ,
α5j,1 = α6j,1 = α7j,1 = α8j,1 = bj,1 ,
for j = 1, 2, ..., 5, where aj,0 = 1, 1, 1, 2, 4 for j = 1, 2, ..., 5, respectively, bj,0 =

4, 2, 1, 1, 1 for j = 1, 2, ..., 5, respectively, aj,1 = 4, 2, 1, 1, 1 for j = 1, 2, ..., 5, re-
spectively, and bj,1 = 1, 1, 1, 2, 4 for j = 1, 2, ..., 5, respectively. For n = 5 through
n = 30, 100,000 samples of size n are generated. For each of these we design a
histogram classifier, which assigns to each bin the majority label in the bin, and
five OBCs corresponding to the five priors. Figure 1 shows average errors, with the
Bayes error for the true distribution marked by small circles. Whereas the OBC
from the uniform prior (prior 3) performs slightly better than the histogram rule,
putting more prior mass in the vicinity of the true distribution (priors 4 and 5) gives
greatly improved performance. The risk in leaving uniformity is demonstrated by
priors 1 and 2, whose masses are concentrated away from the true distribution.
c = 0.5
0.7
histogram
0.6 Bayes error
OBC, prior 1
Average true error
OBC, prior 2
OBC, prior 3
0.5 OBC, prior 4
OBC, prior 5
0.4
0.3
0.2
5 10 15 20 25 30
sample size
Fig. 1. Average true errors for the histogram classifier and OBCs based on different prior distri-
butions. [Reprinted from Dougherty, Optimal Signal Processing Under Uncertainty, SPIE Press,
2018.]
4 OBC for the Gaussian Model
For y ∈ {0, 1}, assume an d Gaussian distribution with parameters θy = [μy , Λy ],

where μy is the mean of the class-conditional distribution and Λy is a collection
of parameters determining the covariance matrix Σy of the class. We distinguish
between Λy and Σy to enable us to impose a structure on the covariance. In Refs.
13 and 14, three types of models are considered: a fixed covariance (Σy = Λy is
known perfectly), a scaled identity covariance having uncorrelated features with
equal variances (Λy = σy2 is a scalar and Σy = σy2 Id , where Id is the d × d identity
matrix), and a general (unconstrained, but valid) random covariance matrix, Σy =
Λy . The parameter space of μy is d . The parameter space of Λy , denoted Λy ,
must permit only valid covariance matrices. We write Σy without explicitly showing
its dependence on Λy . A multivariate Gaussian distribution with mean μ and
covariance Σ is denoted by fμ,Σ (x), so that the parameterized class-conditional
distributions are fθy (x|y) = fμy ,Σy (x).
In the independent covariance model, c, θ0 = [μ0 , Λ0 ] and θ1 = [μ1 , Λ1 ] are
independent prior to the data, so that π(θ) = π(c)π(θ0 )π(θ1 ). Assuming π(c)
and π ∗ (c) have been established, we require priors π(θy ) and posteriors π ∗ (θy ) for
both classes. We begin by specifying conjugate priors for θ0 and θ1 . Let ν be
a non-negative real number, m a length d real vector, κ a real number, and S a
symmetric positive semi-definite d × d matrix. Define

ν
fm (μ; ν, m, Λ) = |Σ|−1/2 exp − (μ − m) Σ−1 (μ − m) ,
T
(23)
2
−(κ+d+1)/2 1
fc (Λ; κ, S) = |Σ| exp − trace SΣ−1 , (24)
2
where Σ is a function of Λ. If ν > 0, then fm is a (scaled) Gaussian distribution with
mean m and covariance Σ/ν. If Σ = Λ, κ > d − 1, and S is positive definite, then
fc is a (scaled) inverse-Wishart(κ, S) distribution. To allow for improper priors, we
do not necessarily require fm and fc to be normalizable.
For y = 0, 1, assume Σy is invertible and priors are of the form
π(θy ) = π(μy |Λy )π(Λy ), (25)
where
π(μy |Λy ) ∝ fm (μy ; νy , my , Λy ), (26)

π(Λy ) ∝ fc (Λy ; κy , Sy ). (27)
If νy > 0, then π(μy |Σy ) is proper and Gaussian with mean my and covariance
Σy /νy . The hyperparameter my can be viewed as a target for the mean, where the
larger νy is the more localized the prior is about my .
In the general covariance model where Σy = Λy , π(Σy ) is proper if κy > d − 1
and Sy is positive definite. If in addition νy > 0, then π(θy ) is a normal-inverse-
Wishart distribution, which is the conjugate prior for the mean and covariance
when sampling from normal distributions.15,16 Then Eπ [Σy ] = (κy − d − 1)−1 Sy , so
that Sy can be viewed as a target for the shape of the covariance, where the actual
expected covariance is scaled. If Sy is scaled appropriately, then the larger κy is the
more certainty we have about Σy . At the same time, increasing κy while fixing the
other hyperparameters defines a prior favoring smaller |Σy |.
The model allows for improper priors. Some useful examples of improper priors
occur when Sy = 0 and νy = 0. In this case, π(θy ) ∝ |Σy |−(κy +d+2)/2 . If κy +d+2 =
0, then we obtain flat priors. If Λy = Σy , then with κy = 0 we obtain Jeffreys’ rule
prior, which is designed to be invariant to differentiable one-to-one transformations
of the parameters,17,18 and with κy = −1 we obtain Jeffreys’ independence prior,
which uses the same principle as the Jeffreys’ rule prior but also treats the mean
and covariance matrix as independent parameters.
Theorem 3 [14]. In the independent covariance model, the posterior distribu-
tions possess the same form as the priors:
π ∗ (θy ) ∝ fm (μy ; νy∗ , my∗ , Λy )fc (Λy ; κ∗y , Sy∗ ), (28)
with updated hyperparameters νy∗ = νy + ny , κ∗y = κy + ny ,
y
νy my + ny μ
my∗ = , (29)
νy + ny
y + νy n y
Sy∗ = Sy + (ny − 1)Σ μy − my )(
( μy − my )T , (30)
νy + ny
y and Σ
where μ y are the sample mean and sample covariance for class y.
Similar results are found in Ref. 15.
The posterior can be expressed as
π ∗ (θy ) = π ∗ (μy |Λy )π ∗ (Λy ), (31)

where
π ∗ (μy |Λy ) = f{m∗y ,Σy /νy∗ } (μy ), (32)

∗ 1
π ∗ (Λy ) ∝ |Σy |−(κy +d+1)/2 exp − trace Sy∗ Σy −1
. (33)
2
Assuming at least one sample point, νy∗ > 0, so π ∗ (μy |Λy ) is always valid. The
validity of π ∗ (Λy ) depends on the definition of Λy .
Improper priors are acceptable but the posterior must always be a valid proba-
bility density.
Since the effective class-conditional densities are found separately, a different
covariance model may be used for each class. Going forward, to simplify notation
we denote hyperparameters without subscripts. In the general covariance model,
Σy = Λy , the parameter space contains all positive definite matrices, and π ∗ (Σy )
has an inverse-Wishart distribution,
∗
|S ∗ |κ /2 −(κ∗ +d+1)/2 1 ∗ −1
π ∗ (Σy ) = κ∗ d/2 |Σ y | exp − trace S Σ , (34)
2 Γd (κ∗ /2) 2 y
where Γd is the multivariate gamma function. For a proper posterior, we require

ν ∗ > 0, κ∗ > d − 1, and S ∗ positive definite.
Theorem 4. [11]. For a general covariance matrix, assuming ν ∗ > 0, κ∗ > d−1,
and S ∗ positive definite, the effective class-conditional density is a multivariate
student’s t-distribution,
∗
1 Γ κ 2+1
fΘ (x|y) = ν ∗ +1 × κ∗ −d+1
(κ∗ − d + 1)d/2 π d/2 | (κ∗ −d+1)ν ∗ 1/2
∗S | Γ 2
−1 − κ∗2+1
∗
1 ν +1
(x − m∗ ) S∗ (x − m∗ )
T
× 1+ ∗ ,
κ −d+1 (κ∗ − d + 1)ν ∗
(35)
∗ ν ∗ +1 ∗ ∗
with location vector m , scale matrix and κ −d+1 degrees of freedom.
(κ∗ −d+1)ν ∗ S
It is proper if κ∗ > d, the mean of this distribution is m∗ , and if κ∗ > d + 1 the
ν ∗ +1 ∗
covariance is (κ∗ −d−1)ν ∗S .
Rewriting Eq. 35 with ν ∗ = νy∗ , m∗ = my∗ , κ∗ = κ∗y , and ky = κ∗y − d + 1 degrees

of freedom for y ∈ {0, 1}, the effective class-conditional densities are

ky +d ky +d
1 Γ 2 1 ∗
T −1 ∗
− 2
fΘ (x|y) = d/2 × 1+ x − my Ψy x − my ,
ky π d/2 |Ψy |1/2 Γ 2y
k ky
(36)
3
Level curve 0
2.5 Level curve 1
PI
2 OBC
IBR
1.5
1
2
0.5
x
-0.5
-1
-1.5
-2
-2 -1 0 1 2 3
x1
Fig. 2. Classifiers for a Gaussian model with two features.
where Ψy is the scale matrix in Eq. 35. The OBC discriminant becomes
k0 +d
1 ∗ T −1 ∗
DOBC (x) = K 1 + (x − m0 ) Ψ0 (x − m0 )
k0
k1 +d
1 ∗ T −1 ∗
− 1+ (x − m1 ) Ψ1 (x − m1 ) , (37)
k1
where
2 d 2
1 − Eπ∗ [c] k0 |Ψ0 | Γ(k0 /2)Γ((k1 + d)/2)
K= . (38)
Eπ∗ [c] k1 |Ψ1 | Γ((k0 + d)/2)Γ(k1 /2)
ψOBC (x) = 0 if and only if DOBC (x) ≤ 0. This classifier has a polynomial decision
boundary as long as k0 and k1 are integers, which is true if κ0 and κ1 are integers.
Consider a synthetic Gaussian model with d = 2 features, independent general
covariance matrices, and a proper prior defined by known c = 0.5 and hyperpa-
rameters ν0 = κ0 = 20d, m0 = [0, . . . , 0], ν1 = κ1 = 2d, m1 = [1, . . . , 1], and
Sy = (κy − d − 1)Id . We assume that the true model corresponds to the means
of the parameters, and take a stratified sample of 10 randomly chosen points from
each true class-conditional distribution. We find both the IBRC ψIBR and the OBC
ψOBC relative to the family of all classifiers. We also consider a plug-in classifier ψPI ,
which is the Bayes classifier corresponding to the means of the parameters. ψPI is
linear. Figure 2 shows ψOBC , ψIBR , and ψPI . Level curves for the class-conditional
distributions corresponding to the expected parameters are also shown.
For the Gaussian and discrete models discussed herein, the OBC can be solved
analytically; however, in many real-world situations Gaussian models are not
suitable. Shortly after the introduction of the OBC, Markov-chain-Monte-Carlo
(MCMC) methods were utilized for RNA-Seq applications.19,20 Other MCMC-based
OBC applications include liquid chromatography-mass spectrometry data,21 selec-
tion reaction monitoring data,22 and classification based on dynamical measure-
ments of single-gene expression,23 the latter using an IBR classifier because no
sample data were included. Another practical issue pertains to missing values,
which are common in many applications, such as genomic classification. The OBC
has been reformulated to take into account missing values.24 Finally, let us note
that, while random sampling is a common assumption in classification theory, non-
random sampling can be beneficial for classifier design.25 In the case of the OBC,
optimal sampling has been considered under different scenarios.3,26
5 Multi-class Classification
In this section, we generalize the BEE and OBC to treat multiple classes with arbi-
trary loss functions. We present the analogous concepts of Bayesian risk estimator
(BRE) and optimal Bayesian risk classifier (OBRC), and show that the BRE and
OBRC can be represented in the same form as the expected risk and Bayes decision
rule with unknown true densities replaced by effective densities. We consider M
classes, y = 0, . . . , M − 1, let f (y | c) be the probability mass function of Y parame-
terized by a vector c, and for each y let f (x | y, θy ) be the class-conditional density
function for X parameterized by θy . Let θ be composed of the θy .
Let L(i, y) be a loss function quantifying a penalty in predicting label i when
the true label is y. The conditional risk in predicting label i for a given point x is
defined by R(i, x, c, θ) = E[L(i, Y ) | x, c, θ]. A direct calculation yields
M −1
y=0 L(i, y)f (y | c)f (x | y, θy )
R(i, x, c, θ) = M −1 . (39)
y=0 f (y | c)f (x | y, θy )
The expected risk of an M -class classifier ψ is given by

M −1 M −1
R(ψ, c, θ) = E[R(ψ(X), X, c, θ) | c, θ] = L(i, y)f (y | c)εi,y (ψ, θy ), (40)
y=0 i=0
where the classification probability

εi,y (ψ, θy ) = f (x | y, θy )dx = P (X ∈ Ri | y, θy ) (41)
Ri
is the probability that a class y point will be assigned class i by ψ, and the Ri =
{x : ψ(x) = i} partition the feature space into decision regions.
A Bayes decision rule (BDR) minimizes expected risk, or equivalently, the con-
ditional risk at each fixed point x:
ψBDR (x) = arg min R(i, x, c, θ)

i∈{0,...,M −1}
M −1
= arg min L(i, y)f (y | c)f (x | y, θy ). (42)
i∈{0,...,M −1}
y=0
We break ties with the lowest index, i ∈ {0, . . . , M − 1}, minimizing R(i, x, c, θ).
In the binary case with the zero-one loss function, L(i, y) = 0 if i = y and
L(i, y) = 1 if i = y, the expected risk reduces to the classification error so that the
BDR is a Bayes classifier.
With uncertainty in the multi-class framework, we assume that c is the probabil-
ity mass function of Y , that is, c = {c0 , . . . , cM −1 } ∈ ΔM −1 , where f (y | c) = cy and
ΔM −1 is the standard M − 1 simplex defined by cy ∈ [0, 1] for y ∈ {0, . . . , M − 1}
M −1
and y=0 cy = 1. Also assume θy ∈ Θy for some parameter space Θy , and
θ ∈ Θ = Θ0 × . . . × ΘM −1 . Let C and T denote random vectors for parameters
c and θ. We assume that C and T are independent prior to observing data, and
assign prior probabilities π(c) and π(θ). Note the change of notation: up until now,
c and θ have denoted both the random variables and the parameters. The change
is being made to avoid confusion regarding the expectations in this section.
Let Sn be a random sample, xiy the ith sample point in class y, and ny the
number of class-y sample points. Given Sn , the priors are updated to posteriors:

M −1
ny
π ∗ (c, θ) = f (c, θ | Sn ) ∝ π(c)π(θ) f (xiy , y | c, θy ), (43)
y=0 i=1
where the product on the right is the likelihood function. Since f (xiy , y | c, θy ) =
cy f (xiy | y, θy ), we may write π ∗ (c, θ) = π ∗ (c)π ∗ (θ), where

M −1
π ∗ (c) = f (c | Sn ) ∝ π(c) (cy )ny (44)
y=0
and

M −1
ny
∗
π (θ) = f (θ | Sn ) ∝ π(θ) f (xiy | y, θy ) (45)
y=0 i=1
are marginal posteriors of C and T. Independence between C and T is preserved

in the posterior. If the prior is proper, this all follows from Bayes’ theorem; other-
wise, Eq. 44 and Eq. 45 are taken as definitions, with proper posteriors required.
Given a Dirichlet prior on C with hyperparameters αy , with random sampling the
posterior on C is Dirichlet, with hyperparameters αy∗ = αy + ny .
5.1 Optimal Bayesian Risk Classification

We define the Bayesian risk estimate (BRE) to be the MMSE estimate of the
expected risk, or equivalently, the conditional expectation of the expected risk given
the observations. Given a sample Sn and a classifier ψ that is not informed by θ,
owing to posterior independence between C and T, the BRE is given by
M −1 M −1

R(ψ, Sn ) = E[R(ψ, C, T) | Sn ] = L(i, y)E[f (y | C) | Sn ]E[εi,y (ψ, T) | Sn ].
y=0 i=0
(46)
The effective density fΘ (x | y) is in Eq. 12. We also have an effective density

fΘ (y) = f (y | c)π ∗ (c )dc. (47)
The effective densities are expressed via expectation by
fΘ (y) = Ec [f (y | C) | Sn ] = E [ cy | Sn ] = Eπ∗ [cy ], (48)
fΘ (x | y) = Eθy [f (x | y, T) | Sn ]. (49)
We may thus write the BRE in Eq. 46 as
M −1 M −1

R(ψ, Sn ) = εni,y (ψ, Sn ),
L(i, y)fΘ (y) (50)
y=0 i=0
where

εni,y (ψ, Sn ) = E[ε i,y
(ψ, T ) | Sn ] = fΘ (x| y) dx. (51)
Ri
Note that fΘ (y) and fΘ (x | y) play roles analogous to f (y | c) and f (x | y, θy ) in
Bayes decision theory.
Various densities and conditional densities are involved in the theory, generally
denoted by f . For instance, we may write the prior and posterior as π(θ) = f (θ)
and π ∗ (θ) = f (θ|Sn ). We also consider f (y|Sn ) and f (x|y, Sn ). By expressing these
as integrals over Θ, we see that f (y|Sn ) = fΘ (y) and f (x|y, Sn ) = fΘ (x | y).
Whereas the BRE addresses overall classifier performance across the entire fea-
ture space, we may consider classification at a fixed point. The Bayesian conditional
risk estimator (BCRE) for class i ∈ {0, . . . , M −1} at point x is the MMSE estimate
of the conditional risk given the sample Sn and the test point X = x:
x, Sn ) = E[R(i, X, C, T) | Sn , X = x]
R(i,
M −1
= L(i, y)E[P (Y = y | X, C, T) | Sn , X = x]. (52)
y=0
The expectations are over a posterior on C and T updated with both Sn and the
unlabeled point x. It is proven in Ref. 27 that
M −1
y=0 L(i, y)fΘ (y)fΘ (x | y)
R(i, x, Sn ) = M −1 . (53)
y=0 fΘ (y)fΘ (x | y)
This is analogous to Eq. 39 in Bayes decision theory.

Furthermore, given a classifier ψ with decision regions R0 , . . . , RM −1 ,
M −1

E R(ψ(X), X, Sn ) | Sn = x, Sn )f (x | Sn )dx,
R(i, (54)
i=0 Ri
where the expectation is over X (not C or T) given Sn . Calculation shows that27

E R(ψ(X),
X, Sn ) | Sn = R(ψ, Sn ). (55)
Hence, the BRE of ψ is the mean of the BCRE across the feature space.
For binary classification, εni,y (ψ, Sn ) has been solved in closed form as compo-
nents of the BEE for both discrete models under arbitrary classifiers and Gaussian
models under linear classifiers, so the BRE with an arbitrary loss function is avail-
able in closed form for these models. When closed-form solutions for εni,y (ψ, Sn ) are
not available, approximation may be employed.27
We define the optimal Bayesian risk classifier to minimize the BRE:
(ψ, Sn ) ,
ψOBRC = arg min R (56)
ψ∈C
where C is a family of classifiers. If C is the set of all classifiers with measurable

decision regions, then ψOBRC exists and is given for any x by
ψOBRC (x) = arg min x, Sn )
R(i,
i∈{0,...,M −1}
M −1
= arg min L(i, y)fΘ (y)fΘ (x | y) . (57)
i∈{0,...,M −1}
y=0
The OBRC minimizes the average loss weighted by fΘ (y)fΘ (x | y). The OBRC
has the same functional form as the BDR with fΘ (y) substituted for the true class
probability f (y | c), and fΘ (x | y) substituted for the true density f (x | y, θy ) for all
y. Closed-form OBRC representation is available for any model in which fΘ (x | y)
has been found, including discrete and Gaussian models. For binary classification,
the BRE reduces to the BEE and the OBRC reduces to the OBC.
6 Prior Construction
In 1968, E. T. Jaynes remarked,28 “Bayesian methods, for all their advantages, will
not be entirely satisfactory until we face the problem of finding the prior probability
squarely.” Twelve years later, he added,29 “There must exist a general formal the-
ory of determination of priors by logical analysis of prior information — and that
to develop it is today the top priority research problem of Bayesian theory.” The
problem is to transform scientific knowledge into prior distributions.
Historically, prior construction has usually been treated independently of real
prior knowledge. Subsequent to Jeffreys’ non-informative prior,17 objective-based
methods were proposed.30 These were followed by information-theoretic and sta-
tistical approaches.31 In all of these methods, there is a separation between prior
knowledge and observed sample data. Several specialized methods have been pro-
posed for prior construction in the context of the OBC. In Ref. 32, data from unused
features is used to construct a prior. In Refs. 19 and 20, a hierarchical Poisson prior
is employed that models cellular mRNA concentrations using a log-normal distri-
bution and then models where the uncertainty is on the feature-label distribution.
In the context of phenotype classification, knowledge concerning genetic signaling
pathways has been integrated into prior construction.33-35
Here, we outline a general paradigm for prior formation involving an optimiza-
tion constrained by incorporating existing scientific knowledge augmented by slack-
ness variables.36 The constraints tighten the prior distribution in accordance with
prior knowledge, while at the same time avoiding inadvertent over-restriction of the
prior. Two definitions provide the general framework.
Given a family of proper priors π(θ, γ) indexed by γ ∈ Γ, a maximal knowledge-
driven information prior (MKDIP) is a solution to the optimization
arg min Eπ(θ,γ) [Cθ (ξ, γ, D)], (58)

γ∈Γ
where Cθ (ξ, γ, D) is a cost function depending on (1) the random vector θ parame-
terizing the uncertainty class, (2) the parameter γ, and (3) the state ξ of our prior
knowledge and part of the sample data D. When the cost function is additively
decomposed into costs on the hyperparameters and the data, it takes the form
(1) (2)
Cθ (ξ, γ, D) = (1 − β)gθ (ξ, γ) + βgθ (ξ, D), (59)
(1) (2)
where β ∈ [0, 1] is a regularization parameter, and and gθ gθ
are cost functions.
Various cost functions in the literature can be adpated for the MKDIP.36
A maximal knowledge-driven information prior with constraints takes the form
(3)
of the optimization in Eq. 58 subject to the constraints Eπ(θ,γ) [gθ,i (ξ)] = 0, i =
(3)
1, 2, ..., nc , where gθ,i , i = 1, 2, ..., nc , are constraints resulting from the state ξ of
our knowledge, via a mapping

(3) (3)
T : ξ → Eπ(θ,γ) [gθ,1 (ξ)], ..., Eπ(θ,γ) [gθ,nc (ξ)] . (60)
A nonnegative slackness variable εi can be considered for each constraint for the
MKDIP to make the constraint structure more flexible, thereby allowing potential
error or uncertainty in prior knowledge (allowing inconsistencies in prior knowledge).
Slackness variables become optimization parameters, and a linear function times a
regulatory coefficient is added to the cost function of the optimization in Eq. 58,
so that the optimization in Eq. 58 relative to Eq. 59 becomes
nc

(1) (2)
arg min Eπ(θ,γ) λ1 [(1 − β)gθ (ξ, γ) + βgθ (ξ, D)] + λ2 εi , (61)
γ∈Γ,ε∈E
i=1
(3)
subject to −εi ≤ Eπ(θ,γ) [gθ,i (ξ)] ≤ εi , i = 1, 2, ..., nc , where λ1 and λ2 are nonneg-
ative regularization parameters, and ε = (ε1 , ..., εnc ) and E represent the vector of
all slackness variables and the feasible region for slackness variables, respectively.
Each slackness variable determines a range — the more uncertainty regarding a
constraint, the greater the range for the corresponding slackness variable.
Scientific knowledge is often expressed in the form of conditional probabilities
characterizing conditional relations. For instance, if a system has m binary random
variables X1 , X2 , . . . , Xm , then potentially there are m2m−1 probabilities for which
a single variable is conditioned by the other variables:
P (Xi = ki |X1 = k1 , . . . , Xi−1 = ki−1 , Xi+1 = ki+1 , . . . , Xm = km )

= aki i (k1 , . . . , ki−1 , ki+1 , . . . , km ). (62)
(3)
Keeping in mind that constraints are of the form Eπ(θ,γ) [gθ,i (ξ)] = 0, in this setting,
(3)
gθ,i (ξ) = Pθ (Xi = ki |X1 = k1 , . . . , Xi−1 = ki−1 , Xi+1 = ki+1 , . . . , Xm = km )
−aki i (k1 , . . . , ki−1 , ki+1 , . . . , km ). (63)
When slackness variables are introduced, the optimization constraints take the form
aki i (k1 , . . . , ki−1 , ki+1 , . . . , km ) − εi (k1 , . . . , ki−1 , ki+1 , . . . , km )

≤ Eπ(θ,γ) [Pθ (Xi = ki |X1 = k1 , . . . , Xi−1 = ki−1 , Xi+1 = ki+1 , . . . , Xm = km )]
≤ aki i (k1 , . . . , ki−1 , ki+1 , . . . , km ) + εi (k1 , . . . , ki−1 , ki+1 , . . . , km ). (64)
Not all constraints will be used, depending on our prior knowledge. In fact, the
general conditional probabilities conditioned on all expressions Xj = kj , j = i,
will not likely be used because they will likely not be known when there are many
random variables, so that conditioning will be on subsets of these expressions.
Regardless of how the prior is constructed, the salient point regarding optimal
Bayesian operator design (including the OBC) is that uncertainty is quantified rel-
ative to the scientific model (the feature-label distribution for classification). The
prior distribution is on the physical parameters. This differs from the common
method of placing prior distributions on the parameters of the operator. For in-
stance, if we compare optimal Bayesian regression37 to standard Bayesian linear
regression models,38-40 in the latter, the connection of the regression function and
prior assumptions with the underlying physical system is unclear. As noted in Ref.
37, there is a scientific gap in constructing operator models and making prior as-
sumptions on them. In fact, operator uncertainty is a consequence of uncertainty
in the physical system and is related to the latter via the optimization procedure
that produces an optimal operator. A key reason why the MKDIP approach works
is because the prior is on the scientific model, and therefore scientific knowledge
can be applied directly in the form of constraints.
7 Optimal Bayesian Transfer Learning
The standard assumption in classification theory is that training and future data
come from the same feature-label distribution. In transfer learning, the training
data from the actual feature label distribution, called the target, are augmented
with data from a different feature-label distribution, called the source.41 The key
issue is to quantify domain relatedness. This can be achieved by extending the
OBC framework so that transfer learning from source to target domain is via a
joint prior probability density function for the model parameters of the feature-
label distributions of the two domains.42 The posterior distribution of the target
model parameters can be updated via the joint prior probability distribution func-
tion in conjunction with the source and target data. We use π to denote a joint
prior distribution and p to denote a conditional distribution involving uncertainty
parameters. As usual, a posterior distribution refers to a distribution of uncertainty
parameters conditioned on the data.
We consider L common classes in each domain. Let Ss and St denote samples
from the source and target domains with sizes of Ns and Nt , respectively. For l =
1, 2, ..., L, let Ssl = {xs,1
l l
, xs,2 , · · · , xs,n
l
l } and St = {xt,1 , xt,2 , · · · , xt,nl }. Moreover,
l l l l
L
s
L
t
Ss = ∪L l=1 Ss , St = ∪l=1 St , Ns =
l L l
l=1 n l
s , and N t = l=1 n l
t . Since the feature
spaces are the same in both domains, xsl and xtl are d-vectors for d features of the
source and target domains, respectively. Since in transfer learning there is no joint
sampling of the source and target domains, we cannot use a general joint sampling
model, but instead assume that there are two datasets separately sampled from the
source and target domains. Transferability (relatedness) is characterized by how
we define a joint prior distribution for the source and target precision matrices, Λls
and Λlt , l = 1, 2, ..., L.
We employ a Gaussian model for the feature-label distribution, xzl ∼
−1
N (μlz , (Λlz ) ), for l ∈ {1, ..., L}, where z ∈ {s, t} denotes the source s or tar-
get t domain, μls and μlt are mean vectors in the source and target domains for
label l, respectively, Λls and Λlt are the d × d precision matrices in the source and
target domains for label l, respectively, and we employ a joint Gaussian-Wishart
distribution as a prior for mean and precision matrices of the Gaussian models. The
joint prior distribution for μls , μlt , Λls , and Λls takes the form

π μls , μlt , Λls , Λlt = p μls , μlt |Λls , Λlt π Λls , Λlt . (65)
Assuming that, for any l, μs and μt are conditionally independent given Λs and Λlt
l l l
results in conjugate priors. Thus,

π μls , μlt , Λls , Λlt = p μls |Λls p μlt |Λlt π Λls , Λlt , (66)
l l l l l l l −1
and both p μs |Λs and p μt |Λt are Gaussian, μz |Λz ∼ N mz , κz Λz l l
, where
mzl is the mean vector of μlz , and κlz is a positive scalar hyperparameter.
A key issue is the structure of the joint prior governing the target and source
precision matrices. We employ a family of joint priors that falls out naturally from
a collection of partitioned Wishart random matrices.
Based on a theorem in Ref. 43, we define the joint prior distribution π(Λls , Λlt )
in Eq. 66 of the precision matrices of the source and target domains for class l:

1 l −1 l T l l
π(Λt , Λs ) = K etr −
l l l
Mt + (F ) C F Λlt
2

1 −1 l l ν l −d−1 l ν l −d−1
Λs 2
× etr − Cl Λs Λt 2
2
l
ν 1 l
× 0 F1 ; G , (67)
2 4
where etr(A) = exp(tr(A)),

l
Mlt Mlts
M = (68)
(Mlts )T Mls
is a 2d × 2d positive definite scale matrix, ν l ≥ 2d denotes degrees of freedom, p Fq

is the generalized hypergeometric function,44 and
−1 l
Cl = Mls − (Mlts )T Mlt Mts , (69)
l
l −1 l T
l −1
F = C (Mts ) Mt , (70)
1 1
T
Gl = Λls 2 Fl Λlt (Fl ) Λls 2 , (71)
l
l −1 dν l 2 ν νl
(K ) = 2 Γd |Ml | 2 . (72)
2
Based upon a theorem in Ref. 45, Λlt and Λls possess Wishart marginal distributions:
Λlz ∼ Wd (Mlz , ν l ), for l ∈ {1, ..., L} and z ∈ {s, t}.
We need the posterior distribution of the parameters of the target domain upon
observing the source and target samples. The likelihoods of the samples St and
Ss are conditionally independent given the parameters of the target and source
domains. The dependence between the two domains is due to the dependence of the
prior distributions of the precision matrices. Within each domain, the likelihoods of
the classes are conditionally independent given the class parameters. Under these
conditions, and assuming that the priors of the parameters in different classes are
independent, the joint posterior can be expressed as a product of the individual
class posteriors:42

L
π(μt , μs , Λt , Λs |St , Ss ) = π(μlt , μls , Λlt , Λls |Stl , Ssl ), (73)
l=1
where
π(μlt , μls , Λlt , Λls |Stl , Ssl ) ∝ p(Stl |μlt , Λlt )p(Ssl |μls , Λls )

× p μls |Λls p μlt |Λlt π Λls , Λlt . (74)
The next theorem gives the posterior for the target domain.
Theorem 5 [42]. Given the target St and source Ss samples, the posterior
distribution of target mean μlt and target precision matrix Λlt for class l has a
Gaussian-hypergeometric-function distribution

1
l l2
κlt,n l T l l
π(μt , Λt |St , Ss ) = A Λt exp −
l l l l
μt − mt,n Λt μt − mt,n
l l
2

ν +nt2−d−1 1 −1 l
l l
× Λlt etr − Tlt Λt

2
l l l

ν + ns ν 1 l l l T l
× 1 F1 ; ; F Λt (F ) Ts , (75)
2 2 2
where, if Fl is full rank or null, Al is the constant of proportionality,
d2 l
l −1 ν + nlt l ν +n
l l
d(ν l +nl )
2π t
2
t
A = l
2 2 Γd Tt
κt,n 2
l
ν + nls ν l + nlt ν l l l l l T
× 2 F1 , ; ; Ts F Tt (F ) , (76)
2 2 2
and κlt,n = κlt + nlt , mt,n
l
= (κlt mtl + nlt x̄tl )(κlt,n )−1 ,
−1
−1 κl n l
Tlt= Mlt + (Fl )T Cl Fl + (nlt − 1)Ŝlt + l t t l (mtl − x̄tl )(mtl − x̄tl )T ,
κt + n t
l −1 l −1 l
κ n l
Ts = C + (nls − 1)Ŝls + l s s l (msl − x̄sl )(msl − x̄sl )T , (77)
κs + n s
x̄zl and Ŝlz being the sample mean and covariance for z ∈ {s, t} and l.
The effective class-conditional density for class l is

fOBTL (x|l) = f (x|μlt , Λlt )π ∗ (μlt , Λlt )dμlt dΛlt , (78)
μlt ,Λlt
where π ∗ (μlt , Λlt ) = π(μlt , Λlt |Stl , Ssl ) is the posterior of (μlt , Λlt ) upon observation of
Stl and Ssl . We evaluate it.
Theorem 6 [42]. If Fl is full rank or null, then the effective class-conditional
density in the target domain for class l is given by
d l
−d
κlt,n 2 ν + nlt + 1
fOBTL (x|l) = π 2 Γd
κlx 2
l
ν + nlt l ν +n2 t +1 l − ν +n
l l l l
t
×Γ−1
d T x T t
2
2
l
ν + nls ν l + nlt + 1 ν l l l l l T
× 2 F1 , ; ; Ts F Tx (F )
2 2 2
l l l l l

ν + ns ν + nt ν T
×2 F1−1 , ; ; Tls Fl Tlt (Fl ) , (79)
2 2 2
where κlx = κlt,n + 1 = κlt + nlt + 1, and
−1 −1 κlt,n l l T

Tlx = Tlt + l
mt,n − x mt,n −x . (80)
κt,n + 1
A Dirichlet prior is assumed for the prior probabilities clt that the target sample
belongs to class l: ct = (c1t , · · · , cL 1
t ) ∼ Dir(L, ξt ), where ξt = (ξt , · · · , ξt ) is the
L
vector of concentration parameters, and ξt > 0 for l ∈ {1, ..., L}. As the Dirichlet
l
distribution is a conjugate prior for the categorical distribution, upon observing

n = (n1t , ..., nL ∗
t ) sample points for class l in the target domain, the posterior π (ct ) =
π(ct |n) has a Dirichlet distribution Dir(L, ξt + n).
The optimal Bayesian transfer learning classifier (OBTLC) in the target domain
relative to the uncertainty class Θt = {clt , μlt , Λlt }L
l=1 is given by
ψOBTL (x) = arg max Eπ∗ [clt ]fOBTL (x|l). (81)

l∈{1,··· ,L}
If there is no interaction between the source and target domains in all the classes,
then the OBTLC reduces to the OBC in the target domain. Specifically, if Mlts = 0
for all l ∈ {1, ..., L}, then ψOBTL = ψOBC .
Figure 3 shows simulation results comparing the OBC (trained only with target
data) and the OBTL classifier for two classes and ten features (see Ref. 42 for
simulation details). α is a parameter measuring the relatedness between the source
and target domains: α = 0 when the two domains are not related, and α close to
1 indicates greater relatedness. Part (a) shows average classification error versus
the number of source points, with the number of target points fixed at 10, and part
0.35
0.26 OBC, target-only OBC, target-only

OBTL, = 0.6 OBTL, = 0.6
Average classification error
OBTL, = 0.8
Average classification error
OBTL, = 0.8 0.3

0.24
OBTL, = 0.925 OBTL, = 0.925
0.22 0.25
0.2
0.2
0.18
0.15
0.16
0 100 200 300 400 10 20 30 40 50
number of source points per class number of target points per class
Fig. 3. Average classification error: (a) average classification error versus the number of source
training data per class, (b) average classification error versus the number of target training data
per class.
28 Bibliography
(b) shows average classification error versus the number of target points with the
number of source points fixed at 200.
8 Conclusion
Optimal Bayesian classification provides optimality with respect to both prior

knowledge and data; the greater the prior knowledge, the less data are needed
to obtain a given level of performance. Its formulation lies within the classical
operator-optimization framework, adapted to take into account both the opera-
tional objective and the state of our uncertain knowledge.3 Perhaps the salient
issue for OBC applications is the principled transformation of scientific knowledge
into the prior distribution. Although a general paradigm has been proposed in Ref.
36, it depends on certain cost assumptions. Others could be used. Indeed, all opti-
mizations depend upon the assumption of an objective and a cost function. Thus,
optimality always includes a degree of subjectivity. Nonetheless, an optimization
paradigm encapsulates the aims and knowledge of the engineer, and it is natural to
optimize relative to these.
References
[1] Yoon, B-J., Qian, X., and E. R. Dougherty, Quantifying the objective cost of uncer-
tainty in complex dynamical systems, IEEE Trans Signal Processing, 61, 2256-2266,
(2013).
[2] Dalton, L. A., and E. R. Dougherty, Intrinsically optimal Bayesian robust filtering,
IEEE Trans Signal Processing, 62, 657-670, (2014).
[3] Dougherty, E. R., Optimal Signal Processing Under Uncertainty, SPIE Press, Belling-
ham, (2018).
[4] Silver, E. A., Markovian decision processes with uncertain transition probabilities or
rewards, Technical report, Defense Technical Information Center, (1963).
[5] Martin, J. J., Bayesian Decision Problems and Markov Chains, Wiley, New York,
(1967).
[6] Kuznetsov, V. P., Stable detection when the signal and spectrum of normal noise
are inaccurately known, Telecommunications and Radio Engineering, 30-31, 58-64,
(1976).
[7] Poor, H. V., On robust Wiener filtering, IEEE Trans Automatic Control, 25, 531-536,
(1980).
[8] Grigoryan, A. M. and E. R. Dougherty, Bayesian robust optimal linear filters, Signal
Processing, 81, 2503-2521, (2001).
[9] Dougherty, E. R., Hua, J., Z. Xiong, and Y. Chen, Optimal robust classifiers, Pattern
Recognition, 38, 1520-1532, (2005).
[10] Dehghannasiri, R., Esfahani, M. S., and E. R. Dougherty, Intrinsically Bayesian ro-
Bibliography 29
bust Kalman filter: an innovation process approach, IEEE Trans Signal Processing,
65, 2531-2546, (2017).
[11] Dalton, L. A., and E. R. Dougherty, Optimal classifiers with minimum expected
error within a Bayesian framework–part I: discrete and Gaussian models, Pattern
Recognition, 46, 1288-1300, (2013).
[12] Dalton, L. A., and E. R. Dougherty, Optimal classifiers with minimum expected error
within a Bayesian framework–part II: properties and performance analysis, Pattern
Recognition, 46, 1301-1314, (2013).
[13] Dalton, L. A., and E. R. Dougherty, Bayesian minimum mean-square error estimation
for classification error–part I: definition and the Bayesian MMSE error estimator for
discrete classification, IEEE Trans Signal Processing, 59, 115-129, (2011).
[14] Dalton, L. A. , and E. R. Dougherty, Bayesian minimum mean-square error estimation
for classification error–part II: linear classification of Gaussian models, IEEE Trans
Signal Processing, 59, 130-144, (2011).
[15] DeGroot, M. H., Optimal Statistical Decisions, McGraw-Hill, New York, (1970).
[16] Raiffa, H., and R. Schlaifer, Applied Statistical Decision Theory, MIT Press, Cam-
bridge, (1961).
[17] Jeffreys, H., An invariant form for the prior probability in estimation problems, Proc
Royal Society of London. Series A, Mathematical and Physical Sciences, 186, 453-461,
(1946).
[18] Jeffreys, H., Theory of Probability, Oxford University Press, London, (1961).
[19] Knight, J., Ivanov, I., and E. R. Dougherty, MCMC Implementation of the optimal
Bayesian classifier for non-Gaussian models: model-based RNA-seq classification,
BMC Bioinformatics, 15, (2014).
[20] Knight, J., Ivanov, I., Chapkin, R., and E. R. Dougherty, Detecting multivariate
gene interactions in RNA-seq data using optimal Bayesian classification, IEEE/ACM
Trans Computational Biology and Bioinformatics, 15, 484-493, (2018).
[21] Nagaraja, K., and U. Braga-Neto, Bayesian classification of proteomics biomarkers
from selected reaction monitoring data using an approximate Bayesian computation–
Markov chain monte carlo approach, Cancer Informatics, 17, (2018).
[22] Banerjee, U., and U. Braga-Neto, Bayesian ABC-MCMC classification of liquid
chromatography–mass spectrometry data, Cancer Informatics, 14, (2015).
[23] Karbalayghareh, A., Braga-Neto, U. M., and E. R. Dougherty, Intrinsically Bayesian
robust classifier for single-cell gene expression time series in gene regulatory networks,
BMC Systems Biology, 12, (2018).
[24] Dadaneh, S. Z., Dougherty, E. R., and X. Qian, Optimal Bayesian classification with
missing values, IEEE Trans Signal Processing, 66, 4182-4192, (2018).
[25] Zollanvari, A., Hua, J., and E. R. Dougherty, Analytic study of performance of lin-
ear discriminant analysis in stochastic settings, Pattern Recognition, 46, 3017-3029,
(2013).
[26] Broumand, A., Yoon, B-J., Esfahani, M. S., and E. R. Dougherty, Discrete optimal
Bayesian classification with error-conditioned sequential sampling, Pattern Recogni-
tion, 48, 3766-3782, (2015).
[27] Dalton, L. A.„ and M. R. Yousefi, On optimal Bayesian classification and risk es-
timation under multiple classes, EURASIP J. Bioinformatics and Systems Biology,
(2015).
30 Bibliography
[28] Jaynes, E. T., Prior Probabilities, IEEE Trans Systems Science and Cybernetics, 4,
227-241, (1968).
[29] Jaynes, E., What is the question? in Bayesian Statistics, J. M. Bernardo et al., Eds.,
Valencia University Press, Valencia, (1980).
[30] Kashyap, R., Prior probability and uncertainty, IEEE Trans Information Theory,
IT-17, 641-650, (1971).
[31] Rissanen, J., A universal prior for integers and estimation by minimum description
length, Annals of Statistics, 11, 416-431, (1983).
[32] Dalton, L. A., and E. R. Dougherty, Application of the Bayesian MMSE error esti-
mator for classification error to gene-expression microarray data, Bioinformatics, 27,
1822-1831, (2011).
[33] Esfahani, M. S., Knight, J., Zollanvari, A., Yoon, B-J., and E. R. Dougherty, Classifier
design given an uncertainty class of feature distributions via regularized maximum
likelihood and the incorporation of biological pathway knowledge in steady-state phe-
notype classification, Pattern Recognition, 46, 2783-2797, (2013).
[34] Esfahani, M. S., and E. R. Dougherty, Incorporation of biological pathway knowledge
in the construction of priors for optimal Bayesian classification, IEEE/ACM Trans
Computational Biology and Bioinformatics, 11, 202-218, (2014).
[35] Esfahani, M. S., and E. R. Dougherty, An optimization-based framework for the
transformation of incomplete biological knowledge into a probabilistic structure and
its application to the utilization of gene/protein signaling pathways in discrete phe-
notype classification, IEEE/ACM Trans Computational Biology and Bioinformatics,
12, 1304-1321, (2015).
[36] Boluki, S., Esfahani, M. S., Qian, X., and E. R. Dougherty, Incorporating biological
prior knowledge for Bayesian learning via maximal knowledge-driven information
priors, BMC Bioinformatics, 18, (2017).
[37] Qian, X., and E. R. Dougherty, Bayesian regression with network prior: optimal
Bayesian filtering perspective, IEEE Trans Signal Processing, 64, 6243-6253, (2016).
[38] Bernado, J., and A. Smith, Bayesian Theory, Wiley, Chichester, U.K., (2000).
[39] Bishop, C., Pattern Recognition and Machine Learning. Springer-Verlag, New York,
(2006).
[40] Murphy, K., Machine Learning: A Probabilistic Perspective, MIT Press, Cambridge,
(2012).
[41] Pan, S. J., and Q.Yang, A survey on transfer learning, IEEE Trans Knowledge and
Data Engineering, 22, 1345-1359, (2010).
[42] Karbalayghareh, A., Qian, X., and E. R. Dougherty, Optimal Bayesian transfer learn-
ing, IEEE Trans Signal Processing, 66, 3724-3739, (2018).
[43] Halvorsen, K., Ayala, V. and E. Fierro, On the marginal distribution of the diagonal
blocks in a blocked Wishart random matrix, Int. J. Anal, vol. 2016, pp. 1-5, 2016.
[44] Nagar, D. K., and J. C. Mosquera-Benı́tez, Properties of matrix variate hypergeo-
metric function distribution, Appl. Math. Sci., vol. 11, no. 14, pp. 677-692, 2017.
[45] Muirhead, R. J., Aspects of Multivariate Statistical Theory, Wiley, Hoboken, 2009.
March 12, 2020 10:0 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter˙Shi˙Gong page 31
CHAPTER 1.2
DEEP DISCRIMINATIVE FEATURE LEARNING METHOD

FOR OBJECT RECOGNITION
Weiwei Shi1 and Yihong Gong2

1
School of Computer Science and Engineering, Xi’an University of Technology,
Xi’an 710049, China.
2
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University,
Xi’an 710049, China.
1
wshi@xaut.edu.cn, 2 ygong@mail.xjtu.edu.cn
This chapter introduces two deep discriminative feature learning methods for
object recognition without the need to increase the network complexity, one based
on entropy-orthogonality loss, and another one based on Min-Max loss. These
two losses can enforce the learned feature vectors to have better within-class
compactness and between-class separability. Therefore the discriminative ability
of the learned feature vectors is highly improved, which is very essential to object
recognition.
1. Introduction
Recent years have witnessed the bloom of convolutional neural networks (CNNs) in
many pattern recognition and computer vision applications, including object recog-
nition,1–4 object detection,5–8 face verification,9,10 semantic segmentation,6 object
tracking,11 image retrieval,12 image enhancement,13 image quality assessment,14
etc.
These impressive accomplishments mainly benefit from the three factors below:
(1) the rapid progress of modern computing technologies represented by GPGPUs
and CPU clusters has allowed researchers to dramatically increase the scale and
complexity of neural networks, and to train and run them within a reasonable time
frame, (2) the availability of large-scale datasets with millions of labeled training
samples has made it possible to train deep CNNs without a severe overfitting,
and (3) the introduction of many training strategies, such as ReLU,1 Dropout,1
DropConnect,15 and batch normalization,16 can help produce better deep models
by the back-propagation (BP) algorithm.
Recently, a common and popular method to improve object recognition perfor-
mance of CNNs is to develop deeper network structures with higher complexities
and then train them with large-scale datasets. However, this strategy is unsustain-
able, and inevitably reaching its limit. This is because training very deep CNNs is
31
32 W. Shi and Y. Gong
becoming more and more difficult to converge, and also requires GPGPU/CPU clus-
ters and complex distributed computing platforms. These requirements go beyond
the limited budgets of many research groups and many real applications.
The learned features to have good discriminative ability are very essential to
object recognition.17–21 Discriminative features are the features with better within-
class compactness and between-class separability. Many discriminative feature
learning methods22–27 that are not based on deep learning have been proposed.
However, constructing a highly efficient discriminative feature learning method for
CNN is non-trivial. Because the BP algorithm with mini-batch is used to train
CNN, a mini-batch cannot very well reflect the global distribution of the training
set. Owing to the large scale of the training set, it is unrealistic to input the whole
training set in each iteration. In recent years, contrastive loss10 and triplet loss28 are
proposed to strengthen the discriminative ability of the features learned by CNN.
However, both of them suffer from dramatic data expansion when composing the
sample pairs or triplets from the training set. Moreover, it has been reported that
the way of constituting pairs or triplets of training samples can significantly affect
the performance accuracy of a CNN model by a few percentage points.17,28 As a
result, using such losses may lead to a slower model convergence, higher computa-
tional cost, increased training complexity and uncertainty.
For almost all visual tasks, the human visual system (HVS) is always superior
to current machine visual systems. Hence, developing a system that simulates some
properties of the HVS will be a promising research direction. Actually, existing
CNNs are well known for their local connectivity and shared weight properties that
originate from discoveries in visual cortex research.
Research findings in the areas of neuroscience, physiology, psychology, etc,29–31
have shown that, object recognition in human visual cortex (HVC) is accomplished
by the ventral stream, starting from the V1 area through the V2 area and V4 area,
to the inferior temporal (IT) area, and then to the prefrontal cortex (PFC) area.
By this hierarchy, raw input stimulus from the retina are gradually transformed
into higher level representations that have better discriminative ability for speedy
and accurate object recognition.
In this chapter, we introduce two deep discriminative feature learning methods
for object recognition by drawing lessons from HVC object recognition mechanisms,
one inspired by the class-selectivity of the neurons in the IT area, and another one
inspired by the “untangling” mechanism of HVC.
In the following, we first introduce the class-selectivity of the neurons in the IT
area and “untangling” mechanism of HVC, respectively.
Class-selectivity of the neurons in the IT area. Research findings30 have
revealed the class-selectivity of the neurons in the IT area. Specifically, the response
of an IT neuron to visual stimulus is sparse with respect to classes, i.e., it only
responds to very few classes. The class-selectivity implies that the feature vectors
from different classes can be easily separated.
Deep Discriminative Feature Learning Method for Object Recognition 33
“Untangling” mechanism of human visual cortex. Works in the fields of

psychology, neuroscience, physiology, etc.29,30,32 have revealed that object recogni-
tion in human brains is accomplished by the ventral stream that includes four lay-
ers, i.e., V1, V2, V4 and IT. If an object is transformed by any identity-preserving
transformations (such as a shift in position, changes in pose, viewing angle, overall
shape), it leads to different neuron population activities which can be viewed as the
corresponding feature vectors describing the object (see Fig. 1). In feature space, a
low-dimension manifold is formed by these feature vectors which correspond to all
possible identity-preserving transformations of the object. At V1 layer, manifolds
from different object categories are highly curved, and “tangled” with each other.
From V1 layer to IT layer, neurons gradually gain the recognition ability for differ-
ent object classes, implying that different manifolds will be gradually untangled. At
IT layer, each manifold corresponding to an object category is very compact, while
the distances among different manifolds are very large, and hence the discriminative
features are learned (see Fig. 1).
Chair
manifold Transformations
Chair
Not Chair
(a) (b)
Fig. 1. (color online) In the beginning, manifolds corresponding to different object classes are
highly curved and “tangled”. For instance, a chair manifold (see blue manifold) and all other non-
chair manifolds (where, the black manifold is just one example). After a series of transformations,
in the end, each manifold corresponding to an object category is very compact, and the distances
between different manifolds are very large, and then the discriminative features are learned.30,33
Inspired by the class-selectivity of the neurons in the IT area,30 the entropy-

orthogonality loss based deep discriminative feature learning method is proposed.34
Inspired by the “untangling” mechanism of human visual cortex,30 the Min-Max
loss based deep discriminative feature learning method is proposed.20,33 In the
following two section, we will introduce them, respectively.
2. Entropy-Orthogonality Loss Based Deep Discriminative Feature

Learning Method
Inspired by the class-selectivity of the neurons in the IT area, Shi et al.34 pro-
posed to improve the discriminative feature learning of CNN models by enabling
the learned feature vectors to have class-selectivity. To achieve this, a novel loss
function, termed entropy-orthogonality loss (EOL), is proposed to modulate the
neuron outputs (i.e., feature vectors) in the penultimate layer of a CNN model.
The EOL explicitly enables the feature vectors learned by a CNN model to have
the following properties: (1) each dimension of the feature vectors only responds
strongly to as few classes as possible, and (2) the feature vectors from different
classes are as orthogonal as possible. Hence this method makes an analogy between
the CNN’s penultimate layer neurons and the IT neurons, and the EOL measures
the degree of discrimination of the learned features. The EOL and the softmax
loss have the same training requirement without the need to carefully recombine
the training sample pairs or triplets. Accordingly, the training of CNN models is
more efficient and easier-to-implement. When combined with the softmax loss, the
EOL not only can enlarge the differences in the between-class feature vectors, but
also can reduce the variations in the within-class feature vectors. Therefore the dis-
criminative ability of the learned feature vectors is highly improved, which is very
essential to object recognition. In the following, we will introduce the framework of
the EOL-based deep discriminative feature learning method.
2.1. Framework
n
Assume that T = {Xi , ci }i=1 is the training set, where Xi represents the ith training
sample (i.e., input image), ci ∈ {1, 2, · · · , C} refers to the ground-truth label of Xi ,
C refers to the number of classes, and n refers to the number of training samples
in T . For the input image Xi , we denote the outputa of the penultimate layer√of a
1
CNN by xi , and view xi as the feature vector of Xi learned by the CNN. 22 , 2
This method improves discriminative feature learning of a CNN by embedding
the entropy-orthogonality loss (EOL) into the penultimate layer of the CNN during
training. For an L-layer CNN model, embedding the EOL into the layer L − 1 of
the CNN, the overall objective function is:

n
min L = (W, Xi , ci ) + λM(F, c) , (1)
W
i=1
where (W, Xi , ci ) is the softmax loss for sample Xi , W denotes the total layer
parameters of the CNN model, W = {W(l) , b(l) }L l=1 , W
(l)
represents the filter
weights of the l layer, b refers to the corresponding biases. M(F, c) denotes
th (l)
the EOL, F = [x1 , · · · , xn ], and c = {ci }ni=1 . Hyperparameter λ adjusts the balance
between the softmax loss and the EOL.
a Assume that the output has been reshaped into a column vector.
F directly depends on {W(l) , b(l) }L−1

l=1 . Hence M(F, c) can directly modulate
all the layer parameters from 1th to (L − 1)th layers by BP algorithm during the
training process. It is noteworthy that the EOL is independent of, and able to be
applied to different CNN structures. Next, we will introduce the details of the EOL.
2.2. Entropy-Orthogonality Loss (EOL)

In this subsection, we introduce an entropy and orthogonality based loss function,
termed entropy-orthogonality loss (EOL), which measures the degree of discrimina-
tive ability of the learned feature vectors. For simplicity, assuming that the feature
vector xi is a d-dimensional column vector (xi ∈ Rd×1 ).
We call the k th (k = 1, 2, · · · , d) dimension of feature vector “class-sharing”
if it is nonzero on many samples belonging to many classes (we call these classes
“supported classes” of this dimension). Similarly, the k th dimension of feature
vector is called “class-selective” if it is nonzero on samples only belonging to a
few classes. The class-selectivity of the k th dimension increases as the number of
its supported classes decreases. Naturally, we can define the entropyb of the k th
dimension to measure the degree of its class-selectivity as:

C
E(k) = − Pkc logC (Pkc ) , (2)
c=1

|xj (k)| |xkj |
Pkc = j∈π
n
c j∈π
n c , (3)
i=1 |x i (k)| i=1 |xki |
where, xki (i.e., xi (k)) refers to the k th dimension of xi , πc represents the index set
of the samples belonging to the cth class.
The maximum possible value for E(k) is 1 when ∀c, Pkc = C1 , which means that
the set of supported classes of dimension k includes all the classes and, therefore,
dimension k is not class-selective at all (it is extremely “class-sharing”). Similarly,
the minimum possible value of E(k) is 0 when ∃c, Pkc = 1 and ∀c = c, Pkc = 0,
which means that the set of supported classes of dimension k includes just one class
c and, therefore, dimension k is extremely class-selective. For dimension k, the
degree of its class-selectivity is determined by the value of E(k) (between 0 and 1).
As the value of E(k) decreases, the class-selectivity of dimension k increases.
According to the discussions above, the entropy loss E(F, c) can be defined as:

d
E(F, c) = E(k) , (4)
k=1
where, F = [x1 , · · · , xn ], c = {ci }ni=1 .

Minimizing the entropy loss is equivalent to enforcing that each dimension of the
feature vectors should only respond strongly to as few classes as possible. However,
b In the definition of entropy, 0 logC (0) = 0.
the entropy loss does not consider the connection between different dimensions,
which is problematic. Take 3-dimensional feature vector as an example. If we
have six feature vectors from 3 different classes, x1 and x2 come from class 1, x3
and x4 come from class 2, x5 and x6 come from class 3. For the feature vector
matrix F = [x1 , x2 , x3 , x4 , x5 , x6 ], when it takes the following value of A and B,
respectively, E(A,
c) = E(B, c), where c = {1, 1, 2, 2, 3, 3}. However, the latter
one can not be classified at all, this is because x2 , x4 and x6 have the same value.
Although the situation can be partially avoided by the softmax loss, it can still cause
contradiction to the softmax loss and therefore affect the discriminative ability of
the learned features.
⎡ ⎤ ⎡ ⎤
11 00 11 101000
A = ⎣ 0 0 1 1 1 1 ⎦, B = ⎣ 0 1 0 1 0 1 ⎦ (5)
1 1 1
2 1 2 1 2 1 1 0 0 0 1 0
To address this problem, we need to promote orthogonality (i.e., minimize dot
products) between the feature vectors of different classes. Specifically, we need to
introduce the following orthogonality loss O(F, c):
n
O(F, c) = (x
i xj − φij ) = F F − ΦF ,
2 2
(6)
i,j=1
where,
1 , if ci = cj ,
φij = (7)
0 , else ,
Φ = (φij )n×n , · F denotes the Frobenius norm of a matrix, and the superscript
denotes the transpose of a matrix. Minimizing the orthogonality loss is equivalent
to enforcing that (1) the feature vectors from different classes are as orthogonal as
possible, (2) the L2 -norm of each feature vector is as close as possible to 1, and (3)
the distance between any two feature vectors belonging to the same class is as small
as possible.
Based on the above discussions and definitions, the entropy-orthogonality loss
(EOL) M(F, c) can be obtained by integrating Eq. (4) and Eq. (6):
M(F, c) = αE(F, c) + (1 − α)O(F, c)

d
=α E(k) + (1 − α)F F − Φ2F , (8)
k=1
where α is the hyperparameter to adjust the balance between the two terms.
Combining Eq. (8) with Eq. (1), the overall objective function becomes:

n
min L(W, T ) = (W, Xi , ci ) + λαE(F, c) + λ(1 − α)O(F, c)
i=1

n
= (W, Xi , ci ) + λ1 E(F, c) + λ2 O(F, c) , (9)
i=1
where, λ1 = λα, λ2 = λ(1 − α). Next, we will introduce the optimization algorithm
for Eq. (9).
Forward Propagation
W(1) , b(1) W(2) ,b(2) W(3) , b(3) W(4) ,b(4) W(5) , b(5) C -dimensional
output
Input Image
conv1 conv2 conv3 fc2
fc1
Entropy-Orthogonality Loss (EOL)
Softmax Loss
Error Flows Back-Propagation Process
EOL
Fig. 2. The flowchart of training process in an iteration for the EOL-based deep discriminative
feature learning method.34 CNN shown in this figure consists of 3 convolutional (conv) layers
and 2 fully connected (fc) layers, i.e., it is a 5-layer CNN model. The last layer fc2 outputs a
C-dimensional prediction vector, C is the number of classes. The penultimate layer in this model
is fc1, so the entropy-orthogonality loss (EOL) is applied to layer fc1. The EOL is independent of
the CNN structure.
Algorithm 1 Training algorithm for the EOL-based deep discriminative feature

learning method with an L-layer CNN model.
Input: Training set T , hyperparameters λ1 , λ2 , maximum number of iterations
Imax , and counter iter = 0.
Output: W = {W(l) , b(l) }L l=1 .
1: Select a training mini-batch from T .
2: Perform the forward propagation, for each sample, computing the activations
of all layers.
3: Perform the back-propagation from layer L to L − 1, sequentially computing
the error flows of layer L and L − 1 from softmax loss by BP algorithm.
∂E(F,c)
4: Compute ∂xi by Eq. (10), then scale them by λ1 .
5: Compute ∂O(F,c)
∂xi by Eq. (14), then scale them by λ2 .
6: Compute the total error flows of layer L − 1, which is the summation of the
above different items.
7: Perform the back-propagation from layer L − 1 layer to layer 1, sequentially
compute the error flows of layer L − 1, · · · , 1, by BP algorithm.
∂L
8: According to the activations and error flows of all layers, compute ∂W by BP
algorithm.
9: Update W by gradient descent algorithm.
10: iter ← iter + 1. If iter < Imax , perform step 1.
2.3. Optimization
We employ the BP algorithm with mini-batch to train the CNN model. The over-
all objective function is Eq. (9). Hence, we need to compute the gradients of L
with respect to (w.r.t.) the activations of all layers, which are called the error
flows of the corresponding layers. The gradient calculation of the softmax loss is
straightforward. In the following, we focus on obtaining the gradients of the E(F, c)
and O(F, c) w.r.t. the feature vectors xi = [x1i , x2 , · · · , xdi ] , (i = 1, 2, · · · , n),
respectively.
The gradient of E(F, c) w.r.t. xi is
∂E(F, c) ∂E(1) ∂E(2) ∂E(d)
=[ , ,··· , ] , (10)
∂xi ∂x1i ∂x2i ∂xdi
∂E(k) C
(1 + ln(Pkc )) ∂Pkc
=− · , (11)
∂xki c=1
ln(C) ∂xki
⎧
j∈π |xkj |
⎪
⎨ (n c|x |)2 × sgn(xki ) , i ∈ πc ,
∂Pkc j=1 kj
= − (12)
∂xki ⎪
⎩ n j∈πc |xkj |
2 × sgn(xki ) , i ∈ πc ,
( j=1 |xkj |)
where sgn(·) is sign function.
The O(F, c) can be written as:
O(F, c) = F F − Φ2F = T r((F F − Φ) (F F − Φ))
= T r(F FF F) − 2T r(ΦF F) + T r(Φ Φ) , (13)
where T r(·) refers to the trace of a matrix.
The gradients of O(F, c) w.r.t. xi is
∂O(F, c)
= 4F(F F − Φ)(:,i) , (14)
∂xi
where the subscript (:, i) represents the ith column of a matrix.
Fig. 2 shows the flowchart of the training process in an iteration for the EOL-
based deep discriminative feature learning method. Based on the above derivatives,
the training algorithm for this method is listed in Algorithm 1.
3. Min-Max Loss Based Deep Discriminative Feature Learning

Method
Inspired by the “untangling” mechanism of human visual cortex,30 the Min-Max

loss based deep discriminative feature learning method is proposed.20,33 The Min-
Max loss enforces the following properties for the features learned by a CNN model:
(1) each manifold corresponding to an object category is as compact as possible,
and (2) the margins (distances) between different manifolds are as large as possible.
In principle, the Min-Max loss is independent of any CNN structures, and can be
applied to any layers of a CNN model. The experimental evaluations20,33 show that
applying the Min-Max loss to the penultimate layer is most effective for improving
the model’s object recognition accuracies. In the following, we will introduce the
framework of the Min-Max loss based deep discriminative feature learning method.
3.1. Framework
n
Let {Xi , ci }i=1 be the set of input training data, where Xi denotes the ith raw
input data, ci ∈ {1, 2, · · · , C} denotes the corresponding ground-truth label, C is
the number of classes, and n is the number of training samples. The goal of training
CNN is to learn filter weights and biases that minimize the classification error from
the output layer. A recursive function for an M -layer CNN model can be defined
as follows:
(m) (m−1)
Xi = f (W(m) ∗ Xi + b(m) ) , (15)
(0)
i = 1, 2, · · · , n; m = 1, 2, · · · , M ; Xi = Xi , (16)
where, W(m) denotes the filter weights of the mth layer to be learned, b(m) refers to
the corresponding biases, ∗ denotes the convolution operation, f (·) is an element-
(m)
wise nonlinear activation function such as ReLU, and Xi represents the feature
maps generated at layer m for sample Xi . The total parameters of the CNN model
can be denoted as W = {W(1) , · · · , W(M ) ; b(1) , · · · , b(M ) } for simplicity.
This method improves discriminative feature learning of a CNN model by em-
bedding the Min-Max loss into certain layer of the model during the training pro-
cess. Embedding this loss into the k th layer is equivalent to using the following cost
function to train the model:
n
min L = (W, Xi , ci ) + λL(X (k) , c) , (17)
W
i=1
where (W, Xi , ci ) is the softmax loss for sample Xi , L(X (k) , c) denotes the Min-
(k) (k)
Max loss. The input to it includes X (k) = {X1 , · · · , Xn } which denotes the set
of produced feature maps at layer k for all the training samples, and c = {ci }ni=1
which is the set of corresponding labels. Hyper-parameter λ controls the balance
between the classification error and the Min-Max loss.
Note that X (k) depends on W(1) , · · · , W(k) . Hence directly constraining X (k)
will modulate the filter weights from 1th to k th layers (i.e. W(1) , · · · , W(k) ) by
feedback propagation during the training phase.
3.2. Min-Max Loss

In the following, we will introduce two Min-Max losses, i.e., Min-Max loss on intrin-
sic and penalty graphs, and Min-Max loss based on within-manifold and between-
manifold distances, respectively.
Within-manifold Between-manifold
Compactness Margin
xi Manifold-1
xj
Manifold-2
Intrinsic Graph Penalty Graph

(a) (b)
Fig. 3. The adjacency relationships of (a) within-manifold intrinsic graph and (b) between-
manifold penalty graph for the case of two manifolds. For clarity, the left intrinsic graph only
includes the edges for one sample in each manifold.20
3.2.1. Min-Max Loss Based on Intrinsic and Penalty Graphs

(k) (k) (k)
For X (k) = {X1 , · · · , Xn }, we denote by xi the column expansion of Xi . The
goal of the Min-Max loss is to enforce both the compactness of each object man-
ifold, and the max margin between different manifolds. The margin between two
manifolds is defined as the Euclidian distance between the nearest neighbors of the
two manifolds. Inspired by the Marginal Fisher Analysis research from,35 we can
construct an intrinsic and a penalty graph to characterize the within-manifold com-
pactness and the margin between the different manifolds, respectively, as shown in
Fig. 3. The intrinsic graph shows the node adjacency relationships for all the object
manifolds, where each node is connected to its k1 -nearest neighbors within the same
manifold. Meanwhile, the penalty graph shows the between-manifold marginal node
adjacency relationships, where the marginal node pairs from different manifolds are
connected. The marginal node pairs of the cth (c ∈ {1, 2, · · · , C}) manifold are the
k2 -nearest node pairs between manifold c and other manifolds.
Then, from the intrinsic graph, the within-manifold compactness can be char-
acterized as:
n
(I)
L1 = Gij xi − xj 2 , (18)
i,j=1
(I) 1 , if i ∈ τk1 (j) or j ∈ τk1 (i) ,

Gij = (19)
0 , else ,
(I)
where Gij refers to element (i, j) of the intrinsic graph adjacency matrix G(I) =
(I)
(Gij )n×n , and τk1 (i) indicates the index set of the k1 -nearest neighbors of xi in
the same manifold as xi .
From the penalty graph, the between-manifold margin can be characterized as:
n
(P )
L2 = Gij xi − xj 2 , (20)
i,j=1
(P ) 1 , if (i, j) ∈ ζk2 (ci ) or (i, j) ∈ ζk2 (cj ) ,

Gij = (21)
0 , else ,
(P )
where Gij denotes element (i, j) of the penalty graph adjacency matrix G(P ) =
(P )
(Gij )n×n , ζk2 (c) is a set of index pairs that are the k2 -nearest pairs among the set
{(i, j)|i ∈ πc , j ∈
/ πc }, and πc denotes the index set of the samples belonging to the
cth manifold.
Based on the above descriptions, the Min-Max loss on intrinsic and penalty
graphs can be expressed as:
L = L1 − L2 . (22)
Obviously, minimizing this Min-Max loss is equivalent to enforcing the learned
features to form compact object manifolds and large margins between different
manifolds simultaneously. Combining Eq. (22) with Eq. (17), the overall objective
function becomes as follows:

n
min L = (W, Xi , ci ) + λ(L1 − L2 ) , (23)
W
i=1
3.2.2. Min-Max Loss Based on Within-Manifold and Between-Manifold

Distances
In this subsection, we implement the Min-Max loss by minimizing the within-
manifold distance while maximizing the between-manifold distance for learned fea-
ture maps of the layer to which the Min-Max loss is applied. Denote the column
(k)
expansion of Xi by xi , and the index set for the set of samples belonging to class
c by πc . Then the mean vector of the k th layer feature maps belonging to class c
can be represented as
1
mc = xi , (24)
nc i∈π
c
where nc = |πc |. Similarly, the overall mean vector is

1
n
m= xi , (25)
n i=1
C
where n = c=1 |πc |.
(W )
The within-manifold distance Sc for class c can be represented as

Sc(W )
= (xi − mc ) (xi − mc ) . (26)
i∈πc
The total within-manifold distance S (W ) can be computed as

C
S (W ) = Sc(W ) . (27)
c=1
Minimizing S (W ) is equivalent to enforcing the within-manifold compactness.

The total between-manifold distance S (B) can be expressed as

C
S (B) = nc (mc − m) (mc − m) . (28)
c=1
Maximizing S (B) is equivalent to enlarging the between-manifold distances.

Using the above math notations, the Min-Max loss based on within-manifold
and between-manifold distances can be defined as follows:
S (B)
L(X (k) , c) = . (29)
S (W )
Obviously, maximizing this Min-Max loss is equivalent to enforcing the learned
features to form compact object manifolds and large distances between different
manifolds simultaneously. Combining Eq. (29) with Eq. (17), the overall objective
function becomes as follows:

n
S (B)
min L = (W, Xi , ci ) − λ . (30)
W
i=1
S (W )
3.3. Optimization
We use the back-propagation method to train the CNN model, which is carried
out using mini-batch. Therefore, we need to calculate the gradients of the overall
objective function with respect to the features of the corresponding layers. Because
the softmax loss is used as the first term of Eq. (23) and (30), its gradient calculation
is straightforward. In the following, we focus on obtaining the gradient of the Min-
Max loss with respect to the feature maps xi in the corresponding layer.
3.3.1. Optimization for Min-Max Loss Based on Intrinsic and Penalty

Graphs
Let G = (Gij )n×n = G(I) − G(P ) , then the Min-Max objective can be written as:

n
L= Gij xi − xj 2 = 2T r(HΨH ) , (31)
i,j=1
n
where H = [x1 , · · · , xn ], Ψ = D − G, D = diag(d11 , · · · , dnn ), dii = j=1,j=i Gij ,
i = 1, 2, · · · , n, i.e. Ψ is the Laplacian matrix of G, and T r(·) denotes the trace of
a matrix.
The gradients of L with respect to xi is
∂L
= 2H(Ψ + Ψ )(:,i) = 4HΨ(:,i) . (32)
∂xi
where, Ψ(:,i) denotes the ith column of matrix Ψ.
3.3.2. Optimization for Min-Max Loss Based on Within-Manifold and

Between-Manifold Distances
Let S(W ) and S(B) be the within-manifold scatter matrix and between-manifold
scatter matrix, respectively, then we have
S (W ) = T r(S(W ) ) , S (B) = T r(S(B) ) . (33)
According to,36,37 the scatter matrix S(W ) and S(B) can be calculated by:
1 (W )
n

S(W ) = Ω (xi − xj ) (xi − xj ) , (34)
2 i,j=1 ij
1 (B)
n

S(B) = Ω (xi − xj ) (xi − xj ) , (35)
2 i,j=1 ij
(W ) (B)
where n is the number of inputs in a mini-batch, Ωij and Ωij are respectively
(W ) (W )
elements (i, j) of the within-manifold adjacency matrix Ω = (Ωij )n×n , and
(B)
the between-manifold adjacency matrix Ω(B) = (Ωij )n×n
based on the features
X (k) (i.e., the generated feature maps at the k layer, X = {x1 , · · · , xn }) from
th (k)
one mini-batch of training data, which can be computed as:

(W )
1
nc , if ci = cj = c , (B)
1
n − 1
nc , if ci = cj = c ,
Ωij = , Ωij = 1 (36)
0 , otherwise , n , otherwise .
Based on the above descriptions and ,38 the Min-Max loss based on within-
manifold and between-manifold distances L can be written as:
n (B)
i,j=1 Ωij xi − xj 1
1 2 (B)
T r(S(B) ) 2 n (Ω Φ)1n
L= = n (W )
= (W )
, (37)
T r(S(W ) ) 1
Ω xi − xj 2 1n (Ω Φ)1n
2 i,j=1 ij
where Φ = (Φij )n×n is an n × n matrix with Φij = xi − xj 2 , denotes element-

product, and 1n ∈ Rn is a column vector with all the elements equal to one.
The gradients of the tr(S(W ) ) and tr(S(B) ) with respect to xi are:
∂T r(S(W ) )
= (xi 1
n − H)(Ω
(W )
+ Ω(W ) )(:,i) , (38)
∂xi
∂T r(S(B) )
= (xi 1
n − H)(Ω
(B)
+ Ω(B) )(:,i) , (39)
∂xi
where H = [x1 , · · · , xn ], and the subscript (:, i) denotes the ith column of a matrix.
Then the gradients of the Min-Max loss with respect to features xi is
∂T r(S(B) ) ∂T r(S(W ) )
∂L T r(S(W ) ) ∂xi − T r(S(B) ) ∂xi
= . (40)
∂xi [T r(S(W ) )]2
Incremental mini-batch training procedure
In practise, when the number of the classes is large relative to the mini-batch size,
because there is no guarantee that each mini-batch will contain training samples
from all the classes, the above gradient must be calculated in an incremental fashion.
th
Firstly, the mean vector of the c class can be updated as
i∈πc (t) xi (t) + Nc (t − 1)mc (t − 1)
mc (t) = , (41)
Nc (t)
where (t) indicates tth iteration, Nc (t) represents the cumulative total number of
cth class training samples, πc (t) denotes the index set of the samples belonging to
the cth class in a mini-batch, and nc (t) = |πc (t)|. Accordingly, the overall mean
vector m(t) can be updated by
1
C
m(t) = nc (t)mc (t) . (42)
n c=1
C
where n = c=1 |πc (t)|, i.e. n is the number of training samples in a mini-batch.
(W )
In such scenario, at the tth iteration, the within-manifold distance Sc (t) for
class c can be represented as
Sc(W ) (t) = (xi (t) − mc (t)) (xi (t) − mc (t)) , (43)
i∈πc (t)
the total within-manifold distance S (W ) (t) can be denoted as
C
S (W ) (t) = Sc(W ) (t) , (44)
c=1
(B)
the total between-manifold distance S (t) can be expressed as

C
S (B) (t) = nc (t)(mc (t) − m(t)) (mc (t) − m(t)) . (45)
c=1
(W )
Then the gradients of S (t) and S (B) (t) with respect to xi (t) become:
∂S (W ) (t)
C (W )
∂Sc (t)
= I(i ∈ πc (t))
∂xi (t) c=1
∂xi (t)

C
(nc (t)mc (t) − j∈πc (t) xj (t))
=2 I(i ∈ πc (t)) (xi (t) − mc (t)) + , (46)
c=1
Nc (t)
and C
∂S (B) (t) ∂ c=1 nc (t)(mc (t) − m(t)) (mc (t) − m(t))
=
∂xi (t) ∂xi (t)

C
nc (t)(mc (t) − m(t))
=2 I(i ∈ πc (t)) , (47)
c=1
Nc (t)
where I(·) refers to the indicator function that equals 1 if the condition is satisfied,
and 0 if not. Accordingly, the gradients of the Min-Max loss with respect to features
xi (t) is
(B) (W )
∂L S (W ) (t) ∂S (t)
∂xi (t) − S
(B)
(t) ∂S∂xi (t)(t)
= . (48)
∂xi (t) [S (W ) (t)]2
The total gradient with respect to xi is sum of the gradient from the softmax
loss and that of the Min-Max loss.
4. Experiments with Image Classification Task
4.1. Experimental Setups

The performance evaluations are conducted using one shallow model, QCNN,39 and
two famous deep models: NIN,40 AlexNet,1 respectively. During training, the EOL
(or Min-Max loss) is applied to the penultimate layer of the models without chang-
ing the network structures.20,33,34 For those hyperparameters, including dropout
ratio, learning rate, weight decay and momentum, we abide by the original network
settings. The hardware used in the experiments is one NVIDIA K80 GPU and
one Intel Xeon E5-2650v3 CPU. The software used in the experiments is the Caffe
platform.39 All models are trained from scratch without pre-training. In the follow-
ing, we use the Min-Max∗ and Min-Max to represent the Min-Max Loss based on
intrinsic graph and penalty graph, and the Min-Max loss based on within-manifold
and between-manifold distances, respectively.
4.2. Datasets
The CIFAR10,41 CIFAR100,41 MNIST42 and SVHN43 datasets are chosen to con-
duct performance evaluations. CIFAR10 and CIFAR100 are natural image datasets.
MNIST is a dataset of hand-written digit (0-9) images. SVHN is collected from
house numbers in Google Street View images. For an image of SVHN, there may
be more than one digit, but the task is to classify the digit in the image center.
Table 1 lists the details of the CIFAR10, CIFAR100, MNIST and SVHN datasets.
These four datasets are very popular in image classification research community.
This is because they contain a large amount of small images, hence they enable
models to be trained in reasonable time frames on moderate configuration comput-
ers.
Table 1. Details of the CIFAR10, CIFAR100, MNIST and SVHN datasets.
Dataset #Classes #Samples Size and Format Split
training/test:
CIFAR10 10 60000 32×32 RGB
50000/10000
training/test:
CIFAR100 100 60000 32×32 RGB
50000/10000
training/test:
MNIST 10 70000 28×28 gray-scale
60000/10000
training/test/extra:
SVHN 10 630420 32×32 RGB
73257/26032/531131
4.3. Experiments using QCNN Model

First, the “quick” CNN model from the official Caffe package39 is selected as the
baseline (termed QCNN). It consists of 3 convolutional (conv) layers and 2 fully
Table 2. Comparisons of the test error rates (%) on the

CIFAR10, CIFAR100 and SVHN datasets using QCNN.
Method CIFAR10 CIFAR100 SVHN
QCNN (Baseline) 23.47 55.87 8.92
QCNN+EOL34 16.74 50.09 4.47

QCNN+Min-Max∗20 18.06 51.38 5.42
QCNN+Min-Max33 17.54 50.80 4.80
Table 3. Comparisons of the test error rates (%) on the CIFAR10,

CIFAR100, MNIST and SVHN datasets using NIN.
Method CIFAR10 CIFAR100 MNIST SVHN
NIN40 10.41 35.68 0.47 2.35

DSN44 9.78 34.57 0.39 1.92
NIN (Baseline) 10.20 35.50 0.47 2.55
NIN+EOL34 8.41 32.54 0.30 1.70

NIN+Min-Max∗20 9.25 33.58 0.32 1.92
NIN+Min-Max33 8.83 32.95 0.30 1.80
connected (fc) layers. We evaluated the QCNN model using CIFAR10, CIFAR100
and SVHN, respectively. MNIST can not be used to evaluate the QCNN model,
because the input size of QCNN must be 32×32, but the images in MNIST are
28 × 28 in size.
Table 2 shows the test set top-1 error rates of CIFAR10, CIFAR100 and SVHN,
respectively. It can be seen that training QCNN with the EOL or Min-Max loss is
able to effectively improve performance compared to the respective baseline. These
remarkable performance improvements clearly reveal the effectiveness of the EOL
and the Min-Max loss.
4.4. Experiments using NIN Model
Next, we apply the EOL or Min-Max loss to the NIN models.40 NIN consists of
9 conv layers without fc layer. The four datasets, including CIFAR10, CIFAR100,
MNIST and SVHN, are used in the evaluation. For fairness, we complied with the
same training/testing protocols and data preprocessing as in.40,44
Table 3 provides the respective comparison results of test set top-1 error rates for
the four datasets. For NIN baseline, to be fair, we report the evaluation results from
both our own experiments and the original paper.40 We also include the results of
DSN44 in this table. DSN is also based on NIN with layer-wise supervisions. These
results again reveal the effectiveness of the EOL and the Min-Max loss.
(a) QCNN (b) QCNN+EOL (c) QCNN+Min-Max∗ (d) QCNN+Min-Max
Fig. 4. Feature visualization of the CIFAR10 test set, with (a) QCNN; (b) QCNN+EOL. One
dot denotes a image, different colors denote different classes.
(a) NIN (b) NIN+EOL (c) NIN+Min-Max∗ (d) NIN+Min-Max
Fig. 5. Feature visualization of the CIFAR10 test set, with (a) NIN; (b) NIN+EOL.
4.5. Feature Visualization

We utilize t-SNE45 to visualize the learned feature vectors extracted from the penul-
timate layer of the QCNN and NIN models on the CIFAR-10 test set, respectively.
Fig. 4 and 5 show the respective feature visualizations for the two models. It can
be observed that, the EOL and the Min-Max loss make the learned feature vectors
have better between-class separability and within-class compactness compared to
the respective baseline. Therefore the discriminative ability of the learned feature
vectors is highly improved.
5. Discussions
From section 4, all the experiments indicate superiority of the EOL and the Min-
Max loss. The reasons that why better within-class compactness and between-class
separability will lead to better discriminative ability of the learned feature vectors
are as follows:
(1) Almost all the data clustering methods,46–48 discriminant analysis meth-
ods,35,49,50 etc, use this principle to learn discriminative features to better
accomplish the task. Data clustering can be regarded as unsupervised data
classification. Therefore, by analogy, learning features that possess the above
property will certainly improve performance accuracies for supervised data clas-
sification.
(2) As described in Introduction, human visual cortex employs a similar mechanism

to accomplish the goal of discriminative feature extraction. This discovery
serves as an additional justification to this principle.
References
1. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep

convolutional neural networks. In Advances in Neural Information Processing Systems,
pp. 1097–1105, (2012).
2. K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image
recognition, arXiv preprint arXiv:1409.1556. (2014).
3. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-
houcke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, (2015).
4. K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition,
arXiv preprint arXiv:1512.03385. (2015).
5. C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, Scalable, high-quality object detec-
tion, arXiv preprint arXiv:1412.1441. (2014).
6. R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate
object detection and semantic segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 580–587, (2014).
7. R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Com-
puter Vision, pp. 1440–1448, (2015).
8. S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object de-
tection with region proposal networks. In Advances in Neural Information Processing
Systems, pp. 91–99, (2015).
9. J. Hu, J. Lu, and Y.-P. Tan. Discriminative deep metric learning for face verification
in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1875–1882, (2014).
10. Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint
identification-verification. In Advances in Neural Information Processing Systems, pp.
1988–1996, (2014).
11. N. Wang and D.-Y. Yeung. Learning a deep compact image representation for visual
tracking. In Advances in Neural Information Processing Systems, pp. 809–817, (2013).
12. J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li. Deep learning
for content-based image retrieval: A comprehensive study. In Proc. ACM Int. Conf.
on Multimedia, pp. 157–166, (2014).
13. C. Dong, C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image
super-resolution. In Proceedings of the European Conference on Computer Vision, pp.
184–199, (2014).
14. L. Kang, P. Ye, Y. Li, and D. Doermann, Convolutional neural networks for no-
reference image quality assessment, Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. pp. 1733–1740, (2014).
15. L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural
networks using dropconnect. In Proceedings of the international conference on machine
learning, pp. 1058–1066, (2013).
16. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In Proceedings of the International Conference on
Machine Learning, pp. 448–456, (2015).
17. Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach
for deep face recognition. In Proceedings of the European Conference on Computer
Vision, pp. 499–515, (2016).
18. W. Shi, Y. Gong, J. Wang, and N. Zheng. Integrating supervised laplacian objective
with cnn for object recognition. In Pacific Rim Conference on Multimedia, pp. 64–73,
(2016).
19. G. Cheng, C. Yang, X. Yao, L. Guo, and J. Han, When deep learning meets metric
learning: Remote sensing image scene classification via learning discriminative cnns,
IEEE Transactions on Geoscience and Remote Sensing. (2018). doi: 10.1109/TGRS.
2017.2783902.
20. W. Shi, Y. Gong, and J. Wang. Improving cnn performance with min-max objective.
In Proceedings of the International Joint Conference on Artificial Intelligence, pp.
2004–2010, (2016).
21. W. Shi, Y. Gong, X. Tao, and N. Zheng, Training dcnn by combining max-margin,
max-correlation objectives, and correntropy loss for multilabel image classification,
IEEE Transactions on Neural Networks and Learning Systems. 29(7), 2896–2908,
(2018).
22. C. Li, Q. Liu, W. Dong, F. Wei, X. Zhang, and L. Yang, Max-margin-based dis-
criminative feature learning, IEEE Transactions on Neural Networks and Learning
Systems. 27(12), 2768–2775, (2016).
23. G.-S. Xie, X.-Y. Zhang, X. Shu, S. Yan, and C.-L. Liu. Task-driven feature pooling for
image classification. In Proceedings of the IEEE International Conference on Computer
Vision, pp. 1179–1187, (2015).
24. G.-S. Xie, X.-Y. Zhang, S. Yan, and C.-L. Liu, Sde: A novel selective, discriminative
and equalizing feature representation for visual recognition, International Journal of
Computer Vision. 124(2), 145–168, (2017).
25. G.-S. Xie, X.-Y. Zhang, S. Yan, and C.-L. Liu, Hybrid cnn and dictionary-based
models for scene recognition and domain adaptation, IEEE Transactions on Circuits
and Systems for Video Technology. 27(6), 1263–1274, (2017).
26. J. Tang, Z. Li, H. Lai, L. Zhang, S. Yan, et al., Personalized age progression with bi-
level aging dictionary learning, IEEE Transactions on Pattern Analysis and Machine
Intelligence. 40(4), 905–917, (2018).
27. G.-S. Xie, X.-B. Jin, Z. Zhang, Z. Liu, X. Xue, and J. Pu, Retargeted multi-view
feature learning with separate and shared subspace uncovering, IEEE Access. 5,
24895–24907, (2017).
28. F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face
recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 815–823, (2015).
29. T. Serre, A. Oliva, and T. Poggio, A feedforward architecture accounts for rapid
categorization, Proceedings of the National Academy of Sciences. 104(15), 6424–6429,
(2007).
30. J. J. DiCarlo, D. Zoccolan, and N. C. Rust, How does the brain solve visual object
recognition?, Neuron. 73(3), 415–434, (2012).
31. S. Zhang, Y. Gong, and J. Wang. Improving dcnn performance with sparse category-
selective objective function. In Proceedings of the International Joint Conference on
Artificial Intelligence, pp. 2343–2349, (2016).
32. N. Pinto, N. Majaj, Y. Barhomi, E. Solomon, D. Cox, and J. DiCarlo. Human versus
machine: comparing visual object recognition systems on a level playing field. In
Computational and Systems Neuroscience, (2010).
33. W. Shi, Y. Gong, X. Tao, J. Wang, and N. Zheng, Improving cnn performance accu-
racies with min-max objective, IEEE Transactions on Neural Networks and Learning
Systems. 29(7), 2872–2885, (2018).
34. W. Shi, Y. Gong, D. Cheng, X. Tao, and N. Zheng, Entropy and orthogonality based
deep discriminative feature learning for object recognition, Pattern Recognition. 81,
71–80, (2018).
35. S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, Graph embedding and
extensions: a general framework for dimensionality reduction, IEEE Transactions on
Pattern Analysis and Machine Intelligence. 29(1), 40–51, (2007).
36. M. Sugiyama. Local fisher discriminant analysis for supervised dimensionality reduc-
tion. In Proc. Int. Conf. Mach. Learn., pp. 905–912, (2006).
37. G. S. Xie, X. Y. Zhang, Y. M. Zhang, and C. L. Liu. Integrating supervised subspace
criteria with restricted boltzmann machine for feature extraction. In Int. Joint Conf.
on Neural Netw., (2014).
38. M. K. Wong and M. Sun, Deep learning regularized fisher mappings, IEEE Transac-
tions on Neural Networks. 22, 1668–1675, (2011).
39. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Pro-
ceedings of the ACM International Conference on Multimedia, pp. 675–678, (2014).
40. M. Lin, Q. Chen, and S. Yan, Network in network, arXiv preprint arXiv:1312.4400.
(2013).
41. A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images,
Master’s thesis, University of Toronto. (2009).
42. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to
document recognition, Proceedings of the IEEE. 86(11), 2278–2324, (1998).
43. Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in
natural images with unsupervised feature learning. In Neural Information Processing
Systems (NIPS) workshop on deep learning and unsupervised feature learning, vol.
2011, p. 5, (2011).
44. C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In
Artificial Intelligence and Statistics, pp. 562–570, (2015).
45. L. Van der Maaten and G. Hinton, Visualizing data using t-sne, Journal of Machine
Learning Research. 9, 2579–2605, (2008).
46. A. K. Jain, M. N. Murty, and P. J. Flynn, Data clustering: a review, ACM computing
surveys (CSUR). 31(3), 264–323, (1999).
47. U. Von Luxburg, A tutorial on spectral clustering, Statistics and computing. 17(4),
395–416, (2007).
48. S. Zhou, Z. Xu, and F. Liu, Method for determining the optimal number of clusters
based on agglomerative hierarchical clustering, IEEE Transactions on Neural Net-
works and Learning Systems. (2016).
49. R. A. Fisher, The use of multiple measurements in taxonomic problems, Annals of
eugenics. 7(2), 179–188, (1936).
50. J. L. Andrews and P. D. Mcnicholas, Model-based clustering, classification, and dis-
criminant analysis via mixtures of multivariate t-distributions, Statistics and Com-
puting. 22(5), 1021–1029, (2012).
CHAPTER 1.3
DEEP LEARNING BASED BACKGROUND SUBTRACTION:

A SYSTEMATIC SURVEY
Jhony H. Giraldo1, Huu Ton Le2 and Thierry Bouwmans1,*

1
Lab. MIA, La Rochelle Univ., Avenue M. Crépeau, 17000 La Rochelle, France
*
E-mail: tbouwman@univ-lr.fr
2
ICTLab/USTH, Hanoi, Vietnam
Machine learning has been widely applied for detection of moving objects from
static cameras. Recently, many methods using deep learning for background
subtraction have been reported, with very promising performance. This chapter
provides a survey of different deep-learning based background subtraction
methods. First, a comparison of the architecture of each method is provided,
followed by a discussion against the specific application requirements such as
spatio-temporal and real-time constraints. After analyzing the strategies of each
method and showing their limitations, a comparative evaluation on the large
scale CDnet2014 dataset is provided. Finally, we conclude with some potential
future research directions.
1. Introduction
Background subtraction is an essential process in several applications to model

the background as well as to detect the moving objects in the scene like in video
surveillance [1], optical motion capture [2] and multimedia [3]. We can name
different machine learning models which have been used for background
modeling and foreground detection such as Support Vector Machine (SVM)
models [4][5][6], fuzzy learning models [7][8][9], subspace learning models
[10][11][12], and neural networks models [13][14][15]. Deep learning methods
based on Deep Neural Networks (DNNs) with Convolutional Neural Networks
(CNNs also called ConvNets) have the ability of alleviating the disadvantages of
parameters setting inherent in the conventional neural networks. Although CNNs
have existed for a long time, their application to computer vision was limited
during a long period, due to the lack of large training datasets, the size of the
considered networks, as well as the computation power. One of the first
breakthrough was made in 2012, by Krizhevsky et al. [27], with the use of a
51
52 J. H. Giraldo1, et al.
supervised training of a CNN with 8 layers and millions of parameters; the

training dataset was the ImageNet database with 1 million of training images [28]
the largest image dataset at that time. Since this research, along with the progress
of storage devices and GPUs computation power, even larger and deeper
networks became trainable. DNNs also were applied in the field of
background/foreground separation in videos taken by a fixed camera. The
deploying of DNNs introduced a large performance improvement for background
generation [17][31][36][40][41][42][43][44], background subtraction [59][60]
[61][62][63], ground-truth generation [64], and deep learned features [122][123]
[124][125][126]. The rest of this chapter is organized as follows: a review of
background subtraction models based on deep neural networks by comparing
different networks architecture and discussing their adequacy for this task
is given in Section 2. A comparative evaluation on the largescale
ChangeDetection.Net (CDnet) 2014 dataset is given in Section 3. Finally,
conclusions are given in Section 4.
2. Background Subtraction
The goal of background subtraction is to label pixels as background or foreground

by comparing the background image with the current image. DNNs based methods
are dominating the performance on the CDnet 2014 datasets with six models as
follow: 1) FgSegNet_M [59] and its variants FgSegNet_S [60] and FgSegNet_V2
[61], 2) BSGAN [62] and its variant BSPVGAN [63], 3) Cascaded CNNs [64]) for
supervised approaches. Those researches are inspired by three unsupervised
approaches which are multi-features/multi-cues or semantic methods (IUTIS-3
[65], IUTIS-5 [65], SemanticBGS [66]). However, background subtraction is a
classification task and can be solved successfully with DNNs.
2.1. Convolutional Neural Networks

One of the first attempt to use Convolutional Neural Networks (CNNs) for
background subtraction were made by Braham and Van Droogenbroeck [67].
The model named ConvNet has a structure borrowed from LeNet-5 [68] with a
few modifications. The subsampling is performed with max-pooling instead of
averaging, and the hidden sigmoid activation function is replaced with rectified
linear units (ReLU) for faster training. In general, background subtraction can be
divided into four stages: background image extraction via a temporal median in
grey scale, specific-scene dataset generation, network training, and background
subtraction. Practically, the background model is specific for each scene. For
every frame in a video sequence, Braham and Droogenbroeck [67] extract the
Deep Learning Based Background Subtraction 53
image patches for each pixel and then combine them with the corresponding
patches from the background model. The size of image patch in this research is
27*27. After that, these combined patches are used as input of a neural network
to predict the probability of a pixel being foreground or background. The authors
use 5*5 local receptive fields, and 3*3 non-overlapping receptive fields for all
pooling layers. The numbers of feature maps of the first two convolutional layers
are 6 and 16, respectively. The first fully connected layer consists of 120 neurons
and the output layer generates a single sigmoid unit. There are 20,243 parameters
which are trained using back-propagation with a cross-entropy loss function. The
algorithm needs for training the foreground results of a previous segmentation
algorithm (IUTIS [65]) or the ground truth information provided in CDnet 2014
[19]. The CDnet 2014 datasets was divided into two halves; one for training, and
one for testing purposes. The ConvNet shows a very similar performance to other
state-of-the-art methods. Moreover, it outperforms all other methods significantly
with the use of the ground-truth information, especially in videos of hard
shadows and night videos. The F-Measure score of ConvNet in CDnet2014
dataset is 0.9046. DNNs approaches have been applied in other applications such
as: vehicle detection [69], and pedestrian detection [127]. To be more precise,
Yan et al. [127] used a similar scheme to detect pedestrian with both visible and
thermal images. The inputs of the networks consist of the visible frame (RGB),
thermal frame (IR), visible background (RGB) and thermal background (IR),
which summed up to an input size of 64*64*8. This method shows a great
improvement on OCTBVS dataset, in comparison with T2F-MOG, SuBSENSE,
and DECOLOR.
Remarks: ConvNet is one of the simplest approaches to model the
differences between the background and the foreground using CNNs. A merit
contribution of the study of Braham and Van Droogenbroeck [67] is being the
first application of deep learning for background subtraction. For this reason, it
can be used as a reference for comparison in terms of the improvement in
performance. However, several limitations are presented. First, it is difficult to
learn the high-level information through patches [93]. Second, due to the over-
fitting that is caused by using highly redundant data for training, the network is
scene-specific. In experiment, the model can only process a certain scenery, and
needs to be retrained for other video scenes. For many applications where the
camera is fixed and always captures similar scene, this fact is not a serious
problem. However, it may not be the case in certain applications as discussed by
Hu et al. [71]. Third, ConvNet processes each pixel independently so the
foreground mask may contain isolated false positives and false negatives. Fourth,
this method requires the extraction of large number of patches from each frame
in the video, which presents a very expensive computation as pointed out by Lim
and Keles [59]. Fifth, the method requires pre-or post-processing of the data, and
thus is not applicable for end-to-end learning framework. The long-term
dependencies of the input video sequences are not considered since ConvNet
uses only a few frames as input. ConvNet is a deep encoder-decoder network that
is a generator network. However, one of the disadvantages of classical generator
networks is unable to preserve the object edges because they minimize the
classical loss function (e.g., Euclidean distance) between the predicted output and
the ground-truth [93]. This leads to the generation of blurry foreground regions.
Since this first valuable study, the posterior methods were introduced trying to
alleviate these limitations.
2.2. Multi-scale and Cascaded CNNs

A brief review of multi-scale and cascaded CNNs is given in this section. Wang
et al. [64] targeted the problem of ground-truth generation in the context of
background modeling algorithm validation and they introduced a deep learning
method for iterative generation of the ground-truth. First, Wang et al. [64] extract
a local patch of size 31*31 in each channel RGB of each pixel and this image
patch is fed into a basic CNN and the multi-scale CNN. The CNN is built using 4
convolutional layers and 2 fully connected layers. The first 2 convolutional
layers are followed by a 2*2 max pooling layer. The filter size of the
convolutional layer is 7*7 and the authors use the Rectified Linear Unit (ReLU)
as activation function.
Wang et al. [64] considered the CNN output as a likelihood probability and
a cross entropy loss function is used for training. This model computes images of
size 31*31, as a result, the algorithm is limited to process patches of the same
size or less. This limitation is alleviated by introducing the multi-scale CNN
model which allows to generate outputs at three different sizes further combined
in the original size. In order to model the dependencies among adjacent pixels as
well as to enforce spatial coherence to avoid isolated false positives and false
negatives in the foreground mask, Wang et al. [64] introduced a cascaded
architecture called Cascaded CNN. Experiment showed that this CNN
architecture has the advantage of learning its own features that may be more
discriminative than hand-designed features.
The foreground objects from video frames are manually annotated and used
to train the CNN to learn the foreground features. After the training step, CNN
employs generalization to segment the remaining frames of the video. A scene
specific networks, trained with 200 manually selected frames was proposed by
Want et al. [64]. The cascaded CNN achieves an F-Measure score of 0.9209 in
CDnet2014 dataset. The CNN model was built based on Caffe library and
MatConvNet. The Cascaded CNN suffers from several limitations: 1) the model
is more suitable for ground-truth generation than an automated background/
foreground separation application, and 2) it is computationally expensive.
In another study, Lim and Keles [59] proposed a method which is based on a
triplet CNN and a Transposed Convolutional Neural Network (TCNN) attached
at the end of it in an encoder-decoder structure. The model called FgSegNet_M
reuses the four blocks of the pre-train VGG-16 [73] under a triplet framework as
a multiscale feature encoder. At the end of the network, a decoder network is
integrated to map the features to a pixel-level foreground probability map.
Finally, the binary segmentation labels are generated by applying a threshold to
this feature map. Similar to the method proposed by Wang et al. [64], the
network is trained with only few frames (from 50 up to 200). Experimental
results [59] show that TCNN outperforms both ConvNet [67] and Cascaded CNN
[64]. In addition, it obtained an overall F-Measure score of 0.9770, which
outperformed all the reported methods. A variant of FgSegNet_M, called
FgSegNet was introduced by Lim and Keles [60] by adding a feature pooling
module FPM to operate on top of the final encoder (CNN) layer. Lin et al. [64]
further improve the model by proposing a modified FM with feature fusion.
FgSegNet_V2 achieves the highest performance on the CDnet 2014 dataset.
A common drawback of these previous methods is that they require a large
amount of densely labeled video training data. To solve this problem, a novel
training strategy to train a multi-scale cascaded scene-specific (MCSS) CNNs
was proposed by Liao et al. [119]. The network is constructed by joining the
ConvNets [67] and the multiscale-cascaded architecture [64] with a training that
takes advantage of the balance of positive and negative training samples.
Experimental results demonstrate that MCSS obtains the score of 0.904 on the
CDnet2014 dataset (excluding the PTZ category), which outperforms Deep CNN
[72], TCNN [95] and SFEN [104].
A multi-scale CNN based background subtraction method was introduced
by Liang et al. [128]. A specific CNN model is trained for each video to ensure
accuracy, but the authors manage to avoid manual labeling. First, Liang et al.
[128] used the SubSENSE algorithm to generate an initial foreground mask. This
initial foreground mask is not accurate enough to be directly used as ground
truth. Instead, it is used to select reliable pixels to guide the CNN training. A
simple strategy to automatically select informative frames for guided learning is
also proposed. Experiments on the CDnet 2014 dataset show that Guided Multi-
scale CNN outperforms DeepBS and SuBSENSE, with F-Measure score of
0.759.
2.3. Fully CNNs

Cinelli [74] explored the advantages of Fully Convolutional Neural Networks
(FCNN) to diminish the computational requirements and proposed a similar
method than Braham and Droogenbroeck [67]. The fully connected layer in
traditional convolution networks is replaced by a convolutional layer in FCNN to
remove the disadvantages caused by fully connected layers. The FCNN is tested
with both LeNet5 [68] and ResNet [75] architectures. Since the ResNet [75] has
a higher degree of hyper-parameter setting (namely the size of the model and
even the organization of layers) than LeNet5 [68], Cinelli [74] used various
features of the ResNet architecture in order to optimize them for background/
foreground separation. The authors used the networks designed for the ImageNet
Large Scale Visual Recognition Challenge (ILSVRC), which have 224*224 pixel
images as input and also those for the CIFAR-10 and CIFAR-100 datasets, which
works with 32*32 pixel-images. From this study, the two most accurate models
on the CDnet 2014 dataset are the 32-layer CIFAR-derived dilated network and
the pre-trained 34-layer ILSVRC-based dilated model adapted by direct
substitution. However, only the visual results without F-measure were provided
by Cinelli [74].
The idea of using FCNN is also deployed by Yang et al. [76]. The authors
introduced a network with a structure of shortcut connected block with multiple
branches. Each block provides four different branches. In this architecture, the
first three branches extract different features by using different atrous
convolution [78], while the last branch is the shortcut connection. To integrate
the spatial information, atrous convolution [78] is employed instead of common
convolution. This implementation allows to miss considerable details by
expanding the receptive fields. The authors also present the use of PReLU
Parametric Rectified Linear Unit (PReLU) [77] to introduce a learned parameter
to transform the values less than 0. Yang et al. [76] also employed Conditional
Random Fields (CRF) to refine the results. The authors show that the proposed
method obtains better results than traditional background subtraction methods
(MOG [79] and Codebook [80]) as well as recent state-of-art methods (ViBe
[81], PBAS [82] and P2M [83]) on the CDnet 2012 dataset. However, the
experiments were evaluated on only 6 subsets of the CDnet 2012 dataset instead
of all the categories of CDnet 2014 making a comparison with other DNN
methods more difficult to apply.
Alikan [84] designed a Multi-View receptive field Fully CNN (MV-FCN),
which borrows the architecture of fully convolutional structure, inception
modules [85], and residual networking. In practice, MV-FCN is based Unet [46]
with an inception module MV-FCN [84] which applies the convolution of
multiple filters at various scales on the same input, and integrates two
Complementary Feature Flow (CFF) and a Pivotal Feature Flow (PFF)
architecture. The authors also exploited intra-domain transfer learning in order to
improve the accuracy of foreground region prediction. In MV-FCN, the inception
modules are employed at early and late stages with three different sizes of
receptive fields to capture invariance at different scales. To enhance the spatial
representation, the features learned in the encoding phase are fused with
appropriate feature maps in the decoding phase through residual connections.
These multi-view receptive fields, together with the residual feature connections
provide generalized features which are able to improve the performance of pixel-
wise foreground region identification. Alikan [84] evaluated MV-FCN model on
the CDnet 2014 dataset, in comparison with classical neural networks (Stacked
Multi-Layer [87], Multi-Layered SOM [26]), and two deep learning approaches
(SDAE [88], Deep CNN [72]). However, only results with selected sequences are
reported which makes the comparison less complete.
Zeng and Zhu [89] targeted the moving object detection in infrared videos
and they designed a Multiscale Fully Convolutional Network (MFCN). MFCN
does not require the extraction of the background images. The network takes
inputs as frames from different video sequences and generates a probability map.
The authors borrow the architecture of VGG-16 net and use the input size of
224*224. The VGG-16 network consists of five blocks; each block contains
some convolution and max pooling operations. The deeper blocks have a lower
spatial resolution and contain more high-level local features whilst the lower
blocks contain more low-level global features at a higher resolution. After the
output feature layer, a contrast layer is added based on the average pooling
operation with a kernel size of 3*3. Zeng and Zhu [89] proposed a set of
deconvolution operations to upsample the features, creating an output probability
map with the same size as the input, in order to exploit multiscale features from
multiple layers. The cross-entropy is used to compute the loss function. The
network uses the pretrained weights for layers from VGG-16, whilst randomly
initializes other weights with a truncated normal distribution. Those randomly
initialized weights are then trained using the AdamOptimizer method. The
MFCN obtains the best score in THM category of CDnet 2014 dataset with a F-
Measure score of 0.9870 whereas Cascaded CNN [64] obtains 0.8958. Over all
the categories, the F-Measure score of MFCN is 0.96. In a further study, Zeng
and Zhu [90] introduced a method called CNN-SFC by fusing the results
produced by different background subtraction algorithms (SuBSENSE [86],
FTSG [91], and CwisarDH+ [92]) and achieve even better performance. This
method outperforms its direct competitor IUTIS [65] on the CDnet 2014 dataset.
Lin et al. [93] proposed a deep Fully Convolutional Semantic Network

(FCSN) for the background subtraction task. First, FCN can learn the global
differences between the foreground and the background. Second, SuBSENSE
algorithm [86] is able to generate robust background image with better
performance. This background image is then concatenated into the input of the
network together with the video frame. The weights of FCSN are initialized by
partially using pre-trained weights of FCN-VGG16 [94] since these weights are
applied to semantic segmentation. By doing so, FCSN can remember the
semantic information of images and converge faster. Experiment results show
that with the help of pre-trained weights, FCSN uses less training data and gets
better results.
2.4. Deep CNNs

Babaee et al. [72] designed a deep CNN for moving objects detection. The
network consists of the following components: an algorithm to initialize the
background via a temporal median model in RGB, a CNN model for background
subtraction, and a post-processing model applied on the output of the networks
using spatial median filter. The foreground pixels and background pixels are first
classified with SuBSENSE algorithm [86]. Then only the background pixel
values are used to obtain the background median model. Babaee et al. [72] also
used Flux Tensor with Split Gaussian Models (FTSG [91]) algorithm to have
adaptive memory length based on the motion of the camera and objects in the
video frames. The CNNs are trained with background images obtained by the
SuBSENSE algorithm [86]. The network is trained with pairs of RGB image
patches (triplets of size 37*37) from video, background frames and the respective
ground truth segmentation patches with around 5% of the CDnet 2014 dataset.
Babaee et al. [72] trained their model by combining training frames from various
video sequences including 5% of frames from each video sequence. For this
reason, their model is not scene specified. In addition, the authors employ the
same training procedure than ConvNet [67]. Image-patches are combined with
background-patches before feeding the network. The network consists of 3
convolutional layers and a 2-layer Multi-Layer Perceptron (MLP). Babaee et al.
[72] use Rectified Linear Unit (ReLU) as the activation function of each
convolutional layer whilst the last fully connected layer uses the sigmoid
function. Moreover, in order to reduce the effect of overfitting as well as to
provide higher learning rates for training, the authors use batch normalization
before each activation layer. The post-processing step is implemented with the
spatial-median filtering. This network generates a more accurate foreground
mask than ConvNet [67] and is not very prone to outliers in presence of dynamic
backgrounds. Experiment results show that deep CNN based background
subtraction outperforms the existing algorithms when the challenge does not lie
in the background modeling maintenance. The F-Measure score of Deep CNN in
CDnet2014 dataset is 0.7548. However, Deep CNN suffers from the following
limitations: 1) It does not handle very well the camouflage regions within
foreground objects, 2) it performs poorly on PTZ videos category, and 3) due to
the corruption of the background images, it provides poor performance in
presence of large changes in the background.
In another work, Zhao et al. [95] designed an end-to-end two-stage deep
CNN (TS-CNN) framework. The network consists of two stages: a convolutional
encoder-decoder followed by a Multi-Channel Fully Convolutional sub-Network
(MCFVN). The target of the first stage is to reconstruct the background images
and encode rich prior knowledge of background scenes whilst the latter stage
aims to accurately detect the foreground. The authors decided to jointly optimize
the reconstruction loss and segmentation loss. In practice, the encoder consists of
a set of convolutions which can represent the input image as a latent feature
vector. The feature vectors are used by the decoder to restore the background
image. The l2 distance is used to compute the reconstruction loss. The encoder-
decoder network learns from training data to separate the background from the
input image and restores a clean background image. After training, the second
network can learn the semantic knowledge of the foreground and background.
Therefore, the model is able to process various challenges such as the night light,
shadows and camouflaged foreground objects. Experimental results [95] show
that the TS-CNN obtains the F-Measure score of 0.7870 which is more accurate
than SuBSENSE [86], PAWCS [99], FTSG [91] and SharedModel [100] in the
case of night videos, camera jitter, shadows, thermal imagery and bad weather.
The Joint TS-CNN achieves a score of 0.8124 in CDnet2014 dataset.
Li et al. [101] proposed to predict object locations in a surveillance scene
with an adaptive deep CNN (ADCNN). First, the generic CNN-based classifier is
transferred to the surveillance scene by selecting useful kernels. After that, a
regression model is employed to learn the context information of the surveillance
scene in order to have an accurate location prediction. ADCNN obtains very
promising performance on several surveillance datasets for pedestrian detection
and vehicle detection. However, ADCNN focus on object detection and thus it
does not use the principle of background subtraction. Moreover, the performance
of ADCNN was reported with the CUHK square dataset [102], the MIT traffic
dataset [103] and the PETS 2007 instead of the CDnet2014 dataset.
In another study, Chen et al. [104] proposed to detect moving objects by

using pixel-level semantic features with an end-to-end deep sequence learning
network. The authors used a deep convolutional encoder-decoder network to
extract pixel-level semantic features from video frames. For the experiments,
VGG-16 [73] is used as encoder-decoder network but other frameworks, such as
GoogLeNet [85], ResNet50 [75] can also be used. An attention long short-term
memory model named Attention ConvLSTM is employed to model the pixel-
wise changes over time. After that, Chen et al. [104] combined a Spatial
Transformer Network (STN) model with a Conditional Random Fields (CRF)
layer to reduce the sensitivity to camera motion as well as to smooth the
foreground boundaries. The proposed method achieves similar results than the
Convnet [67] whilst outperformed the Convnet [67] for the category “Night
videos”, “Camera jitter”, “Shadow” and “Turbulence” of CDnet 2014 dataset.
Using VGG-16, the attention ConvLSTM obtained an F-Measure of 0.8292.
With GoogLeNet and ResNet50, the F-Measure scores are 0.7360 and 0.8772,
respectively.
2.5. Structured CNNs

Lim et al. [105] designed an encoder structured CNN (Struct-CNN) for
background subtraction. The proposed network includes the following
components: a background image extraction with a temporal median in RGB,
network training, background subtraction and foreground extraction based on
super-pixel processing. The architecture is similar to the VGG-16 network [73]
except the fully connected layers. The encoder takes the 3 (RGB) channel images
(images of size 336*336 pixels) as inputs and generates the 12-channel feature
vector through convolutional and max-pooling layers yielding a 21*21*512
feature vector. After that, the decoder uses the deconvolutional and unpooling
layers to convert the feature vector into a 1-channel image of size 336*336 pixels
providing the foreground mask. This encoder-decoder structured network is
trained in the end-to-end manner using CDnet 2014. The network involves 6
deconvolutional layers and 4 unpooling layers. The authors used the Parametric
Rectified Linear Unit (PReLU) [78] as an activation function and batch-
normalization is employed for all the deconvolutional layers, except for the last
one. The last deconvolutional layer can be considered as the prediction layer.
This layer used the sigmoid activation function to normalize outputs and then to
provide the foreground mask. Lim et al. [105] used 5*5 as feature map size of all
convolutional layer, and 3*3 kernel for the prediction layer. The super pixel
information obtained by an edge detector was also used to suppress the incorrect
boundaries and holes in the foreground mask. Experimental results on the CDnet
2014 show that Struct-CNN outperforms SuBSENSE [86], PAWCS [99], FTSG
[91] and SharedModel [100] in the case of bad weather, camera jitter, low frame
rate, intermittent object motion and thermal imagery. The F-Measure score
excluding the “PTZ” category is 0.8645. The authors excluded this category
arguing that they focused only on static cameras.
2.6. 3D CNNs
Sakkos et al. [106] proposed an end-to-end 3D-CNN to track temporal changes in
video sequences without using a background model for the training. For this
reason, 3D-CNN is able to process multiple scenes without further fine-tuning.
The network architecture is inspired by the C3D branch [107]. Practically, 3D-
CNN outperforms ConvNet [67] and deep CNN [72]. Furthermore, the
evaluation on the ESI dataset [108] with extreme and sudden illumination
changes, show that 3D CNN obtains higher score than the two designed
illumination invariant background subtraction methods (Universal Multimode
Background Subtraction (UMBS) [109] and ESI [108]). For CDnet 2014 dataset,
the proposed framework achieved an average F-Measure of 0.9507.
Yu et al. [117] designed a spatial-temporal attention-based 3D ConvNets to
jointly learn the appearance and motion of objects-of-interest in a video with a
Relevant Motion Event detection Network (ReMotENet). Similar to the work of
Sakkos et al. [106], the architecture of the proposed network is borrowed from
C3D branch [107]. However, instead of using max pooling both spatially and
temporally, the authors divided the spatial and temporal max pooling to capture
fine-grained temporal information, as well as to make the network deeper to learn
better representations. Experimental results show that ReMotENet obtains
comparative results than the object detection-based method, with three to four
orders of magnitude faster. With model size of less than 1MB, it is able to detect
relevant motion in a 15s video in 4-8 milliseconds on a GPU and a fraction of a
second on a CPU.
In another work, Hu et al. [71] developed a 3D Atrous CNN model which
can learn deep spatial-temporal features without losing resolution information.
The authors combined this model with two convolutional long short-term
memory (ConvLSTM) to capture both short-term and long-term spatio-temporal
information of the input frames. In addition, the 3D Atrous ConvLSTM does not
require any pre- or post-processing of the data, but to process data in a
completely end-to-end manner. Experimental results on CDnet 204 dataset show
that 3D atrous CNN outperforms SuBSENSE, Cascaded CNN and DeepBS.
2.7. Generative Adversarial Networks (GANs)

Bakkay et al. [110] designed a model named BScGAN which is a background
subtraction method based on conditional Generative Adversarial Network
(cGAN). The proposed network involves two successive networks: generator and
discriminator. The former network models the mapping from the background and
current image to the foreground mask whilst the later one learns a loss function
to train this mapping by comparing ground-truth and predicted output using the
input image and background. The architecture of BScGAN is similar to the
encoder-decoder architecture of Unet network with skip connections [46]. In
practice, the authors built the encoder using down-sampling layers that decrease
the size of the feature maps followed by convolutional filters. It consists of 8
convolutional layers. The first layer uses 7*7 convolution which generates 64
feature maps.
The last convolutional layer computes 512 feature maps with a 1*1 size.
Before training, their weights are randomly initialized. The 6 middle
convolutional layers are six ResNet blocks. Bakkay et al. [110] used Leaky-
ReLU non-linearities as the activation function of all encoder layers. The decoder
generates an output image with the same resolution of the input one. It is done by
the up-sampling layers followed by deconvolutional filters. Its architecture is
similar to the encoder one, but with a reverse layer ordering and with down-
sampling layers being replaced by upsampling layers. The architecture of the
discriminator network includes 4 convolutional and down-sampling layers. The
convolution layers use the feature size of 3*3 with randomly initialized weights.
The first layer generates 64 feature maps whilst the last layer compute 512
feature maps of size 30*30. Leaky ReLU functions are employed as activation
functions. Experimental results on CDnet 2014 datasets demonstrates that
BScGAN obtains higher scores than ConvNets [67], Cascade CNN [64], and
Deep CNN [76] with an average F-Measure score of 0.9763 without the category
PTZ.
Zheng et al. [112] proposed a Bayesian GAN (BGAN) network. The authors
first used a median filter to extract the background and then they trained a
network based on Bayesian generative adversarial network to classify each pixel,
which makes the model robust to the challenges of sudden and slow illumination
changes, non-stationary background, and ghosts. In practice, the generator and
the discriminator of Bayesian generative adversarial network is constructed by
adopting the deep convolutional neural networks. In a further study, Zheng et al.
[113] improved the performance of BGAN with a parallel version named
BPVGAN.
Bahri et al. [114] introduced the Neural Unsupervised Moving Object

Detection (NUMOD) which is an end-to-end framework. The network is built
based on the batch method name ILISD [115]. Thanks to the parameterization
with the generative neural network, NUMOD is able to work in both online and
batch mode. Each video frame is decomposed into three components:
background, foreground and illumination changes. The background model is
generated by finding a low-dimensional manifold for the background of the
image sequence, by using a fully connected generative neural network. The
architecture of NUMOD consists of Generative Fully Connected Networks
(GFCN). The first one named Net1 estimates the background image from the
input image whilst the second one, named Net2 generates background image
from the illumination invariant image. Net1 and Net2 share the same
architecture. First, the input to GFCN is optimizable low-dimensional latent
vector. Then, two fully connected hidden layers are employed with ReLU
nonlinearity activation function. The second hidden layer is fully connected to
the output layer which is activated by the sigmoid function. A loss term is
computed to impose the output of GFCN to be similar to the current input frame.
In practice GFCN can be considered as the decoder part of an auto-encoder with
a small modification. In GFCN, the low dimensional latent code is a free
parameter than can be optimized and is the input to the network, other than being
learnt by the encoder as in the case of the auto-encoder. The performance of
GFCN, evaluated on a subset of CDnet 2014 dataset shows that GFCN is more
robust to illumination changes than GRASTA [55], COROLA [116] and DAN
with Adaptive Tolerance Measure [43].
3. Experimental Results
To have a fair comparison, we present the results obtained on the well-known

publicly available CDnet 2014 dataset which was developed as part of Change
Detection Workshop challenge (CDW 2014). The CDW2014 contains 22
additional camera-captured videos providing 5 different categories, compared to
CDnet 2012. Those addition videos are used to incorporate some challenges that
were not addressed in the 2012 dataset. The categories are listed as follows:
baseline, dynamic backgrounds, camera jitter, shadows, intermittent object
motion, thermal, challenging weather, low frame-rate, night videos, PTZ and
turbulence. In the CDnet 2014, the ground truths of only the first half of every
video in the 5 new categories is made publicly available for testing, other than
CDnet 2012 which publishes the ground truth of all video frames. However, the
evaluation is reported for all frames. All the challenges of these different
categories have different spatial and temporal properties.
The F-measures obtained by the different DNN algorithms are compared
with the F-measures of other representative background subtraction algorithms
over the complete evaluation dataset: (1) two conventional statistical models
(MOG [128], RMOG [132], and (2) three advanced non-parametric models
(SubSENSE [126], PAWCS [127], and Spectral-360 [114]). The evaluation of
deep learning-based background separation models is reported on the following
categories:
x Pixel-wise algorithms: The algorithms in this category were directly applied
by the authors to background/foreground separation without considering
spatial and temporal constraints. Thus, they may introduce isolated false
positives and false negatives. We compare two algorithms: FgSegNet (multi-
scale) [80], and BScGAN [10].
x Temporal-wise algorithms: These algorithms model the dependencies
among adjacent temporal pixels and thus enforce temporal coherence. We
compared one algorithm: 3D-CNN [110].
Table 1 groups the different F-measures which come either from the
corresponding papers, or the CDnet 2014 website. In the same way, Table 2
shows some visual results obtained using SuBSENSE [126], FgSegNet-V2 [61],
and BPVGAN [63].
Table 1. F-measure metric over the 6 categories of the CDnet2014, namely Baseline (BSL), Dynamic background (DBG), Camera jitter (CJT), Intermittent
Motion Object (IOM), Shadows (SHD), Thermal (THM), Bad Weather (BDW), Low Frame Rate (LFR), Night Videos (NVD), PTZ, Turbulence (TBL). In
bold, the best score in each algorithm's category. The top 10 methods are indicated with their rank. There are three groups of leading methods: FgSegNet's
group, 3D-CNNs group and GANs group.
Algorithms (Authors) BSL DBG CJT IOM SHD THM BDW LFR NVD PTZ TBL Average F-Measure
Basic statistical models
Deep Learning Based Background Subtraction

MOG [79] 0.8245 0.6330 0.5969 0.5207 0.7156 0.6621 0.7380 0.5373 0.4097 0.1522 0.4663 0.5707
RMOG [132] 0.7848 0.7352 0.7010 0.5431 0.7212 0.4788 0.6826 0.5312 0.4265 0.2400 0.4578 0.5735
Advanced non parametric models
SuBSENSE [126] 0.9503 0.8117 0.8152 0.6569 0.8986 0.8171 0.8619 0.6445 0.5599 0.3476 0.7792 0.7408
PAWCS [127] 0.9397 0.8938 0.8137 0.7764 0.8913 0.8324 0.8152 0.6588 0.4152 0.4615 0.645 0.7403
Spectral-360 [114] 0.933 0.7872 0.7156 0.5656 0.8843 0.7764 0.7569 0.6437 0.4832 0.3653 0.5429 0.7054
Multi-scale or/and cascaded CNNs
FgSegNet-M (Spatial-wise) [59] 0.9973 0.9958 0.9954 0.9951 0.9937 0.9921 0.9845 0.8786 0.9655 0.9843 0.9648 0.9770 Rank 3
FgSegNet-S (Spatial-wise) [60] 0.9977 0.9958 0.9957 0.9940 0.9927 0.9937 0.9897 0.8972 0.9713 0.9879 0.9681 0.9804 Rank 2
FgSegNet-V2 (Spatial-wise) [61] 0.9978 0.9951 0.9938 0.9961 0.9955 0.9938 0.9904 0.9336 0.9739 0.9862 0.9727 0.9847 Rank 1
3D CNNs
3D CNN (Temporal-wise) [106] 0.9691 0.9614 0.9396 0.9698 0.9706 0.9830 0.9509 0.8862 0.8565 0.8987 0.8823 0.9507 Rank 7
3D Atrous CNN (Spatial/Temporal-wise) [71] 0.9897 0.9789 0.9645 0.9637 0.9813 0.9833 0.9609 0.8994 0.9489 0.8582 0.9488 0.9615 Rank 5
FC3D (Spatial/Temporal-wise) [133] 0.9941 0.9775 0.9651 0.8779 0.9881 0.9902 0.9699 0.8575 0.9595 0.924 0.9729 0.9524 Rank 6
MFC3D (Spatial/Temporal-wise) [133] 0.9950 0.9780 0.9744 0.8835 0.9893 0.9924 0.9703 0.9233 0.9696 0.9287 0.9773 0.9619 Rank 4
Generative Adversarial Networks
BScGAN (Pixel-wise) [110] 0.9930 0.9784 0.9770 0.9623 0.9828 0.9612 0.9796 0.9918 0.9661 - 0.9712 0.9763 Rank 10
BGAN (Pixel-wise) [62] 0.9814 0.9763 0.9828 0.9366 0.9849 0.9064 0.9465 0.8472 0.8965 0.9194 0.9118 0.9339 Rank 9
BPVGAN (Pixel-wise) [63] 0.9837 0.9849 0.9893 0.9366 0.9927 0.9764 0.9644 0.8508 0.9001 0.9486 0.9310 0.9501 Rank 8
65
66
Table 2. Visual results on CDnet 2014 dataset: From left to right: Original images, Ground-Truth images, SubSENSE [126], FgSegNet-V2 [61],
BPVGAN [63].
Categories Original Ground Truth 14- SubSENSE 45-FgSegNet-V2 41-BPVGAN
B-Weather
Skating
(in002349)
Baseline
Pedestrian
J. H. Giraldo1, et al.
(in000490)
C-Jitter
Badminton
(in001123)
Dynamic-B
Fall
(in002416)
I-O-Motion
Sofa
(in001314)
4. Conclusion
In this chapter, we have presented a full review of recent advances on the use of
deep neural networks applied to background subtraction for detection of moving
objects in video taken by a static camera. The experiments reported on the large-
scale CDnet 2014 dataset show the gap of performance obtained by the
supervised deep neural networks methods in this field. Although, applying deep
neural networks on for background subtraction problem has received significant
attention in the last two years since the paper of Braham and Van Droogenbroeck
[67], there are many unsolved important issues. Researchers need to answer the
question: what is the best suitable type of deep neural networks and its
corresponding architecture for background initialization, background subtraction
and deep learned features in the presence of complex backgrounds? Several
authors avoid experiments on the "PTZ" category and when the F-Measure is
provided the score is not always very high. Thus, it seems that the current deep
neural networks tested meet problems in the case of moving cameras. In the field
of background subtraction, only convolutional neural networks and generative
adversarial networks have been employed. Thus, future directions may
investigate the adequacy of deep belief neural networks, deep restricted kernel
neural networks [129], probabilistic neural networks [130] and fuzzy neural
networks [131] in the case of static camera as well as moving camera.
References
[1] S. Cheung, C. Kamath, “Robust Background Subtraction with Foreground Validation for
Urban Traffic Video”, Journal of Applied Signal Processing, 14, 2330-2340, 2005.
[2] J. Carranza, C. Theobalt. M. Magnor, H. Seidel, “Free-Viewpoint Video of Human Actors”,
ACM Transactions on Graphics, 22 (3), 569-577, 2003.
[3] F. El Baf, T. Bouwmans, B. Vachon, “Comparison of Background Subtraction Methods for
a Multimedia Learning Space”, SIGMAP 2007, Jul. 2007.
[4] I. Junejo, A. Bhutta, H Foroosh, “Single Class Support Vector Machine (SVM) for Scene
Modeling”, Journal of Signal, Image and Video Processing, May 2011.
[5] J. Wang, G. Bebis, M. Nicolescu, M. Nicolescu, R. Miller, “Improving target detection by
coupling it with tracking”, Machine Vision and Application, pages 1-19, 2008.
[6] A. Tavakkoli, M. Nicolescu, G. Bebis, “A Novelty Detection Approach for Foreground
Region Detection in Videos with Quasi-stationary Backgrounds”, ISVC 2006, pages 40-49,
Lake Tahoe, NV, November 2006.
[7] F. El Baf, T. Bouwmans, B. Vachon, “Fuzzy integral for moving object detection”, IEEE
FUZZ-IEEE 2008, pages 1729–1736, June 2008.
[8] F. El Baf, T. Bouwmans, B. Vachon, “Type-2 fuzzy mixture of Gaussians model:
Application to background modeling”, ISVC 2008, pages 772–781, December 2008.
[9] T. Bouwmans, “Background Subtraction for Visual Surveillance: A Fuzzy Approach”

Chapter 5, Handbook on Soft Computing for Video Surveillance, Taylor and Francis Group,
pages 103–139, March 2012.
[10] N. Oliver, B. Rosario, A. Pentland, “A Bayesian computer vision system for modeling
human interactions”, ICVS 1999, January 1999.
[11] Y. Dong, G. DeSouza, “Adaptive learning of multi-subspace for foreground detection under
illumination changes”, Computer Vision and Image Understanding, 2010.
[12] D. Farcas, C. Marghes, T. Bouwmans, “Background subtraction via incremental maximum
margin criterion: A discriminative approach”, Machine Vision and Applications,
23(6):1083–1101, October 2012.
[13] M. Chacon-Muguia, S. Gonzalez-Duarte, P. Vega, “Simplified SOM-neural model for
video segmentation of moving objects”, IJCNN 2009, pages 474-480, 2009.
[14] M. Chacon-Murguia, G. Ramirez-Alonso, S. Gonzalez-Duarte, “Improvement of a neural-
fuzzy motion detection vision model for complex scenario conditions”, International Joint
Conference on Neural Networks, IJCNN 2013, August 2013.
[15] M. Molina-Cabello, E. Lopez-Rubio, R. Luque-Baena, E. Domínguez, E. Palomo,
"Foreground object detection for video surveillance by fuzzy logic based estimation of pixel
illumination states", Logic Journal of the IGPL, September 2018.
[16] E. Candès, X. Li, Y. Ma, J. Wright. Robust principal component?”, International Journal of
ACM, 58(3), May 2011.
[17] P. Xu, M. Ye, Q. Liu, X. Li, L. Pei, J. Ding, “Motion Detection via a Couple of Auto-
Encoder Networks”, IEEE ICME 2014, 2014.
[18] N. Goyette, P. Jodoin, F. Porikli, J. Konrad, P. Ishwar, “Changedetection.net: A new change
detection benchmark dataset”, IEEE Workshop on Change Detection, CDW 2012 in
conjunction with CVPR 2012, June 2012.
[19] Y. Wang, P. Jodoin, F. Porikli, J. Konrad, Y. Benezeth, P. Ishwar, “CDnet 2014: an
expanded change detection benchmark dataset”, IEEE Workshop on Change Detection,
CDW 2014 in conjunction with CVPR 2014, June 2014.
[20] A. Schofield, P. Mehta, T. Stonham, “A system for counting people in video images using
neural networks to identify the background scene”, Pattern Recognition, 29:1421–1428,
1996.
[21] P. Gil-Jimenez, S. Maldonado-Bascon, R. Gil-Pita, H. Gomez-Moreno, “Background pixel
classification for motion detection in video image sequences”, IWANN 2003, 2686:718–
725, 2003.
[22] L. Maddalena, A. Petrosino, “A self-organizing approach to detection of moving patterns
for real-time applications”, Advances in Brain, Vision, and Artificial Intelligence,
4729:181–190, 2007.
[23] L. Maddalena, A. Petrosino, “Multivalued background/foreground separation for moving
object detection”, WILF 2009, pages 263–270, June 2009.
[24] L. Maddalena, A. Petrosino, “The SOBS algorithm: What are the limits?”, IEEE Workshop
on Change Detection, CVPR 2012, June 2012.
[25] L. Maddalena, A. Petrosino, “The 3dSOBS+ algorithm for moving object detection”, CVIU
2014, 122:65–73, May 2014.
[26] G. Gemignani, A. Rozza, “A novel background subtraction approach based on multi-
layered self organizing maps”, IEEE ICIP 2015, 2015.
[27] A. Krizhevsky, I. Sutskever, G. Hinton, “ImageNet: Classification with Deep Convolutional
Neural Networks”, NIPS 2012, pages 1097–1105, 2012.
[28] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-Fei, “Imagenet: A large-scale hierarchical
image database”, IEEE CVPR 2009, 2009.
[29] T. Bouwmans, L. Maddalena, A. Petrosino, “Scene Background Initialization: A

Taxonomy”, Pattern Recognition Letters, January 2017.
[30] P. Jodoin, L. Maddalena, A. Petrosino, Y. Wang, “Extensive Benchmark and Survey of
Modeling Methods for Scene Background Initialization”, IEEE Transactions on Image
Processing, 26(11):5244– 5256, November 2017.
[31] I. Halfaoui, F. Bouzaraa, O. Urfalioglu, “CNN-Based Initial Background Estimation”, ICPR
2016, 2016.
[32] S. Javed, A. Mahmood, T. Bouwmans, S. Jung, “Background- Foreground Modeling Based
on Spatio-temporal Sparse Subspace Clustering”, IEEE Transactions on Image Processing,
26(12):5840– 5854, December 2017.
[33] B. Laugraud, S. Pierard, M. Van Droogenbroeck, “A method based on motion detection for
generating the background of a scene”, Pattern Recognition Letters, 2017.
[34] B. Laugraud, S. Pierard, M. Van Droogenbroeck,"LaBGen-P-Semantic: A First Step for
Leveraging Semantic Segmentation in Background Generation", MDPI Journal of Imaging
Volume 4, No. 7, Art. 86, 2018.
[35] T. Bouwmans, E. Zahzah, “Robust PCA via principal component pursuit: A review for a
comparative evaluation in video surveillance”, CVIU 2014, 122:22–34, May 2014.
[36] R. Guo, H. Qi, “Partially-sparse restricted Boltzmann machine for background modeling
and subtraction”, ICMLA 2013, pages 209–214, December 2013.
[37] T. Haines, T. Xiang, “Background subtraction with Dirichlet processes”, European
Conference on Computer Vision, ECCV 2012, October 2012.
[38] A. Elgammal, L. Davis, “Non-parametric model for background subtraction”, European
Conference on Computer Vision, ECCV 2000, pages 751–767, June 2000.
[39] Z. Zivkovic, “Efficient adaptive density estimation per image pixel for the task of
background subtraction”, Pattern Recognition Letters, 27(7):773–780, January 2006.
[40] L. Xu, Y. Li, Y. Wang, E. Chen, “Temporally adaptive restricted Boltzmann machine for
background modeling”, AAAI 2015, January 2015.
[41] A. Sheri, M. Rafique, M. Jeon, W. Pedrycz, “Background subtraction using Gaussian
Bernoulli restricted Boltzmann machine”, IET Image Processing, 2018.
[42] A. Rafique, A. Sheri, M. Jeon, “Background scene modeling for PTZ cameras using RBM”,
ICCAIS 2014, pages 165–169, 2014.
[43] P. Xu, M. Ye, X. Li, Q. Liu, Y. Yang, J. Ding, “Dynamic Background Learning through
Deep Auto-encoder Networks”, ACM International Conference on Multimedia, Orlando,
FL, USA, November 2014.
[44] Z. Qu, S. Yu, M. Fu, “Motion background modeling based on context-encoder”, IEEE
ICAIPR 2016, September 2016.
[45] Y. Tao, P. Palasek, Z. Ling, I. Patras, “Background modelling based on generative Unet”,
IEEE AVSS 2017, September 2017.
[46] O. Ronneberger, T. Brox. P. Fischer, “U-Net: Convolutional Networks for, biomedical
image segmentation”, International Conference on Medical Image Computing and
Computer-Assisted Intervention, pages 234–241, 2015.
[47] M. Gregorio, M. Giordano, “Background modeling by weightless neural networks”, SBMI
2015 Workshop in conjunction with ICIAP 2015, September 2015.
[48] G. Ramirez, J. Ramirez, M. Chacon, “Temporal weighted learning model for background
estimation with an automatic re-initialization stage and adaptive parameters update”, Pattern
Recognition Letters, 2017.
[49] A. Agarwala, M. Dontcheva, M. Agrawala, S. Drucker, A. Colburn, B. Curless, D. Salesin,
M. Cohen, “Interactive digital photomontage”, ACM Transactions on Graphics, 23(1):294–
302, 2004.
[50] B. Laugraud, S. Pierard, M. Van Droogenbroeck, “LaBGen-P: A pixel-level stationary

background generation method based on LaBGen”, Scene Background Modeling Contest in
conjunction with ICPR 2016, 2016.
[51] I. Goodfellow et al., “Generative adversarial networks”, NIPS 2014, 2014.
[52] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen., “Improved
techniques for training GANs”, NIPS 2016, 2016.
[53] M. Sultana, A. Mahmood, S. Javed, S. Jung, “Unsupervised deep context prediction for
background estimation and foreground segmentation”. Preprint, May 2018.
[54] X. Guo, X. Wang, L. Yang, X. Cao, Y. Ma, “Robust foreground detection using smoothness
and arbitrariness constraints”, European Conference on Computer Vision, ECCV 2014,
September 2014.
[55] J. He, L. Balzano, J. Luiz, “Online robust subspace tracking from partial information”, IT
2011, September 2011.
[56] J. Xu, V. Ithapu, L. Mukherjee, J. Rehg, V. Singh, “GOSUS: Grassmannian online subspace
updates with structured-sparsity”, IEEE ICCV 2013, September 2013.
[57] T. Zhou, D. Tao, “GoDec: randomized low-rank and sparse matrix decomposition in noisy
case”, International Conference on Machine Learning, ICML 2011, 2011.
[58] X. Zhou, C. Yang, W. Yu, “Moving object detection by detecting contiguous outliers in the
low-rank representation”, IEEE Transactions on Pattern Analysis and Machine Intelligence,
35:597-610, 2013.
[59] L. Lim, H. Keles, “Foreground Segmentation using a Triplet Convolutional Neural Network
for Multiscale Feature Encoding”, Preprint, January 2018.
[60] K. Lim, L. Ang, H. Keles, “Foreground Segmentation Using Convolutional Neural
Networks for Multiscale Feature Encoding”, Pattern Recognition Letters, 2018.
[61] K. Lim, L. Ang, H. Keles, “Learning Multi-scale Features for Foreground Segmentation”,
arXiv preprint arXiv:1808.01477, 2018.
[62] W. Zheng, K. Wang, and F. Wang. Background subtraction algorithm based on bayesian
generative adversarial networks. Acta Automatica Sinica, 2018.
[63] W. Zheng, K. Wang, and F. Wang. A novel background subtraction algorithm based on
parallel vision and Bayesian GANs. Neurocomputing, 2018.
[64] Y. Wang, Z. Luo, P. Jodoin, “Interactive deep learning method for segmenting moving
objects”, Pattern Recognition Letters, 2016.
[65] S. Bianco, G. Ciocca, R. Schettini, “How far can you get by combining change detection
algorithms?” CoRR, abs/1505.02921, 2015.
[66] M. Braham, S. Pierard, M. Van Droogenbroeck, "Semantic Background Subtraction", IEEE
ICIP 2017, September 2017.
[67] M. Braham, M. Van Droogenbroeck, “Deep background subtraction with scene-specific
convolutional neural networks”, International Conference on Systems, Signals and Image
Processing, IWSSIP2016, Bratislava, Slovakia, May 2016.
[68] Y. Le Cun, L. Bottou, P. Haffner. “Gradient-based learning applied to document
recognition”, Proceedings of IEEE, 86:2278–2324, November 1998.
[69] C. Bautista, C. Dy, M. Manalac, R. Orbe, M. Cordel, “Convolutional neural network for
vehicle detection in low resolution traffic videos”, TENCON 2016, 2016.
[70] C. Lin, B. Yan, W. Tan, “Foreground detection in surveillance video with fully
convolutional semantic network”, IEEE ICIP 2018, pages 4118-4122, October 2018.
[71] Z. Hu, T. Turki, N. Phan, J. Wang, “3D atrous convolutional long short-term memory
network for background subtraction”, IEEE Access, 2018
[72] M. Babaee, D. Dinh, G. Rigoll, “A deep convolutional neural network for background
subtraction”, Pattern Recognition, September 2017.
[73] K. Simonyan, A. Zisserman, “Very deep convolutional networks for large-scale image
recognition”, arXiv preprint arXiv:1409.1556, 2014.
[74] L. Cinelli, “Anomaly Detection in Surveillance Videos using Deep Residual Networks",
Master Thesis, Universidade de Rio de Janeiro, February 2017.
[75] K. He, X. Zhang, S. Ren, "Deep residual learning for image recognition", "EEE CVPR
2016, June 2016.
[76] L. Yang, J. Li, Y. Luo, Y. Zhao, H. Cheng, J. Li, "Deep Background Modeling Using Fully
Convolutional Network", IEEE Transactions on Intelligent Transportation Systems, 2017.
[77] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. Yuille, “DeepLab: Semantic image
segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs”,
Tech. Rep., 2016.
[78] K. He, X. Zhang, S. Ren, J. Sun, "Delving deep into rectifiers: Surpassing human-level
performance on ImageNet classification", IEEE ICCV 2015, pages 1026–1034, 2015.
[79] C. Stauffer, W. Grimson, “Adaptive background mixture models for real-time tracking”,
IEEE CVPR 1999, pages 246-252, 1999.
[80] K. Kim, T. H. Chalidabhongse, D. Harwood, L. Davis, “Background Modeling and
Subtraction by Codebook Construction”, IEEE ICIP 2004, 2004
[81] O. Barnich, M. Van Droogenbroeck, “ViBe: a powerful random technique to estimate the
background in video sequences”, ICASSP 2009, pages 945-948, April 2009.
[82] M. Hofmann, P. Tiefenbacher, G. Rigoll, "Background Segmentation with Feedback: The
Pixel-Based Adaptive Segmenter", IEEE Workshop on Change Detection, CVPR 2012,
June 2012
[83] L. Yang, H. Cheng, J. Su, X. Li, “Pixel-to-model distance for robust background
reconstruction", IEEE Transactions on Circuits and Systems for Video Technology, April
2015.
[84] T. Akilan, "A Foreground Inference Network for Video Surveillance using Multi-View
Receptive Field", Preprint, January 2018.
[85] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, A. Rabinovich,
"Going deeper with convolutions", IEEE CVPR 2015, pages 1-9, 2015.
[86] P. St-Charles, G. Bilodeau, R. Bergevin, "Flexible Background Subtraction with Self-
Balanced Local Sensitivity", IEEE CDW 2014, June 2014.
[87] Z. Zhao, X. Zhang, Y. Fang, “Stacked multilayer self-organizing map for background
modeling” IEEE Transactions on Image Processing, Vol. 24, No. 9, pages. 2841–2850,
2015.
[88] Y. Zhang, X. Li, Z. Zhang, F. Wu, L. Zhao, “Deep learning driven bloc-kwise moving
object detection with binary scene modeling”, Neurocomputing, Vol. 168, pages 454-463,
2015.
[89] D. Zeng, M. Zhu, "Multiscale Fully Convolutional Network for Foreground Object
Detection in Infrared Videos", IEEE Geoscience and Remote Sensing Letters, 2018.
[90] D. Zeng, M. Zhu, “Combining Background Subtraction Algorithms with Convolutional
Neural Network”, Preprint, 2018.
[91] R. Wang, F. Bunyak, G. Seetharaman, K. Palaniappan, “Static and moving object detection
using flux tensor with split Gaussian model”, IEEE CVPR 2014 Workshops, pages 414–
418, 2014.
[92] M. De Gregorio, M. Giordano, “CwisarDH+: Background detection in RGBD videos by
learning of weightless neural networks”, ICIAP 2017, pages 242–253, 2017.
[93] C. Lin, B. Yan, W. Tan, "Foreground Detection in Surveillance Video with Fully
Convolutional Semantic Network", IEEE ICIP 2018, pages 4118-4122, Athens, Greece,
October 2018.
[94] J. Long, E. Shelhamer, T. Darrell, “Fully convolutional networks for semantic
segmentation,” IEEE CVPR 2015, pages 3431-3440, 2015.
[95] X. Zhao, Y. Chen, M. Tang, J. Wang, "Joint Background Reconstruction and Foreground
Segmentation via A Two-stage Convolutional Neural Network", Preprint, 2017.
[96] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A. Efros, "Context encoders: Feature
learning by inpainting", arXiv preprint arXiv:1604.07379, 2016.
[97] A. Radford, L. Metz, S. Chintala, "Unsupervised representation learning with deep

convolutional generative adversarial networks," Computer Science, 2015.
[98] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. Yuille, “Deeplab: Semantic image
segmentation with deep convolutional nets, atrous convolution and fully connected CRFs,”
arXiv preprint arXiv:1606.00915, 2016.
[99] P. St-Charles, G. Bilodeau, R. Bergevin, “A Self-Adjusting Approach to Change Detection
Based on Background Word Consensus", IEEE Winter Conference on Applications of
Computer Vision, WACV 2015, 2015.
[100] Y. Chen, J. Wang, H. Lu, “Learning sharable models for robust background subtraction”,
IEEE ICME 2015, pages 1-6, 2015.
[101] X. Li, M. Ye, Y. Liu, C. Zhu, “Adaptive Deep Convolutional Neural Networks for Scene-
Specific Object Detection”, IEEE Transactions on Circuits and Systems for Video
Technology, September 2017.
[102] M. Wang and W. Li and X. Wang, "Transferring a generic pedestrian detector towards
specific scenes", IEEE CVPR 2012, pgas 3274-3281, 2012.
[103] X. Wang, X. Ma, W Grimson, "Unsupervised activity perception in crowded and
complicated scenes using hierarchical Bayesian models", IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. 31, No. 3, pages 539-555, March 2009.
[104] Y. Chen, J. Wang, B. Zhu, M. Tang, H. Lu, "Pixel-wise Deep Sequence Learning for
Moving Object Detection", IEEE Transactions on Circuits and Systems for Video
Technology, 2017.
[105] K. Lim, W. Jang, C. Kim, "Background subtraction using encoder-decoder structured
convolutional neural network", IEEE AVSS 2017, Lecce, Italy, 2017
[106] D. Sakkos, H. Liu, J. Han, L. Shao, “End-to-end video background subtraction with 3D
convolutional neural networks”, Multimedia Tools and Applications, pages 1-19, December
2017.
[107] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Palur, "C3D: generic features for video
analysis", IEEE ICCV 2015, 2015.
[108] L. Vosters, C. Shan, T. Gritti, “Real-time robust background subtraction under rapidly
changing illumination conditions”, Image Vision and Computing, 30(12):1004-1015, 2012.
[109] H. Sajid, S. Cheung. “Universal multimode background subtraction”, IEEE Transactions on
Image Processing, 26(7):3249–3260, May 2017.
[110] M. Bakkay, H. Rashwan, H. Salmane, L. Khoudoury D. Puig, Y. Ruichek, "BSCGAN:
Deep Background Subtraction with Conditional Generative Adversarial Networks", IEEE
ICIP 2018, Athens, Greece, October 2018.
[111] P. Isola, J. Zhu, T. Zhou, A. Efros, “Image-to-image translation with conditional adversarial
networks”, arXiv preprint, 2017.
[112] W. Zheng, K. Wang, F. Wang, "Background Subtraction Algorithm based on Bayesian
Generative Adversarial Networks", Acta Automatica Sinica, 2018.
[113] W. Zheng, K. Wang, F. Wang, "A Novel Background Subtraction Algorithm based on
Parallel Vision and Bayesian GANs", Neurocomputing, 2018.
[114] F. Bahri, M. Shakeri, N. Ray, "Online Illumination Invariant Moving Object Detection by
Generative Neural Network", Preprint, 2018.
[115] M. Shakeri, H. Zhang, “Moving object detection in time-lapse or motion trigger image
sequences using low-rank and invariant sparse decomposition”, IEEE ICCV 2017, pages
5133–5141, 2017.
[116] M. Shakeri, H. Zhang, “COROLA: A sequential solution to moving object detection using
low-rank approximation”, Computer Vision and Image Understanding, 146:27-39, 2016.
[117] R. Yu, H. Wang, L. Davis, "ReMotENet: Efficient Relevant Motion Event Detection for
Large-scale Home Surveillance Videos", Preprint, January 2018.
[118] X. Liang, S. Liao, X. Wang, W. Liu, Y. Chen, S. Li, "Deep Background Subtraction with
Guided Learning", IEEE ICME 2018 San Diego, USA, July 2018.
[119] J. Liao, G. Guo, Y. Yan, H. Wang, "Multiscale Cascaded Scene-Specific Convolutional

Neural Networks for Background Subtraction", Pacific Rim Conference on Multimedia,
PCM 2018, pages 524-533, 2018.
[120] S. Lee, D. Kim, "Background Subtraction using the Factored 3-Way Restricted Boltzmann
Machines", Preprint, 2018.
[121] P. Fischer, A. Dosovitskiy, E. Ilg, P. Hausser, C. Hazirbas¸, V. Golkov, P. Smagt,
D. Cremers, T. Brox, “Flownet: Learning optical flow with convolutional networks”, arXiv
preprint arXiv:1504.06852, 2015.
[122] Y. Zhang, X. Li, Z. Zhang, F. Wu, L. Zhao, “Deep Learning Driven Blockwise Moving
Object Detection with Binary Scene Modeling”, Neurocomputing, June 2015.
[123] M. Shafiee, P. Siva, P. Fieguth, A. Wong, “Embedded Motion Detection via Neural
Response Mixture Background Modeling”, CVPR 2016, June 2016.
[124] M. Shafiee, P. Siva, P. Fieguth, A. Wong, “Real-Time Embedded Motion Detection via
Neural Response Mixture Modeling”, Journal of Signal Processing Systems, June 2017.
[125] T. Nguyen, C. Pham, S. Ha, J. Jeon, "Change Detection by Training a Triplet Network for
Motion Feature Extraction", IEEE Transactions on Circuits and Systems for Video
Technology, January 2018.
[126] S. Lee, D. Kim, "Background Subtraction using the Factored 3-Way Restricted Boltzmann
Machines", Preprint, 2018.
[127] Y. Yan, H. Zhao, F. Kao, V. Vargas, S. Zhao, J. Ren, "Deep Background Subtraction of
Thermal and Visible Imagery for Pedestrian Detection in Videos", BICS 2018, 2018.
[128] X. Liang, S. Liao, X. Wang, W. Liu, Y. Chen, S. Li, "Deep Background Subtraction with
Guided Learning", IEEE ICME 2018, July 2018.
[129] J. Suykens, “Deep Restricted Kernel Machines using Conjugate Feature Duality”, Neural
Computation, Vol. 29, pages 2123-2163, 2017.
[130] J. Gast, S. Roth, “Lightweight Probabilistic Deep Networks”, Preprint, 2018.
[131] Y. Deng, Z. Ren, Y. Kong, F. Bao, Q. Dai, “A Hierarchical Fused Fuzzy Deep Neural
Network for Data Classification”, IEEE Transactions on Fuzzy Systems, Vol. 25, No. 4,
pages 1006-1012, 2017.
[132] V. Sriram, P. Miller, and H. Zhou. “Spatial mixture of Gaussians for dynamic background
modelling.” IEEE International Conference on Advanced Video and Signal Based
Surveillance, 2013.
[133] Y. Wang, Z. Yu, L. Zhu, "Foreground Detection with Deeply Learned Multi-scale Spatial-
Temporal Features", MDPI Sensors, 2018.
[134] V. Mondéjar-Guerra, J. Rouco, J. Novo, M. Ortega, "An end-to-end deep learning approach
for simultaneous background modeling and subtraction", British Machine Vision
Conference, September 2019.
March 12, 2020 10:14 ws-rv961x669 HBPRCV-6th Edn.–11573 BookChapterNew page 75
CHAPTER 1.4
SIMILARITY DOMAINS NETWORK FOR MODELING SHAPES AND

EXTRACTING SKELETONS WITHOUT LARGE DATASETS
Sedat Ozer
Bilkent University, Ankara, Turkey
sedatist@gmail.com
In this chapter, we present a method to model and extract the skeleton of a shape with re-
cently proposed similarity domains network (SDN). SDN is especially useful when there
is only one image sample available and when there is no additional pre-trained model is
available. SDN is a neural network with one-hidden layer with explainable kernel param-
eters. Kernel parameters have a geometric meaning within the SDN framework, which is
encapsulated with similarity domains (SDs) within the feature space. We model the SDs
with Gaussian kernel functions. A similarity domain is a d dimensional sphere in the d
dimensional feature space and it represents the similarity domain of an important data sam-
ple where any other data that falls inside the similarity domain of that important sample is
considered similar to that sample and they share the same class label. In this chapter, we
first demonstrate how using SDN can help us model a pixel-based image in terms of SDs
and then demonstrate how those learned SDs can be used to extract the skeleton from a
shape.
1. Introduction
Recent advances in deep learning moved attention of many researchers to the neural net-
works based solutions for shape understanding, shape analysis and parametric shape mod-
eling. While a significant amount of research has been done for skeleton extraction and
modeling from shapes in the past, recent advances in deep learning and their improved
success in object detection and classification applications has also moved the attention of
researchers towards neural networks based solutions for skeleton extraction and model-
ing. in this chapter, we introduce a novel shape modeling algorithm based on Radial Basis
Networks (RBNs) which are a particular type of neural networks that utilize radial basis
functions (RBF) as activation function in its hidden layer. RBFs have been used in the
literature for many classification tasks including the original LeNET architecture [1]. Even
though RBFs are useful in modeling surfaces and various classification tasks as in [2–8],
when the goal is modeling a shape and extracting a skeleton many challenges appear as-
sociated with utilizing RBFs in neural networks. Two such challenges are: (I) estimating
the optimal number of used RBFs (e.g., the number of yellow circles in our 2D image ex-
amples) in the network along with their optimal locations (their centroid values), and (II)
75
76 S. Ozer
(a) Binary input image (b) Altered output image using SDs
(c) Visualization of all the SDs (d) Visualization of only the foreground SDs
Fig. 1. This figure demonstrates how the shape parameters of SDN can be utilized on shapes. (a) Original binary
input image is shown. (b) The altered image by utilizing the SDN’s shape parameters. Each object is scaled and
shifted at different and individual scales. We first used a region growing algorithm to isolate the kernel parameters
for each object and then individually scaled and shifted them. (c) All the computed shape parameters of the input
binary image are visualized. (d) Only the foreground parameters are visualized.
estimating the optimal parameters of RBFs by relating them to shapes geometrically. The
kernel parameters are typically known as the scale or the shape parameter (representing
the radius of a circle in the figures) and used interchangeably in the literature. The stan-
dard RBNs as defined in [9] apply the same kernel parameter value to each basis function
used in the network architecture. Recent literature focused on using multiple kernels with
individual and distinct kernel parameters as in [10] and [11]. While the idea of utilizing
different kernels with individual parameters has been heavily studied in the literature under
the “Multiple Kernel Learning” (MKL) framework as formally modeled in [11], there are
not many efficient approaches and available implementations focusing on utilizing multiple
kernels with their own parameters in RBNs for shape modeling. Recently, the work in [12]
combined the optimization advances achieved in the kernel machines domain with the ra-
dial basis networks and introduced a novel algorithm for shape analysis. In this chapter, we
call that algorithm as “Similarity Domains Network” (SDN) and discuss its benefits from
both shape modeling (see Figure 1) and skeleton extraction perspectives. As we demon-
strate in this chapter, the computed similarity domains of SDN can be used not only for
obtaining parametric models for shapes but also models for their skeletons.
Similarity Domains Network for Modeling Shapes and Extracting Skeletons without Large Datasets 77
2. Related Work
Skeleton extraction has been widely studied in the literature as in [13–16]. However, in this
chapter, we study and focus on how to utilize SDs that are obtained by a novel and recently
introduced algorithm, Similarity Domains Networks, and demonstrate how to obtain para-
metric models for shapes and to extract the skeleton of a shape. SDs save only a portion
of the entire data, thus they provide a means to reduce the complexity of computations for
skeleton extraction and shape modeling. Our proposed algorithm: SDN is related to both
radial basis networks and kernel machines. However, in this chapter, we mainly discuss and
present our novel algorithm from the neural networks perspective and relate it to the radial
basis networks (RBNs). In the past, the RBN related research mainly focused on comput-
ing the optimal kernel parameter (i.e., the scale or shape parameter) that was used in all of
the RBFs as in [17, 18]. While the parameter computation for multiple kernels have been
heavily studied under the MKL framework in the literature (for examples, see the survey
papers: [19, 20]), the computation of multiple kernel parameters in RBNs has been mostly
studied under two main approaches: using optimization or using heuristic methods. For
example, in [21], the authors proposed using multiple scales as opposed to using a single
scale value in RBNs. Their approach utilizes first computing the standard deviation of each
cluster (after applying a k-means like clustering on the data) and then using a scaled ver-
sion of those standard deviations of each cluster as the shape parameter for each RBF in
the network. The work in [22] also used a similar approach by using the root-mean-square-
deviation (RMSD) value between the RBF centers and the data value for each RBF in the
network. The authors used a modified orthogonal least squares (OLS) algorithm to select
the RBF centers. The work in [10] used k-means algorithm on the training data to choose
k centers and used those centers as RBF centers. Then it used separate optimizations for
computing the kernel parameters and the kernel weights (see next chapter for the formal
definitions). Using additional optimization steps for different set of parameters is costly
and makes it harder to interpret those parameters and to relate them to shapes geomet-
rically and accurately. As an alternative solution, the work in [12] proposed a geometric
approach by using the distance between the data samples as a geometric constraint. In [12],
we did not use the well known MKL model. Instead, we defined interpretable similarity
domains concept using RBFs and developed his own optimization approach with geometric
constrains similar to the original Sequential Minimal Optimization (SMO) algorithm [23].
Consequently, the SDN algorithm combines both RBN and kernel machine concepts to
develop a novel algorithm with geometrically interpretable kernel parameters.
In this chapter, we demonstrate using SDN for parametric shape modeling and skeleton
extraction. Unlike the existing work on radial basis networks, instead of applying an ini-
tial k-means algorithm or OLS algorithm to compute the kernel centers separately or using
multiple cost functions, SDN chooses the RBF centers and their numbers automatically
via its sparse modeling and uses a single cost function to be optimized with its geometric
constraint. That is where SDN differs from other similar RBN works as they would have
issues on computing all those parameters within a single optimization step while automat-
ically adjusting the number of RBFs used in the network sparsely.
78 S. Ozer
Fig. 2. An illustration of SDN as a radial basis network. The network contains a single hidden layer.
The input layer (d dimensional input vector) is connected to n radial basis functions. The output is
the weighted sum of the radial basis functions’ outputs.
3. Similarity Domains
A similarity domain [12] is a geometric concept that defines a local similarity around a
particular data set where that data represents the center of a similarity sphere (i.e., similar-
ity domain) in the Euclidian space. Through similarity domains, we can define a unified
optimization problem in which the kernel parameters are computed automatically and ge-
ometrically. We formalize the similarity domain of xi Rd , as the sphere in Rd where the
center is the support vector (SV) xi and the sphere radius is ri . The radius ri is defined as
follows:
For any (+1) labelled support vector x+ + d
i , where xi R and superscript (+) repre-
sents the (+1) class:
− −
i − x1 , ..., xi − xk )/2
ri = min( x+ +
(1)
where superscript (-) means the (-1) class.
For any (-1) labelled support vector x− i :
ri = min( x− + −
i − x1 , ..., xi − xk )/2.
+
(2)
In this work, we use Gaussian kernel function to represent similarities and similarity do-
mains as follows:
Kσi (x, xi ) = exp(− x − xi 2 /σi2 ) (3)
where σi is the kernel parameter for SV xi . The similarity (kernel) function takes its
maximum value where x = xi . The relation between ri and σi is as follows: ri2 = aσi2
where a is a domain specific scalar (constant). In our image experiments, the value of a is
found via a grid search and we observed that setting a = 2.85 suffices for all the images
used in our experiments.
Note that, in contrast to [24, 25], our similarity domain definition differs from the term
“minimal enclosing sphere”. In our approach, we define the term similarity domain as
the dominant region of a SV in which the SV is the centroid and all the points within the
domain are similar to the SV. The boundary of the similarity domain of a SV is defined
(a) The input image (b) All ri (c) Foreground ri

Fig. 3. (color online) Visualization of the SDM kernel parameters at T = 0.05 with zero pixel error
learning. The blue area represents the background and the yellow area represents the foreground. The
red dots are the RBF centers and yellow circles around them show the boundaries of SDs. The green
lines are the radiuses (ri ) of SDs. The ri are obtained from the computed σi . (a) Original image:
141x178 pixels. (b) Visualization of all the ri from both background and foreground with total of
1393 centers. (c) Visualization of only the ri for the object with total of 629 foreground centers (i.e.,
by using only the 2.51% of all image pixels). All images are resized to fit into the figure.
based on its distance to the closest point from the other class. Thus any given vector within
a similarity domain (a region) will be similar to the associated SV of that similarity domain.
We will use the similarity domain concept to define a kernel machine that computes its
kernel parameters automatically and geometrically in the next section.
4. Similarity Domains Network
A typical radial basis network (RBN) includes a single hidden layer and uses a radial basis
function (RBF) as the activation function in each neuron in that hidden layer (i.e., the
hidden layer uses n RBFs). Similar to RBN, a similarity domains network also uses a single
hidden layer where the activation functions are radial basis functions. Unlike a typical RBN
that uses the same kernel parameter in all the radial basis functions in the hidden layer, SDN
uses different kernel parameters for each RBF used in the hidden layer. The illustration of
SDN as a radial basis network is given in Figure 2. While the number of RBFs in the hidden
layer is decided by different algorithms in RBN (as discussed in the previous section),
SDN assigns a RBF to each training sample. In the figure, the hidden layer uses all of the n
training data as an RBF center and then through its sparse optimization, it selects a subset of
the training data (e.g., subset of pixels for shape modeling) and reduces that number n to k
where n ≥ k. SDN models the decision boundary as a weighted combination of Similarity
Domains (SDs). A similarity domain is a d dimensional sphere in the d dimensional feature
space. Each similarity domain is centered at an RBF center and modeled with a Gaussian
RBF in SDN. SDN estimates the label y of a given input vector x as y as shown below:
k
y = sign(f (x)) and f (x) = αi yi Kσi (x, xi ), (4)
i=1
where the scalar αi is a nonzero weight for the RBF center xi , yi {−1, +1} the class label
of the training data and k the total number of RBF centers. K(.) is the Gaussian RBF
kernel defined as:
Kσi (x, xi ) = exp(− x − xi 2 /σi2 ) (5)
80 S. Ozer
where σi is the shape parameter for the center xi . The centers are automatically selected
among the training data during the training via the following cost function:
n
1
n n
max Q(α) = αi − αi αj yi yj Kσij (xi , xj ),
α
i=1
2 i=1 j=1

n
(6)
subject to: αi yi = 0, C ≥ αi ≥ 0 for i = 1, 2, ..., n,
i=1
and Kσij (xi , xj ) < T, if yi yj = −1, ∀i, j
where T is a constant scalar value assuring that the RBF function yields a smaller value
for any given pair of samples from different classes. The shape parameter σij is defined as
σij = min(σi , σj ). For a given closest pair of vectors xi and xj for which yi yj = −1, we
can define the kernel parameters as follows:
− xi − xj 2
σi2 = σj2 = (7)
ln(K(xi , xr ))
As a result, the decision function takes the form:

k
x − xi 2
f (x) = αi yi exp(− )−b (8)
i=1
σi2
where k is the total number of support vectors. In our algorithm the bias value b is constant
and is equal to 0.
Discussion: Normally, avoiding the term b in the decision function eliminates the con-

n
straint αi yi = 0 in the optimization problem. However, since yi {−1, +1}, the sum in
i=1

n 1
m 2
m
that constraint can be rewritten as α i yi αi − αi 0 where m1 + m2 = n.
i=1 i=1 i=1
This means that if the αi values are around the value of 1 (or equal to 1) ∀i , then this con-
straint also means that the total number of support vectors from each class should be equal
n
or similar to each other, i.e., m1 m2 . That is why we keep the constraint α i yi = 0
i=1
in our algorithm as that would help us to compute SVs from both classes with comparable
numbers.
The decision function f (x) can be expressed as:

k1
k2
f (x) = αi yi Ki (x, xi ) + αj yj Kj (x, xj ) where k1 is the total number of SVs
i=1 j=1
near vector x such that the Euclidian norm xi − x 2 − σi2 >> 0 and k2 is the total
number of SVs for which xj − x 2 − σj2 0. Notice that k1 + k2 = k. This property
suggests that local predictions can be made by the approximated decision function:

k1
f (x) αi yi Ki (x, xi ). This approach can simplify the computations in large
i=1
datasets as in this approach, we do not require access to all of the available of SVs. Further
details on SDs and SDN formulation can be found in [12].
5. Parametric Shape Modeling with SDN
Sparse and parametric shape modeling is a challenge in the literature. For shape modeling,
we propose using SDNs. SDN can model shapes sparsely with its computed kernel param-
eters. For that, first we train SDN to learn the shape as a decision boundary from the given
binary image. For that, we label the shape (e.g., the white region in Fig. 3a) as foreground
and label everything else (e.g., the black region in Fig. 3a) as background while using each
pixel 2D coordinate as features. Once the image is learned by SDN, the computed kernel
parameters of SDN along with their 2D coordinates are used to model the shape with our
one-class classifier without performing any re-training.
As mentioned earlier, we can use Gaussian RBFs and their shape parameters (i.e. the
kernel parameters) to model shapes parametrically within the SDN framework. For that
purpose, we can save and use only the foreground (the shape’s) RBF (or SD) centers and
their shape parameters to obtain a one class classifier. The computed RBF centers of SDN
can be grouped for both foreground and for background as:

s1
s2
C1 = xi and C2 = xi , where s1 + s2 = k, s1 is the total number
i=1,yi ∈+1 i=1,yi ∈−1
of centers from the (+1) class and s2 is the total number of centers from the (-1) class.
Since the Gaussian kernel functions (RBFs) now represent local SDs geometrically, the
original decision function f (x) can now be approximated by using only C1 (or by using
only C2 ). Therefore, we define the one-class approximation by using only the centers and
their associated kernel parameters from the C1 forany given x as follows:
y = +1, if x − xi < aσi2 , ∃xi ∈ S1
(9)
otherwise y = −1,

where the SD radius for the ith center xi is defined as aσi2 and a is a domain specific
constant. One class approximation examples are given in Figure 1b where we used only
the SDs from the foreground to reconstruct the altered image.
6. Extracting the Skeleton from SDs
The parametric and geometric properties of SDN provides new parameters to analyze a
shape via its similarity domains. Furthermore, while the typical neural network based
applications for skeleton estimation focus on learning from multiple images, SDN can learn
a shape’s parameters from only the given single image without requiring any additional data
set or a pre-trained model. Therefore, SDN is advantageous especially in cases where the
data is very limited or only one sample shape is available.
Once learned and computed by the SDN, the similarity domains (SDs) can be used to
extract the skeleton of a given shape. When computed by considering only the existing
SDs, the process of skeleton extraction requires only a subset of pixels (i.e., the SDs)
during the computation. To extract the skeleton, we first bin the computed shape parameters
(σi2 ) into m bins (in our experiments m is set to 10). Since typically the majority of the
similarity domains lay around the object (or shape) boundary, they appear in small values.
82 S. Ozer
(a) All of the foreground ri (b) Skeleton for σi2 > 29.12 (c) Skeleton for σi2 > 48.32
(d) Skeleton for σi2 > 67.51 (e) Skeleton for σi2 > 86.71 (f) Skeleton for σi2 > 105.90
Fig. 4. The results of filtering shape parameters at different thresholds is visualized for the image
shown in Figure 3a. The remaining similarity domains after thresholding and the extracted skeletons
from those similarity domains are visualized: (b) for σi2 > 29.12; (c) for σi2 > 48.32; (d) for
σi2 > 67.51; (e) for σi2 > 86.71; (f) for σi2 > 105.90.
With a simple thresholding process, we can eliminate the SDs from our subset where we
search for the skeleton. Eliminating them at first, gives us a lesser number of SDs to
consider for skeleton extraction. After eliminating those small SDs and their computed
parameters with a simple thresholding process, we connect the centers of the remaining
SDs by tracing the overlapping SDs. If there are non-overlapping SDs exist within the
same shape after the thresholding process, we perform linear estimation and connect the
closest SDs. We interpolate a line between those closest SDs to visualize the skeleton in
our figures. Thresholding the kernel parameters of SDs at different values, yields different
set of SDs, and therefore, we obtain different skeletons as shown in Figure 4.
7. Experiments
In this section, we demonstrate how to use SDN for parametric shape learning from a given
single input image without requiring any additional dataset. Since it is hard to model shapes
with the standard RBNs, and since there is no good RBN implementation was available to
us, we did not use any other RBN network in our experiments for comparison. As discussed
in the earlier sections, the standard RBNs have many issues and multiple individual steps
to compute the RBN parameters including the total number of RBF centers and finding
the center values along with the computation of the shape parameters at those centers.
However, comparison of kernel machines (SVM) and SDN on shape modeling was already
studied in the literature before (see [12]). Therefore, in this section, we focus on parametric
shape modeling and skeleton extraction from SDs by using SDNs. In the figures, all the
images are resized to fit into the figures.
7.1. Parametric Shape Modeling with SDs
Here, we first demonstrate visualizing the computed shape parameters of SDN on a given
sample image in Figure 3. Figure 3a shows the original input image. We used each pixel’s
2D coordinate in the image as the features in the training data , and each pixel’s color
(being black or white) as the training labels. SDN is trained at T=0.05. SDN learned and
modeled the shape and reconstructed it with zero pixel error by using 1393 SDs. Pixel error
is the total number of wrongly classified pixels in the image. Figure 3b visualizes all the
computed shape parameters of the RBF centers of SDN as circles and Figure 3c visualizes

the ones for the foreground only. The radius of a circle in all figures is computed as aσi2
where a = 2.85. We found the value of a through a heuristic search and noticed that 2.85
suffices for all the shape experiments that we had. There are total of 629 foreground RBF
centers computed by SDN (only 2.51% of all the input image pixels).
7.2. Skeleton Extraction From the SDs
Next, we demonstrate the skeleton extraction from the computed similarity domains as a
proof of concept. Extracting the skeleton from the SDs as opposed to extracting it from the
pixels, simplifies the computations as SDs are only a small portion of the total number of
pixels (reducing the search space). To extract the skeleton from the computed SDs, we first
quantize the shape parameters of the object into 10 bins and then starting from the largest
bin, we select the most useful bin value to threshold the shape parameters. The remaining
SD centers are connected based on their overlapping similarity domains. If multiple SDs
overlap inside the same SD, we look at their centers and we ignore the SDs whose centers
fall within the same SD (accepted the original SD center). That is why some points of
the shape are not considered as a part of the skeleton in Figure 4. Figure 4 demonstrates
the remaining (thresholded) SD centers and their radiuses at various thresholds in yellow.
In the figure, the skeletons (shown as a blue line) are extracted by considering only the
remaining SDs after thresholding as explained in Section 6. Another example is shown in
Figure 5. The input binary image is shown in Figure 5a. Figure 5b shows all the foreground
similarity domains. The learned SDs are thresholded and the corresponding skeleton as
extracted from the remaining SDs are visualized as a blue line in Figure 5c.
One benefit of using only SDs to re-compute skeletons is that, as the SDs are a subset
of the training data, the number of SDs are gradually less than the total number of pixels
that needs to be considered for skeleton computation. While our shown skeleton extraction
algorithm here is a naive and a basic one, the goal here is showing the use of SDs for
(re)computation of skeletons instead of using all the pixels of the shape.
Table 1. Bin centers for the quantized foreground shape parameters (σi2 ) and the total number of shape
parameters that fall in each bin for the image in Fig. 3a.
Bin Center: 9.93 29.12 48.32 67.51 86.71 105.90 125.09 144.29 163.48 182.68
Total Counts: 591 18 7 3 2 4 0 0 1 3
84 S. Ozer
(a) Input Image (b) σi2 > 0 (c) for σi2 > 6.99
Fig. 5. (color online)Visualization of the skeleton (shown as blue line) extracted from SDs on another
image. (a) Input image: 64 x 83 pixels. (b) Foreground SDs. (c) Skeleton for σi2 > 6.99.
8. Conclusion
In this chapter, we introduced how the computed SDs of the SDN algorithm can be used
to extract skeleton from shapes as a proof of concept. Instead of using and processing
all the pixels to extract the skeleton of a given shape, we propose to use the SDs of the
shape to extract the skeleton. SDs are a subset of the training data (i.e., a subset if the all
pixels), thus using SDs can gradually reduces the (re)computation of skeletons at different
parameters. SDs and their parameters are obtained by SDN after the training steps. The
RBF shape parameters of SDN are used to define the size of SDs and they can be used to
model a shape as described in Section 5 and as visualized in our experiments. While the
presented skeleton extraction algorithm is a naive solution to demonstrate the use of SDs,
future work will focus on presenting more elegant solutions to extract the skeleton from
SDs. SDN is a novel classification algorithm and has potential in many shape analysis
applications besides the skeleton extraction. SDN architecture contains a single hidden
layer neural network and it uses RBFs as activation functions in the hidden layer. Each
RBF as its own kernel parameter.
Optimization algorithm plays an important role to obtain meaningful SDs with SDN
for skeleton extraction. We use a modified version of Sequential Minimal Optimization
(SMO) algorithm [23] to train SDN. While we have not tested its performance with other
optimization techniques yet, we do not expect other standard batch or stochastic gradient
based algorithms to yield the same results as we obtain with our algorithm. A future work
will focus on the optimization part and will perform a more in detailed analysis from the
optimization perspective.
A shape can be modeled parametrically by using SDNs via its similarity domains where
SDs are modeled with radial basis functions. A further reduction in parameters can be
obtained with one class classification approximation of SDN as shown in Eq. 9. SDN
can parametrically model a given single shape without requiring or using large datasets.
Therefore, it can be efficiently used to learn and model a shape even if there is only one
image is available where there is no any additional dataset or model can be provided.
A future work may include introducing a better skeleton algorithm utilizing SDs. Cur-
rent naive technique relies on manual thresholding. However, a future technique may elim-
inate such manual operation to extract the skeleton.
Acknowledgement
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the
Quadro P6000 GPU used for this research. The author would like to thank Prof. Chi Hau
Chen for his valuable comments and feedback.
References
[1] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., Gradient-based learning applied to document
recognition, Proceedings of the IEEE. 86(11), 2278–2324, (1998).
[2] S. Ozer, D. L. Langer, X. Liu, M. A. Haider, T. H. van der Kwast, A. J. Evans, Y. Yang, M. N.
Wernick, and I. S. Yetik, Supervised and unsupervised methods for prostate cancer segmenta-
tion with multispectral mri, Medical physics. 37(4), 1873–1883, (2010).
[3] L. Jiang, S. Chen, and X. Jiao, Parametric shape and topology optimization: A new level set
approach based on cardinal basis functions, International Journal for Numerical Methods in
Engineering. 114(1), 66–87, (2018).
[4] S.-H. Yoo, S.-K. Oh, and W. Pedrycz, Optimized face recognition algorithm using radial basis
function neural networks and its practical applications, Neural Networks. 69, 111–125, (2015).
[5] M. Botsch and L. Kobbelt. Real-time shape editing using radial basis functions. In Computer
graphics forum, vol. 24, pp. 611–621. Blackwell Publishing, Inc Oxford, UK and Boston, USA,
(2005).
[6] S. Ozer, M. A. Haider, D. L. Langer, T. H. van der Kwast, A. J. Evans, M. N. Wernick, J. Tracht-
enberg, and I. S. Yetik. Prostate cancer localization with multispectral mri based on relevance
vector machines. In Biomedical Imaging: From Nano to Macro, 2009. ISBI’09. IEEE Interna-
tional Symposium on, pp. 73–76. IEEE, (2009).
[7] S. Ozer, On the classification performance of support vector machines using chebyshev kernel
functions, Master’s Thesis, University of Massachusetts, Dartmouth. (2007).
[8] S. Ozer, C. H. Chen, and H. A. Cirpan, A set of new chebyshev kernel functions for support
vector machine pattern classification, Pattern Recognition. 44(7), 1435–1447, (2011).
[9] R. P. Lippmann, Pattern classification using neural networks, IEEE communications magazine.
27(11), 47–50, (1989).
[10] L. Fu, M. Zhang, and H. Li, Sparse rbf networks with multi-kernels, Neural processing letters.
32(3), 235–247, (2010).
[11] F. R. Bach, G. R. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the
smo algorithm. In Proceedings of the twenty-first international conference on Machine learn-
ing, p. 6. ACM, (2004).
[12] S. Ozer, Similarity domains machine for scale-invariant and sparse shape modeling, IEEE
Transactions on Image Processing. 28(2), 534–545, (2019).
[13] N. D. Cornea, D. Silver, and P. Min, Curve-skeleton properties, applications, and algorithms,
IEEE Transactions on Visualization & Computer Graphics. (3), 530–548, (2007).
[14] H. Sundar, D. Silver, N. Gagvani, and S. Dickinson. Skeleton based shape matching and re-
trieval. In 2003 Shape Modeling International., pp. 130–139. IEEE, (2003).
[15] P. K. Saha, G. Borgefors, and G. S. di Baja, A survey on skeletonization algorithms and their
applications, Pattern Recognition Letters. 76, 3–12, (2016).
86 S. Ozer
[16] I. Demir, C. Hahn, K. Leonard, G. Morin, D. Rahbani, A. Panotopoulou, A. Fondevilla, E. Bal-

ashova, B. Durix, and A. Kortylewski, SkelNetOn 2019 Dataset and Challenge on Deep Learn-
ing for Geometric Shape Understanding, arXiv e-prints. (2019).
[17] M. Mongillo, Choosing basis functions and shape parameters for radial basis function methods,
SIAM undergraduate research online. 4(190-209), 2–6, (2011).
[18] J. Biazar and M. Hosami, An interval for the shape parameter in radial basis function approxi-
mation, Applied Mathematics and Computation. 315, 131–149, (2017).
[19] S. S. Bucak, R. Jin, and A. K. Jain, Multiple kernel learning for visual object recognition: A
review, Pattern Analysis and Machine Intelligence, IEEE Transactions on. 36(7), 1354–1369,
(2014).
[20] M. Gönen and E. Alpaydın, Multiple kernel learning algorithms, The Journal of Machine
Learning Research. 12, 2211–2268, (2011).
[21] N. Benoudjit, C. Archambeau, A. Lendasse, J. A. Lee, M. Verleysen, et al. Width optimiza-
tion of the gaussian kernels in radial basis function networks. In ESANN, vol. 2, pp. 425–432,
(2002).
[22] M. Bataineh and T. Marler, Neural network for regression problems with reduced training sets,
Neural networks. 95, 1–9, (2017).
[23] J. Platt, Fast training of support vector machines using sequential minimal optimization, Ad-
vances in kernel methods support vector learning. 3, (1999).
[24] C. J. Burges, A tutorial on support vector machines for pattern recognition, Data mining and
knowledge discovery. 2(2), 121–167, (1998).
[25] J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis. (Cambridge university
press, 2004).
CHAPTER 1.5
ON CURVELET-BASED TEXTURE FEATURES FOR

PATTERN CLASSIFICATION
Ching-Chung Li and Wen-Chyi Lin

University of Pittsburgh, Pittsburgh, PA 15261
E-mails: ccl@pitt.edu, dųįŭŪů@pitt.edu
This chapter presents an exploration of the curvelet-based approach to image

texture analysis for pattern recognition. A concise introduction to the curvelet
transform is given, which is a relatively new method for sparse representation
of images with rich edge structures. Its application to the multi-resolution
texture feature extraction is discussed. Merits of this approach have been
reported in recent years in several application areas, for example, on analysis
of medical MRI organ tissue images, classification of critical Gleason grading
of prostate cancer histological images and grading of mixture aggregate
material, etc. A bibliography is provided at the end for further reading.
1. Introduction
Image texture may be considered as an organized pattern of some simple

primitives and their spatial relationships described in a statistical sense. The
development of methodologies for texture analysis and classification over the
past fifty years has led to a number of successful applications in, for example,
remote sensing, astrophysical and geophysical data analyses, biomedical
imaging, biometric informatics, document retrieval and material inspection, etc.1-
3
Classical methods began with the notion of the co-occurrence matrix on gray
levels of two neighboring pixels, the intriguing Law’s simple masks, a bank of
spatial filters, fractal models, and then followed by multi-resolution approaches
with wavelet transforms4-5, Gabor wavelet filter banks6 in a layered pyramid,
ridgelet transform and more recently curvelet transform7-9. The curvelet
transform developed by Candes and Donoho10-12 is a generalization of wavelet
transform for optimal sparse representation of a class of continuous functions
with curvilinear singularity, i.e., discontinuity along a curve with bounded
curvature. With regard to the 2-dimensional image processing, the development
87
88 On Curvelet-based Texture Features for Pattern Classification
of the curvelet transform has gone through two generations. The first generation
attempted to extend the ridglelet transform in small blocks of smoothly
partitioned subband-filtered images to obtain a piecewise line segment in
succession to approximate a curve segment at each scale. It suffered from the
accuracy problem of using partitioned blocks of small size to compute the local
ridgelet transform. The second generation adopts a new formulation through
curvelet design in the frequency domain that results in the equivalent outcome
attaining the same curve characteristics in the spatial domain. A fast digital
implementation of the discrete curvelet transform is also available for the use in
applications.11,13 The curvelet transform has been applied to image de-noising,
estimation, contrast enhancement, image fusion, texture classification, inverse
problems and sparse sensing.2, 14-17
There is an excellent paper, written by J. Ma and G. Plonka,18 as a review and
tutorial on the curvelet transform for engineering and information scientists. The
subject has also been discussed in sections of S. Mallat’s book4 on signal
processing and of the book by J-L Starck19, et al., on sparse image and signal
processing. This chapter provides a concise description of the second generation
curvelet transform and feature extraction in images based on multi-scale curvelet
coefficients for image texture pattern classification.
2. The Method of Curvelet Transform
A curvelet in the 2-dimensional space is a function ĳ(x1, x2) of two spatial

variables x1 and x2 that is defined primarily over a narrow rectangular region of
short width along x1 axis and longer length along x2 axis following the parabolic
scaling rule, i.e., the width scaling is equal to the square of the length scaling, as
illustrated in Fig. 1. It changes rapidly along x1 is smooth along x2 , so its Fourier
transform ĳj(Ȧ1, Ȧ2) is in a broad frequency band along Ȧ1 and is limited to a
narrow low frequency band along Ȧ2, that means, Mˆ j (Z1 , Z 2 ) is compactly
supported over a narrow sector in the 2-d frequency domain (Ȧ1, Ȧ2) where ^
denotes the Fourier transform. The curvelet is orientation sensitive. If ĳ(x1, x2) is
rotated by an angle ș and expressed in the spatial polar coordinates (ȡ, ș), the
similar frequency support will appear along the radial frequency axis r in the
frequency polar plot.
With both shift (k1, k2) in (x1, x2) and rotation ș, the curvelet at scale j 0 is
given by
ĳj,ߠ, k (x1, x2) = 2-3j/4ĳ( Rș [2-j (x1 í k1), 2-j/2 (x2 í k2 )])
C.-C. Li and W.-C. Lin. 89
and
Mˆ j ,T ,k (Z1 , Z2 ) 23 j / 4 M RT >2 j Z1 ,2 j / 2 Z2 @ e i Z x Z x
T
1 1 2 2
(1)
where subscript k denotes (k1, k2), [ , ]T denotes a column vector and Rș is a

rotation matrix,
§ cosT sin T ·
RT ¨¨ ¸,
© sin T cosT ¸¹
The set of {ĳj,ߠ,k (x1, x2)} is a tight frame that can be used to give an optimal
representation of a function f(x1, x2) by the linear combination of the curvelets
{ĳj,ߠ,k (x1, x2)} with coefficients {cj, ߠ, k} which are the set of inner products
f ¦c
j ,T , k
j ,T , k M j ,T ,k x1 , x 2
and
1 2 ˆ
c j ,T , k f , M j ,T ,k ( ) f , Mˆ j ,T ,k (2)
2S
In the discrete curvelet transform, let us consider the design of curvelet ĳj (x1,
x2) through its Fourier transform Mˆ j (Z1 , Z 2 ) in the polar frequency domain (r, ș)
via a pair of radial window W(r) and angular window V(t) in the polar
coordinates where r (1/2, 2) and t [í1, 1]. Note that r is the normalized
radial frequency variable with the normalization constant ʌ and the angular
variable ș is normalized by 2ʌ to give parameter t which may vary around the
normalized orientation șl in the range [í1, 1]. Both W(r) and V(t) windows are
smooth non-negative real-valued functions and are subject to the admissibility
conditions
f
3 3 (3)
¦W
j -f
2
(2 j r ) 1, r ( , );
4 2
¦V
A -f
2
(t l ) 1, t (
1 1
, ).
2 2
(4)
Let Uj be defined by
3 j
2 ¬ j / 2 ¼T (5)
U j (r ,T l ) 2 4
W ( 2 j r )V ( ),
2S
where l is the normalized șl at scale j (j 0). With the symmetry property of the
Fourier transform, the range of ș is now (íʌ/2, ʌ/2) and thus the resolution unit
can be reduced to half size.
Let Uj be the polar wedge defined with the support of W and V
3 j
2 ¬ j / 2 ¼T ș (íʌ/2, ʌ/2) (6)
U j (r ,T l ) 2 4
W ( 2 j r )V ( ),
2S
where ¬ j / 2¼ denotes the integer part of j/2. This is illustrated by the shaded
sector in Figure 2. In the frequency domain, the scaled curvelet at scale j without
shift can be chosen with the polar wedge Uj given by
Mˆ j ,l ,k (Z1 , Z2 ) U j (r ,T )
with the shift k, it will be
Mˆ j ,l ,k (Z1 , Z 2 ) Mˆ j ,l ,k (U j (r ,T T l )e i (Z k Z k 1 1 2 2)
(7)
j/2
where șl = l (2S ) 2 ¬ ¼ , with l = 0, 1, 2, … such that 0 șl < ʌ. Then,
through the Plancherel’s theorem, curvelet coefficients can be obtained by using
the inner product in the frequency domain
1
fˆ (Z )Mˆ j (U j (r k ,T Tl ))dZ
(2S ) 2 ³
c( j , l , k ) :
1 i xk( j ,l ) ,Z
2 ³ fˆ (Z )U j (r ,T )e dZ
(2S )
1
³ fˆ (Z ,Z )U (r k ,T T )e
i k1Z1 k 2Z2
2 1 2 j l dZ1Z2 (8)
(2S )
The discrete curvelet coefficients can be computed more efficiently through

the inner product in the frequency domain as shown by Eq. (8) and in Fig. 1
where, for one scale j, the same curvelet function with different orientations are
well tiled in a circular shell or coronae. Conceptually, it is straight forward that
we can compute the inner product of the Fourier transform of the image and that
of the curvelet in each wedge, put them together and then take the inverse Fourier
transform to obtain the curvelet coefficients for that scale. However, the FFT of
the image is in the rectangular coordinates, while wedges are in polar
coordinates; and the square region and the circular region are not fully
overlapped.
Fig. 1. A narrow rectangular support for curvelet in the spatial domain is shown in the right, its
width and length have two different scales according to the parabolic scaling rule; also shown is its
shift and rotation in the spatial domain. In the left, a 2-dimensional frequency plane in polar
coordinates is shown with radial windows in a circular corona for supporting a curvelet in different
orientation, the shaded sector illustrates a radial wedge with parabolic scaling.
Let wedges be extended to concentric squares as illustrated in Fig. 2, then the

wedges are in different trapezoidal shapes and incremental orientations of
successive wedges are far from uniform. Special care must be taken to facilitate
the computation. There are two different algorithms for fast digital curvelet
transform: one is called the unequispaced FFT approach, and the other is called
frequency wrapping approach. The concept of the frequency wrapping approach
may be briefly explained by the sketch in Fig. 3. Let us examine a digital wedge
at scale j shown by the shaded area. Under a simple shearing process, the
trapezoidal shaped wedge can be mapped into one with a parallel pipe shaped
support enclosing the trapezoidal support which will contain some data samples
from two neighboring trapezoidal wedges. It will then be mapped into a
rectangular region centered at the origin of the frequency plane. It turns out that
the data in the parallel pipe wedge can be properly mapped into the rectangular
wedge by a wrapping process as illustrated by the shaded parts enclosed in the
rectangle. The tiling of the parallel pipes, which is geometrically periodic in
either vertical or horizontal direction and each contains the identical curvelet
information, will wrap its information into the rectangular wedge to contain the
same frequency information as in the parallel pipe and, thus, in the original
wedge, to compute the inner product with the given image to obtain the same
inner product. Although the wrapped wedge appears to contain the broken pieces
of the data, actually it is just the re-indexing of the components of the original
data. In this way, the inner product can be computed for each wedge and
immediately followed by the inverse FFT to obtain the contribution to the
curvelet coefficients from that original wedge with the trapezoidal support.
Pooling contributions from all the wedges will give the final curvelet coefficients
at that scale. Software for both algorithms are freely available in Candes’
ĲĴ
laboratory, we have used the second algorithm in our study of the curvelet-
based texture pattern classification of prostate cancer tissue images.
Fig. 2. The digital coronae in the frequency plane with pseudo-polar coordinates, trapezoidal
wedges are shown also satisfying parabolic scaling rule.
Fig. 3. The schematic diagram to illuminate the concept of the wrapping algorithm for computing
digital curvelet coefficients at a given scale. A shaded trapezoidal wedge in a digital corona in the
frequency plane is sheared into a parallelepiped wedge, and then is mapped into a rectangular
wedge by a wrapping process to make it having the identical frequency information content by
virtue of the periodization. Although the wrapped wedge appears to contain the broken pieces of
the data, actually it is just the re-indexing of the components of the original data.
The computed curvelet coefficients of sample images are illustrated in the

following four Figures. The strength of coefficients is indicated by the brightness
at their corresponding locations (k1, k2) in reference to the original image spatial
coordinates, thus the plot of low scale appears to be coarse; coefficients of
different orientations are pooled together in one plot for each scale. Fig. 4 shows
a section of yeast image, the exterior contours and the contours of inner core
material are sparsely represented by the curvelet coefficients in each scale.
Fig. 5 shows curvelet coefficients of a cloud image from coarse to fine in six
scales, the coefficient texture patterns in scale 3 to 6 will provide more
manageable texture features in pattern analysis. Fig. 6 shows curvelet coefficients
in four scales of an iris image providing a comparable view with the well-known
representation by the Gabor wavelet.36 Fig. 7 shows the multiscale curvelet

coefficients of prostate cancer tissue images of four Gleason score. The reliable
recognition of the tissue score is a very important problem in clinical urology.
We will describe in the following our current work on this classification problem
based on the curvelet texture representation.
Fig. 4. Fuzzy yeast cells in the dark background. The original image27 has very smooth intensity
variation. Scales 2í4 illustrate the integrated curvelets extracted from the original image.
Fig. 5. NASA satellite view of cloud pattern over the South Atlantic Ocean (NASA courtesy/Jeff
Schmaltz).
Fig. 6. Iris image. The scales 2í5 curvelet coefficient patterns illustrate different texture
distributions of the original image.28
Fig. 7. TMA prostate images of Gleason grades P3S3, P3S4, P4S3 and P4S4 along with their
respective curvelet patterns of scales 2í5 demonstrate the transitions from benign, critical
intermediate class to carcinoma class.
3. Curvelet-based Texture Features
The value of curvelet coefficients cj,l,k at position (k1, k2) under scale j denotes the
strength of the curvelet component oriented at angle ș in the representation of an
image function f(x1, x2). It contains the information on edginess coordinated
along a short path of connected pixels in that orientation. Intuitively, it would be
advantageous to extract texture features in the curvelet coefficient space. One
may use the standard statistical measures, such as, entropy, energy, mean,
standard deviation, 3rd order and 4th order moments of an estimated marginal
distribution (histogram) of curvelet coefficients as texture features,20 and also of
the co-occurrence of curvelet coefficients. The correlation of coefficients across
orientation and across scale may also be utilized. They may provide more
discriminative power in texture classification than those features extracted by the
classical approaches.
Dettori, Semler and Zayed7-8 studied the use of curvelet-based texture features
in recognition of normality of organ sections in CT images and reported a
significant improvement in recognition accuracy, in comparison with the result
obtained by using wavelet-based and ridgelet-based features. Arivazhagan,
Ganesan and Kumar studied the texture classification on a set of natural images
from VisTex Dataset,33-34 using curvelet-based statistical and co-occurrence
features, they also obtained superior classification result. Alecu, Munteanu, et
al.,35 conducted an information-theoretic analysis on correlations of curvelet
coefficients in a scale, between orientations, and across scales; they showed that
the generalized Gaussian density function gave a better fit for marginal
probability density functions of curvelet coefficients. This is due to the sparse
representation by curvelet coefficients, there will be a fewer number of
coefficients and the histogram at a given scale will appear to be more peaked and
have a long tail in general. Following that notion, Gomez and Romero developed
a new set of curvelet-based texture descriptors under the consideration of the
generalized Gaussian model for marginal density functions and demonstrated
their success in a classification experiment using a set of natural images from the
KTH-TIPS dataset.32 Murtagh and Starck also considered the generalized
Gaussian model for histograms of curvelet coefficients of each scale and selected
the second order, third order and fourth order moments as statistical texture
features in classifying and grading the aggregate mixtures with superb
experimental results.19-20 Rotation invariant curvelet features were used in
studies of region-based image retrieval by Zhang, Islam and Sumana, by
Cavusoglu, and by Zand, Doraisamy, Halin and Mustaffa, all gave superior
results in their comparative studies.29-31
We have conducted the research of applying the discrete curvelet transform to

extract texture features in prostate cancer tissue images for differentiating the
disease grade. This will be described in the following to illustrate our experience
in the advances of curvelet transform applications.
4. A Sample Application Problem
This section gives a brief discussion of the development, in collaboration with

the Pathology/Urology Departments of Johns Hopkins University, of applying
the curvelet transform to the analysis of prostate pathological images of critical
Gleason scores for computer-aided classification17 which could serve as a
potential predictive tool for urologists to anticipate prognosis and provide
suggestions for adequate treatment. The Gleason grading system is a standard for
interpreting prostate cancer established by expert pathologists based on
microscopic tissue images from needle biopsies.21-27 Gleason grade is categorized
into 1 to 5 increasing based on the cumulative loss of regular glandular structure
which reflects the degree of malignancy aggressive phenotype. The Gleason
score (GS) is the summation of the primary grade and the secondary grade,
ranging from 5 to 10, in which a total score of 6 would be determined as a
slower-growing cancer, 7 (3+4) as a medium-grade, and 4+3, 8, 9, or 10 as more
aggressive carcinoma. The Gleason Scores 6 and 7, are distinguished as the mid-
point between the low-grade (less aggressive) and the intermediate-grade
carcinoma which also generates the most absence of agreement in second-
opinion assessments.
A set of Tissue MicroArray (TMA) images have been used as the data base.
Each TMA (Tissue MicroArray) image of 1670 × 1670 pixels contains a core
image of 0.6mm in diameter at 20× magnification. The available 224 prostate
histological images consist of 4 classes including P3S3, P3S4, P4S3 and P4S4,
each class with 56 images from 16 cases. We used 32 images of each class for
training and the remaining 24 images of each class for testing. The curvelet
transform was conducted on 25 patches of subimages columnwise or rowwise
covering each image area with half overlapping and the class belonging of each
patch are pooled together to make a majority decision for class assignment of the
image. A two level tree classifier consisting of three Gaussian-kernel support
vector machines (SVM) has been developed where the first machine is primarily
to decide whether an input patch belongs to Grade 3 (GG3 designates the
inclusion of P3S3 and P3S4) or Grade 4 (GG4 stands for the inclusion of P4S3
and P4S4) and then to make majority decision of multiple patches in the image.
One SVM machine at the next level is to differentiate P3S3 from P3S4 and the
other to classify P4S3 and P4S4.
A central area of 768 × 768 pixels of the tissue image was taken which should
be sufficient for covering the biometrics of the prostate cells and glandular
structure characteristics. Sampled by the aforementioned approach, each patch
with 256 × 256 pixels then undergoes the fast discrete curvelet transform with the
use of the Curvelab Toolbox software13 to generate curvelet coefficients cj,l,k in 4
scales. The prostate cellular and ductile structures contained therein are
represented by the texture characteristics in the image region where the curvelet-
based analysis is performed. As shown in Fig 7, four patches taken from four
prostate patterns including two critical in-between grades of P3S4 and P4S3. The
curvelet coefficients at each scale are displayed in the lower part to illustrate the
edge information that integrated over all orientations.
The "scale number" used here for curvelet coefficients corresponds to the
subband index number considered in the discrete frequency domain. For a 256 ×
256 image, the scale 5 refers to the highest frequency subband, that is subband 5,
and scales 4, 3 and 2 refer to the successively lower frequency subbands.
Their statistical measures including mean ȝj, variance ıj2, entropy ej, and
energy Ej, of curvelet coefficients at each scale j for each patch are computed as
textual features. Nine features have been selected to form a 9-dimensional
feature vector for use in pattern classification which includes entropy in scales
3í4, energy in scales 2í4, mean in scales 2í3 and variance in scales 2í3.
Table 1. Jackknife Cross-validation results
Sensitivity Specificity Accuracy

GG3 vs GG4 94.53% 96.88% 95.70%
P3S3 vs P3S4 97.81% 97.50% 97.65%
P4S3 vs P4S4 98.43% 97.80% 98.13%
Overall 93.68%
All three kernel SVMs were successfully trained, together with the successful
majority decision rule, giving a trained tree classifier of 4 classes of critical
Gleason scoring with no training error. The leave-one (image)-out cross-
validation was applied to assess the trained classifier. The 10% Jackknife cross-
validation tests were carried out for 100 realizations for all three SVMs and the
statistical results are listed in Table 1 with above 95.7% accuracy for individual
machines and an overall accuracy of 93.68% for 4 classes. The trained classifier
was tested with 96 images (24 images p e r c l a s s ) . The result given in Table 2
shows remarkable testing accuracy to classify tissue images of four critical
Gleason scores (GS) 3+3, 3+4, 4+3 and 4+4, as compared t o the published
results that we are aware of. The lowest correct classification rate (87.5%)
obtained for the intermediate grade P4S3 by virtue of the situation between
P3S4 and P4S4 where subtle textural characteristics are difficult to be
differentiated.
Table 2. Test results of the 4-class tree classifier
Grade Accuracy
P3S3 95.83%
P3S4 91.67%
P4S3 87.50%
P4S4 91.67%
5. Summary and Discussion
We have discussed the exciting development over the past decade on the application
of curvelet transform to the texture feature extraction in several biomedical imaging,
material grading and document retrieval problems9, 37-43. One may consider the
curvelet as a sophisticated “texton”, the image texture is characterized in terms of its
dynamic and geometric distributions in multiple scales. With the implication of
sparse representation, it leads to efficient and effective multiresolution texture
descriptors capable of providing the enhanced performance of pattern classification.
Many more works need to be done to explore its full potential in various
applications. A closely related method of wave atoms44 representation on oscillatory
patterns may guide new joint development on image texture characterization in
different fields of practical applications.
Appendix
This appendix provides a summary of the recent work45-46 on applying the

curvelet-based texture analysis to grading the critical Gleason patterns of prostate
cancer tissue histological images exemplified in Section 4. With respect to all
orientations at each location of an image at a given scale, the selection of curvelet
coefficients with significant magnitude yields the dominant curve segments in
reconstruction where both positive edges and the corresponding negative edges in
opposite direction always exist.ġ This enables a sparser representation for the
boundary information of nuclei and glandular structures; the histogram of these
curvelet coefficients at a given scale shows a bimodal distribution. Reverse the
sign of those negative coefficients and merge them into the pool of the positive
curvelet coefficients of significant magnitude, we obtain a single mode
distribution of the curvelet magnitude which is nearly rotation invariant. Thus,
for a given scale, at each location, the maximum curvelet coefficient is defined
by
Maximum curvelet coefficient c j k max c j k , T . (9)

T
Based upon histograms of the maximum curvelet coefficients at different scales,

the statistical texture features are computed.
A two-layer tree classifier structure with two Gaussian kernel SVMs has been
trained to classify each tissue image into one of the four critical patterns of
Gleason grading: GS 3+3, GS 3+4, GS 4+3 and GS 4+4. In addition to variance
ıj2, energy Enerj, and entropy Sj, the skewness Ȗj, and kurtosis kurtj of maximum
curvelet coefficients at scale j were also considered for texture feature selectionį
The statistical descriptors were evaluated from all training images and rank-
ordered by applying the Kullback–Leibler divergence measure.
Eight features were selected for the first SVM 1 for classifying GS 3+3 and
GS 4+4, as given below:
>S 4 J4 Ener 4 J5 Ener5 kurt 4 V 32 @

T
kurt 3 ,
Patches in images of GS 3+4 and GS 4+3 may have textures as a mixture of grade
G3 and grade G4, instead of either G3 or G4; thus a refinement of the feature
vector for use in SVM 2 at the second level was obtained by adding some fine
scale feature components and re-ranking the features to enhance the
differentiation between GS 3+4 and GS 4+3 which resulted the selection of a set
of 10 feature components as given below.
>S 4 Ener4 J 4 V 42 kurt5 S5 J 5 Ener5 kurt3 V 32 . @

T
The performance of this curvelet-based classifier: one of the 100 realizations of

the validation test of SVM 1 at the first level and of SVM 2 at the second level,
and the testing result of the tree classifier are given in Tables 3, Table 4, and
Table 5, respectively. Its comparison, in terms of cross validation, with the
classifier by Fehr, et al48, and the one by Mosquera-Lopez, et al47 is given in
Table 6.
Table 3. Validation test of SVM 1 (based on image patches)

Test Overall
Label GS 3+3 GS 4+4 Error Accuracy
Class 1 GS 3+3 789 11 1.37%
98.44%
Class 2 GS 4+4 14 786 1.75%
Average 98.63% 98.25% 1.56%
Level 1 classifier SVM 1 + patch voting
Input with training samples of GS 3+3 and 4+4 Input with samples of GS 3+4 and 4+3
Test Class 1 Class 2 Class 3 Error
Label GS 3+3 GS 4+4 GS 3+4 GS 4+3
Class 1 GS 3+3 32 0 0 0 0%
Class 2 GS 4+4 0 32 0 0 0%
GS 3+4 0 0 32 0 0%
Class 3
GS 4+3 0 0 0 32 0%
Table 4. Validation results of SVM 2
Test Indecision
Label GS 3+4 GS 4+3
GS 3+4 GS 4+3
GS 3+4 31 1
GS 4+3 31 1
Average 96.88% 96.88% 3.12% 3.12%
Table 5. Testing Result of the Tree Classifier

Gleason Gleason
Gleason Score 7 Indecision Overall
Test Score 6 Score 8
Label
GS 3+3 GS 4+4 GS 3+4 GS 4+3
GS6 GS 3+3 24 24
GS8 GS 4+4 23 1 24
GS 3+4 22 2 24
GS7
GS 4+3 23 1 24
100% 95.83% 91.67% 95.83% 96
Accuracy
Average Accuracy 95.83%
Table 6. Comparison of cross-validation of different Approaches for G3 vs G4 and 4 critical

Gleason grades
Method Dataset Grade 3 vs Grade 4 GS 7 (3+4) vs GS 7 (4+3)

Quaternion wavelet transform
30 grade 3, 30 grade 4
(QWT), quaternion ratios, and 98.83%
and 11 grade 5 images
modified LBP47
34 GS 3+3 vs 159 GS
7
114 GS 3+4 vs 26 GS
Texture features from combining
4+3
diffusion coefficient and T2- 93.00% 92.00%
159 GS 7 includes:
weighted MRI images48
114 GS 3+4,
26 GS 4+3,
19 GS 8
Our two-level classifier using 32 GS 3+3, 32 GS 3+4,
maximum curvelet coefficient- 32 GS 4+3, and 32 GS 98.88% 95.58%
based texture and features46 4+4 images (20×)
References
1. M. Tuceryan and A. K. Jain, “Texture Analysis,” in Handbook of Pattern Recognition and

Computer Vision, 2nd ed., Eds., C. H. Chen, L. f. Pau and P. S. P. Wang, Chap. 2.1., World
Scientific, (1999).
2. C. V. Rao, J. Malleswara, A. S. Kumar, D. S. Jain and V. K. Dudhwal, “Satellite Image
Fusion using Fast Discrete Curvelet Transform,” Proc. IEEE Intern. Advance Computing
Conf., pp. 252-257, (2014).
3. M. V. de Hoop, H. F. Smith, G. Uhlmann and R. D. van der Hilst, “Seismic Imaging with
the Generalized Radon Transform, A Curvelet Transform Perspective,” Inverse Problems,
vol. 25, 025005, (2009).
4. S. Mallat, “A Wavelet Tour of Signal Processing, the Sparse Way”, 3rd ed., Chap. 5,
(2009).
5. C. H. Chen and G. G. Lee, “On Multiresolution Wavelet Algorithm using Gaussian Markov
Random Field Models, in Handbook of Pattern Recognition and Computer Vision, 2nd ed.,
Eds., C. H. Chen. L. F. Pau and P.S.P. Wang, Chap. 1.5., World Scientific, (1999).
6. A.K. Jain and F. Farrokhnia, “Unsupervised Texture Segmentation using Gabor Filters,”
Pattern Recognition, vol. 34, pp. 1167-1186, (1991).
7. L. Dettori and A. I. Zayed, “Texture Identification of Tissues using Directional Wavelet,
Ridgelet and Curvelet Transforms,” in Frames and Operator Theory in Image and Signal
Processing, ed., D. R. Larson, et al., Amer. Math. Soc., pp. 89-118, (2008).
8. L. Dettori and L. Semler, “A comparison of Wavelet Ridgelet and Curvelet Texture
Classification Algorithms in Computed tomography,” Computer Biology & Medicine, vol.
37, pp. 486-498, (2007).
9. G. Castellaro, L. Bonilha, L. M. Li and F. Cendes, “Multiresolution Analysis using wavelet,
ridgelet and curvelet Transforms for Medical Image Segmentation,” Intern. J. Biomed.
Imaging, v. 2011, Article ID 136034, (2011).
10. E. J. Candes and D. L. Donoho, “New Tight Frames of Curvelets and Optimal
Representation of Objects with Piecewise Singularities,” Commun. Pure Appl. Math., vol.
57, no. 2, pp. 219-266, (2004).
11. E. J. Candes, L. Demanet, D. L. Donoho, and L. Ying, “Fast Discrete Curvelet Transform,”
Multiscale Modeling & Simulations, vol. 5, no. 3, pp. 861-899, (2006).
12. E. J. Candes and D. L. Donoho, “Continuous Curvelet Transform: II. Discretization and
Frames,” Appl. Comput. Harmon. Anal., vol. 19, pp. 198-222, (2005).
13. E. J. Candes, L. Demanet, D. L. Donoho, and L. Ying, “Curvelab Toolbox, version 2.0,”
CIT, (2005).
14. J-L. Starck, E. Candes and D. L. Donoho, “The Curvelet Transform for Image Denoising,”
IEEE Trans. IP, vol. 11, pp. 131-141, (2002).
15. K. Nguyen, A. K. Jain and B. Sabata, “Prostate Cancer Detection: Fusion of Cytological
and Textural Features,” Jour. Pathology Informatics, vol. 2, (2011).
16. L. Guo, M. Dai and M. Zhu, Multifocus Color Image Fusion based on Quaternion Curvelet
Transform,” Optic Express, vol. 20, pp. 18846-18860, (2012).
17. Wen-Chyi Lin, Ching-Chung Li, Christhunesa S. Christudass, Jonathan I. Epstein and
Robert W. Veltri, “Curvelet-based Classification of Prostate Cancer Histological Images of
Critical Gleason Scores,” In Biomedical Imaging (ISBI), 2015 IEEE 12th International
Symposium on, pp. 1020-1023 (2015).
18. Jianwei Ma and Gerlind Plonka. "The curvelet transform." Signal Processing Magazine,
IEEE 27.2, pp.118-133 (2010).
19. J-C. Starck, F. Murtagh and J. M. Fadili, “Sparse Image and Signal Processing”, Cambridge
University Press, Chap. 5, (2010).
20. F. Murtagh and J-C. Starck, “Wavelet and Curvelet Moments for Image Classification:
Application to Aggregate Mixture Grading,” Pattern Recognition Letters, vol. 29, pp. 1557-
1564, (2008).
21. C. Mosquera-Lopez, S. Agaian, A. Velez-Hoyos and I. Thompson, “Computer-aided
Prostate Cancer Diagnosis from Digitized Histopathology: A Review on Texture-based
Systems,” IEEE Review in Biomedical Engineering, v.8, pp. 98-113, (2015).
22. D. F. Gleason, and G. T. Mellinger, “The Veterans Administration Cooperative Urological
Research Group: Prediction of Prognosis for Prostatic Adenocarcinoma by Combined
Histological Grading and Clinical Staging,” J Urol, v.111, pp. 58-64, (1974).
23. Luthringer, D. J., and Gross, M.,"Gleason Grade Migration: Changes in Prostate Cancer
Grade in the Contemporary Era," PCRI Insights, vol. 9, pp. 2-3, (August 2006).
24. J.I. Epstein, "An update of the Gleason grading system," J Urology, v. 183, pp. 433-440,
(2010).
25. Pierorazio PM, Walsh PC, Partin AW, and Epstein JI., “Prognostic Gleason grade grouping:
data based on the modified Gleason scoring system,” BJU International, (2013).
26. D. F. Gleason, and G. T. Mellinger, “The Veterans Administration Cooperative Urological
Research Group: Prediction of Prognosis for Prostatic Adenocarcinoma by Combined
Histological Grading and Clinical Staging,” J Urol, v.111, pp. 58-64, (1974).
27. Gonzalez, Rafael C., and Richard E. Woods. "Digital image processing 3rd edition." (2007).
28. John Daugman, University of Cambridge, Computer Laboratory. [Online]
KWWSZZZFOFDPDFXNaMJGVDPSOHLULVMSJ.
29. Cavusoglu, “Multiscale Texture Retrieval based on Low-dimensional and Rotation-
invariant Features of Curvelet Transform,” EURASIP Jour. On Image and Video
Processing, paper 2014:22, (2014).
30. Zhang, M. M. Islam, G. Lu and I. J. Sumana, “Rotation Invariant Curvelet Features for
Region Based Image Retrieval,” Intern. J. Computer Vision, vo. 98, pp. 187-201, (2012).
31. Zand, Mohsen, et al. "Texture classification and discrimination for region-based image
retrieval." Journal of Visual Communication and Image Representation 26, pp. 305-316.
(2015).
32. F. Gomez and E. Romero, “Rotation Invariant Texture Classification using a Curvelet based
Descriptor,” Pattern Recognition Letter, vol. 32, pp. 2178-2186, (2011).
33. S. Arivazhagan and T. G. S. Kumar, “Texture Classification using Curvelet Transform,”
Intern. J. Wavelets, Multiresolution & Inform Processing, vol. 5, pp. 451-464, (2007).
34. S. Arivazhagan, L. Ganesan and T. G. S. Kumar, “Texture Classification using Curvelet
Statistical and Co-occurrence Features,” Proc. IEEE ICPR’06, pp. 938-941, (2006).
35. Alecu, A. Munteanu, A. Pizurica, W. P. Y. Cornelis and P. Schelkeus, “Information-
Theoretic Analysis of Dependencies between Curvelet Coefficients,” Proc. IEEE ICOP, pp.
1617-1620, (2006).
36. J. Daugman, “How Iris Recognition Works,” IEEE Trans. On Circuits & Systems for Video
Technology, vo. 14, pp. 121-130, (2004).
37. L. Shen and Q. Yin, “Texture Classification using Curvelet Transform,” Proc. ISIP’09,
China, pp. 319-324, (2009).
38. H. Chang and C. C. J. Kuo, “Texture analysis and Classification with Tree-structured
Wavelet Transform,” IEEE Trans. Image Proc., vol. 2, pp. 429-444, (1993).
39. M. Unser and M. Eden, “Multiresolution Texture Extraction and Selection for Texture
Segmentation,” IEEE Trans. PAMI, vol. 11, pp. 717-728, (1989).
40. M. Unser, “Texture Classification and Segmentation using Wavelet Frames,” IEEE Trans.
IP, vol. 4, pp. 1549-1560, (1995).
41. Lain and J. Fan, “Texture Classification by Wavelet Packet Signatures,” IEEE Trans.
PAMI, vol. 15, pp. 1186-1191, (1993).
42. Lain and J. Fan, “Frame Representation for Texture Segmentation,” IEEE Trans. IP, vol. 5,
pp. 771-780, (1996).
43. Nielsen, F. Albregtsen and H. E., “Statistical Nuclear Texture Analysis in Cancer Research:
A Review of Methods and Applications,” Critical Review in Oncogenesis, vol. 14, pp. 89-
164, (2008).
44. L. Demanet and L. Ying, “Wave Atoms and Sparsity of Oscillatory Patterns,” Appl.
Comput. Harmon. Anal., vol. 23, pp. 368-387, (2007).
45. Lin, Wen-Chyi, Ching-Chung Li, Jonathan I. Epstein, and Robert W. Veltri. "Curvelet-
based texture classification of critical Gleason patterns of prostate histological images." In
Computational Advances in Bio and Medical Sciences (ICCABS), 2016 IEEE 6th
International Conference on, pp. 1-6, (2016).
46. Lin, Wen-Chyi, Ching-Chung Li, Jonathan I. Epstein, and Robert W. Veltri. "Advance on
curvelet application to prostate cancer tissue image classification." In 2017 IEEE 7th
International Conference on Computational Advances in Bio and Medical Sciences
(ICCABS), pp. 1-6, (2017).
47. C. Mosquera-Lopez, S. Agaian and A. Velez-Hoyos. "The development of a multi-stage
learning scheme using new descriptors for automatic grading of prostatic carcinoma." Proc.
IEEE ICASSP, pp. 3586-3590, (2014).
48. D. Fehr, H. Veeraraghavan, A. Wibmer, T. Gondo, K. Matsumoto, H. A. Vargas, E. Sala,
H. Hricak, and J. O. Deasy. "Automatic classification of prostate cancer Gleason scores
from multiparametric magnetic resonance images," Proceedings of the National Academy
of Sciences 112, no. 46, pp.6265-6273, (2015).
CHAPTER 1.6
AN OVERVIEW OF EFFICIENT DEEP LEARNING
ON EMBEDDED SYSTEMS
Xianju Wang
Bedford, MA, USA
wxjzyw@gmail.com
Deep neural networks (DNNs) have exploded in the past few years, particularly in the
area of visual recognition and natural language processing. At this point, they have
exceeded human levels of accuracy and have set new benchmarks in several tasks.
However, the complexity of the computations requires specific thought to the network
design, especially when the applications need to run on high-latency, energy efficient
embedded devices.
In this chapter, we provide a high-level overview of DNNs along with specific
architectural constructs of DNNs like convolutional neural networks (CNNs) which are
better suited for image recognition tasks. We detail the design choices that deep-learning
practitioners can use to get DNNs running efficiently on embedded systems. We
introduce chips most commonly used for this purpose, namely microprocessor, digital
signal processor (DSP), embedded graphics processing unit (GPU), field-programmable
gate array (FPGA) and application specific integrated circuit (ASIC), and the specific
considerations to keep in mind for their usage. Additionally, we detail some
computational methods to gain more efficiency such as quantization, pruning, network
structure optimization (AutoML), Winograd and Fast Fourier transform (FFT) that can
further optimize ML networks after making the choice of network and hardware.
1. Introduction
Deep learning has evolved into the state-of-the-art technique for artificial
intelligence (AI) tasks since early 2010. Since the breakthrough application of
deep learning for image recognition and natural language processing (NLP), the
number of applications that use deep learning has increased significantly.
In many applications, deep neural networks (DNNs) are now able to surpass
human levels of accuracy. However, the superior accuracy of DNNs comes from
the cost of high computational complexity. DNNs are both computationally
intensive and memory intensive, making them difficult to deploy and run on
embedded devices with limited hardware resources [1,2,3].
The deep neural networks (DNN) process and task includes training and
inference, which have different computational needs. Training is the stage in
107
108 An Overview of Efficient Deep Learning on Embedded Systems
which your network tries to learn from the data, while inference is the phase in
which a trained model is used to predict the real samples.
Network training often requires a large dataset and significant computational
resources. In many cases, training a DNN model still takes several hours to days
to complete and thus is typically executed in the cloud. For the inference, it is
desirable to have its processing near the sensor and on the embedded systems to
reduce latency and improve privacy and security. In many applications, inference
requires high speed and low power consumption. Thus, implementing deep
learning on embedded systems becomes more critical and difficult. In this
chapter, we are going to focus on reviewing different methods to implement deep
learning inference on embedded systems.
2. Overview of Deep Neural Networks
At the highest level, one can think of a DNN as a series of smooth geometric
transformations from an input space to a desired output space. The “series” are
the layers that are stacked one after the other, with the output of the outer layer
fed as inputs to the inner layer. The input space could contain mathematical
representations of images, language or any other feature set, while the output is
the desired “answer” that during the training phase is fed to the network and
during inference is predicted. The geometric transformations can take several
forms and the choice often depends on the nature of the problem that needs to be
solved. In the world of image and video processing, the most common form is
convolutional neural networks (CNNs). CNNs specifically have contributed to
the rapid growth in computer vision due to the nature of their connections and the
computational efficiency they offer compared to other types of networks.
2.1. Convolutional Neural Networks
As previously mentioned, CNNs are particularly suited to analyze images. A

convolution is a “learned filter” that is used to extract specific features from
images, for example, edges in the earlier layers and complex shapes in the deeper
ones. The computer sees an image as a matrix of pixels arranged as width *
height * depth. The resolution of the image determines the width and the height,
while depth is usually expressed in 3 color channels – R, G and B. A convolution
filter operation performs a dot product between the filter and the corresponding
pixel values between the width and the height of the input image. This filter is
then moved in a sliding window fashion to generate an output feature map that is
then transformed using an activation function and then fed into the deeper layers.
X. Wang 109
In order to reduce the number of parameters, often a form of sub-sampling called

“pooling” is applied that smoothes out neighboring pixels while reducing the
dimensions as the layers get deeper. CNNs are incredibly efficient and, unlike
other feed-forward networks, are spatially “aware” and can handle rotation and
translation invariance remarkably well. There are several well known
architectural examples of CNNs, chief among them being AlexNet, VGGNet and
Resnet which differ in the accuracy vs. speed trade-off. Some of these CNNs are
compared later.
2.2. Computational Costs of Networks

The accuracy of CNN models have been increasing since their breakthrough in
2012. However, the accuracy comes at the price of a high computational cost.
Popular CNN models along with their computational costs are shown in Table 1.
Table 1. Computational costs for popular CNN models [4]
Model AlexNet GoogleNet VGG16 Resnet50

Conv layers 5 57 13 53
Conv MACs 666M 1.58G 15.3G 3.86G
Conv parameters 2.33M 5.97M 14.7M 23.5M
3. Hardware for DNN Processing
Typical embedded systems are microprocessors, embedded graphics processing

unit (GPU), digital signal processor (DSP), field-programmable gate array
(FPGA) and application specific integrated circuit (ASIC). There are three major
metrics to measure the efficiency of a DNN’s hardware: the processing
throughput, the power consumption and the cost of the processor. The processing
throughput is the most important metric to compare the performance and is
usually measured in the number of FLoating Point Operations per second, or
FLOP/s [2].
3.1. Microprocessors
For many years microprocessors have been applied as the only efficient way to
implement embedded systems. Advanced RISC machine (ARM) processors use
reduced instruction sets and require fewer transistors than those with a complex
instruction set computing (CISC) architecture (such as the x86 processors used in
most personal computers), which reduces the size and complexity, while
lowering the power consumption. ARM processors have been extensively used
in consumer electronic devices such as smart phones, tablets, multimedia players
and other mobile devices.
Microprocessors are extremely flexible in terms of programmability, and all
workloads can be run reasonably well on them. While ARM is quite powerful, it
is not a good choice for massive data parallel computations, and is only used for
low speed or low-cost applications. Recently, Arm Holdings developed Arm NN,
an open-source network machine learning (ML) software, and NXP
Semiconductors released the eIQ™ machine learning software development
environment. Both of these include inference engines, neural network compilers
and optimized libraries, which help users develop and deploy machine learning
and deep learning systems with ease.
3.2. DSPs
DSPs are well known for their high computation performance, low-power
consumption and relatively small size. DSPs have highly parallel architectures
with multiple functional units, VLIW/SIMD features and pipeline capability,
which allow complex arithmetic operations to be performed efficiently.
Compared to microprocessors, one of the primary advantages of DSP is its
capability of handling multiple instructions at the same time [5] without
significantly increasing the size of the hardware logic. DSPs are suitable for
accelerating computationally intensive tasks on embedded devices and have been
used in many real time signal and image processing systems.
3.3. Embedded GPUs
GPUs are currently the most widely used hardware option for machine and deep
learning. GPUs are designed for high data parallelism and memory bandwidth
(i.e. can transport more “stuff” from memory to the compute cores). A typical
NVIDIA GPU has thousands of cores, allowing for fast execution of the same
operation across multiple cores. Graphics processing units (GPUs) are widely
used in network training. Although extremely capable, GPUs have had trouble
gaining traction in the embedded space given the power, size and cost constraints
often found in embedded applications.
X. Wang 111
3.4. FPGAs
FPGAs were developed for digital embedded systems, based on the idea of using
arrays of reconfigurable complex logic blocks (LBs) with a network of
programmable interconnects surrounded by a perimeter of I/O blocks (IOBs).
FPGAs allow the design of custom circuits that implement hardware specific
time-consuming computation. The benefit of an FPGA is the great flexibility in
logic, providing extreme parallelism in data flow and processing vision
applications, especially at the low and intermediate levels where they are able to
employ the parallelism inherent in images. For example, 640 parallel
accumulation buffers and ALUs can be created, summing up an entire 640x480
image in just 480 clock cycles [6, 7]. In many cases, FPGAs have the potential to
exceed the performance of a single DSP or multiple DSPs. However, a big
disadvantage is their power consumption efficiency.
FPGAs have more recently become a target appliance for machine learning
researchers, and big companies like Microsoft and Baidu have invested heavily in
FPGAs. It is apparent that FPGAs offer much higher performance/watt than
GPUs, because even though they cannot compete on pure performance, they use
much less power. Generally, FPGA is about an order of magnitude less efficient
than ASIC. However, modern FPGAs contain hardware resources, such as DSPs
for arithmetic operations and on-chip memories located next to DSPs which
increase the flexibility and reduce the efficiency gap between FPGA and ASIC
[2].
3.5. ASICs
An application-specific integrated circuit is the least flexible, but highest

performing, hardware option. They are also the most efficient in terms of
performance/dollar and performance/watt, but require huge investment and NRE
(non-recurring engineering) costs which make them cost-effective only in large
quantities. ASICs can be designed for either training or inference, as the
functionality of the ASIC is designed and hard-coded (it can’t be changed).
While GPUs and FPGAs perform far better than CPUs for AI related tasks, a
factor of up to 10 in efficiency may be gained with a more specific design with
ASIC. Google is the best example of successful machine learning ASIC
deployments. Released in 2018, Its inference-targeted, edge TPU (Coral) can
achieve 4.0 TOPs/second peak performance. Intel Movidius Neural Compute is
another good example, which boasts high performance with easy use and
deployment solutions.
4. The Methods for Efficient DNN Inference
As discussed in Section 3, the superior accuracy of DNNs comes from the cost of
high computational complexity. DNNs are both computationally intensive and
memory intensive, making them difficult to deploy and run on embedded devices
with limited hardware resources. In the past few years, several methods were
proposed to implement efficient inference.
4.1. Reduce Number of Operations and Model Size
4.1.1. Quantization
Network quantization compresses the network by reducing the number of bits to

represent each weight. During training, the default size for programmable
platforms such as CPUs and GPUs is often 32 or 64 bits with floating-point
representation. During inference, the predominant numerical format used is 32-
bit floating point or FP32. However, the desire for savings in energy and increase
in throughput of deep learning models has led to the use of lower-precision
numerical formats. It has been extensively demonstrated that weights and
activations can be represented using 8-bit integers (or INT8) without incurring
significant loss in accuracy. The use of even lower bit-widths has shown great
progress in the past few years. They usually employ one or more of the methods
below to improve model accuracy [8].
x Training/Re-training/ Iterative quantization
x Changing the activation function
x Modifying network structure to compensate for the loss of information
x Conservative quantization on first and last year
x Mixed weights and activations precision
The simplest form of quantization involves directly quantizing a model

without re-training; this method is commonly referred to as post-training
quantization. In order to minimize the loss of accuracy from quantization, the
model can be trained in a way that considers the quantization. This means
training with quantization of weights and activations "baked" into the training
procedure. The latency and accuracy results of different quantization methods
can be found from Table 2 below [9].
X. Wang 113
Table 2. Benefits of model quantization for several CNN models
Model Mobilenet- Mobilenet- Inception-v3 Resnet-v2-101

v1-1-224 v2-1-224
Top-1 Accuracy 0.709 0.719 0.78 0.77
(Original)
Top-1 Accuracy 0.657 0.637 0.772 0.768
(Post Training Quantized)
Top-1 Accuracy 0.7 0.709 0.775 N/A
(Quantization Aware Training)
Latency (ms) (Original) 124 89 1130 3973
Latency (ms) 112 98 845 2868
(Post Training Quantized)
Latency (ms) 65 54 543 N/A
(Quantization Aware Training)
Size (MB) 16.9 14 95.7 178.3
(Original)
Size (MB) 4.3 3.6 23.9 44.9
(Optimized)
4.1.2. Network Pruning
Network pruning has been widely used to compress CNN and recurrent neural
network (RNN) models. Neural network pruning is an old concept, dating back to
1990 [10]. The main idea is that, among many parameters in the network, most
are redundant and do not contribute much to the output. It has been proven to be
an effective way to reduce the network complexity and over-fitting [11,12,13].
As shown in Figure 1, before pruning, every neuron in each layer has a
Fig. 1. Pruning Deep Neural Network [LeCun et al., 1989].
Fig. 1. Pruning Deep Neural Network [3]

connection to the following layer and there are a lot of multiplications to execute.
After pruning, the network becomes sparse and only connects each neuron to a
few others, which saves a lot of multiplication computations.
As shown in Figure 2, pruning usually includes a three-step process: training
connectivity, pruning connections and retraining the remaining weights. It starts
by learning the connectivity via normal network training. Next, it prunes the
small-weight connections: all connections with weights below a threshold are
removed from the network. Finally, it retrains the network to learn the final
weights for the remaining sparse connections. This is the most straight-forward
method of pruning and is called one-shot pruning. Song Han et. al show that this
is surprisingly effective and usually can reduce 2x the connections without losing
accuracy. They also noticed that after pruning followed by retraining, they can
achieve much better results with higher sparsity at no accuracy loss. They called
this iterative pruning. We can think of iterative pruning as repeatedly learning
which weights are important, removing the least important weights, and then
retraining the model to let it "recover" from the pruning by adjusting the
remaining weights. The number of parameters was reduced by 9x and 12x for
AlexNet and VGG-16 model respectively [2,10].
Train
Connectivity
Prune
Connectivity
Retrain Weights
Fig. 2. Pruning Pipeline
Pruning makes network weights sparse. While it reduces model size and
computation, it also reduces the regularity of the computations. This makes it
more difficult to parallelize in most embedded systems. In order to avoid the
need for custom hardware like FPGA, structured pruning is developed and
involves pruning groups of weights, such as kernels, filters, and even entire
feature-maps. The resulting weights can better align with the data-parallel
X. Wang 115
architecture (e.g., SIMD) found in existing embedded hardware, like

microprocessors, GPUs and DSPs, which results in more efficient processing
[15].
4.1.3. Compact Network Architectures
The network computations can also be reduced by improving the network

architecture. More recently, when designing DNN architectures, the filters with a
smaller width and height are used more frequently because concatenating several
of them can emulate a larger filter. For example, one 5x5 convolution can be
replaced by two 3x3 convolutions. Alternatively, a 3-D convolution can be
replaced by a set of 2-D convolutions followed by 1x1 3-D convolutions. This is
also called depth-wise separable convolution [2].
Another way to reduce the computations is low-rank approximation. This
maximizes the number of separable filters in CNN models. For example, a 2D
separable filter (m×n) has a rank 1 and can be expressed as two successive 1D
filters. A separable 2D convolution requires m+n multiplications while a standard
2D convolution requires m×n multiplications. But only a small proportion of the
filters in CNN models are separable. To increase the proportion, one method is to
force the convolution kernels to be separable by penalizing high rank filters when
training the network. The other approach is to use a small set of low rank filters
which can be implemented as successive separable filters to approximate
standard convolution [2,4].
4.2. Optimize Network Structure
Several of the optimizations described above can be automated by recasting them

as machine learning problems. You can imagine a ML network that learns the
right optimizations (e.g. pruning, network size, quantization) given a desired
accuracy and speed. This is called automated machine learning (AutoML) and
has become a hot topic in the last few years.
AutoML provides methods and processes to make ML accessible for non-ML
experts, to improve efficiency of ML and to accelerate research and applications
of ML. Neural architecture search is one of the most important areas in AutoML;
it tries to address the problem of finding a well-performing (for example, high
accuracy with less computation) DNN by optimizing the number of layers, the
number of neurons, the number and size of filters, the number of channels, the
type of activations and many more design decisions [17].
4.3. Winograd Transform and Fast Fourier Transform
The Winograd minimal filter algorithm was introduced in 1980 by Shmuel

Winograd [16]; it is a computational transform that can be applied to
convolutions when the stride is 1. Winograd convolutions are particularly
efficient when processing small kernel sizes (k<=3).
Fast Fourier transform (FFT) is a well-known algorithm that transforms 2D
convolutions into multiplications in the frequency domain. Using FFT to process
a 2D convolution reduces the arithmetic complexity to o(w*w*log2(w)). FFT
finds its interest in convolutions with large kernel sizes (k>5) compared to the
use of small kernel sizes (k<=3) in the Winograd algorithm [4].
5. Summary
While DNNs have seen vast growth in the past few years and have surpassed the
levels of accuracy that humans can achieve in many tasks, the computational
complexity and demands make them sub-optimal for efficient embedded
processing. Consequently, techniques that enable efficiency in processing and
throughput are critical to expanding DNNs to serve within embedded platforms.
In this chapter, we reviewed some of the methods that can be used to improve
energy efficiency without sacrificing accuracy within cost-effective hardware,
such as quantization, pruning, network structure optimization (AutoML),
Winograd and FFT. These methods are useful for increasing and diversifying the
capabilities of DNNs, which makes DNNs more accessible to end-users. We
expect that research and commercial applications in this area will continue to
grow over the next few years.
References
1. Song Han, Huizi Mao, William J. Dally, Deep Compression: Compressing deep neural
networks with pruning, trained quantization and Huffman coding, ICLR, San Juan, Puerto
Rico, October 2016.
2. Vivienne Sze, Tien-Ju Yang, Yu-Hsin Chen, Joe Emer, Efficient Processing of Deep Neural
Networks: A Tutorial and Survey, Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329,
December 2017.
3. Song Han, Efficient Methods and Hardware for Deep Learning, PhD.’s thesis, Stanford
University, USA (2017)
4. Kamel Abdelouahab, Maxime Pelcat, Francois Berry, Jocelyn Serot, Accelerating CNN
inference on FPGAs: Survey, 2018, https://hal.archives-ouvertes.fr/hal-01695375/document
5. http://www.ti.com/lit/an/sprabf2/sprabf2.pdf
X. Wang 117
6. Branislav Kisacanin, Shuvra S. Bhattacharyya, Sek Chair, Embedded Computer Vision

(Advances in Computer Vision and Pattern Recognition), 2008
7. Donald G. Bailey, Design for Embedded Image Processing on FPGAs, 2011
8. https://nervanasystems.github.io/distiller/quantization.html
9. https://github.com/tensorflow/tensorflow/tree/r1.13/tensorflow/contrib/quantize
10. Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel, “Optimal brain damage.”
in Advances in Neural Information Processing Systems (NIPS), vol. 2, pp. 598–605, 1989
11. Hanson, Stephen Jose and Pratt, Lorien Y. Comparing biases for minimal network
construction with back-propagation. In Advances in neural information processing systems,
pp. 177–185, 1989.
12. Strom, Nikko. Phoneme probability estimation with dynamic sparsely connected artificial
neural ¨ networks. The Free Speech Journal, 1(5):1–41, 1997
13. https://papers.nips.cc/paper/5784-learning-both-weights-and-connections-for-efficient-neural-
network.pdf
14. https://nervanasystems.github.io/distiller/pruning.html#pruning
15. https://nervanasystems.github.io/distiller/algo_pruning.html#structure-pruners
16. Shmuel Winograd, Arithmetic complexity of computation, vol. 33., Siam, 1980
17. https://www.ml4aad.org/automl/
March 12, 2020 10:55 ws-rv961x669 HBPRCV-6th Edn.–11573 bernard˙HBPRCV page 119
CHAPTER 1.7
RANDOM FOREST FOR DISSIMILARITY-BASED

MULTI-VIEW LEARNING
Simon Bernard1 , Hongliu Cao1,2 , Robert Sabourin2 and Laurent Heutte1

1
Normandie Univ, UNIROUEN, UNIHAVRE, INSA Rouen, LITIS, 76000 Rouen, France
2
LIVIA, École de Technologie Supérieure (ÉTS), Université du Québec, Montreal, QC,
Canada
simon.bernard@univ-rouen.fr, caohongliu@gmail.com,
robert.sabourin@etsmtl.ca, laurent.heutte@univ-rouen.fr
Many classification problems are naturally multi-view in the sense their data are
described through multiple heterogeneous descriptions. For such tasks, dissimilar-
ity strategies are effective ways to make the different descriptions comparable and
to easily merge them, by (i) building intermediate dissimilarity representations
for each view and (ii) fusing these representations by averaging the dissimilari-
ties over the views. In this work, we show that the Random Forest proximity
measure can be used to build the dissimilarity representations, since this mea-
sure reflects similarities between features but also class membership. We then
propose a Dynamic View Selection method to better combine the view-specific
dissimilarity representations. This allows to take a decision, on each instance to
predict, with only the most relevant views for that instance. Experiments are
conducted on several real-world multi-view datasets, and show that the Dynamic
View Selection offers a significant improvement in performance compared to the
simple average combination and two state-of-the-art static view combinations.
1. Introduction
In many real-world pattern recognition problems, the available data are complex
in that they cannot be described by a single numerical representation. This may
be due to multiple sources of information, as for autonomous vehicles for example,
where multiple sensors are jointly used to recognize the environment.1 It may also
be due to the use of several feature extractors, such as in image recognition tasks,
often based on multiple families of features, such as color, shape, texture descriptors,
etc.2
Learning from these types of data is called multi-view learning and each modal-
ity/set of features is called a view. For this type of task, it is assumed that the views
convey different types of information, each of which can contribute to the pattern
recognition task. Therefore, the challenge is generally to carry out the learning task
taking into account the complementarity of the views. However, the difficulty with
119
120 S. Bernard et al.
this is that these views can be very different from each other in terms of dimension,
nature and meaning, and therefore very difficult to compare or merge. In a recent
work,2 we proposed to use dissimilarity strategies to overcome this issue. The idea
is to use a dissimilarity measure to build intermediate representations from each
view separately, and to merge them afterward. By describing the instances with
their dissimilarities to other instances, the merging step becomes straightforward
since the intermediate dissimilarity representations are fully comparable from one
view to another.
For using dissimilarities in multiv-iew learning, two questions must be addressed:
(i) how to measure and exploit the dissimilarity between instances for building the
intermediate representation? and (ii) how to combine the view-specific dissimilarity
representations for the final prediction?
In our preliminary work,2 the first question has been addressed with Random
Forest (RF) classifiers. RF are known to be versatile and accurate classifiers3,4
but they are also known to embed a (dis)similarity measure between instances.5
The advantage of such a mechanism in comparison to traditional similarity mea-
sures is that it takes the classification/regression task into account for computing
the similarities. For classification for example, the instances that belong to the
same class are more likely to be similar according to this measure. Therefore, a
RF trained on a view can be used to measure the dissimilarities between instances
according to the view, and according to their class membership as well. The way
this measure is used to build the per-view intermediate representations is by cal-
culating the dissimilarity of a given instance x to all the n training instances. By
doing so, x can be represented by a new feature vector of size n, or in other words
in a n-dimensional space where each dimension is the dissimilarity to one of the
training instances. This space is called the dissimilarity space6,7 and is used as the
intermediate representation for each view.
As for the second question, we addressed the combination of the view-specific
dissimilarity representations by computing the average dissimilarities over all the
views. That is to say, for an instance x, all the view-specific dissimilarity vectors are
computed and averaged to obtain a final vector of size n. Each value in this vector is
thus the average dissimilarity between x and one of the n training instances over the
views. This is a simple, yet meaningful way to combine the information conveyed
by each view. However, one could find it a little naive when considering the true
rationale behind multi-view learning. Indeed, even if the views are expected to be
complementary to each other, they are likely to contribute in very different ways
to the final decision. One view in particular is likely to be less informative than
another, and this contribution is even likely to be very different from an instance to
predict to another. In that case, it is desirable to estimate and take this importance
into account when merging the view-specific representations. This is the goal we
follow in the present work.
Random Forest for Dissimilarity-based Multi-view Learning 121
In a nutshell, our preliminary work2 has validated the generic framework ex-
plained above, with the two following key steps: (i) building the dissimilarity space
with the RF dissimilarity mechanism and (ii) combining the views afterward by
averaging the dissimilarities. In the present work, we deepen the second step by
investigating two methods to better combine the view-specific dissimilarities:
(1) combining the view-specific dissimilarities with a static weighted average, so

that the views contribute differently to the final dissimilarity representation;
in particular, we propose an original weight calculation method based on an
analysis of the RF classifiers used to compute the view-specific dissimilarities;
(2) combining the view-specific dissimilarities with a dynamic combination, for
which the views are solicited differently from one instance to predict to another;
this dynamic combination is based on the definition of a region of competence
for which the performance of the RF classifiers is assessed and used for a view
selection step afterward.
The rest of this chapter is organized as follows. The Random Forest dissimi-
larity measure is firstly explained in Section 2. The way it is used for multi-view
classification is detailed in Section 3. The different strategies for combining the
dissimilarity representations are given in Section 4, along with our two proposals
for static and dynamic view combinations. Finally, the experimental validation is
presented in Section 5.
2. Random Forest Dissimilarity
To fully understand the way a RF classifier can be used to compute dissimilarities

between instances, it is first necessary to understand how an RF is built and how
it gives a prediction for each new instance.
2.1. Random Forest

In this work, the name “Random Forest” refers to the Breiman’s reference method.3
Let us briefly recall its procedure to build a forest of M Decision Trees, from a
training set T . First, a bootstrap sample is built by random draw with replacement
of n instances, amongst the n training instances available in T . Each of these
bootstrap samples is then used to build one tree. During this induction phase, at
each node of the tree, a splitting rule is designed by selecting a feature over mtry
features randomly drawn from the m available features. The feature retained for
the splitting rule at a given node is the one among the mtry that maximizes the
splitting criterion. At last, the trees in RF classifiers are grown to their maximum
depth, that is to say when all their terminal nodes (also called leaves) are pure. The
resulting RF classifier is typically noted as:
H(x) = {hk (x), k = 1, . . . , M } (1)

where hk (x) is the k th Random Tree of the forest, built using the mechanisms ex-
plained above.3,8 Note however that there exist many other RF learning methods
that differ from the Breiman’s method by the use of different randomization tech-
niques for growing the trees.9
For predicting the class of a given instance x with a Random Tree, x goes down
the tree structure from its root to one of its leaves. The descending path followed by
x is determined by successive tests on the values of its features, one per node along
the path. The prediction is given by the leaf in which x has landed. More informa-
tion about this procedure can be found in the recently published RF reviews.8–10
The key point here is that, if two test instances land in the same terminal node,
they are likely to belong to the same class and they are also likely to share simi-
larities in their feature vectors, since they have followed the same descending path.
This is the main motivation behind using RF for measuring dissimilarities between
instances.
Note that the final prediction of a RF classifier is usually obtained via majority
voting over the component trees. Here again, there exist alternatives to majority
voting,9 but this latter remains the most used as far as we know.
2.2. Using Random Forest for measuring dissimilarities

The RF dissimilarity measure is the opposite measure of the RF proximity (or sim-
ilarity) measure defined in Breiman’s work,2,3,10 the latter being noted pH (xi , xj )
in the following.
The RF dissimilarity measure is inferred from a RF classifier H, learned from
T . Let us firstly define the dissimilarity measure inferred by a single Random Tree
hk , noted dk : let Lk denote the set of leaves of hk , and let lk (x) denote a function
from the input domain X to Lk , that returns the leaf of hk where x lands when one
wants to predict its class. The dissimilarity measure dk is defined as in Equation 2:
if two training instances xi and xj land in the same leaf of hk , then the dissimilarity
between both instances is set to 0, else it is equal to 1.
0, if lk (xi ) = lk (xj )
dk (xi , xj ) = (2)
1, otherwise
The dk measure is the strict opposite of the tree proximity measure pk ,3,10 i.e.
dk (xi , xj ) = 1 − pk (xi , xj ).
Now, the measure dH (xi , xj ) derived from the whole forest consists in calculating
dk for each tree in the forest, and in averaging the resulting dissimilarity values over
the M trees, as follows:
1
M
dH (xi , xj ) = dk (xi , xj ) (3)
M
k=1
Similarly to the way the predictions are given by a forest, the rationale is that
the accuracy of the dissimilarity measure dH relies essentially on the averaging
over a large number of trees. Moreover, this measure is a pairwise function dH :

X × X → R+ that satisfies the reflexivity property (dH (xi , xi ) = 0), the non-
negativity property (dH (xi , xj ) ≥ 0) and the symmetry property (dH (xi , xj ) =
dH (xj , xi )). Note however that it does not satisfy the last two properties of the
distance functions, namely the definiteness property (dH (xi , xj ) = 0 does not imply
xi = xj ) and the triangle inequality (dH (xi , xk ) is not necessarily less or equal to
dH (xi , xj ) + dH (xj , xk )).
As far as we know, only few variants of this measure have been proposed in the
literature.5,11 These variants differ from the measure explained above in the way
they infer the dissimilarity value from a tree structure. The motivation is to design a
finer way to measure the dissimilarity than the coarse binary value given in Equation
2. This coarse value may seem intuitively too superficial to measure dissimilarities,
especially considering that a tree structure can provide richer information about
the way two instances are similar to each other.
The first variant5 modifies the pH measure by using the path length from one leaf
to another when two instances land in different leaf nodes. In this way, pk (xi , xj )
does not take its value in {0, 1} anymore but is computed as follows:
1 1
M M
1
pH (xi , xj ) = pk (xi , xj ) = (4)
M M exp (w.gijk )
k=1 k=1
where, gijk is the number of tree branches between the two terminal nodes occupied
by xi and xj in the k th tree of the forest, and where w is a parameter to control the
influence of g in the computation. When lk (xi ) = lk (xj ), dk (xi , xj ) is still equal to
0, but in the opposite situation the resulting value is in ]0, 1].
A second variant,11 noted RFD in the following, leans on a measure of instance
hardness, namely the κ-Disagreeing Neighbors (κDN) measure,12 that estimates the
intrinsic difficulty to predict an instance as follows:
|xk : xk ∈ κN N (xi ) ∩ yk = yi |
κDN (xi ) = (5)
κ
where κN N (xi ) is the set of the κ nearest neighbors of xi . This value is used for
measuring the dissimilarity dˆk (x, xi ), between any instance x to any of the training
instances xi , as follows:
M
(1 − κDNk (xi )) × dk (x, xi )
dˆk (x, xi ) = k=1M (6)
k=1 (1 − κDNk (xi ))
where κDNk (xi )) is the κDN measure computed in the subspace formed by the
sole features used in the k th tree of the forest.
Any of these variants could be used to compute the dissimilarities in our frame-
work. However, we choose to use the RFD variant in the following, since it has been
shown to give very good results when used for building dissimilarity representations
for multi-view learning.11
3. The Dissimilarity Representation for Multi-view Learning
3.1. The dissimilarity space

Among the different dissimilarity strategies for classification, the most popular is the
dissimilarity representation approach.6 It consists in using a set R of m reference
instances, to build a n × m dissimilarity matrix. The elements of this matrix are the
dissimilarities between the n training instances in T and the m reference instances
in R:
⎡ ⎤
d(x1 , p1 ) d(x1 , p2 ) . . . d(x1 , pm )
⎢ d(x2 , p1 ) d(x2 , p2 ) . . . d(x2 , pm ) ⎥
D(T, R) = ⎢⎣ ...
⎥
⎦ (7)
... ... ...
d(xn , p1 ) d(xn , p2 ) . . . d(xn , pm )
where d stands for a dissimilarity measure, xi are the training instances and pj
are the reference instances. Even if T and R can be disjoint sets of instances, the
most common is to take R as a subset of T , or even as T itself. In this work, for
simplification purpose, and to avoid the selection of reference instances from T , we
choose R = T . As a consequence the dissimilarity matrix D is always a symmetric
n × n matrix.
Once such a squared dissimilarity matrix is built, there exist two main ways
to use it for classification: the embedding approach and the dissimilarity space
approach.6 The embedding approach consists in embedding the dissimilarity matrix
in a Euclidean vector space such that the distances between the objects in this
space are equal to the given dissimilarities. Such an exact embedding is possible
for every symmetric dissimilarity matrix with zeros on the diagonal.6 In practice,
if the dissimilarity matrix can be transformed in a positive semi-definite (p.s.d.)
similarity matrix, this can be done with kernel methods. This p.s.d. matrix is
used as a pre-computed kernel, also called a kernel matrix. This method has been
successfully applied with RFD along with SVM classifiers.2,13
The second approach, the dissimilarity space strategy, is more versatile and
does not require the dissimilarity matrix to be transformed into a p.s.d. similarity
matrix. It simply consists in using the dissimilarity matrix as a new training set.
Indeed, each row i of the matrix D can be seen as the projection of a training
instance xi into a dissimilarity space, where the j th dimension is the dissimilarity
with the training instance xj . As a consequence, the matrix D(T, T ) can be seen
as the projection of the training set T into this dissimilarity space, and can be fed
afterward to any learning procedure. This method is much more straightforward
than the embedding approach as it can be used with any dissimilarity measurement,
regardless its reflexivity or symmetry properties, and without transforming it into
a p.s.d. similarity matrix.
In the following, the dissimilarity matrices built with the RFD measure are
called RFD matrices and are noted DH for short. It can be proven that the matrices
derived from the initial RF proximity measure3,10 are p.s.d and can be used as pre-
computed kernels in SVM classifiers,2 following the embedding approach. However,

the proof does not apply if the matrices are obtained using the RFD measure.11
This is the main reason we use the dissimilarity space strategy in this work, as it
allows more flexibility.
3.2. Using dissimilarity spaces for multi-view learning

In traditional supervised learning tasks, each instance is described by a single vec-
tor of m features. For multi-view learning tasks, each instance is described by Q
different vectors. As a consequence, the task is to infer a model h:
h : X (1) × X (2) × · · · × X (Q) → Y (8)
where the X (q) are the Q input domains, i.e. the views. These views are generally
of different dimensions, noted m1 to mQ . For such learning tasks, the training set
T is actually made up with Q training subsets:

(q) (q)
T (q) = (x1 , y1 ), (x2 , y2 ), . . . , (x(q)
n , yn ) , ∀q = 1..Q (9)
The key principle of the proposed framework is to compute the RFD matrices
(q)
DH from each of the Q training subsets T (q) . For that purpose, each T (q) is fed
to the RF learning procedure, resulting in Q RF classifiers noted H (q) , ∀q = 1..Q.
(q)
The RFD measure is then used to compute the Q RFD matrices DH , ∀q = 1..Q.
Once these RFD matrices are built, they have to be merged in order to build the
joint dissimilarity matrix DH that will serve as a new training set for an additional
learning phase. This additional learning phase can be realized with any learning
algorithm, since the goal is to address the classification task. For simplicity and
because they are as accurate as they are versatile, the same Random Forest method
used to calculate the dissimilarities is also used in this final learning stage.
Regarding the merging step, which is the main focus of the present work, it can
be straightforwardly done by a simple average of the Q RFD matrices:
1 (q)
Q
DH = D (10)
Q q=1 H
The whole RFD based multi-view learning procedure is summarized in Algorithm

1 and illustrated in Figure 1.
As for the prediction phase, the procedure is very similar. For any new instance
x to predict:
(q)
(1) Compute dH (x, xi ), ∀xi ∈ T (q) , ∀q = 1..Q, to form Q n-sized dissimilarity
vectors for x. These vectors are the dissimilarity representations for x, from
each of the Q views.
Q (q)
q=1 dH (x, xi ), ∀i = 1..n, to form the n-sized vector
1
(2) Compute dH (x, xi ) = Q
that corresponds to the projection of x in the joint dissimilarity space.
(3) Predict the class of x with the classifier trained on DH .
Algorithm 1: The RFD multi-view learning procedure

Input: T (q) , ∀q = 1..Q: the Q training sets, composed of n instances
Input: RF (.): The Breiman’s RF learning procedure
Input: RF D(., .|.): the RF D dissimilarity measure
Output: H (q) : Q RF classifiers
Output: Hf inal : the final RF classifier
1 for q = 1..Q do
2 H (q) = RF (T (q) )
(q)
// Build the n × n RFD matrix DH :
3 forall xi ∈ T (q) do
4 forall xj ∈ T (q) do
(q)
5 DH [i, j] = RF D(xi , xj |H (q) )
6 end
7 end
8 end
// Build the n × n average RFD matrix DH :
1
Q (q)
9 DH = Q q=1 DH
// Train the final classifier on DH :
10 Hf inal = RF (DH )
Fig. 1. The RFD framework for multi-view learning.
4. Combining Views with Weighted Combinations
The average dissimilarity is a simple, yet meaningful way to merge the dissimilarity
representations built from all the views. However, it intrinsically considers that
all the views are equally relevant with regard to the task and that the resulting
dissimilarities are as reliable as each other. This is likely to be wrong from our
point of view. In multi-view learning problems, the different views are meant to be
complementary in some ways, that is to say to convey different types of information
regarding the classification task. These different types of information may not
have the same contribution to the final predictions. That is the reason why it
may be important to differentiate these contributions, for example with a weighted
combination in which the weights would be defined according to the view reliability.
The calculation of these weights can be done following two paradigms: static
weighting and dynamic weighting. The static weighting principle is to weight the
views once for all, with the assumption that the importance of each view is the same
for all the instances to predict. The dynamic weighting principle on the other way,
aims at setting different weights for each instance to predict, with the assumption
that the contribution of each view to the final prediction is likely to be different
from one instance to another.
4.1. Static combination
Given a set of dissimilarity matrices {D(1) , D(2) , . . . , D(Q) } built from Q different
views, our goal is to find the best set of non-negative weights {w(1) , w(2) , . . . , w(Q) },
so that the joint dissimilarity matrix is:

Q
D= w(q) D(q) (11)
q=1
Q
where w(q) ≥ 0 and q=1 w(q) = 1.
There exist several ways, proposed in the literature, to compute the weights
of such a static combination of dissimilarity matrices. The most natural one is to
deduce the weights from a quality score measured on each view. For example, this
principle has been used for multi-scale image classification14 where each view is a
version of the image at a given scale, i.e. the weights are derived directly from the
scale factor associated with the view. Obviously, this only makes sense with regard
to the application, for which the scale factor gives an indication of the reliability
for each view.
Another, more generic and classification-specific approach, is to evaluate the
quality of the dissimilarity matrix using the performance of a classifier. This makes
it possible to estimate whether a dissimilarity matrix sufficiently reflects class mem-
bership.14,15 For example, one can train a SVM classifier from each dissimilarity
matrix and use its accuracy as an estimation of the corresponding weights.14 kNN
classifiers are also very often used for that purpose.15,16 The reason is that a good
dissimilarity measure is expected to propose good neighborhoods, or in other words
the most similar instances should belong to the same class.
Since kernel matrices can be viewed as similarity matrices, there are also few
solutions in the literature of kernel methods that could be used to estimate the
quality of a dissimilarity matrix. The most notable is the Kernel Alignment (KA)
estimate17 A(K1 , K2 ), for measuring the similarity between two kernel matrices K1
and K2 :
K1 , K2 F
A(K1 , K2 ) = (12)
K1 , K1 F K2 , K2 F
where Ki is a kernel matrix and where ·, ·F is the Frobenius norm.17
In order to use the KA measure to estimate the quality of a given kernel matrix, a
target matrix must be defined beforehand. This target matrix is an ideal theoretical
similarity matrix, regarding the task. For example, for binary classification, the
ideal target matrix is usually defined as K∗ = yyT , where y = {y1 , y2 , . . . , yn } are
the true labels of the training instances, in {−1, +1}. Thus, each value in K∗ is:

∗ 1, if yi = yj
Kij = (13)
−1, otherwise
In other words, the ideal matrix is the similarity matrix in which instances are
considered similar (K∗ij = 1) if and only if they belong to the same class. This
estimate is transposed to multi-class classification problems as follows:18

∗ 1, if yi = yj
Kij = −1 (14)
C−1 , otherwise
where C is the number of classes.
Both kNN and KA methods presented above are used in the experimental part
for comparison purposes (cf. Section 5). However, in order to use the KA method
for our problem, some adaptations are required. Firstly, the dissimilarity matrices
need to be transformed into similarity matrices by S(q) = 1 − D(q) . The following
heuristic is then used to deduce the weight from the KA measure:19
A(S(q) , yyT )
w(q) = Q (15)
(h) , yyT )
h=1 A(S
Strictly speaking, for the similarity matrices S(q) to be considered as kernel matrices,
it must be proven that they are p.s.d. When such matrices are proven to be p.s.d,
the KA estimates is necessarily non-negative, and the corresponding w(q) are also
non-negative.17,19 However, as it is not proven that our matrices S(q) built from
RF D are p.s.d., we propose to use the softmax function to normalize the weights
and to ensure they are strictly positive:
exp(A(S(q) , K∗ ))
w(q) = Q (16)
(h) , K∗ ))
h=1 exp(A(S
The main drawback of the methods mentioned above is that they evaluate the
quality of the dissimilarity matrices based solely on the training set. This is the
very essence of these methods, which are designed to evaluate (dis)similarity matri-
ces built from a sample, e.g. the training set. However, this may cause overfitting
issues when these dissimilarity matrices are used for classification purposes as it is
the case in our framework. Ideally, the weights should be set from the quality of the
dissimilarity representations estimated on an independent validation dataset. Ob-
viously, this requires to have additional labeled instances. The method we propose
in this section allows to estimate the quality of the dissimilarity representations
without the use of additional validation instances.
The idea behind our method is that the relevance of a RFD space is reflected by
the accuracy of the RF classifier used to build it. This accuracy can be efficiently
estimated with a mechanism called the Out-Of-Bag (OOB) error. This OOB error
is an estimate supplied by the Bagging principle, known to be a reliable estimate of
the generalization error.3 Since the RF classifiers in our framework are built with
the Bagging principle, the OOB error can be used to estimate their generalization
error without the need of an independent validation dataset.
Let us briefly explained here how the OOB error is obtained from a RF: let B
denote a Bootstrap sample formed by randomly drawing p instances from T , with
replacement. When p = n, n being the number of instances in T , it can be proven
that about one third of T , in average, will not be drawn to form B.3 These instances
are called the OOB instances of B. Using Bagging for growing a RF classifier, each
tree in the forest is trained on a Bootstrap sample, that is to say using only about
two thirds of the training instances. Similarly, each training instance x is used for
growing about two thirds of the trees in the forest. The remaining trees are called
the OOB trees of x. The OOB error is the error rate measured on the whole training
set by only using the OOB trees of each training instance.
Therefore, the method we propose to use consists in using the OOB error of the
RF classifier trained on a view directly as its weight in the weighted combination.
This method is noted SWOOB in the following.
4.2. Dynamic combination
In contrast to static weighting, dynamic weighting aims at assigning different

weights to the views for each instance to predict.20 The intuition behind using
dynamic weighting in our framework is that the prediction for different instances
may rely on different types of information, i.e. different views. In that case, it is
crucial to use different weights for building the joint dissimilarity representation
from one instance to predict to another.
However, such a dynamic weighting process is particularly complex in our frame-
work. Let us recall that the framework we propose to use in this work is composed of
two stages: (i) inferring the dissimilarity matrix from each view, and (ii) combining
the per-view dissimilarity matrices to form a new training set. The weights we want
to determine are the weights used to compute the final joint dissimilarity matrix in
stage (ii). As a consequence, if these weights change for each instance to predict,
the joint dissimilarity matrix must be completely recalculated and a new classifier
must also be re-trained afterwards. This means that, for every new instance to
predict, a whole training procedure has to be performed. This is computationally

expensive and quite inefficient from our point of view.
To overcome this problem, we propose to use Dynamic Classifier Selection (DCS)
instead of dynamic weighting. DCS is a generic strategy, amongst the most success-
ful ones in the Multiple Classifier Systems literature.20 It typically aims at selecting
one classifier in a pool of candidate classifiers, for each instance to predict. This
is essentially done through two steps:21 (i) the generation of a pool of candidate
classifiers and (ii) the selection of the most competent classifier in this pool for
the instance to predict. The solutions we propose for these steps are illustrated in
Figure 2, the first step in the upper part and the second step in the lower part. The
whole procedure is also detailed in Algorithm 2 and described in the following.
4.2.1. Generation of the pool of classifiers
The generation of the pool is the first key step of DCS. As the aim is to select the
most competent classifier on the fly for each given test instance, the classifiers in
the pool must be as diverse and as individually accurate as possible. In our case,
the challenge is not to create the diversity in the classifiers, since they are trained
on different joint dissimilarity matrices, generated with different sets of weights.
The challenge is rather to generate these different weight tuples used to compute
the joint dissimilarity matrices. For such a task, a traditional grid search strategy
could be used. However, the number of candidate solutions increases exponentially
with respect to the number of views. For example, Suppose that we sample the
weights with 10 values in [0, 1]. For Q views, it would result in 10Q different weight
tuples. Six views would thus imply to generate 1 million weight tuples and to train
1 million classifiers afterwards. Here again, this is obviously highly inefficient.
The alternative approach we propose is to select a subset of views for every
candidate in the pool, instead of considering a weighted combination of all of them.
By doing so, for each instance to predict, only the views that are considered infor-
mative enough are expected to be used for its prediction. The selected views are
then combined by averaging. For example, if a problem is described with six views,
there are 26 − 1 = 63 possible combinations (the situation in which none of the
views is selected is obviously ignored), which will result in a pool of 63 classifiers
H = {H1 , H2 , . . . , H63 }. Lines 1 to 6 of Algorithm 2 give a detailed implementation
of this procedure.
4.2.2. Evaluation and selection of the best classifier
The selection of the most competent classifier is the second key step of DCS. Gen-
erally speaking, this selection is made through two steps:20 (i) the definition of a
region of competence for the instance to predict and (ii) the evaluation of each clas-
sifier in the pool for this region of competence, in order to select the most competent
one.
The region of competence Θt of each instance xt is the region used to estimate

the competence of the classifiers for predicting that instance. The usual way to do
so is to rely on clustering methods or to identify the k nearest neighbors (kNN)
of xt . For clustering,22 the principle is usually to define the region of competence
as the closest cluster of xt , according to the distances of xt to the centroids of the
clusters. As the clusters are fixed once for all, many different instances might share
the same region of competence. In contrast, kNN methods give different regions of
competence from one instance to another, which allows for more flexibility but at
the expense of a higher computational cost.23
The most important part of the selection process is to define the criterion to
measure the competence level of each classifier in the pool. There are a lot of
methods for doing so, that differ in the way they estimate the competence, using
for example a ranking, the classifier accuracies, a data complexity measure, etc.20
Nevertheless, the general principle is most of the time the same: calculating the
measure on the region of competence exclusively. We do not give an exhaustive
survey of the way it can be done here, but briefly explain the most representative
method, namely the Local Classifier Accuracy (LCA) method,24 as an illustration.
The LCA method measures the local accuracy of a candidate classifier Hi , with
respect to the prediction ŷt of a given instance xt :

xk ∈Θt,ŷt I(Hi (xk ) = ŷt )
wi,t = (17)
xk ∈Θt I(yk = ŷt )
where Θt = {x1 , . . . xk , . . . , xK } is the region of competence for xt , and Θt,ŷt is

the set of instances from Θt that belong to the same class as ŷt . Therefore, wi,t
represents the percentage of correct classifications within the region of competence,
by only considering the instances for which the classifier predicts the same class as
for xt . In this calculation, the instances in Θt generally come from a validation set,
independent of the training set T .20
The alternative method we propose here is to use a selection criterion that does
not rely on an independent validation set, but rather relies on the OOB estimate. To
do so, the region of competence is formed by the k nearest neighbors of xt , amongst
the training instances. These nearest neighbors are determined in the joint dissimi-
larity space with the RFD measure (instead of the traditional Euclidean distance).
This is related to the fact that each candidate classifier is trained in this dissimi-
larity space, but also because the RFD measure is more robust to high dimensional
spaces, contrary to traditional distance measures. Finally, the competence of each
classifier is estimated with its OOB error on the k nearest neighbors of xt . Lines 7
to 15 of Algorithm 2 give all the details of this process.
To sum it up, the key mechanisms of the DCS method we proposed, noted
DCSRF D and detailed in Algorithm 2, are:
• Create the pool of classifiers by using all the possible subsets of views, to avoid
the expensive grid search for the weights generation (lines 4-5 of Algorithm 2).
• Define the region of competence in the dissimilarity space by using the RFD
dissimilarity measure, to circumvent the issues that arise from high dimensional
spaces (lines 12-13 of Algorithm 2).
• Evaluate the competence of each candidate classifier with its OOB error rate,
so that no additional validation instances are required (line 14 of Algorithm 2).
• Select the best classifier for xt (lines 16-17 of Algorithm 2).
These steps are also illustrated in Figure 2 with the generation of the pool of
classifiers in the upper part, and with the evaluation and selection of the classifier
in the lower part of the figure. For illustration purposes, the classifier ultimately
selected for predicting the class of xt is assumed to be the second candidate (in
red).
Fig. 2. The DCSRF D procedure, with the training and prediction phases. The best candidate
classifier that gives the final prediction for xt is H[2] in this illustration (in red).
Algorithm 2: The DCSRF D method

Input: T (q) , ∀q = 1..Q: the Q training sets, each composed of n instances
Input: D(q) , ∀q = 1..Q: Q n × n RFD matrices, built from the Q views
Input: H (q) , ∀q = 1..Q: the Q RF classifiers used to build the D(q)
Input: RF (.): The RF learning procedure
Input: RF D(., .|.): the RF D measure
Input: k: the number of neighbors to define the region of competence
Input: xt : an instance to predict
Output: ŷ: the prediction for xt
// 1 - Generate the pool of classifiers:
1 {w0 , w1 , . . . , w2Q −1 } = all the possible Q-sized 0/1 vectors
2 H = an empty pool of classifiers
3 for i = 1..2Q − 1 do
// The ith candidate classifier in the pool, wi [q] being the q th value of wi ,
either equal to 1 or 0:
1
Q (q)
4 Di = Q q=1 D .wi [q]
5 H[i] = RF (Di )
6 end
// 2 - Evaluate the candidate classifiers for xt
7 for q = 1..Q do
// the q th dissimilarity representation of xt
(q)
8 dxt = RF D(xt , xj |H (q) ), ∀xj ∈ T (q)
9 end
10 D = an empty set of dissimilarity representations of xt
11 for i = 1..2Q − 1 do
// The averaged dissimilarity representation of xt :
1
Q (q)
12 D[i] = Q q=1 dxt .wi [q]
// The region of competence, Di [j, .] being the j th row of Di :
13 θt,i = the kNN according to RF D(D[i], Di [j, .]|H[i]), ∀j = 1..n
// The competence of H(i) on θt,i :
14 St,i = OOBerr (H[i], θt,i )
15 end
// 3 - Select the best classifier for xt and predict its class
16 m = arg maxi St,i
17 ŷ = H[m](D[m])
5. Experiments
5.1. Experimental protocol

Both the SWOOB and the DCSRF D methods are evaluated on several real-world
multi-view datasets in the following, and compared to state-of-the-art methods: the
simple average of the view-specific dissimilarity matrices as a baseline method and
the two static weighting methods presented in Section 4.1, namely the 3N N and
the KA methods.
The multi-view datasets used in this experiment are described in Table 1. All
these datasets are real-world multi-view datasets, supplied with several views of
the same instances: NonIDH1, IDHcodel, LowGrade and Progression are medical
imaging classification problems, with different families of features extracted from
different types of radiographic images; LSVT and Metabolomic are two other med-
ical related classification problems, the first one for Parkinson’s disease recognition
and the second one for colorectal cancer detection; BBC and BBCSport are text
classification problems from news articles; Cal7, Cal20, Mfeat, NUS-WIDE2, NUS-
WIDE3, AWA8 and AWA15 are image classification problems made up with dif-
ferent families of features extracted from the images. More details about how these
datasets have been constituted can be found in the paper (and references therein)
cited in the caption of Table 1.
Table 1. Real-world multi-view datasets2
features instances views classes IRa
AWA8 10940 640 6 8 1

AWA15 10940 1200 6 15 1
BBC 13628 2012 2 5 1.34
BBCSport 6386 544 2 5 3.16
Cal7 3766 1474 6 7 25.74
Cal20 3766 2386 6 20 24.18
IDHcodel 6746 67 5 2 2.94
LowGrade 6746 75 5 2 1.4
LSVT 309 126 4 2 2
Metabolomic 476 94 3 2 1
Mfeat 649 600 6 10 1
NonIDH1 6746 84 5 2 3
NUS-WIDE2 639 442 5 2 1.12
NUS-WIDE3 639 546 5 3 1.43
Progression 6746 84 5 2 1.68
a Imbalanced Ratio, i.e. the number of instances from the ma-

jority class over the number of instances from the minority class.
All the methods used in these experiments include the same first stage, i.e.
building the RF classifiers from each view and building then the view-specific RFD
matrices. Therefore, for a fair comparison on each dataset, all the methods use
the exact same RF classifiers, made up with the same 512 trees.2 As for the other
important parameters of the RF learning procedure, the mtry parameter is set to
√
mq , where mq is the dimension of the q th view, and all the trees are grown to
their maximum depth (i.e. with no pre-pruning).
The methods compared in this experiment differ in the way they combine the
view-specific RFD matrices afterwards. We recall below these differences:
• Avg denotes the baseline method for which the joint dissimilarity representation
is formed by a simple average of the view-specific dissimilarity representations.
• SW3N N and SWKA both denote static weighting methods for determining Q
weights, one per view. The first one derives the weights from the performance
of a 3N N classifier applied on each RFD matrix; the second one uses the KA
method to estimate the relevancy of each RFD matrix in regards to the classi-
fication problem.
• SWOOB is the static weighting method we propose in this work and presented
in Section 4.1; it computes the weights of each view from the OOB error rate
of its RF classifier.
• DCSRF D is the dynamic selection method we propose in this work and pre-
sented in Section 4.2; it computes different combinations of the RFD matrices
for each instance to predict based on its k nearest neighbors, with k = 7 fol-
lowing the recommendation in the literature.20
After each method determine a set of Q weights, the joint RFD matrix is computed.
This matrix is then used as a new training set for a RF classifier learnt with the
√
same parameters as above (512 trees, mtry = n with n the number of training
instances, fully grown trees).
As for the pre-processing of the datasets, a stratified random splitting procedure
is repeated 10 times, with 50% of the instances for training and 50% for testing.
The mean accuracy, with standard deviations, are computed over the 10 runs and
reported in Table 2. Bold values in this table are the best average performance
obtained on each dataset.
Table 2. Accuracy (mean ± standard deviation) and average ranks
Avg SW3N N SWKA SWOOB DCSRF D
AWA8 56.22% ± 1.01 56.22% ± 0.99 56.12% ± 1.42 56.59% ± 1.41 57.28% ± 1.49
AWA15 38.23% ± 0.83 38.13% ± 0.87 38.27% ± 1.05 38.23% ± 1.26 38.82% ± 1.56
BBC 95.46% ± 0.65 95.52% ± 0.64 95.36% ± 0.74 95.46% ± 0.60 95.42% ± 0.59
BBCSport 90.18% ± 1.96 90.29% ± 1.83 90.26% ± 1.78 90.26% ± 1.95 90.44% ± 1.89
Cal7 96.03% ± 0.53 96.10% ± 0.57 96.11% ± 0.60 96.10% ± 0.60 94.65% ± 1.09
Cal20 89.76% ± 0.80 89.88% ± 0.82 89.77% ± 0.68 90.00% ± 0.71 89.15% ± 0.97
IDHCodel 76.76% ± 3.59 77.06% ± 3.43 77.35% ± 3.24 76.76% ± 3.82 77.65% ± 3.77
LowGrade 63.95% ± 5.62 62.56% ± 6.10 63.95% ± 3.57 63.95% ± 5.01 65.81% ± 5.31
LSVT 84.29% ± 3.51 84.29% ± 3.65 84.60% ± 3.54 84.76% ± 3.63 84.44% ± 3.87
Metabolomic 69.17% ± 5.80 68.54% ± 5.85 70.00% ± 4.86 70.00% ± 6.12 70.21% ± 4.85
Mfeat 97.53% ± 1.00 97.53% ± 1.09 97.53% ± 1.09 97.57% ± 1.01 97.63% ± 0.99
NonIDH1 80.70% ± 3.76 80.47% ± 3.32 80.00% ± 3.15 80.93% ± 4.00 79.77% ± 2.76
NUS-WIDE2 92.82% ± 1.93 92.86% ± 1.88 92.60% ± 2.12 92.97% ± 1.72 93.30% ± 1.58
NUS-WIDE3 80.32% ± 1.95 79.95% ± 2.40 80.09% ± 2.07 80.14% ± 2.20 80.77% ± 2.06
Progression 65.79% ± 4.71 65.79% ± 4.71 65.79% ± 4.99 66.32% ± 4.37 66.84% ± 5.29
Avg rank 3.67 3.50 3.30 2.40 2.13
5.2. Results and discussion

The first observation one can make from the results gathered in Table 2 is that the
best performance are obtained with one of the two proposed methods for 13 over the
15 datasets. This is confirmed by the average ranks that place these two methods
in the first two positions. To better assess the extent to which these differences are
significant, a pairwise analysis based on the Sign test is computed on the number of
wins, ties and losses between the baseline method Avg and all the other methods.
The result is shown in Figure 3.
Fig. 3. Pairwise comparison between each method and the baseline Avg. The vertical lines are
the level of statistical significance according to the Sign test.
From this statistical test, one can observe that none of the static weighting
methods allows to reach the significance level of wins over the baseline method. It
indicates that the simple average combination, when using dissimilarity representa-
tions for multi-view learning, is a quite strong baseline. It also underlines that all
views are globally relevant for the final classification task. There is no view that is
always irrelevant, for all the predictions.
Figure 3 shows also that the dynamic selection method proposed in this work
is the only method that predominantly improves the accuracy over this baseline,
till reaching the level of statistical significance. From our point of view, it shows
that all the views do not participate in the same extent to the good prediction
of every instance. Some instances are better recognized when the dissimilarities
are computed by relying on some views more than on the others. These views are
certainly not the same ones from one instance to another, and some instances may
need the dissimilarity information from all the views at some point. Nevertheless,
this highlights that the confusion between the classes is not always consistent from
one view to another. In that sense, the views complement each others, and this can
be efficiently exploited for multi-view learning provided that we can identify the
views that are the most reliable for every instance, one by one.
6. Conclusion
Multi-view data are now very common in real world applications. Whether they
arise from multiple sources or from multiple feature extractors, the different views
are supposed to provide a more accurate and complete description of objects than
a single description would do. Our proposal in this work was to address multi-
view classification tasks using dissimilarity strategies, which give an efficient way to
handle the heterogeneity of the multiple views.

The general framework we proposed consists in building an intermediate dis-
similarity representation for each view, and in combining these representations af-
terwards for learning. The key mechanism is to use Random Forest classifiers to
measure the dissimilarities. Random Forests embed a (dis)similarity measure that
takes the class membership into account in such a way that instances from the
same class are similar. The resulting dissimilarity representations can be efficiently
merged since they are fully comparable from one view to another.
Using this framework, our main contribution was to propose a dynamic view
selection method that provides a better way of merging the per-view dissimilarity
representations: a subset of views is selected for each instance to predict, in order
to take the decision on the most relevant views while at the same time ignoring as
much as possible the irrelevant views. This subset of views is potentially different
from one instance to another, because all the views do not contribute at the same
extent to the prediction of each instance. This has been confirmed on several real-
world multi-view datasets, for which the dynamic combination of views has allowed
to obtain much better results than static combination methods.
However, in its current form, the dynamic selection method proposed in this
chapter strongly depends on the number of candidate classifiers in the pool. To allow
for more versatility, it could be interesting to decompose each view into several sub-
views. This could be done for example, by using Bagging and Random Subspaces
principles before computing the view-specific dissimilarities. In such a way, the
dynamic combination could only select some specific part of each view, instead of
considering the views as a whole.
Acknowledgement
This work is part of the DAISI project, co-financed by the European Union with
the European Regional Development Fund (ERDF) and by the Normandy Region.
References
1. X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection net-
work for autonomous driving. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 6526–6534, (2017).
2. H. Cao, S. Bernard, R. Sabourin, and L. Heutte, Random forest dissimilarity based
multi-view learning for radiomics application, Pattern Recognition. 88, 185–197,
(2019).
3. L. Breiman, Random forests, Machine Learning. 45(1), 5–32, (2001).
4. M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, Do we need hundreds
of classifiers to solve real world classification problems?, Journal of Machine Learning
Research. 15, 3133–3181, (2014).
5. C. Englund and A. Verikas, A novel approach to estimate proximity in a random
forest: An exploratory study, Expert Systems with Applications. 39(17), 13046–13050,
(2012).
6. E. Pekalska and R. P. W. Duin, The Dissimilarity Representation for Pattern Recogni-

tion: Foundations And Applications (Machine Perception and Artificial Intelligence).
(World Scientific Publishing Co., Inc., 2005).
7. Y. M. G. Costa, D. Bertolini, A. S. Britto, G. D. C. Cavalcanti, and L. E. S. de Oliveira,
The dissimilarity approach: a review, Artificial Intelligence Review. pp. 1–26, (2019).
8. G. Biau and E. Scornet, A random forest guided tour, TEST. 25, 197–227, (2016).
9. L. Rokach, Decision forest: Twenty years of research, Information Fusion. 27, 111–
125, (2016).
10. A. Verikas, A. Gelzinis, and M. Bacauskiene, Mining data with random forests: A
survey and results of new tests, Pattern Recognition. 44(2), 330 – 349, (2011).
11. H. Cao. Random Forest For Dissimilarity Based Multi-View Learning: Application To
Radiomics. PhD thesis, University of Rouen Normandy, (2019).
12. M. R. Smith, T. Martinez, and C. Giraud-Carrier, An instance level analysis of data
complexity, Machine Learning. 95(2), 225–256, (2014).
13. K. R. Gray, P. Aljabar, R. A. Heckemann, A. Hammers, and D. Rueckert, Random
forest-based similarity measures for multi-modal classification of alzheimer’s disease,
NeuroImage. 65, 167–175, (2013).
14. Y. Li, R. P. Duin, and M. Loog. Combining multi-scale dissimilarities for image classi-
fication. In International Conference on Pattern Recognition (ICPR), pp. 1639–1642.
IEEE, (2012).
15. R. P. Duin and E. Pekalska, The dissimilarity space: Bridging structural and statistical
pattern recognition, Pattern Recognition Letters. 33(7), 826–832, (2012).
16. D. Li and Y. Tian, Survey and experimental study on metric learning methods, Neural
Networks. (2018).
17. N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. S. Kandola. On kernel-target
alignment. In Advances in Neural Information Processing Systems (NeurIPS), pp.
367–373, (2002).
18. J. E. Camargo and F. A. González. A multi-class kernel alignment method for im-
age collection summarization. In Proceedings of the 14th Iberoamerican Conference
on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer
Vision, and Applications (CIARP), pp. 545–552. Springer-Verlag, (2009).
19. S. Qiu and T. Lane, A framework for multiple kernel support vector regression and its
applications to sirna efficacy prediction, IEEE/ACM Transactions on Computational
Biology and Bioinformatics (TCBB). 6(2), 190–199, (2009).
20. R. M. Cruz, R. Sabourin, and G. D. Cavalcanti, Dynamic classifier selection: Recent
advances and perspectives, Information Fusion. 41, 195–216, (2018).
21. A. S. Britto Jr, R. Sabourin, and L. E. Oliveira, Dynamic selection of classifiers - a
comprehensive review, Pattern Recognition. 47(11), 3665–3680, (2014).
22. R. G. Soares, A. Santana, A. M. Canuto, and M. C. P. de Souto. Using accuracy
and diversity to select classifiers to build ensembles. In IEEE International Joint
Conference on Neural Network (IJCNN), pp. 1310–1316. IEEE, (2006).
23. M. C. De Souto, R. G. Soares, A. Santana, and A. M. Canuto. Empirical comparison
of dynamic classifier selection methods based on diversity and accuracy for building
ensembles. In IEEE International Joint Conference on Neural Networks (IJCNN), pp.
1480–1487. IEEE, (2008).
24. K. Woods, W. P. Kegelmeyer, and K. Bowyer, Combination of multiple classifiers
using local accuracy estimates, IEEE transactions on pattern analysis and machine
intelligence. 19(4), 405–410, (1997).
March 12, 2020 11:20 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-Li-Lai-Rosin page 139
CHAPTER 1.8
A REVIEW OF IMAGE COLOURISATION
Bo Li, Yu-Kun Lai and Paul L. Rosin∗

School of Mathematics and Information Science,
Nanchang Hangkong University, Nanchang, China, 330063
School of Computer Science and Informatics, Cardiff University, Cardiff, UK
This chapter reviews the recent development of image colourisation, which aims
at adding colour to a given greyscale image. There are numerous applications
involving image colourisation, such as converting black and white photos or movies
to colour, restoring historic photographs to improve the aesthetics of the image, as
well as colourising many other types of images lacking colour (e.g. medical images,
infrared night time images). According to the source where the colours come
from, the existing methods can be categorised into three classes: colourisation by
reference, colourisation by scribbles and colourisation by deep learning. In this
chapter, we introduce the basic idea and survey the typical algorithms of each
type of method.
1. Introduction
The first monochrome (black and white) photography was captured in 1839, and
until the mid-20th century the majority of photography remained monochrome. In
order to produce more realistic images, photographers and artists attempted to add
colours to the black and white images. In the mid-19th century to the mid-20th
century, people hand-coloured monochrome photographs manually, such as shown
in Fig. 1. However, hand-colouring of photographs requires expertise knowledge
and is time consuming. In 1970, the term of colourisation was first introduced
by Wilson Markle1 to describe the computer-assisted process for adding colours to
black and white movies. It arose from colouring the classic black and white photos
and videos, and now has been applied in various fields, such as hyperspectral image
visualisation, designing cartoons, and 3-dimensional data rendering, etc.
For human beings, the plausible colourisation images can be produced imme-
diately through our brain. However, it is not so direct for computers to find a
reasonable colourisation result since it requires predicting R, G and B colours for
∗ B.Li is with the School of Mathematics and Information Science, Nanchang Hangkong University,
Nanchang, China, and also with the School of Educational Information Technology, Central China
Normal University, Wuhan, China. e-mail: bolimath@gmail.com. Y.-K. Lai and P. L. Rosin are
with the School of Computer Science and Informatics, Cardiff University, Cardiff, UK.
139
140 B. Li, Y.-K. Lai and P. L. Rosin
Fig. 1. A hand-coloured print from the same negative, hand-coloured by Stillfried & Andersen
between 1875 and 1885. https://en.wikipedia.org/wiki/Hand-colouring_of_photographs
each pixel with the given intensity. Mathematically, image colourisation can be
formulated as follows. Given a greyscale image L ∈ Rm×n , where m and n are the
width and height of the image, image colourisation aims at finding a mapping func-
tion f from the intensity image L to its corresponding colour version C ∈ Rm×n×3 ,
f : L ∈ Rm×n → C ∈ Rm×n×3 (1)
Image colourisation attempts to extrapolate the data from 1-dimension to 3-

dimensions, and is a typical ill-posed problem which does not have a unique so-
lution. In order to reduce the linearly dependent relationship between luminance
and chrominance, CIELAB or CIEYUV colour space2 is typically adopted rather
than RGB colour space.
In order to produce plausible colourisation images, numerous methods have been
studied. According to the source where the colours come from, the existing methods
can be categorised into three classes: colourisation by reference, colourisation by
scribbles and colourisation by deep learning.
Colourisation by reference refers to transferring colour from a colour example
image to the target greyscale image. This type of method is fully automatic. Given
a target greyscale image, the user is only required to provide a colour reference
image which has similar content to the target image, and then the colour will be
transferred from the reference image to the target image automatically. However,
one of the main problems is that the spatial consistency of such colourisation results
is often poor. Colourisation by scribbles aims at propagating the colour scribbles
specified by the user to the whole image automatically. User interaction is required
to produce the colour strokes. Scribble-based methods can produce smooth colour
images via the diffusion process; however, it may produce colour bleeding effects
around boundaries, and the performance is highly dependent on the accuracy and
amount of user interactions. Benefitting from the development of artificial intelli-
gence and neural networks, the colour components can be effectively learned from a
large number of training images, and numerous deep learning based image colouri-
sation methods have been proposed. Despite the powerful learning ability of deep
A Review of Image Colourisation 141
neural networks, it is difficult to control the deep model to generate user desired
colourful images due to its black-box property.
This chapter is organised as follows. We will introduce the basic idea and re-
view the corresponding typical algorithms of each type of method respectively in
sections 2–5, and finally draw conclusions in section 6.
2. Colourisation by Reference Image
Colourisation by reference image means that, given a target greyscale image and
a colour reference image, the colour will be transferred automatically from the
reference image to the grey image to produce a colourisation result. The basic
pipeline of colourisation by reference image is shown in Fig. 2. Given a colour
reference and greyscale destination pair of images, the first step is feature extraction
for both images. Next, for each pixel in the destination image, the most similar pixel
in the reference image will be found by feature matching, and then the chrominance
information will be transferred to the destination image according to the matching
results to form the initial colourisation image. Finally, a propagation process is
performed to produce a smooth colour image.

Fig. 2. The pipeline of colourisation by reference.
The pioneering work of colourisation by reference image was proposed by Welsh

et al.3 Motivated by colour transfer, the method transfers the colour from the refer-
ence colour image to the target greyscale image based upon independent pixelwise
matches. The algorithm proposed by Welsh et al.3 is composed of three steps.
First, both target image and reference image are converted into the CIELAB colour
space;2 then for each pixel in the target grey image, the best matching pixel in
the reference image is selected according to a similarity measurement based on the
intensity; finally, the colour will be transferred from the reference image to the tar-
get greyscale image according to the matching results. In order to allow more user
interaction and improve the matching accuracy in the colour transfer procedure,
some user-provided swatches are used to limit the feature matching.
The method3 is user friendly and automatic, however, the produced colourisation
results lack spatial consistency since each pixel in the target image is processed in
isolation. Numerous neighbouring pixels with similar intensity can be mismatched
to different colours.
Instead of relying on a series of independent pixel-level decisions, a new strategy
accounting for the higher-level context of each pixel was proposed by Irony et al.4
The pipeline of the proposed method is shown in Fig. 3. Given a target greyscale
image and a reference colour image, the method first segments the reference colour
image by using a robust supervised classification scheme. Next, each pixel in the
target image is mapped to one segment. As pixel classification can lead to a vast
number of misclassified pixels, a voting postprocessing step is conducted to enhance
locality consistency, then the pixels with a sufficiently high confidence level will be
provided as colour strokes. Finally a colour propagation scheme is used to diffuse
the colours from these strokes to the whole image. The work exploits higher level
features which can discriminate between different regions rather than processing
each pixel independently, and guarantees spatial consistency by adopting a voting
process and global diffusion. However, the performance of this method is highly
reliant upon the image segmentation stage.

Fig. 3. Colourisation by example.4
A cascaded feature matching scheme was proposed in Gupta et al.5 In this

work, both the target image and reference image are first segmented into superpix-
els. On one hand, the use of a superpixel representation can reduce computational
complexity. On the other hand, it can also enhance spatial consistency compared
with processing each pixel independently. Instead of using a combination of dif-
ferent kinds of features, a fast cascade feature matching scheme is adopted to find
correspondences between superpixels of the reference and target images. To further
enforce the spatial coherence of these initial colour assignments, an image space
voting framework is used to correct invalid colour assignments.
An automatic feature selection and fusion based image colourisation method
was proposed in Li et al.6 Specifically, image regions can be generally classified
as uniform background or non-uniform textures. Different regions have different
characteristics and hence different features may work more effectively. Based on
the above observation, the distribution of intensity deviation for uniform and non-
uniform regions is learned, and the probability of a given region being assigned a
uniform or non-uniform label is estimated by using Bayesian inference, which is
then used for selecting suitable features. Instead of making individual decisions
locally, a Markov Random Field (MRF) model is adopted to improve the labelling
consistency which can be solved effectively by the graph cut algorithm (Fig. 4).
In order to enhance locality consistency, an image colourisation method based
on sparse representation learning was proposed in Li et al.7 In this work, the task

Fig. 4. Image colourisation by automatic feature selection and fusion.6
of colourisation is reformulated as a dictionary-based sparse reconstruction prob-

lem. Based on the assumption that superpixels with similar spatial location and/or
feature representation are likely to match spatially close regions from the reference
image, a new regularisation term was proposed to enhance the locality consistency.
Although the locality consistent regularisation can improve the matching accuracy,
the initial colourisation images obtained by feature matching often fail to preserve
the edges. In order to improve colour coherence while preserving sharp boundaries,
a new luminance guided joint filter was proposed. The joint filter attempts to en-
sure the edge structure of chrominance images is similar to the luminance channel,
and the optimisation problem can be effectively solved by a typical screened Poisson
equation.
Arbelot et al.8 proposed a new edge-aware image texture descriptor by utilis-
ing spatial coherence around image structure. First, the region covariance matrices
are computed to characterise the local texture. Since region covariances can only
describe second-order statistics, and it is difficult to measure their similarity, the
covariance matrices are then transformed into vectors by using Cholesky decom-
position. Then a multiscale gradient descent is conducted to improve the features
from being blurred. Finally, a luminance guided bilateral filter is utilised to improve
the results while keeping sharp edges.
In the feature matching process of example-based image colourisation, the scale
of features may vary between the reference image and the destination grey image.
An automatic colourisation method based on a location-aware cross-scale matching
algorithm and a simple combination of different scale features was proposed in Li
et al.9 First, the image pyramids are constructed for both the reference image and
the destination image, then a cross-scale texture matching strategy is conducted,
and the final fusion of the matching results is obtained by a global Markov Random
Field optimisation. Since only low-level features are used to find the optimal match-
ing, some unreasonable semantic errors in matching will appear. A novel up-down
location aware semantic detector was proposed to automatically find these match-
ing errors and correct the colourisation results. Finally, a nonlocal 1 optimisation
framework along with confidence weighting is used to suppress artefacts caused by
wrong matchings while avoiding over-smoothing edges.
A variational colourisation model was proposed in Bugeau et al.10 In this work,
the optimal feature matching and colour propagation are simultaneously solved by
a variational energy minimisation problem. First, for each pixel in the destination
image, some candidate matching pixels in the reference image are selected by fast
feature matching. Then a variational energy function is designed to choose the best
candidate and produce a smooth colourisation result by minimising the colour vari-
ance in the interior region while keeping the edges as sharp as possible. However,
the total variation regularisation term used in Bugeau et al.10 is only composed of
chrominance channels, which results in obvious halo effects around strong bound-
aries. Pierre et al.11 proposed a new non-convex variational framework based on
total variation defined on both luminance and chrominance channels to reduce the
halo effects. With the regularisation of the luminance image, the method produces
more spatially consistent results whilst preserving image contours. In addition, the
authors prove the convergence of the proposed non-convex model.
Instead of local pixelwise prediction, Charpiat et al.12 tried to solve the prob-
lem by learning multimodal probability distributions of colours, and finally a global
graph cut is used for automatic colour assignment. In this paper, image colourisa-
tion is stated as a global optimisation problem with an explicit energy. First, the
probability of every possible colour at each pixel is predicted, which can be seen
as a conditional probability of colours given the intensity feature. Then a spatial
coherence criterion is learned, and finally a global graph cut algorithm is used to
find the optimal colourisation result. The method performs at the global level, and
performs more robustly to texture noise and local prediction errors with the help
of graph cuts.
A global histogram regression based colourisation method was proposed in Liu
et al.13 The basic assumption is that the final colourisation image and the reference
image should have similar colour distributions. First, a locally weighted linear re-
gression on the luminance histograms of both source and target images is performed.
Next, zero-points (i.e., local maxima and minima) of the approximated histogram
can be detected and adjusted to match between target and reference images. Then,
the luminance-colour correspondence for the target image can be solved by calcu-
lating the average colour from the source image. Finally, the colourisation result is
achieved by directly mapping this luminance-colour correspondence with the target
image. However, due to the fact that the method does not take the structural in-
formation of the target image into account, it may produce many colour bleeding
effects around boundaries.
In order to remove the influences of illumination, Liu et al.14 proposed an
illumination-independent intrinsic image colourisation algorithm (Fig. 5). First,
both reference images and destination image are decomposed into reflectance
(albedo) components and illumination (shading) components. In order to obtain
robust intrinsic decomposition, multiple reference images containing a similar scene
to the destination image are collected from the Internet. Then the colours from the
reference reflectance image will be transferred to the pixels of the grey destination

Fig. 5. Intrinsic colourisation.14
image with high confidence in the reference decomposition result. An optimisa-

tion model is conducted to propagate the colour through the whole destination
reflectance image to enhance spatial consistency. Finally, the illumination compo-
nent of the destination image is put back to produce the final colourisation result.
By leveraging the rich image content on the internet, a semantic colourisation
method using Internet images was proposed in Chia et al.15 First, the user needs
to provide semantic labels and segmentation clues for the foreground objects in the
destination image, then for each foreground object, numerous similar colour images
are collected from the Internet. In order to find the suitable candidates, an image
filtering algorithm based on the spatial distribution of local and regional features
is used to refine the collected images. Finally, the colours are transferred from the
reference images to the destination image by using a graph-based optimisation algo-
rithm. The method can produce multiple plausible colourisation results, although
user effort is needed to assist in segmentation and label specification.
3. Colourisation by Scribbles
Given a destination greyscale image with some pre-scribbled colour strokes, colouri-
sation by scribbles attempts to propagate the colour from the desired colour strokes
to the whole image automatically based on the assumption that neighbouring pixels
with similar intensity features should have similar colours. The performance of the
colourisation is dependent on the construction of the affinity matrix, and how to
reduce the colour bleeding effects around boundaries is another crucial problem for
the scribble-based colourisation.
The first colourisation model by scribbles was proposed in Levin et al.16 They
assume that neighbouring pixels that have similar intensity features should have
similar colours. Based on the above basic assumption, the colourisation is resolved
by an optimisation process. The algorithm is composed of three steps. First, the
user must paint some colour scribbles in the interior of various regions, such as
shown in Fig. 6. Then an affinity matrix W is constructed, where each element
of the affinity matrix ωr,s measures the similarity between pixels r and s. Finally,
the colours from the scribbles will be propagated to the whole image by minimising
the following quadratic energy function which measures the difference between the

Fig. 6. Colourisation using optimisation.16
colour ur at pixel r and the weighted average of the colours at neighbouring pixels,

min (ur − ωr,s us )2 , s.t. ur = u0,r , r ∈ Ω (2)
u
r s∈Nr
where u means a chrominance channel, and Ω denotes the group of user scribbles.
As problem (2) is a smooth convex optimisation, it can be solved effectively by
traditional methods.
However, the performance of Levin et al.’s method16 is highly dependent on the
accuracy and amount of user scribbles. For images with complex textures, a very
large number of strokes are required to guarantee high quality colourisation results.
In addition, there are obvious colour bleeding effects around boundaries due to the
characteristic of isotropic diffusion defined in (2).
In order to reduce the burden of users, an efficient interactive colourisation
algorithm was proposed in Luan et al.17 Compared with Levin et al.’s method,16
only a small number of colour strokes are required. The algorithm consists of two
stages, colour labelling stage and colour mapping stage. In the first stage, the image
will be segmented into coherent regions according to the intensity similarity to a
small number of user-provided colour strokes. Instead of propagating colours from
strokes to the neighbourhood pixels directly, this method first groups pixels with
similar texture features which should have similar colours. The amount of colour
strokes is reduced dramatically by this strategy. In the colour mapping stage, the
user needs to assign the colour for a few pixels with significant luminance in each
coherent region, and then the colour of the rest of the pixels will be produced by a
simple linear blending by piece-wise linear mapping in the luminance channel.
Xu et al.18 proposed to reduce the colour strokes from a feature space perspec-
tive. The method adaptively determines the impact of each colour stroke in the
feature space composed of spatial location, image structure, and spatial distance.
Each stroke is confined to control a subset of pixels via a Laplacian weighted global
optimisation. Numerous regularisation terms can be incorporated with the global
optimisation to enhance edge-preserving property.
In Ding et al.,19 an automatic scribble generation and colourisation method
was proposed. Instead of assigning colour strokes by users, the authors propose
to generate scribbles automatically by distinguishing the pixels where the spatial
distribution entropy achieves locally extreme values. Then the colourisation will
be conducted by computing quaternion wavelet phases along equal-phase lines, and
a contour strength model is also established in scale space to guide the colour
propagation while preserving the edge structure.
In order to fix the artefacts of colour bleeding around boundaries, an adaptive
edge detection based colourisation algorithm was proposed in Huang et al.20 First
the reliable edge information is extracted from the greyscale image, and then the
similar propagation method to Levin et al.’s method16 is conducted with the as-
sistance of the edge structure. In the work by Anagnostopoulos et al.,21 salient
contours are utilised to improve the colour bleeding artefacts caused by weak ob-
ject boundaries. Their method is composed of two stages. In the first stage, the
user-provided scribble image is enhanced with the assistance of salient contours
automatically detected in the destination greyscale image. Meanwhile, the image
will be segmented into homogeneous colour regions of high confidence and critical
attention-needing regions. For pixels in the homogeneous regions, the colour will
be diffused by the model proposed in Levin et al.’s method,16 while for the pix-
els in attention-needing regions, a second edge-preserving diffusion stage will be
performed with the guidance of salient contours.
In order to reduce the complexity of optimisation-based colour propagation,
a fast image colourisation algorithm using chrominance blending was proposed in
Yatziv et al.22 Based on the basic observation that most of the time is spent on the
iterative solution of the optimisation model defined in Levin et al.,16 a non-iterative
method was proposed in this paper. The proposed scheme is based on the concept
of weighted colour blending derived from the geodesic distance between different
pixels computed in the luminance channel. The method is fast and permits the
user to interactively get the desired results promptly after providing a reduced set
of chrominance scribbles.
Scribble-based image colourisation can also be solved by sparse representation
learning.23 First, an over-complete dictionary in chrominance space is trained on
numerous sample colour patches to explore the low-dimensional subspace manifold
structure. Given a greyscale image with a small subset of colour strokes, the image
is first segmented into overlapping patches, and then the sparse coefficients of each
patch on the pretrained dictionary can be learned using a sparse representation
based on the luminance and the given chrominance within the patch. Once the
sparse coefficients are solved, the colour of each patch can be generated by a sparse
linear combination of the colour dictionary. A large dataset which can cover the
variation of target images is required to train the dictionary, and each patch is
processed independently without considering the locality consistency.
Benefitting from the strong theories and tractable computations of matrix re-
covery, Wang et al.24 made the first attempt to reformulate the task of image
colourisation as a matrix completion problem. Each chrominance image can be
seen as a corrupted matrix with reliable values only on the locations of scribbles,
then the task of image colourisation is formulated to complete the chrominance
matrix with a semi-supervised learning method. Based on the basic assumption
that any natural image can be effectively approximated by a low-rank matrix plus a
sparse matrix, a low-rank subspace learning method which can be effectively solved
by the augmented Lagrange multiplier algorithm is utilised to complete the colour
matrix.
However, the image matrix cannot guarantee low-rank for images with complex
textures. In this case, a colourisation algorithm by patch-based local low-rank
matrix completion was proposed in Yao et al.25 Instead of assuming that the whole
image matrix is low-rank, the method first divides the image into small patches, and
assumes that the subspace consists of all patches which have a low-rank structure.
Then a local low-rank matrix factorisation algorithm was proposed to complete the
colour image, and an efficient optimisation algorithm based on alternating direction
method of multipliers was proposed.
An image colourisation method based on colour propagation and low-rank min-
imisation was proposed in Ling et al.26 Given a greyscale image and a few colour
strokes, the paper first propagates the colour from the colour strokes to the neigh-
bourhood pixels according to the Chi-square distance of local texture features, and
meantime a confidence map is computed. The initial colourisation result computed
by propagation is not accurate enough, and so a rank optimisation constrained by
the previous computed confidence map was proposed to improve the performance.
4. Colourisation by Deep Learning
Deep learning methods27 have achieved breakthroughs in numerous research fields,

such as image classification,28–30 image segmentation31–33 and speech recogni-
tion,34,35 etc. Deep learning models are good at learning or approximating a non-
linear mapping between different domains by training the parameters on a large
dataset. For the task of image colourisation, the deep learning methods attempt to
learn a mapping from the luminance channel to the chrominance channel. The input
of the network is the luminance channel and the output is the chrominance channel,
which when concatenated with the input luminance image produces the colourised
image. We note that any colour image can be separated into its luminance and
colour components, and in this manner, we can collect as many training samples as
we want in order to train a neural network for image colourisation. Although we
have enough training data, this learning problem is less straightforward than one
may expect.

Fig. 7. Deep colourisation.36
The first deep learning based image colourisation method was proposed by Cheng
et al.36 In this paper, image colourisation is reformulated as a regression problem
and is solved by a regular fully-connected deep neural network (Fig. 7). Finally, a
joint bilateral filtering is utilised as a post-processing step to reduce the artefacts.
The model utilised in this paper is a three-layer fully connected neural network.
Three levels of features are utilised, including the raw image patch, DAISY features,
and semantic features. Given the features as the input, the output of the network
is the prediction of the corresponding chrominance. 2344 training images are used
to train the network. In addition, the model requires handcrafted features as the
input of the network rather than learning features solely from the input images
themselves.
Instead of relying on hand-crafted features, a fully automatic end-to-end image
colourisation model was proposed by Larsson et al.37 A pretrained VGG network
is utilised to generate features of different scales. For each pixel, a hypercolumn
feature is extracted by concatenating the features at its spatial location in all layers,
which incorporates the semantic information and localisation property. Taking into
account that some objects (such as clothing) may be drawn from many suitable
colours, this paper treats colour prediction as a histogram estimation task rather
than as regression, and a KL-divergence based loss function is designed to measure
the prediction accuracy.
Due to the underlying uncertainty of image colourisation, regression based learn-
ing methods often result in desaturated colourisation. Zhang et al.38 proposed a
novel classification based colourisation network. In order to model the multimodal
nature of image colourisation, the authors attempt to predict a distribution of pos-
sible colours for each pixel rather than a fixed colour. Due to the distribution of
chrominance values in natural images being strongly biased, a class rebalancing pro-
cess is utilised to emphasise rare colours. Finally, vibrant and realistic colourisation
results are produced by taking the “annealed mean” of the distribution. The main
contribution of this work is designing an appropriate objective function that han-
dles the multimodal uncertainty of the colourisation problem and captures a wide
diversity of colours. In addition, this paper proposes a novel framework for testing
colourisation results.
A novel end-to-end framework which combines both global priors and local image
features was proposed in Iizuka et al.39 The proposed architecture can extract
local, mid-level and global features jointly from an image, which can then be fused
for predicting the final colourisation. In addition, a global semantic class label is
utilised during the training process to learn more discriminative global features. The
proposed model is composed of four main components: a low-level feature network,
a mid-level feature network, a global feature network and a colourisation network.
First, a 6-layer convolutional neural network is used to learn the low-level features
from the image, and then mid-level and high-level features are learned based on the
shared low-level features. Next, a fusion layer is designed to incorporate the global
features into local mid-level features, and then the fused features are processed by a
set of convolutions and upsampling layers to generate the final colourisation results.
In order to incorporate global semantic priors, a global classification branch is added
to help learn the global context of the image. In addition, the model can directly
transfer the style of an image into the colourisation of another.
Colourisation is an ambiguous problem, with multiple plausible colourisation
results being possible for a single grey-level image. For example, a tree can be
green, yellow, brown or red. However, the above end-to-end deep learning methods
can only produce a single colourisation.
A user-guided deep image colourisation method was proposed in Zhang et al.40
Compared with traditional optimisation-based interactive colourisation methods,16
the proposed deep neural network propagates user edits by fusing low-level cues
along with high-level semantic information, learned from a million images rather
than using hand-defined rules. The proposed network learns how to propagate
sparse user hints by training a deep network to directly predict the mapping from
a greyscale image and randomly generated user colour hints to a full colour image
on a large dataset. In addition, a data-driven colour palette is designed to suggest
colours for each pixel.
Another user-guided deep colourisation method was proposed in He et al.41
In this paper, a reference colour image is used to guide the output of the deep
colour model rather than using user-provided local colour hints as in Zhang et
al.40 It is the first deep learning approach for exemplar-based local colourisation.
The proposed network is composed of two sub-networks, a similarity sub-net and
a colourisation sub-net. The similarity sub-net can be seen as a pre-processing
step for the colourisation sub-net. It computes the semantic similarities between
the reference and the target image by using a pretrained VGG-19 network, and
generates a dense bidirectional mapping function by using a deep image analogy
technique. Then the greyscale target image, the colour reference image and the
learned bidirectional mapping functions are fed into the colourisation sub-net. The
architecture of the colourisation sub-net is a typical multi-task learning framework
consisting of two branches. The first branch is used to measure the chrominance
loss, which encourages the propagated colourisation results to be as close as possible

to the ground truth chrominance. A high-level perceptual loss is introduced in the
second branch, which is used to predict perceptually plausible colours even without
a proper reference. In addition, a novel image retrieval algorithm was proposed to
automatically recommend good references to the user.
Another way to produce diverse colourisation results for a grey-scale target image
is to learn a conditional probability model in a low dimensional embedding of the
colour fields. Given a target grey-scale image G, a conditional probability P (C|G)
for the chrominance field C is learned from a large scale dataset. Then diverse
colourisation results can be generated by drawing samples from the learned model,
{Ck }Nk=1 ∼ P (C|G). However, it is difficult to learn such a conditional distribution
in high-dimensional colour spaces.
A variational autoencoder model which aims to learn a low dimensional embed-
ding of colour spaces was proposed in Deshpande et al.42 Instead of learning the
conditional distribution in the original high dimensional colour spaces, the method
attempts to find a low dimensional feature representation of colour fields which is
useful to build such a prediction model. Loss functions consist of specificity, colour-
fulness and gradient terms which are designed to avoid the over-smoothing and
washed out colour effects. Finally, the samples from the learned conditional model
result in diverse colourisation results.
A probabilistic image colourisation model was proposed in Royer et al.43 The
proposed network consists of two sub-nets. The first feed-forward network learns
a low-dimensional embedding encoding information about plausible image colours
based on the high-level features learned from the greyscale image. Then the em-
bedding is fed to an autoregressive PixelCNN network44 to predict a proper dis-
tribution of the image chromaticity conditioned on the greyscale input. In order
to enhance the structural consistency, a conditional random field based variational
auto-encoder formulation was proposed in Messaoud et al.45 The method attempts
to produce diverse colourisation results while taking structural consistency into ac-
count. Moreover, the method also incorporates external constraints from diverse
sources including a user interface.
A pixel recursive colourisation method was proposed in Guadarrama et al.46
Based on the observation that image colourisation is robust to the scale of the
chrominance image, a conditional PixelCNN44 is first trained to generate a low
resolution colour image for a given greyscale image with relatively higher resolution.
Then, a second convolutional neural network is trained to generate a high-resolution
colourisation with the input of the original greyscale image and the low resolution
colour image learned in the first stage.
Image colourisation was solved by a conditional generative adversarial network
in Isola et al.47 Given an input grey-scale image, a colourisation will be generated
conditioned on the input by a generative adversarial network. Instead of designing
an effective loss function manually with convolutional neural network based meth-
ods, a high-level semantic loss function which can make the output indistinguishable
from reality is learned automatically by a generative adversarial network, which is
then used to train the network to learn the mapping from the input image to the
output image. In addition, the proposed architecture can learn a loss that adapts
to the data, which avoids designing different loss functions for specific tasks.
The image-to-image translation network has to be trained on aligned image
pairs in Isola et al.,47 however, paired training data is not available for many tasks.
A novel unpaired image-to-image translation framework was proposed in Zhu et
al.48 by using cycle-consistent adversarial networks. The proposed cycle generative
model first learns a mapping from the source domain to the target domain, and then
an inverse mapping is introduced to learn the mapping from the target domain to
the input domain coupled with a cycle consistency loss function.
Some experimental results of deep learning based colourisation methods are
shown in Fig. 8. For popular scenes, such as shown in the first row, most of the
algorithms can produce plausible results. However, when the texture of different
objects look similar, many semantically wrong colours will be generated, such as
shown in the second and the third row. In addition, the colourisation results gener-
ated by most of the existing methods are not colourful enough yet, such as shown
in the last three rows.
5. Other Related Work
Interactive outline colourisation is a special case of image colourisation. Outline

colourisation aims to generate a colour and shaded image from a black and white
outline, such as shown in Fig. 9. Different from greyscale images, outline images do
not have greyscale information. Therefore, most of the existing image colourisation
methods cannot achieve desirable performance on outline images.
One of the first manga colourisation methods was proposed in Qu et al.49 The
proposed method propagates colour strokes with the constraints of both pattern-
continuity and intensity-continuity. The algorithm is composed of two steps: seg-
mentation and colour-filling. In the first step, local and statistical pattern features
are extracted by Gabor wavelet filters, which are then used to guide the bound-
ary segmentation based on a level set propagation. After the segmentation, various
colour propagation techniques can be employed in the second stage for filling colours.
A deep learning based outline colourisation method was proposed in Frans.50
The model is composed of two distinct convolutional networks in tandem. The
first sub-net attempts to predict the colour based only on outlines, and the second
network is adopted to generate shadings conditioned on both outlines and a colour
scheme. Both networks are trained in tandem to produce a final colourisation.
Based on the observation that the background colours of comics are often con-
sistent but random, a consistent comic colourisation with pixel-wise background
classification was proposed in Kang et al.51 A conditional generative adversarial
network based outline colourisation was proposed in Liu et al.52 Given a black-

Fig. 8. Colourisation results by deep learning models. From (a) to (e) are respectively the
ground truth and results of methods: Larson et al.,37 Zhang et al.,38 Iizuka et al.39 and Zhang et
al.40
and-white outline image, an auto-painter module was proposed to automatically

generate plausible colours adapted with colour control. The Wasserstein distance is
adopted to assist the training of the generative network. A user-guided deep anime
line art colourisation method with conditional adversarial networks was proposed
in Ci et al.53
A two-stage sketch colourisation method was proposed in Zhang et al.54 In
the first stage, a convolutional neural network is trained to determine the colour
composition and predict a coarse colourisation with a rich variety of colours over
Fig. 9. Outline colourization through tandem adversarial networks.50
the outlines. Then in the second stage, incorrect colour regions are detected and
refined with an additional set of user hints. Each stage is learned independently in
the training phase, and in the test phase they are concatenated to produce the final
colourisation.
6. Conclusion
Image colourisation is an important and difficult research topic in computer vision.

It is a typical ill-posed problem since image colourisation attempts to extrapolate
the data from 1-dimensional greyscale images to 3-dimensional colour images. The
history of the development of image colourisation is introduced in this chapter,
and most of the existing methods are reviewed and discussed. Compared with
traditional reference-based or scribble-based methods, deep learning methods can
leverage large-scale data to learn high level features, and produce robust and mean-
ingful colourisation results. However, due to the black-box property of deep learning
models, the colourisation results produced by deep learning methods are more dif-
ficult to control. Moreover, most of the existing work evaluates the colourisation
results by visual inspection, as no quantitative metric has been developed to mea-
sure the quality of the colourisation results. How to develop a meaningful metric is
an interesting and important research topic for future work.
References
1. G. C. Burns, Museum of broadcast communications: Encyclopedia of television,

http://www.museum.tv/archives/etv/index.html. (1970).
2. T. Smith and J. Guild, The CIE colorimetric standards and their use, Transactions
of the Optical Society. 33(3), 73, (1931).
3. T. Welsh, M. Ashikhmin, and K. Mueller, Transferring color to greyscale images, ACM

Transactions on Graphics. 21(3), 277–280, (2002).
4. R. Irony, D. Cohen-Or, and D. Lischinski. Colorization by example. In Eurographics
Conference on Rendering Techniques, pp. 201–210, (2005).
5. R. K. Gupta, A. Y.-S. Chia, D. Rajan, E. S. Ng, and H. Zhiyong. Image colorization
using similar images. In ACM International Conference on Multimedia, pp. 369–378,
(2012).
6. B. Li, Y.-K. Lai, and P. L. Rosin, Example-based image colorization via automatic
feature selection and fusion, Neurocomputing. 266, 687–698, (2017).
7. B. Li, F. Zhao, Z. Su, X. Liang, Y.-K. Lai, and P. L. Rosin, Example-based image
colorization using locality consistent sparse representation, IEEE Transactions on
Image Processing. 26(11), 516–525, (2017).
8. B. Arbelot, R. Vergne, T. Hurtut, and J. Thollot. Automatic texture guided color
transfer and colorization. In Expressive, (2016).
9. B. Li, Y.-K. Lai, M. John, and P. L. Rosin, Automatic example-based image colouri-
sation using location-aware cross-scale matching, IEEE Transactions on Image Pro-
cessing. (2019).
10. A. Bugeau, V.-T. Ta, and N. Papadakis, Variational exemplar-based image coloriza-
tion, IEEE Transactions on Image Processing. 23(1), 298–307, (2014).
11. F. Pierre, J.-F. Aujol, A. Bugeau, N. Papadakis, and V.-T. Ta, Luminance-
chrominance model for image colorization, SIAM J. Imaging Sciences. 8(1), 536–563,
(2015).
12. G. Charpiat, M. Hofmann, and B. Schölkopf. Automatic image colorization via multi-
modal predictions. In European Conference on Computer Vision, pp. 126–139. (2008).
13. S. Liu and X. Zhang, Automatic grayscale image colorization using histogram regres-
sion, Pattern Recognition Letters. 33(13), 1673–1681, (2012).
14. X. Liu, L. Wan, Y. Qu, T.-T. Wong, S. Lin, C.-S. Leung, and P.-A. Heng, Intrinsic
colorization, ACM Trans. Graph. 27(5), 152, (2008).
15. A. Y.-S. Chia, S. Zhuo, R. K. Gupta, Y.-W. Tai, S.-Y. Cho, P. Tan, and S. Lin,
Semantic colorization with internet images, ACM Trans. Graph. 30(6), 156, (2011).
16. A. Levin, D. Lischinski, and Y. Weiss, Colorization using optimization, ACM Trans-
actions on Graphics. 23(3), 689–694, (2004).
17. Q. Luan, F. Wen, D. Cohen-Or, L. Liang, Y.-Q. Xu, and H.-Y. Shum. Natural im-
age colorization. In Eurographics Conference on Rendering Techniques, pp. 309–320.
Eurographics Association, (2007).
18. L. Xu, Q. Yan, and J. Jia, A sparse control model for image and video editing, ACM
Transactions on Graphics (TOG). 32(6), 197, (2013).
19. X. Ding, Y. Xu, L. Deng, and X. Yang, Colorization using quaternion algebra with
automatic scribble generation, Advances in Multimedia Modeling. pp. 103–114, (2012).
20. Y.-C. Huang, Y.-S. Tung, J.-C. Chen, S.-W. Wang, and J.-L. Wu. An adaptive edge
detection based colorization algorithm and its applications. In ACM Multimedia, pp.
351–354, (2005).
21. N. Anagnostopoulos, C. Iakovidou, A. Amanatiadis, Y. Boutalis, and S. A.
Chatzichristofis. Two-staged image colorization based on salient contours. In IEEE
International Conference on Imaging Systems and Techniques, pp. 381–385, (2014).
22. L. Yatziv and G. Sapiro, Fast image and video colorization using chrominance blend-
ing, IEEE Transactions on Image Processing. 15(5), 1120–1129, (2006).
23. J. Pang, O. C. Au, K. Tang, and Y. Guo. Image colorization using sparse representa-
tion. In IEEE International Conference on Acoustics, Speech and Signal Processing,
pp. 1578–1582, (2013).
24. S. Wang and Z. Zhang. Colorization by matrix completion. In AAAI Conference on

Artificial Intelligence, pp. 1169–1175, (2012).
25. Q. Yao and J. T. Kwok. Colorization by patch-based local low-rank matrix completion.
In AAAI Conference on Artificial Intelligence, pp. 1959–1965, (2015).
26. Y. Ling, O. C. Au, J. Pang, J. Zeng, Y. Yuan, and A. Zheng. Image colorization via
color propagation and rank minimization. In IEEE International Conference on Image
Processing, pp. 4228–4232, (2015).
27. Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature. 521(7553), 436, (2015).
28. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: a large-scale
hierarchical image database. In IEEE Conference on Computer Vision and Pattern
Recognition, pp. 248–255, (2009).
29. A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep
convolutional neural networks. In Advances in Neural Information Processing Systems,
pp. 1097–1105, (2012).
30. M. Tan and Q. V. Le, EfficientNet: rethinking model scaling for convolutional neural
networks, arXiv preprint arXiv:1905.11946. (2019).
31. R. Girshick. Fast R-CNN. In IEEE International Conference on Computer Vision, pp.
1440–1448, (2015).
32. J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic
segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3431–3440, (2015).
33. K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In IEEE International
Conference on Computer Vision, pp. 2961–2969, (2017).
34. G. Hinton, L. Deng, D. Yu, G. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Van-
houcke, P. Nguyen, B. Kingsbury, et al., Deep neural networks for acoustic modeling
in speech recognition, IEEE Signal Processing Magazine. 29, (2012).
35. A. Graves, A.-R. Mohamed, and G. Hinton. Speech recognition with deep recurrent
neural networks. In IEEE International Conference on Acoustics, Speech and Signal
Processing, pp. 6645–6649, (2013).
36. Z. Cheng, Q. Yang, and B. Sheng. Deep colorization. In IEEE International Confer-
ence on Computer Vision, pp. 415–423, (2015).
37. G. Larsson, M. Maire, and G. Shakhnarovich. Learning representations for automatic
colorization. In European Conference on Computer Vision, pp. 577–593. Springer,
(2016).
38. R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In European Confer-
ence on Computer Vision, pp. 649–666. Springer, (2016).
39. S. Iizuka, E. Simo-Serra, and H. Ishikawa, Let there be color!: Joint end-to-end learn-
ing of global and local image priors for automatic image colorization with simultaneous
classification, ACM Transactions on Graphics (TOG). 35(4), 110, (2016).
40. R. Zhang, J.-Y. Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A. Efros, Real-
time user-guided image colorization with learned deep priors, ACM Transactions on
Graphics (TOG). 36(4), 119, (2017).
41. M. He, D. Chen, J. Liao, P. V. Sander, and L. Yuan, Deep exemplar-based colorization,
ACM Transactions on Graphics (TOG). 37(4), 47, (2018).
42. A. Deshpande, J. Lu, M.-C. Yeh, M. Jin Chong, and D. Forsyth. Learning diverse
image colorization. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 6837–6845, (2017).
43. A. Royer, A. Kolesnikov, and C. H. Lampert, Probabilistic image colorization, CoRR.
abs/1705.04258, (2017).
44. A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Condi-
tional image generation with pixelCNN decoders. In Advances in Neural Information

Processing Systems, pp. 4790–4798, (2016).
45. S. Messaoud, D. Forsyth, and A. G. Schwing. Structural consistency and controllabil-
ity for diverse colorization. In Proceedings of the European Conference on Computer
Vision, pp. 596–612, (2018).
46. S. Guadarrama, R. Dahl, D. Bieber, M. Norouzi, J. Shlens, and K. Murphy. Pixcolor:
Pixel recursive colorization. In BMVC, (2017).
47. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with con-
ditional adversarial networks. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1125–1134, (2017).
48. J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation us-
ing cycle-consistent adversarial networks. In IEEE International Conference on Com-
puter Vision, pp. 2223–2232, (2017).
49. Y. Qu, T.-T. Wong, and P.-A. Heng. Manga colorization. In ACM Transactions on
Graphics (TOG), vol. 25, pp. 1214–1220, (2006).
50. K. Frans, Outline colorization through tandem adversarial networks, arXiv preprint
arXiv:1704.08834. (2017).
51. S. Kang, J. Choo, and J. Chang. Consistent comic colorization with pixel-wise back-
ground classification. In NIPS’17 Workshop on Machine Learning for Creativity and
Design, (2017).
52. Y. Liu, Z. Qin, T. Wan, and Z. Luo, Auto-painter: Cartoon image generation from
sketch by using conditional wasserstein generative adversarial networks, Neurocom-
puting. 311, 78–87, (2018).
53. Y. Ci, X. Ma, Z. Wang, H. Li, and Z. Luo. User-guided deep anime line art coloriza-
tion with conditional adversarial networks. In 2018 ACM Multimedia Conference on
Multimedia Conference, pp. 1536–1544, (2018).
54. L. Zhang, C. Li, T.-T. Wong, Y. Ji, and C. Liu. Two-stage sketch colorization. In
SIGGRAPH Asia 2018 Technical Papers, p. 261. ACM, (2018).
March 12, 2020 11:35 ws-rv961x669 HBPRCV-6th Edn.–11573 ”Speech Regn” page 159
CHAPTER 1.9
RECENT PROGRESSES ON DEEP LEARNING FOR

SPEECH RECOGNITION
Jinyu Li1 and Dong Yu2

1
Microsoft Speech and Language Group
2
Tencent AI Lab
We discuss two important areas in deep learning based automatic speech recog-
nition (ASR) where significant research attention has been given recently: end-
to-end (E2E) modeling and robust ASR. E2E modeling aims at simplifying the
modeling pipeline and reducing the dependency on domain knowledge by intro-
ducing sequence-to-sequence translation models. These models usually optimize
the ASR objectives end-to-end with few assumptions, and can potentially improve
the ASR performance when abundant training data is available. Robustness is
critical to, but is still less than desired in, practical ASR systems. Many new
attempts, such as teacher-student learning, adversarial training, improved speech
separation and enhancement, have been made to improve the systems’ robust-
ness. We summarize the recent progresses in these two areas with a focus on the
successful technologies proposed and the insights behind them. We also discuss
possible research directions.
1. Introduction
Recent advances in automatic speech recognition (ASR) have been mostly due to
the advent of using deep learning algorithms to build hybrid ASR systems with deep
acoustic models like feed-forward deep neural networks (DNNs), convolutional neu-
ral networks (CNNs), and recurrent neural networks (RNNs). The hybrid systems
usually contain an acoustic model which calculates the likelihood of an acoustic
signal given phonemes; a language model which calculates the probability of a word
sequence; and a lexicon model which decomposes words into phonemes. It also
requires a very complicated decoder to generate word hypotheses during runtime.
Among these components, the most important one is the acoustic model, which
generates pseudo likelihood with neural networks. Given their effectiveness and
robustness, hybrid systems are still dominating ASR services in industry.
However, hybrid systems have the limitation that many components in the sys-
tem either require expert knowledge to build or can only be trained separately. In
lieu of this limitation, in the last few years, researchers in ASR have been devel-
oping fully end-to-end (E2E) systems [1–10]. E2E ASR systems directly translate
an input sequence of acoustic features to an output sequence of tokens (characters,
159
160 J. Li and D. Yu
words etc). This reconciles well with the notion that ASR is inherently a sequence-
to-sequence conversion task that maps input waveforms to output token sequences.
We will describe three most popular E2E systems in detail and discuss practical
issues to be solved in E2E systems in Section 2.
Although significant progresses have been made in ASR, it is still challenging for
even well-trained ASR systems to perform well on highly mismatched environments.
Model adaptation with target domain data is one solution but usually requires
labeled data from the target domain to be effective. Teacher-student learning [11,
12] is another model adaptation technique. It has been gaining popularity in the
industry [13–15] since it can exploit large amounts of unlabeled data. Adversarial
learning [16] tackles the problem from a different angle. It aims at generating
models that are less sensitive to factors irrelevant to the task. Newly developed
speech separation and enhancement techniques, on the other hand, significantly
improves the system’s robustness when recognizing overlapping/noisy speech. We
will discuss all these technologies in Section 3.
Finally we discuss open problems and future directions in Section 4.
2. End-to-End Models
Some widely used contemporary E2E techniques for sequence-to-sequence transduc-

tion are: (a) Connectionist Temporal Classification (CTC) [17, 18], (b) Attention-
based Encoder-Decoder (AED) [3, 19–22], and (c) RNN Transducer (RNN-T)
[23]. These approaches have been successfully applied to large-scale ASR sys-
tems [1–5, 8, 24–26]. Figure 1 illustrates these three popular E2E technologies.
It is worth noting that with the recent success in machine translation, the trans-
former model [27] may become popular for the E2E ASR modeling in the near
future [28–31].
Fig. 1. Flowchart of three popular end-to-end techniques: a) CTC; b) RNN-T; c) AED [26]
Recent Progresses on Deep Learning for Speech Recognition 161
2.1. Connectionist Temporal Classification

The Connectionist Temporal Classification (CTC) technique [1, 32, 33] was designed
to map the speech input frames into an output label sequence. As the length
of output labels is smaller than that of input speech frames, the CTC path is
introduced to force the output to have the same length as the input speech frames
by adding blank as an additional label and allowing repetition of labels.
Denote x as the speech input sequence, y as the original label sequence, and
−1
B (y) as all of the CTC paths mapped from y. The CTC loss function, defined
as the sum of negative log probabilities of correct labels, is
LCT C = −lnP (y|x), (1)
with

P (y|x) = P (q|x), (2)
q∈B−1 (y)
where q is a CTC path. With the conditional independence assumption, P (q|x)

can be decomposed into a product of posterior from each frame as

T
P (q|x) = P (qt |x), (3)
t=1
where T is the length of the speech sequence.

Figure 1.(a) shows the flowchart of the CTC model. When calculating the
posteriors in Eq. (3), an encoder network is used to convert the acoustic feature xt
into a high-level representation henc
t
henc
t = f enc (xt ), (4)
where t is time index.
The final posterior of each output token is obtained after applying the softmax
operation on top of the logits vector transformed from henct .
Compared to the traditional cross-entropy training in the hybrid system, CTC
is harder to train without proper initialization. In [1], the long short-term mem-
ory (LSTM) network in the CTC system was initialized from the LSTM network
trained with the cross-entropy criterion. This initialization step can be circum-
vented by using very large amounts of training data which also helps to prevent
overfitting [32]. However, even with a large training set, the randomly initialized
CTC model tends to be difficult to train when presented with difficult samples.
In [34], a learning strategy called SortaGrad was proposed. With this strategy, the
system first presents the CTC network with short utterances (easy samples) and
then presents it with longer utterances (harder samples) in the early training stage.
In the later epochs, the training utterances are fed to the CTC network completely
randomly. This strategy significantly improves the convergence of CTC training.
Inspired by the CTC model, Povey et al. proposed the lattice-free maximum
mutual information (LFMMI) [35] strategy to train deep networks directly from
162 J. Li and D. Yu
random initialization. This single-step training procedure has great advantage over
the popular two-step strategy, which trains the model first with the cross-entropy
criterion and then with the sequence discriminative criterion. To build a reliable
LFMMI training pipeline, Povey et al. developed many tricks, including a phoneme
HMM topology in which the first frame of a phoneme has a different label than
the rest frames, a phoneme n-gram language model used to create denominator
graph, a time-constraint that is similar to the delay-constrain in CTC [33], several
regularization methods to reduce overfitting, and framing stacking. LFMMI has
been proven effective on tasks with different scale and underlying models. The
detailed LFMMI training procedure can be found in [7].
The conditional independence assumption in CTC is criticized the most. Several
attempts have been made to relax or remove such assumption in CTC. In [36,
37], attention modeling was directly integrated into the CTC framework by using
time convolution features, non-uniform attention, implicit language modeling, and
component attention. Such attention-based CTC model relaxes the conditional
independence assumption by working on the hidden layers. It does not change the
CTC objective function and training process, and hence enjoys the simplicity of
CTC modeling.
2.2. RNN Transducer

On the other hand, RNN transducer (RNN-T) [23] and RNN aligner (RNN-A) [6]
extend CTC modeling by changing the objective function and the training process
to remove the conditional independence assumption of CTC. RNN transducer, il-
lustrated in Figure 1.(b), contains an encoder network, a prediction network, and a
joint network. It was demonstrated to be effective [5, 25].
The encoder network is the same as that in CTC and is analogous to the acoustic
model in the hybrid system. The prediction network is essentially an RNN language
model. It produces a high-level representation
hpre
u =f
pre
(yu−1 ) (5)
by conditioning on the previous non-blank target yu−1 predicted by the RNN-T
model, where u is the output label index. The joint network is a feed-forward
network that combines the output henc
t from the encoder network and the output
pre
hu from the prediction network as
zt,u = f joint (henc pre
t , hu ) (6)
= ψ(U henc
t +V hpre
u + bz ), (7)
where U and V are weight matrices, bz is a bias vector, and ψ is a non-linear
function, e.g. Tanh or ReLU.
zt,u is connected to the output layer with a linear transform
ht,u = Wy zt,u + by . (8)
The final posterior of each output token k is

P (yu = k|x1:t , y1:u−1 ) = sof tmax(hkt,u ). (9)
The loss function in RNN-T is the negative log posterior of output label sequence
y given input acoustic feature x, which is calculated based on the forward-backward
algorithm described in [23].
Recently, RNN-T was successfully deployed to Google’s phone device [26]. In
spite of its recent success in industry, there is much less research on RNN-T than
on AED and CTC, possibly due to the complexity of RNN-T training [38]. For ex-
ample, the encoder and prediction networks compose a grid of alignments, and the
posteriors need to be calculated at each point in the grid to perform the forward-
backward training of RNN-T. This a three-dimension tensor that requires much
more memory than what is needed in the training of other E2E models. In [38], the
forward-backward recursion used in RNN-T training was formulated in a matrix
form. With the loop skewing transformation, the forward and backward probabili-
ties can be vectorized and the recursions can be computed in a single loop instead
of two nested loops. Such an implementation significantly improves the training
efficiency of RNN-T. An additional training improvement was proposed in [39] to
reduce the training memory cost so that larger minibatches can be used to improve
the training efficiency. Note that in addition to the training speed improvement, it
is also important to boost the accuracy of RNN-T with advanced structures for the
encoder network, which was explored in [39] by 1) decoupling the target classifica-
tion task and temporal modeling task with separate modeling units [40, 41]; and 2)
exploring future context frames to generate more informative encoder output [42].
2.3. Attention-based Encoder-Decoder

The attention-based encoder-decoder (AED) (or LAS: Listen, Attend and Spell
[3, 8]) model is another type of E2E model [3, 21]. It has its roots in the successful
attention model in machine learning [20, 43] which extends the encoder-decoder
framework [19] using an attention decoder. The attention model calculates the
probability as

P (y|x) = P (yu |x, y1:u−1 ), (10)
u
with
P (yu |x, y1:u−1 ) = AttentionDecoder(henc , y1:u−1 ), (11)
(12)
Again, the training objective is to minimize −lnP (y|x).
The flowchart of the attention-based model is shown in Figure 1.(c). Here, the
encoder transforms the whole speech input sequence x to a high-level hidden vector
1 , h2 , ......, hL ), whereL ≤ T . At each step in generating an
sequence henc = (henc enc enc
output label yu , an attention mechanism selects/weighs the hidden vector sequence

164 J. Li and D. Yu
henc so that the most related hidden vectors are used for the prediction. Comparing
Eq. (10) to Eq. (2), we can see that the attention-based model doesn’t make the
conditional independence assumption made by CTC.
The decoder network in AED has three components: a multinomial distribution
generator (13), an RNN decoder (14), and an attention network (15)-(20) as follows:
yu = Generate(yu−1 , su , cu ), (13)
su = Recurrent(su−1 , yu−1 , cu ), (14)

T
cu = Annotate(αu , henc ) = αu,t henc
t , (15)
t=1
αu = Attend(su−1 , αu−1 , h enc

). (16)
Generate(.) is a feedforward network with a softmax operation generating the
probability of the target output p(yu |yu−1 , su , cu ). Recurrent(.) is an RNN decoder
operating on the output time axis indexed by u and has hidden state su . Annotate(.)
computes the context vector cu (also called the soft alignment) using attention
probability vector αu . Attend(.) computes the attention weight αu,t using a single
layer feedforward network as,
eu,t = Score(su−1 , αu−1 , henc
t ), (17)
exp(eu,t )
αu,t = T . (18)
t =1 exp(eu,t )
Score(.) can either be content or hybrid. It is computed using,

vT tanh(Usu−1 + Whenct + b), (content)
eu,t = T enc
(19)
v tanh(Usu−1 + Wht + vfu,t + b), (hybrid)
where, fu,t = F ∗ αu−1 . (20)
The operation ∗ denotes convolution. U, W, v, F, b are trainable attention param-
eters.
Many tricks have been developed to efficiently train a well performing AED
model. For example, the vanilla AED model is very expensive to train since the
hidden vectors at all time steps are used in Eq. (11). To reduce the number
of candidates used in the attention decoder, a windowing method was introduced
in [21]. In [3], a pyramid structure is used in the encoder network which cuts the
number of frames by half after each layer. Recently, Google proposed a series of
improvements on AED [8]:
• using word piece as the modeling unit [44] which balances generalization
and language model quality;
• using scheduled sampling [45] which feeds the predicted label from the
previous time step (instead of the ground truth label) during training to
make training and testing consistent;
• incorporating multi-head attention [27] in which each head can generate a

different attention distribution and provide different supporting evidence;
• applying label smoothing [46] to prevent the model from making over-
confident predictions;
• integrating an external language model (LM) which was trained with more
text data [47];
• using minimum word error rate sequence discriminative training [48].
In [49], the AED model is optimized together with a CTC model in a multi-
task learning framework by sharing the encoder. Such a training strategy greatly
improves the convergence of the attention-based model and mitigates the alignment
issue. In [50], the system’s performance is further improved by combining the scores
from both the AED model and the CTC model during decoding.
Streaming is critical in speech recognition services in industry. However, in
the vanilla AED model the attention is designed to be applied to the whole input
utterance to achieve good performance. This introduces significant latency and
prevents it from being used in the streaming mode. Attempts have been made to
support streaming in AED models. The basic idea of these methods is to apply
AED on chunks of input audios. The difference between these attempts is the way
the chunks are determined and used for attention. In [51], monotonic chunkwise
attention (MoChA) was proposed to stream the attention by splitting the encoder
outputs into small fixed-size chunks so that the soft attention is only applied to
those small chunks instead of the whole utterance. It was later improved with
adaptive-size chunks in [52]. In [53], CTC segments are used to decide the chunks,
which are used to trigger the attention. In [54], a Continuous Integrate-and-Fire
strategy is used to simulate neural model behavior so that the boundary of attention
can be decided.
2.4. Practical Issues

2.4.1. Tokenization
As the goal of ASR is to generate a word sequence from the speech waveform, words
are the most natural output units for network modeling. However, a big challenge in
the word-based E2E model is the out-of-vocabulary (OOV) issue. In [24, 32, 55, 56],
only the most frequent words in the training set were used as targets whereas the
remaining words were just tagged as OOVs. All these OOV words can neither be
further modeled nor be recognized during evaluation. To solve this OOV issue in
the word-based CTC, a hybrid CTC was proposed [57] to use the output from the
word-based CTC as the primary ASR result and consults a character-based CTC at
the segment level where the word-based CTC emits an OOV token. In [58], a spell
and recognize model was used to learn to first spell a word and then recognize it.
Whenever an OOV is detected, the decoder consults the character sequence from the
speller. However, both methods could not improve the overall recognition accuracy
166 J. Li and D. Yu
significantly due to the two-stage (OOV-detection and then character-sequence-

consulting) process. In [10, 37], a better solution was proposed by decomposing
the OOV word into a mixed-unit sequence of frequent words and characters at the
training stage. In [59], mixed-units are further extended to phrase units which
contain ngram words. All these operations are done in a top-down fashion.
In contrast, in [60], word-piece, a subword unit generated in a bottom-up fash-
ion, was proposed as the modeling unit. Inspired by the byte-pair encoding (BPE)
algorithm, which is used for data compression, the authors described a simple algo-
rithm that starts from characters and iteratively collapses tokens to form subwords.
This vocabulary of sub-words can be used to segment words in a deterministic fash-
ion. Word-pieces outperformed characters as modeling units in E2E models [8].
Reported in [59], E2E models with word-piece units and mixed-units showed sim-
ilar performance. While the BPE method simply relies on the character sequence
frequencies, a pronunciation-assisted sub-word formation strategy [61] was proposed
to improve the tokenization by leveraging the word pronunciation.
2.4.2. Language Model Integration

E2E model training requires paired speech and text data. In contrast, the separated
LM training can take advantage of much larger text corpus than that in E2E training
set. It is desirable if E2E model can take advantage of the large amounts of text
only data as well. It is popular to fuse an external LM trained with large amounts
of text data in the E2E model. There are three popular approaches:
• Shallow fusion [62]: The external LM and E2E model are trained sepa-
rately. The external LM is interpolated log-linearly with the E2E model at
inference time only.
• Deep fusion [62]: The external LM and E2E model are trained separately.
Then the external LM is integrated into the E2E model by fusing the ex-
ternal neural LM’s hidden states and the E2E decoder score.
• Cold fusion [63]: The E2E model is trained from scratch by integrating
with a pre-trained external LM.
In [64], a detailed comparison over these three fusion methods is reported. It

is shown that shallow fusion is the simplest but effective method for integrating
external LM in the first pass decoding, while cold fusion produces a lower oracle
error rate and is beneficial for rescoring in the second pass.
In addition to LM integration, there are other ways to leverage the unpaired
large amounts of text to improve E2E models. A straightforward way is to generate
synthetic data using text-to-speech (TTS) techniques on unpaired text data and
augment the original paired training data with such synthetic data [65]. However,
the quality of the synthetic data relies heavily on the quality of the TTS system that
generates them. Till now, it is still challenging to synthesize speech data from large
amounts of speakers in different environments. Spelling correction methods [66, 67]
were shown to be more effective by using TTS data to train a separate translation
model which is used to correct the errors made by E2E models. Because TTS data
is only used to train the spelling correction model without changing E2E models,
it is a better way to circumvent the quality limitation of TTS.
2.4.3. Context Modeling
Hybrid systems usually are equipped with an on-the-fly rescoring strategy that
dynamically adjusts the LM weights of a small number of n-grams which are relevant
to the particular recognition context, such as contacts, locations, and play lists.
Such context modeling significantly boosts the ASR accuracy for those specific
scenarios. It is thus desirable if E2E models also support context modeling. One
solution is to add a context bias encoder in addition to the original audio encoder
into the E2E model [68]. The recognition performance on rare words in the context
can be further improved by adding a phoneme encoder for the words in the context
[69]. However, as shown in [68], it becomes challenging for the bias attention module
to focus if the biasing list is too large. Hence, a more practical way is to do shallow
fusion with the contextual biasing LM [70].
In dialog scenarios which contains multiple related utterances, the context from
previous utterances would help the recognition of the current utterance. In [71], the
state of the previous utterance is used as the initial state of the current utterance.
In [72], a text encoder is used to embed the decoding hypothesis from the previous
utterance as additional input to the decoder of the current utterance. In [73], a
more explicit model was proposed for the two-party conversation scenario by using
a speaker-specific cross-attention mechanism that can look at the output of both
speakers to better recognize long conversations.
3. Robustness
While E2E modeling has advanced the general ASR technology, robustness contin-
ues to be a critical problem in ASR to enable natural interaction between human
and machine. Current state of the art systems can achieve remarkable recognition
accuracy when the test and training conditions match, especially when both are un-
der a quiet close-talk setup. However, the performance dramatically degrades under
mismatched or complicated environments such as high noise conditions, including
music or interfering talkers, or speech with strong accents [74, 75]. The solutions
to this problem include adaptation, speech enhancement, and robust modeling.
3.1. Knowledge Transfer with Teacher-Student Learning
The most straightforward way to improve the recognition accuracy in a new domain
is to collect and label data in the new domain and fine-tune the model trained in the
source domain with the newly labeled data. Many adaptation techniques have been
168 J. Li and D. Yu
proposed recently. A detailed review of these techniques can be found in [76]. While
the conventional adaptation techniques require large amounts of labeled data in
the target domain, the teacher-student (T/S) paradigm can better take advantage
of large amounts of unlabeled data and has been widely used in industrial scale
tasks [13–15].
The concept of T/S learning was originally introduced in [77] but became popu-
lar after it was used to learn a shallow network from a deep network by minimizing
the L2 distance of logits between the teacher and student networks [78]. In T/S
learning, the network of interest, the student, is trained to mimic the behavior of
a well-trained network, the teacher. There are two popular applications of T/S in
speech recognition: model compression, which aims to train a small network that
performs similarly to a large network [11, 12], and domain adaptation, which strives
at improving a model’s performance on a target-domain by learning the behavior
of a model trained on a source-domain [13, 79].
The most popular T/S learning strategy for ASR was proposed in 2014 [11].
In this work, Li et al. proposed to minimize the Kullback-Leibler (KL) divergence
between the output posterior distributions of the teacher network and the student
network. Hinton et. al. [12] later suggested an interpolated version which uses a
weighted sum of the soft posteriors and the one-hot hard label to train the student
model. Their method is known as knowledge distillation, but essentially is the same
as T/S learning which transfers the teacher’s knowledge to the student. In addition
to learning from pure soft labels [11] and from interpolated labels [12], conditional
T/S learning [80] was recently proposed to selectively learn from either the soft
label or the hard label conditioned on whether the teacher can correctly predict the
hard label.
The aforementioned T/S learning exploits the frame-level similarity between the
teacher and student networks. Although it was successful in many applications, it
may not be the best solution for ASR because ASR is a sequence translation problem
while the frame-based posteriors from the teacher network may not fully capture the
sequential nature of speech data. In [81], Wong and Gales proposed sequence-level
T/S learning which optimizes the student network by minimizing the KL-divergence
between lattice arc sequence posteriors from teacher and student networks. The
teacher can be an ensemble network which combines the sequence posteriors from
all the experts so that the student network can approximate the performance of the
powerful ensemble network instead of individual experts. This was further improved
in [82] where different sets of state clusters are used to compute the sequence T/S
criterion between teacher and student models. A similar work [83, 84] was conducted
on the LFMMI models.
Given the success of T/S learning, a natural question to ask is why it is superior
to the standard training with hard labels. We conjecture the following advantages:
• T/S learning with pure soft labels, no matter at frame [11] or sequence [82]
level, can leverage large amounts of unlabeled data. This is particularly
useful in industrial setup where there are, in theory, unlimited unlabeled

data. In [13], Microsoft developed a far-field smart-speaker system which
utilizes large amounts of unlabeled data. Later, Amazon reported a similar
work which exploits up-to 1 million hours of unlabeled data [14, 15]. It is
significantly more expensive to collect 1 million hours of labeled data.
• As indicated in [12], soft labels produced by the teacher carry knowledge
learned by the teacher on the difficulty of classifying each frame/sequence,
while the hard labels do not contain such information. The knowledge
contained in soft labels can help the student to put effort wisely during
training and achieve better performance after converging.
• If the student is well trained, it can approximate and approach the behavior
of the teacher. A good practice is to build a strong giant teacher by fusing
multiple models into one. In many cases, this strategy can help to train a
student that outperforms the same-structure model directly trained on hard
labels [82]. The fusion of posteriors of individual networks in the teacher
can be as simple as linear combination or combination with attention [85].
Alternatively, we may train the student with multiple teachers [86] without
an explicit fusion process.
• When training data amount is small, the student learns to generalize better
with soft labels [87].
Note that although majority works of T/S learning were conducted on hybrid
models, it can be easily applied to E2E models [88–90].
3.2. Adversarial Learning
While model adaptation adapts the source model so that the system performs bet-
ter in the target domain, it is more desirable if the model, trained once, performs
robustly under various conditions and domains. Adversarial training [16] aims to
achieve this goal by building robust models during the training stage without the
need of target domain data. The original idea of adversarial learning was proposed
in the generative adversarial network by Goodfellow et. al. [16] for data generation
where a generator network captures the data distribution and an auxiliary discrim-
inator network estimates the probability that a sample comes from the real data.
Later, adversarial learning is applied to unsupervised domain adaptation by gen-
erating a deep feature that is both discriminative for the main task in the source
domain and invariant to the shifts between source and target domains [91]. A gra-
dient reversal layer network was proposed to facilitate the adversarial learning. A
similar idea was then applied to domain [92–94] and speaker adaptation [95] of the
acoustic model.
Due to the inherent inter-domain variability in the speech signal, a multi-
conditional model shows its high variances in hidden and output unit distributions.
Adversarial learning effectively improves the noise robustness [96–99], reduces the
170 J. Li and D. Yu
inter-speaker [99–103], inter-language [104–106] and inter-dialect [107] variability in

the acoustic models by learning a domain-invariant deep feature at the intermediate
layer of the network while performing token classification as the primary task. The
basic idea behind adversarial training is to train a model so that its hidden represen-
tation has strong discrimination ability for the main task (e.g., ASR) but contains
no or little information to identify irrelevant input conditions such as speaker and
noise level. It typically contains three components: the encoder, the recognizer, and
the domain discriminator. The encoder generates the hidden intermediate represen-
tation, which will be used by the recognizer to generate posteriors of ASR modeling
units (e.g., phoneme), and by the domain discriminator to predict the domain label
(e.g., speaker or noise level). The training objective is to maximize ASR accuracy
while minimizing the domain classification ability.
Adversarial learning does not require any knowledge about the target domain.
However, adversarial learning can be more effective if the target domain data is
available during training.
3.3. Speech Separation

ASR systems’ performance can be dramatically degraded when multiple speakers
speak simultaneously. Unfortunately, such condition, often called the cocktail party
problem, is widely observed in real-world situations and severely affects users’ ex-
periences. Much work has been conducted recently to attack this problem.
While the actual implementation or techniques used may be different, the at-
tempted solutions usually contains an explicit or implicit step of source separation.
Given the observed mixture signal mixed through an unspecified mixing process,
the objective of speech separation is to invert the unknown mixing process and es-
timate the individual source signals. Note that, under some conditions the mixing
mapping may be non-invertible.
Recently, many deep learning based techniques have been developed to solve
this problem, firstly under the monaural setting, and later under the multi-channel
setting. The core idea of these new techniques is converting the speech separation
problem into a supervised learning problem in which the optimization objective is
closely related to the separation task. These new techniques dramatically outper-
formed conventional approaches, such as minimum mean square error [108] suppres-
sor, computational auditory scene analysis (CASA) [109], and non-negative matrix
factorization (NMF) [110]. The performance improvements are particularly impres-
sive with the very recent techniques such as deep clustering (DPCL) [111, 112],
deep attractor network (DANet) [113, 114] and permutation invariant training
(PIT) [115, 116]. These recent techniques aim to solve the label permutation prob-
lem [111, 116], a problem occurs when the mixing sources are symmetric and the
learner cannot predetermine the target signal for the model’s outputs, and work
very well when separating multi-talker overlapping speech. DPCL, DANet, and
PIT all can separate two- and three-talker mixed speech with comparable quality
using a single model. Comparatively, PIT is simpler to implement, easier to be

integrated with other techniques, and computationally more efficient at run time.
TasNet [117, 118], which is a time-domain PIT model, so far achieved the best per-
formance when measured on the signal to distortion ratio (SDR) improvement on
overlapping speech separation.
Since monaural speech separation cannot exploit the source location information,
therefore recent works focus on a multi-channel setup, in which an array of micro-
phones provides multiple recordings of the speech mixed with the same sources.
These multi-channel recordings contain information indicative of the spatial origins
of the underlying sound sources. When sound sources are spatially separated, mi-
crophone array techniques can localize sound sources and then extract the source
from the target direction.
The multi-channel spatial features can be leveraged for speech enhancement and
speaker separation in many different ways. For example, binaural features, such
as inter-channel time difference (ITD), inter-channel phase difference (IPD), and
inter-channel level difference (ILD), extracted from individual time-frequency (T-
F) unit pairs, have been exploited in the supervised speech segregation tasks [119]
as additional information to classify speech signal in the T-F domain. Beamforming
[120] is another way of leveraging spatial information. It boosts the signal from a
specific direction and attenuates interferences from other directions through proper
array configuration. Using spatial cues or the beamformer outputs as additional
features is a straightforward extension to the deep learning models, such as DPCL,
PIT, and DANet, originally designed for monaural speech separation.
The first two studies on extending single-channel deep clustering for mutli-
channel separation are [121] and [122]. In [121], the ideal binary masks (IBMs)
estimated with single-channel deep clustering are employed to compute speech and
noise covariance matrices, which are leveraged to derive an enhanced beamformer
for each source for separation. In [122], Drude et al. first employ the embeddings
produced with single-channel deep clustering for spatial clustering and then com-
pute a beamformer for each source for separation.
Different from the above two approaches, the study in [123] proposes concatenat-
ing cosine IPDs and sine IPDs with log magnitudes as input to the deep clustering
networks. Chen et al. [124] proposed a cascade system consisting of multi-look fixed
beamformers followed by a single-channel anchored deep attractor network on each
look-direction. Each fixed beamformer attenuates the speakers not in the hypoth-
esized look direction. Its output is further used for monaural separation. Their
subsequent study [125] trains a beam prediction network to select the beam that
yields the highest SDR for each speaker. The beamformed signals from the selected
beams are then fed to a PIT network to obtain the final separation results. Similar
to multi-channel deep clustering, in [126] the input to PIT is the concatenation of
the magnitude spectra of all the microphones and the IPDs between a reference
microphone and each of the non-reference microphones. Following [127], robust
172 J. Li and D. Yu
speaker localization is approached from the angle of monaural or two-channel sep-

aration, where a two-channel PIT network is trained in [128] to identify T-F units
dominated by the same speaker for accurate direction estimation. Speaker extrac-
tion is then achieved by using an enhancement network trained with a combination
of directional features, which are computed by compensating IPDs or by using data-
dependent beamforming, initial mask estimates from the two-channel PIT network,
and the spectral features.
As the final goal is ASR, each separated speech stream can be directly fed
into ASR systems to generate recognition hypotheses. An even better way is to
learn the separation model jointly with the speech recognition model using the
objective function for speech recognition. Since separation is just an intermediate
step, Yu et al. [129] proposed to directly optimize the cross-entropy criterion against
senone labels using PIT without having an explicit speech separation step. In
[130], PIT-ASR was further advanced by imposing a modular structure on the
network, applying progressive pretraining, and improving the objective function
with teacher-student learning and a discriminative training criterion. Even using
the ASR criterion to guide the separation of multi-speaker as PIT-ASR, current
multi-speaker ASR system still relies on signals from source speakers to get the
time alignment. Hence the multi-speaker mixture training data indeed is artificially
generated. In [131], an attempt was done by using end-to-end ASR training criterion
for multi-speaker ASR. The whole system is optimized with only transcription-level
labels without using the source signal from each speaker. It was further extended
in [132] without pre-training. This opens a door for using real multi-speaker mixture
data for training.
4. Summary and Future Directions
In this chapter, we summarized progresses in the two areas where significant efforts
have been taken for ASR, namely, E2E modeling and robust modeling. Although
the effort of using E2E models to replace hybrid systems in industry applications is
only with limited success (for example Google deployed RNN-T on devices where
LM is weak [26]) due to E2E systems’ inability to model rarely observed samples
during training [133], E2E modeling is still a direction worth further investigation.
E2E modeling directly optimizes the objective of the task and provides additional
flexibility in choosing models. It can easily integrate various information through
multiple encoders [69, 134]. In addition, some tasks that are complicated to hybrid
models can be easily realized with E2E models. For example, code switching or
multi-lingual ASR is difficult in the hybrid system. However, it is relatively easier
in E2E models by modeling characters and sub-words from all languages with an
extended output layer [135–138]. As another example, it is challenging to identify
who spoke what at when during conversation, a task which is usually accomplished
with separated ASR and speaker diarization systems. In [139], a single E2E model
was proposed to perform both tasks by generating speaker-decorated transcription.
Among three E2E models introduced in Section 2, CTC has clear disadvantage
due to the output independence assumption, while RNN-T and AED have bet-
ter potential. RNN-T usually outperforms CTC, and AED performs best among
these three models due to its powerful structure. Because RNN-T is a streaming
model and AED can achieve better accuracy, a recent study combines these two
E2E models by using RNN-T in the first-pass decoding and AED in the second-
pass rescoring. This strategy improved the recognition accuracy with reasonable
perceived latency [133]. Given its success in machine translation, the Transformer
is a promising E2E model structure.
While we may continue proposing new E2E model structures, an equally impor-
tant task is solving practical issues that prevent E2E models from being used in
industrial applications. Some of these issues have been discussed in Section 2.4. Ac-
curacy is not everything. To deploy a system we need to tradeoff between accuracy,
latency, and computational cost.
While the performance of ASR systems surpassed the threshold for adoption in
matched training-test environments, research focus has been shifted to developing
robust ASR systems that perform well in challenging real-world scenarios, such as
mismatched test environments and overlapping speech. Model adaptation is the
most straightforward solution. T/S learning has been successful in industry-scale
tasks due to its effectiveness in using large amounts of unlabeled data. The chal-
lenge of T/S learning for model adaptation is its reliance on parallel data although
in many cases simulated parallel data is effective. How to apply T/S model adap-
tation to scenarios where parallel data is not available and difficult to simulate
is an interesting research topic. If there is no prior knowledge about the testing
environment, adversarial learning can be a good choice. It trains ASR models to
generate domain-invariant features. The efficacy of adversarial learning has been
reported on tasks with small dataset. Its effectiveness is pending examination when
huge amount of training data is available in which case the network may learn
domain-invariant features implicitly without adversarial training.
Great progress has been made in speech separation with the introduction of
DPCL, PIT and their variants. However, there are still challenging problems to be
solved.
• The application scenario should be better defined. Currently, many sep-

aration algorithms assume that the number of speakers is known or the
overlapping speech is pre-segmented, which often does not hold in practice.
• Some works measure the performance only on overlapping speech and ignore
the non-overlapping speech. However, both conditions are important in
real-world scenarios.
• Most current research on speech separation conducts experiments on arti-
ficial databases which may be very different from real-world scenarios.
To facilitate the research of speech separation toward addressing above challeng-

174 J. Li and D. Yu
ing problems, a database of continuous speech separation [140] is proposed recently.

We believe the scheme of multi-channel feature encoding, the integration of the
signal processing (e.g. beamforming) and machine learning approaches, and the
exploitation of multimodal cues (e.g. speaker characteristics [141–144] and visual
information [145, 146]) are encouraging directions that have attracted great atten-
tion recently. For example, the most effective method on the challenging CHiME5
task [147] is the speaker-dependent extraction [148] which uses target speaker in-
formation to guide the separation. However, the computational cost is proportional
to the number of speakers in the conversation. This issue may be solved by using
the whole speaker inventory in the conversation as the guide instead of only a single
speaker [149].
5. Acknowledgement
The first author would like to thank Dr. Zhong Meng, Dr. Jeremy Wong, and Dr.
Amit Das at Microsoft for valuable inputs to improve the quality of this chapter.
References
[1] H. Sak, A. Senior, K. Rao, O. Irsoy, A. Graves, F. Beaufays, and J. Schalkwyk.

Learning acoustic frame labeling for speech recognition with recurrent neural net-
works. In Proc. ICASSP, pp. 4280–4284, (2015).
[2] Y. Miao, M. Gowayyed, and F. Metze. EESEN: End-to-end speech recognition using
deep RNN models and WFST-based decoding. In Proc. ASRU, pp. 167–174. IEEE,
(2015).
[3] W. Chan, N. Jaitly, Q. Le, and O. Vinyals. Listen, attend and spell: A neural
network for large vocabulary conversational speech recognition. In Proc. ICASSP,
pp. 4960–4964. IEEE, (2016).
[4] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly. A com-
parison of sequence-to-sequence models for speech recognition. In Proc. Interspeech,
pp. 939–943, (2017).
[5] E. Battenberg, J. Chen, R. Child, A. Coates, Y. G. Y. Li, H. Liu, S. Satheesh, A. Sri-
ram, and Z. Zhu. Exploring neural transducers for end-to-end speech recognition. In
Proc. ASRU, pp. 206–213. IEEE, (2017).
[6] H. Sak, M. Shannon, K. Rao, and F. Beaufays. Recurrent neural aligner: An encoder-
decoder neural network model for sequence to sequence mapping. In Proc. Inter-
speech, (2017).
[7] H. Hadian, H. Sameti, D. Povey, and S. Khudanpur. Towards discriminatively-
trained HMM-based end-to-end models for automatic speech recognition. In Proc.
ICASSP, (2018).
[8] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan,
R. J. Weiss, K. Rao, K. Gonina, et al. State-of-the-art speech recognition with
sequence-to-sequence models. In Proc. ICASSP, (2018).
[9] T. N. Sainath, C.-C. Chiu, R. Prabhavalkar, A. Kannan, Y. Wu, P. Nguyen, and
Z. Chen. Improving the performance of online neural transducer models. In Proc.
ICASSP, pp. 5864–5868, (2018).
[10] J. Li, G. Ye, A. Das, R. Zhao, and Y. Gong. Advancing acoustic-to-word CTC
model. In Proc. ICASSP, (2018).
[11] J. Li, R. Zhao, J.-T. Huang, and Y. Gong. Learning small-size DNN with output-
distribution-based criteria. In Proc. Interspeech, pp. 1910–1914, (2014).
[12] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network,
[13] J. Li, R. Zhao, Z. Chen, et al. Developing far-field speaker system via teacher-student
learning. In Proc. ICASSP, (2018).
[14] L. Mošner, M. Wu, A. Raju, S. H. K. Parthasarathi, K. Kumatani, S. Sundaram,
R. Maas, and B. Hoffmeister. Improving noise robustness of automatic speech recog-
nition via parallel data and teacher-student learning. In Proc. ICASSP, pp. 6475–
6479, (2019).
[15] S. H. K. Parthasarathi and N. Strom. Lessons from building acoustic models with a
million hours of speech. In Proc. ICASSP, pp. 6670–6674, (2019).
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural
information processing systems, pp. 2672–2680, (2014).
[17] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal
classification: labelling unsegmented sequence data with recurrent neural networks.
In Proceedings of the 23rd international conference on Machine learning, pp. 369–
376. ACM, (2006).
[18] A. Graves and N. Jaitley. Towards end-to-end speech recognition with recurrent
neural networks. In PMLR, pp. 1764–1772, (2014).
[19] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk,
and Y. Bengio, Learning phrase representations using RNN encoder-decoder for
statistical machine translation, arXiv preprint arXiv:1406.1078. (2014).
[20] D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning
to align and translate, arXiv preprint arXiv:1409.0473. (2014).
[21] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio. End-to-end
attention-based large vocabulary speech recognition. In Proc. ICASSP, pp. 4945–
4949. IEEE, (2016).
[22] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. Attention-based
models for speech recognition. In NIPS, pp. 577–585, (2015).
[23] A. Graves, Sequence transduction with recurrent neural networks, CoRR.
abs/1211.3711, (2012).
[24] H. Soltau, H. Liao, and H. Sak, Neural speech recognizer: Acoustic-to-word LSTM
model for large vocabulary speech recognition, arXiv preprint arXiv:1610.09975.
(2016).
[25] K. Rao, H. Sak, and R. Prabhavalkar. Exploring architectures, data and units for
streaming end-to-end speech recognition with RNN-transducer. In Proc. ASRU,
(2017).
[26] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach,
A. Kannan, Y. Wu, R. Pang, et al. Streaming end-to-end speech recognition for
mobile devices. In Proc. ICASSP, pp. 6381–6385, (2019).
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L . Kaiser,
and I. Polosukhin. Attention is all you need. In Advances in Neural Information
Processing Systems, pp. 6000–6010, (2017).
[28] L. Dong, S. Xu, and B. Xu. Speech-transformer: a no-recurrence sequence-to-
sequence model for speech recognition. In Proc. ICASSP, pp. 5884–5888, (2018).
[29] S. Zhou, L. Dong, S. Xu, and B. Xu. Syllable-based sequence-to-sequence speech
176 J. Li and D. Yu
recognition with the transformer in Mandarin Chinese. In Proc. Interspeech, (2018).

[30] Y. Zhao, J. Li, X. Wang, and Y. Li. The speechtransformer for large-scale mandarin
chinese speech recognition. In Proc. ICASSP, pp. 7095–7099, (2019).
[31] S. Karita, N. E. Y. Soplin, S. Watanabe, M. Delcroix, A. Ogawa, and T. Nakatani.
Improving transformer based end-to-end speech recognition with connectionist tem-
poral classification and language model integration. In Proc. Interspeech, (2019).
[32] H. Sak, A. Senior, K. Rao, and F. Beaufays. Fast and accurate recurrent neural
network acoustic models for speech recognition. In Proc. Interspeech, (2015).
[33] A. Senior, H. Sak, F. de Chaumont Quitry, T. Sainath, and K. Rao. Acoustic
modelling with CD-CTC-SMBR LSTM RNNs. In Proc. ASRU, pp. 604–609. IEEE,
(2015).
[34] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen,
M. Chrzanowski, A. Coates, G. Diamos, et al., Deep speech 2: End-to-end speech
recognition in English and Mandarin, arXiv preprint arXiv:1512.02595. (2015).
[35] D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar, X. Na, Y. Wang, and
S. Khudanpur. Purely sequence-trained neural networks for asr based on lattice-free
MMI. In Proc. Interspeech, (2016).
[36] A. Das, J. Li, R. Zhao, and Y. Gong. Advancing connectionist temporal classification
with attention modeling. In Proc. ICASSP, (2018).
[37] A. Das, J. Li, G. Ye, R. Zhao, and Y. Gong, Advancing acoustic-to-word CTC
model with attention and mixed-units, IEEE/ACM Transactions on Audio, Speech,
and Language Processing. 27(12), 1880–1892, (2019).
[38] T. Bagby, K. Rao, and K. C. Sim. Efficient implementation of recurrent neural
network transducer in tensorflow. In Proc. SLT, pp. 506–512, (2018).
[39] J. Li, R. Zhao, H. Hu, and Y. Gong. Improving RNN transducer modeling for end-
to-end speech recognition. In Proc. ASRU, (2019).
[40] J. Li, C. Liu, and Y. Gong. Layer trajectory LSTM. In Proc. Interspeech, (2018).
[41] J. Li, L. Lu, C. Liu, and Y. Gong. Exploring layer trajectory LSTM with depth
processing units and attention. In Proc. SLT, (2018).
[42] J. Li, L. Lu, C. Liu, and Y. Gong. Improving layer trajectory LSTM with future
context frames. In Proc. ICASSP, pp. 6550–6554, (2019).
[43] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In Ad-
vances in neural information processing systems, pp. 2204–2212, (2014).
[44] M. Schuster and K. Nakajima. Japanese and Korean voice search. In Proc. ICASSP,
pp. 5149–5152. IEEE, (2012).
[45] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence
prediction with recurrent neural networks. In Advances in Neural Information Pro-
cessing Systems, pp. 1171–1179, (2015).
[46] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the in-
ception architecture for computer vision. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 2818–2826, (2016).
[47] A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar. An
analysis of incorporating an external language model into a sequence-to-sequence
model. In Proc. ICASSP, (2018).
[48] R. Prabhavalkar, T. N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C.-C. Chiu, and
A. Kannan. Minimum word error rate training for attention-based sequence-to-
sequence models. In Proc. ICASSP, (2018).
[49] S. Kim, T. Hori, and S. Watanabe. Joint CTC-attention based end-to-end speech
recognition using multi-task learning. In Proc. ICASSP, (2017).
[50] T. Hori, S. Watanabe, and J. Hershey. Joint CTC/attention decoding for end-to-end
speech recognition. In Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 518–529, (2017).
[51] C.-C. Chiu and C. Raffel, Monotonic chunkwise attention, arXiv preprint
arXiv:1712.05382. (2017).
[52] R. Fan, P. Zhou, W. Chen, J. Jia, and G. Liu, An online attention-based model for
speech recognition, arXiv preprint arXiv:1811.05247. (2018).
[53] N. Moritz, T. Hori, and J. Le Roux. Triggered attention for end-to-end speech
recognition. In Proc. ICASSP, pp. 5666–5670, (2019).
[54] L. Dong and B. Xu, CIF: Continuous integrate-and-fire for end-to-end speech recog-
nition, arXiv preprint arXiv:1905.11235. (2019).
[55] L. Lu, X. Zhang, and S. Renais. On training the recurrent neural network encoder-
decoder for large vocabulary end-to-end speech recognition. In Proc. ICASSP, pp.
5060–5064. IEEE, (2016).
[56] K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Nahamoo, Di-
rect acoustics-to-word models for English conversational speech recognition, arXiv
preprint arXiv:1703.07754. (2017).
[57] J. Li, G. Ye, R. Zhao, J. Droppo, and Y. Gong. Acoustic-to-word model without
OOV. In Proc. ASRU, (2017).
[58] K. Audhkhasi, B. Kingsbury, B. Ramabhadran, G. Saon, and M. Picheny. Build-
ing competitive direct acoustics-to-word models for English conversational speech
recognition. In Proc. ICASSP, (2018).
[59] Y. Gaur, J. Li, Z. Meng, and Y. Gong. Acoustic-to-phrase models for speech recog-
nition. In Proc. Interspeech, (2019).
[60] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words
with subword units. In Proceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725, Berlin,
Germany, (2016).
[61] H. Xu, S. Ding, and S. Watanabe. Improving end-to-end speech recognition with
pronunciation-assisted sub-word modeling. In Proc. ICASSP, pp. 7110–7114, (2019).
[62] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares,
H. Schwenk, and Y. Bengio, On using monolingual corpora in neural machine trans-
lation, arXiv preprint arXiv:1503.03535. (2015).
[63] A. Sriram, H. Jun, S. Satheesh, and A. Coates. Cold fusion: Training seq2seq models
together with language models. In Proc. Interspeech, (2018).
[64] S. Toshniwal, A. Kannan, C.-C. Chiu, Y. Wu, T. N. Sainath, and K. Livescu. A
comparison of techniques for language model integration in encoder-decoder speech
recognition. In Proc. SLT, pp. 369–375, (2018).
[65] A. Tjandra, S. Sakti, and S. Nakamura. Listening while speaking: Speech chain by
deep learning. In Proc. ASRU, pp. 301–308, (2017).
[66] J. Guo, T. N. Sainath, and R. J. Weiss. A spelling correction model for end-to-end
speech recognition. In Proc. ICASSP, pp. 5651–5655, (2019).
[67] S. Zhang, M. Lei, and Z. Yan. Investigation of transformer based spelling correction
model for CTC-based end-to-end Mandarin speech recognition. In Proc. Interspeech,
(2016).
[68] G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao. Deep context:
end-to-end contextual speech recognition. In Proc. SLT, pp. 418–425, (2018).
[69] A. Bruguier, R. Prabhavalkar, G. Pundak, and T. N. Sainath. Phoebe:
Pronunciation-aware contextualization for end-to-end speech recognition. In Proc.
ICASSP, pp. 6171–6175, (2019).
[70] D. Zhao, T. N. Sainath, D. Rybach, D. Bhatia, B. Li, and R. Pang. Shallow-fusion
178 J. Li and D. Yu
end-to-end contextual biasing. In Proc. Interspeech, (2019).

[71] S. Kim and F. Metze. Dialog-context aware end-to-end speech recognition. In Proc.
SLT, pp. 434–440, (2018).
[72] R. Masumura, T. Tanaka, T. Moriya, Y. Shinohara, T. Oba, and Y. Aono. Large
context end-to-end automatic speech recognition via extension of hierarchical recur-
rent encoder-decoder models. In Proc. ICASSP, pp. 5661–5665, (2019).
[73] S. Kim, S. Dalmia, and F. Metze. Cross-attention end-to-end ASR for two-party
conversations. In Proc. Interspeech, (2019).
[74] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, An overview of noise-robust auto-
matic speech recognition, IEEE/ACM Transactions on Audio, Speech and Language
Processing. 22(4), 745–777 (April, 2014).
[75] J. Li, L. Deng, R. Haeb-Umbach, and Y. Gong, Robust Automatic Speech Recogni-
tion: A Bridge to Practical Applications. (Academic Press, 2015).
[76] D. Yu and J. Li, Recent Progresses in Deep Learning Based Acoustic Models,
IEEE/CAA J. of Autom. Sinica. 4(3), 399–412 (July, 2017).
[77] C. Bucilu, R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings
of the 12th ACM SIGKDD international conference on Knowledge discovery and
data mining, pp. 535–541. ACM, (2006).
[78] J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in neural
information processing systems, pp. 2654–2662, (2014).
[79] S. Watanabe, T. Hori, J. Le Roux, et al. Student-teacher network learning with
enhanced features. In Proc. ICASSP, (2017).
[80] Z. Meng, J. Li, Y. Zhao, and Y. Gong. Conditional teacher-student learning. In
Proc. ICASSP, pp. 6445–6449, (2019).
[81] J. H. Wong and M. J. Gales. Sequence student-teacher training of deep neural net-
works. In Proc. Interspeech, (2016).
[82] J. H. M. Wong, M. J. F. Gales, and Y. Wang, General sequence teacher-student
learning, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
(2019).
[83] N. Kanda, Y. Fujita, and K. Nagamatsu. Investigation of lattice-free maximum
mutual information-based acoustic models with sequence-level Kullback-Leibler di-
vergence. In Proc. ASRU, pp. 69–76. IEEE, (2017).
[84] V. Manohar, P. Ghahremani, D. Povey, and S. Khudanpur. A teacher-student learn-
ing approach for unsupervised domain adaptation of sequence-trained ASR models.
In Proc. SLT, pp. 250–257, (2018).
[85] A. Das, J. Li, C. Liu, and Y. Gong. Universal acoustic modeling using neural mixture
models. In Proc. ICASSP, pp. 5681–5685, (2019).
[86] Z. You, D. Su, and D. Yu. Teach an all-rounder with experts in different domains.
In Proc. ICASSP, pp. 6425–6429, (2019).
[87] J. Wong, M. Gales, and Y. Wang. Learning between different teacher and student
models in ASR. In Proc. ASRU, (2019).
[88] R. Pang, T. Sainath, R. Prabhavalkar, et al. Compression of end-to-end models. In
Proc. Interspeech, pp. 27–31, (2018).
[89] R. M. Munim, N. Inoue, and K. Shinoda. Sequence-level knowledge distillation for
model compression of attention-based sequence-to-sequence speech recognition. In
Proc. ICASSP, pp. 6151–6155, (2019).
[90] Z. Meng, J. Li, Y. Gaur, and Y. Gong. Domain adaptation via teacher-student
learning for end-to-end speech recognition. In Proc. ASRU, (2019).
[91] Y. Ganin and V. Lempitsky, Unsupervised domain adaptation by backpropagation,
[92] S. Sun, B. Zhang, L. Xie, and Y. Zhang, An unsupervised deep domain adaptation
approach for robust speech recognition, Neurocomputing. (2017).
[93] Z. Meng, Z. Chen, V. Mazalov, J. Li, and Y. Gong. Unsupervised adaptation with
domain separation networks for robust speech recognition. In Proc. ASRU, (2017).
[94] P. Denisov, N. T. Vu, and M. F. Font. Unsupervised domain adaptation by ad-
versarial learning for robust speech recognition. In Speech Communication; 13th
ITG-Symposium, pp. 1–5. VDE, (2018).
[95] Z. Meng, J. Li, and Y. Gong. Adversarial speaker adaptation. In Proc. ICASSP, pp.
5721–5725, (2019).
[96] Y. Shinohara. Adversarial multi-task learning of deep neural networks for robust
speech recognition. In Proc. Interspeech, pp. 2369–2372, (2016).
[97] D. Serdyuk, K. Audhkhasi, P. Brakel, B. Ramabhadran, S. Thomas, and
Y. Bengio, Invariant representations for noisy speech recognition, arXiv preprint
arXiv:1612.01928. (2016).
[98] Z. Meng, J. Li, Y. Gong, and B.-H. F. Juang. Adversarial teacher-student learning
for unsupervised domain adaptation. In Proc. ICASSP, (2018).
[99] Z. Meng, J. Li, and Y. Gong. Attentive adversarial learning for domain-invariant
training. In Proc. ICASSP, pp. 6740–6744. IEEE, (2019).
[100] G. Saon, G. Kurata, T. Sercu, et al. English conversational telephone speech recog-
nition by humans and machines. In Proc. Interspeech, (2017).
[101] Z. Meng, J. Li, Y. Gong, and B.-H. F. Juang. Speaker-invariant training via adver-
sarial learning. In Proc. ICASSP, (2018).
[102] L. Tóth and G. Gosztolya. Reducing the inter-speaker variance of CNN acoustic
models using unsupervised adversarial multi-task training. In International Confer-
ence on Speech and Computer, pp. 481–490. Springer, (2019).
[103] L. Wu, H. Chen, L. Wang, P. Zhang, and Y. Yan, Speaker-invariant feature-mapping
for distant speech recognition via adversarial teacher-student learning, Interspeech.
1, 1, (2019).
[104] J. Yi, J. Tao, Z. Wen, and Y. Bai. Adversarial multilingual training for low-resource
speech recognition. In Proc. ICASSP, (2018).
[105] O. Adams, M. Wiesner, S. Watanabe, and D. Yarowsky, Massively multilingual
adversarial speech recognition, HAACL-HLT. (2019).
[106] K. Hu, H. Sak, and H. Liao, Adversarial training for multilingual acoustic modeling,
[107] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie. Domain adversarial
training for accented speech recognition. In Proc. ICASSP, (2018).
[108] Y. Ephraim and D. Malah, Speech enhancement using a minimum mean-square
error log-spectral amplitude estimator, IEEE Transactions on Acoustics, Speech,
and Signal Processing. 33(2), 443–445, (1985).
[109] D. Wang and G. Brown, Computational Auditory Scene Analysis: Principles, Algo-
rithms, and Applications. (Wiley-IEEE Press, 2006).
[110] C. Févotte, E. Vincent, and A. Ozerov. Single-channel audio source separation with
NMF: Divergences, constraints and algorithms. In Audio Source Separation, pp.
1–24. Springer, (2018). doi: 10.1007/978-3-319-73031-8 1.
[111] J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe. Deep clustering: Discrimi-
native embeddings for segmentation and separation. In the Proceedings of ICASSP,
pp. 31–35, (2016).
[112] Y. Isik, J. Roux, Z. Z. Chen, and et al. Single-channel multi-speaker separation using
deep clustering. In Interspeech, pp. 545–549, (2016).
[113] Z. Chen, Y. Luo, and N. Mesgarani. Deep attractor network for single-microphone
180 J. Li and D. Yu
speaker separation. In the Proceedings of ICASSP, pp. 246–250, (2017).

[114] Y. Luo, Z. Chen, and N. Mesgarani, Speaker-independent speech separation with
deep attractor network, IEEE/ACM Transactions on Acoustics, Speech, and Signal
Processing. (2018).
[115] M. Kolbak, D. Yu, Z.-H. Tan, and J. Jensen, Multitalker speech separation with
utterance-level permutation invariant training of deep recurrent neural networks,
IEEE/ACM Transactions on Audio, Speech and Language Processing. 25(10), 1901–
1913, (2017).
[116] D. Yu, M. Kolbak, Z.-H. Tan, and J. Jensen. Permutation invariant training of deep
models for speaker-independent multi-talker speech separation. In the Proceedings
of ICASSP, (2017).
[117] Y. Luo and N. Mesgarani. Tasnet: time-domain audio separation network for real-
time, single-channel speech separation. In the Proceedings of ICASSP, (2018).
[118] Y. Luo and N. Mesgarani, Tasnet: Surpassing ideal time-frequency masking for
speech separation, arXiv preprint arXiv:1809.07454v2. (2018).
[119] N. Roman, D. Wang, and G. Brown, Speech segregation based on sound localization,
J. Acoust. Soc. Am. 114, 2236–2252, (2003).
[120] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, A consolidated perspec-
tive on multi-microphone speech enhancement and source separation, IEEE/ACM
Transactions on Audio, Speech, and Language Processing. 25, 692–730, (2017).
[121] T. Higuchi, K. Kinoshita, M. Delcroix, K. Zmolkova, and T. Nakatani. Deep
clustering-based beamforming for separation with unknown number of sources. In
Interspeech, (2017).
[122] L. Drude and R. Haeb-Umbach. Tight integration of spatial and spectral features
for bss with deep clustering embeddings. In Interspeech, (2017).
[123] Z.-Q. Wang, J. L. Roux, and J. Hershey. Multi-channel deep clustering: Discrimi-
native spectral and spatial embeddings for speaker-independent speech separation.
In the Proceedings of ICASSP, (2018).
[124] Z. Chen, J. Li, X. Xiao, T. Yoshioka, H. Wang, Z. Wang, and Y. Gong. Cracking the
cocktail party problem by multi-beam deep attractor network. In IEEE Workshop
on ASRU, (2017).
[125] Z. Chen, T. Yoshioka, X. Xiao, J. Li, M. L. Seltzer, and Y. Gong. Efficient integra-
tion of fixed beamformers and speech separation networks for multi-channel far-field
speech separation. In the Proceedings of ICASSP, (2018).
[126] T. Yoshioka, H. Erdogan, Z. Chen, and F. Alleva. Multi-microphone neural
speech separation for far-field multi-talker speech recognition. In the Proceedings
of ICASSP, (2018).
[127] Z.-Q. Wang, X. Zhang, and D. Wang, Robust speaker localization guided by deep
learning based time-frequency masking, IEEE/ACM Transactions on Audio, Speech,
and Language Processing. 27, 178–188, (2019).
[128] Z.-Q. Wang and D.-L. Wang, Combining spectral and spatial features for deep learn-
ing based blind speaker separation, IEEE/ACM Transactions on Audio, Speech, and
Language Processing. 27, 457–468, (2019).
[129] D. Yu, X. Chang, and Y. Qian. Recognizing multi-talker speech with permutation
invariant training. In Proc. Interspeech, (2017).
[130] Z. Chen, J. Droppo, J. Li, and W. Xiong, Progressive joint modeling in unsupervised
single-channel overlapped speech recognition, IEEE/ACM Transactions on Audio,
Speech and Language Processing (TASLP). 26(1), 184–196, (2018).
[131] S. Settle, J. L. Roux, T. Hori, S. Watanabe, and J. R. Hershey. End-to-end multi-
speaker speech recognition. In Proc. ICASSP, (2018).
[132] X. Chang, Y. Qian, K. Yu, and S. Watanabe. End-to-end monaural multi-speaker

ASR system without pretraining. In Proc. ICASSP, pp. 6256–6260. IEEE, (2019).
[133] T. Sainath, R. Pang, and et. al. Two-pass end-to-end speech recognition. In Proc.
Interspeech, (2019).
[134] X. Wang, R. Li, S. H. Mallidi, T. Hori, S. Watanabe, and H. Hermansky. Stream
attention-based multi-array end-to-end speech recognition. In Proc. ICASSP, pp.
7105–7109, (2019).
[135] J. Cho, M. K. Baskar, R. Li, M. Wiesner, S. H. Mallidi, N. Yalta, M. Karafiat,
S. Watanabe, and T. Hori. Multilingual sequence-to-sequence speech recognition:
architecture, transfer learning, and language modeling. In Proc. SLT, pp. 521–527,
(2018).
[136] N. Luo, D. Jiang, S. Zhao, C. Gong, W. Zou, and X. Li, Towards end-to-end code-
switching speech recognition, arXiv preprint arXiv:1810.13091. (2018).
[137] K. Li, J. Li, G. Ye, R. Zhao, and Y. Gong. Towards code-switching ASR for end-to-
end CTC models. In Proc. ICASSP, pp. 6076–6080, (2019).
[138] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan. Bytes are all you need: End-to-
end multilingual speech recognition and synthesis with bytes. In Proc. ICASSP, pp.
5621–5625, (2019).
[139] L. E. Shafey, H. Soltau, and I. Shafran. Joint speech recognition and speaker di-
arization via sequence transduction. In Proc. Interspeech, (2019).
[140] Z. Chen, T. Yoshioka, L. Lu, T. Zhou, J. Wu, Y. Luo, Z. Meng, X. Xiao, and J. Li,
Continuous speech separation: dataset and analysis, in Proc. ICASSP. (2020).
[141] K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, and T. Nakatani.
Speaker-aware neural network based beamformer for speaker extraction in speech
mixtures. In Proc. Interspeech, (2017).
[142] M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and T. Nakatani. Single channel
target speaker extraction and recognition with speaker beam. In Proc. ICASSP, pp.
5554–5558. IEEE, (2018).
[143] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous,
R. J. Weiss, Y. Jia, and I. L. Moreno, Voicefilter: Targeted voice separation by
speaker-conditioned spectrogram masking, arXiv preprint arXiv:1810.04826. (2018).
[144] X. Xiao, Z. Chen, T. Yoshioka, H. Erdogan, C. Liu, D. Dimitriadis, J. Droppo, and
Y. Gong. Single-channel speech extraction using speaker inventory and attention
network. In Proc. ICASSP, pp. 86–90. IEEE, (2019).
[145] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman,
and M. Rubinstein, Looking to listen at the cocktail party: A speaker-independent
audio-visual model for speech separation, arXiv preprint arXiv:1804.03619. (2018).
[146] J. Wu, Y. Xu, S.-X. Zhang, L.-W. Chen, M. Yu, L. Xie, and D. Yu, Time domain
audio visual speech separation, arXiv preprint arXiv:1904.03760. (2019).
[147] J. Barker, S. Watanabe, E. Vincent, and J. Trmal, The fifth ’CHiME’ speech
separation and recognition challenge: Dataset, task and baselines, arXiv preprint
arXiv:1803.10609. (2018).
[148] L. Sun, J. Du, T. Gao, Y. Fang, F. Ma, J. Pan, and C.-H. Lee. A two-stage single-
channel speaker-dependent speech separation approach for Chime-5 challenge. In
Proc. ICASSP, pp. 6650–6654. IEEE, (2019).
[149] P. Wang, Z. Chen, X. Xiao, Z. Meng, T. Yoshioka, T. Zhou, L. Lu, and J. Li. Speech
separation using speaker inventory. In Proc. ASRU, (2019).
PART 2
APPLICATIONS
184 Introduction
A BRIEF INTRODUCTION
I recall that as early as sixties, Prof. K.S. Fu has emphasized on the applications of
pattern recognition techniques. He was involved in no less than two dozen pattern
recognition applications [1, 2]. Early effort was focused on speech recognition,
character recognition, medical diagnosis and remote sensing. In our daily life we
have now benefited so much with the enormous progress in the area of speech and
character/document processing and recognition. Throughout this handbook series
we have published important work on both speech and character recognition. It is
difficult to provide extensive list of books in this area with over 60 years of
development. References 3 and 4 are among the many important books in the area.
The progress in computer vision as well as pattern recognition has enormous
impact on rapid advances in person identifications, especially on face recognition
(see e.g. [5]), fingerprint recognition and biometric authentication [6] with great
accuracy thanks to the advances in neural networks techniques. Though medical
diagnosis using pattern recognition has a long history, reliable automated systems
have been few. With enormous progress in medical imaging hardware, much
success has yet to be seen with the use of computer vision and pattern recognition
(see e.g. [7]). The success in remote sensing for both hyperspectral/multispectral
and synthetic aperture radar data however is much more evident with some
important books (see e.g. [8, 9, 10, 11]). There are many hundreds of other
applications of pattern recognition and computer vision with different amounts of
success.
In a recent report by Dr. Bob Fisher of the University of Edinburgh,
(rbf@inf.ed.ac.uk), 300 areas of computer vision applications are listed. The
number can be on the low side considering many on-going efforts to integrate
computer vision as part of larger automation systems. On the issue of software
development, there has been limited success with comprehensive pattern
recognition or computer vision software packages. Specialized speech analysis
and recognition software has been better developed. There are also many popular
neural network software systems available such as those offered by MathLab. It
is useful to mention a recent deep learning software by R. Cresson [12].
Examples of some difficult applications from my experience are in tele-
seismic signal recognition (see e.g. [13, 14]), signal and image recognition of
underwater objects (see e.g. [15, 16]) and in automated sorting of fishes (see e.g.
[17]). For such applications, a correct recognition rate of 90% is considered very
good. The use of multi-sensors and knowledge-based method can provide
Introduction 185
significant improvement. It is noted that for waveforms in seismic and underwater

acoustics, there is little contextual information available while there is rich
contextual and even structure information available in images that can be helpful
for recognition. Elegant mathematical transform techniques that reduce the feature
dimension are still needed for effective waveform recognition.
Among other application area, machine vision and inspection has enjoyed
enormous success in industry. Some progress in machine vision has been reported
in the Handbook of Pattern Recognition and Computer Vision, Vols. 1–5. Like in
many other applications, deep learning brings a new approach to machine vision
problems [18]. In deep learning for remote sensing alone there may well be over
100 publications in the last six years.
We believe applications that benefit humankind such as medical diagnosis
and remote sensing of earth environment are most important and challenging.
Commercial success though temporary can be an indication of a good use of
theoretical advances. Indeed, there is unlimited application possibilities in the
foreseeable future. Though many applications share the use of similar pattern
recognition techniques and principles, it is interesting to note that different
applications have inspired the exploration of new theory and techniques. It should
also be noted that the error is not sought in pattern classification or image
segmentation. Our goal is to make the error as small as possible. As such fully
automated recognition systems may not completely replace human experts. Also
with rapidly improved sensor technology we are approaching near zero errors in
many pattern recognition and image segmentation tasks. The future of applications
in pattern recognition and computer vision (in both 2-D and 3-D) is indeed very
bright.
References
1. K.S. Fu, editor, “Applications of Pattern Recognition”, CRC Press 1982.
2. K.S. Fu, editor, “Syntactic Pattern Recognitions, Applications”, Springer-Verlag 1982.
3. F. Jelinek, “Statistical Methods for Speech Recognition (language, speech, and
communications), MIT Press 1998, now in 4th printing available through Amazon.com.
4. M. Choriet, N. Kharma, C.L. Liu, C.Y. Suen, “Character Recognition Systems: a guide for
students and practitioners”, Wiley 2007.
5. S. Li and A.K. Jain, editors, “Handbook of Face Recognition”, Springer 2011.
6. S. Y. Kung, M.W. Mak and S.H. Lin, ‘Biometric Authentication”, Prentice-Hall; 2006.
7. C.H. Chen, editor, “Computer Vision in Medical Imaging”, World Scientific Publishing 2014.
8. L. Alparone, B. Aiazzi, S. Baronti and A. Grazelli, “Remote Sensing Image Fusion”, CRC Press,
2015.
9. C.H. Chen, editor, “Signal and Image Processing for Remote Sensing”, 2nd edition, CRC Press
2012.
10. Q. Zhang and R. Skjetne, “Sea Ice Image Processing with MATLAB”, CRC Press 2018.
186 Introduction
11. C.H. Chen, editor, “Signal and Image Processing for Remote Sensing”, CRC Press 2006 (first
edition), 2012 (second edition).
12. R. Cresson, “Deep Learning on Remote Sensing Images with Open Source Software”, CRC
Press, June 2020.
13. C.H. Chen, “Seismic signal recognition”, Geoexploration, vol. 6, no. 1, pp. 133–146, 1978.
14. H.H. Liu and K.S. Fu, “A syntactic approach to seismic pattern recognition”, IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 4, pp. 136–140, 1982.
15. C.H. Chen, “Recognition of underwater transient patterns”, Pattern Recognition, vol. 18, no. 9,
pp. 485-490, 1985.
16. C.H. Chen, “Neural networks for active sonar classification”. IEEE OCEANS 1990.
17. K. Stokesbury, Lecture on status of the marine fishery research program at UMass Dartmouth,
May 15, 2019.
18. A. Wilson, “Deep learning brings a new dimension to machine vision”, Laser Focus World,
May 2019, pp. 43–47.
March 13, 2020 9:24 ws-rv961x669 HBPRCV-6th Edn.–11573 chapter-rhaensch page 187
CHAPTER 2.1
MACHINE LEARNING IN REMOTE SENSING
Ronny Hänsch∗
Department SAR Technology,
German Aerospace Center (DLR), 82234 Weßling, Germany
Remote Sensing plays an essential role in Earth Observation and thus in un-
derstanding of the complex relationship between bio/geo-physical processes and
human welfare. The growing quality and quantity of remotely sensed data make
manual interpretation infeasible and require accurate and efficient methods to
automatically analyse the acquired data. This chapter discusses two Machine
Learning approaches that address these tasks on the example application of se-
mantic segmentation of polarimetric Synthetic Aperture Radar images. It shows
the importance of proper classifier design and training as well as the benefits of
automatically learned features.
1. Introduction
Several human welfare concerns (such as storms, wildfires, floods, epidemics,

poverty) are directly linked to land use, environmental vulnerability, and the living
conditions of human populations. Thus, human welfare, i.e. how to maintain and
improve it, does directly depend on a deep understanding of those highly complex
relations.
The next decades will be characterised by large and fast changes of human
populations, climate, economic demands, and consequently land use. Observing,
monitoring, and understanding these changes as well as linking environmental con-
ditions to human welfare are two of the most important challenges humanity faces
today.
While not being the only approach to address these challenges, Remote Sensing
(RS) does play a crucial role as it provides data about developments over space and
time that cannot be acquired by a ground-based sensor.
In general, Remote Sensing refers to acquiring data of an object or phenomenon
without establishing a tactile contact between sensor and object. In a modern
∗ R. Hänsch is with the Department SAR Technology of the German Aerospace Center (DLR),
82234 Weßling, Germany, and was during the writing of this chapter with the Computer Vi-
sion & Remote Sensing group, Technische Universität Berlin, 10587 Berlin, Germany. e-mail:
ronny.haensch@dlr.de. webpage: www.rhaensch.de.
187
188 Ronny Hänsch
context, however, it usually means acquiring data about the Earth (or other plane-
tary objects) by either air- or space-borne sensors often including (semi-)automatic
processing and analysis.
The corresponding sensors can be divided into two groups depending on whether
the sensor emits radiation on its own (i.e. being an active sensor) or relies on
external radiation sources such as the sun (i.e. being a passive sensor). Examples
of active sensors are Synthetic Aperture Radar (SAR, emitting microwaves) and
Light Detection and Ranging (LiDAR, using laser light), while examples of passive
sensors are CCD cameras, infrared sensors, and imaging spectrometers.
These sensors provide information about terrestrial, marine, and atmospheric
variables, such as changes in land cover, sea surfaces, and temperatures. Modern
Remote Sensing data products (combined with data from other sources) are used
with decades of scientific development and operational experience to address tasks
such as natural hazard monitoring, global climate change, and urban planning.
To this aim, a large number of air- and space-borne sensors deployed by research
institutes and industry of many countries provide multi-source (LiDAR, SAR, op-
tical, etc.), multi-temporal, and multi-resolution Remote Sensing data of increasing
quantity and quality (e.g. with higher spatial and spectral resolution).
However, acquiring, processing, and interpreting these data comes with several
challenges that are very different from the challenges of close-range Computer Vision
and which often hinder a successful direct transfer and application of correspond-
ing tools and methods. Apart from special applications such as medical Computer
Vision and autonomous driving, close-range Computer Vision is dominated by op-
tical digital cameras. This technology is sufficiently mature and industrialised to be
affordable for most individual persons and easily manageable even by laymen. In
contrast, Remote Sensing consists of a large variety of sensors with very different
properties. While commercial solutions for some sensors exist, sensor development
and launch are often still active fields of research. The data acquisition can only
seldom be performed by individuals as significant monetary, infrastructural, and
knowledge resources are needed to deploy and manage for example a satellite or
perform air-borne measurement campaigns. Also the data processing often requires
expert knowledge as it involves for example atmospheric correction, sensor cali-
bration, and geo-referencing. This is why Remote Sensing data is often owned by
specific research institutes, space agencies, or companies and not directly available
to the public. Nowadays, more and more data is freely available (e.g. via ESA’s
Copernicus programme [1]) or can be obtained via scientific proposals. However,
national and international law as well as paywalls still hinder free access to many
Remote Sensing data products. On the other hand, the interpretation of Remote
Sensing data often requires domain specific expertise as well. While most people
nowadays are able to understand overhead optical imagery, the visual interpretation
of a SAR image poses difficulties even for trained experts. This is only worsened if
not a semantic interpretation but a bio/geo-physical understanding is aimed for, e.g.
Machine Learning in Remote Sensing 189
estimating soil moisture, forest height, ice thickness, vegetation health, or biomass.
The difficulties regarding data access and data interpretability pose a significant
obstacle when applying Machine Learning for the automatic analysis of Remote
Sensing data. Most Machine Learning methods that aim to estimate a mapping
from the input data to the target variable require training data, i.e. samples for
which input and desired output are known. Close-range Computer Vision often
deals with every-day objects that can be labelled by laymen via e.g. crowd-sourcing.
The same is only seldom possible for Remote Sensing data as its interpretation
requires expert knowledge and some target variables can only be determined by
in-situ measurements.
Nevertheless, there has been a large success in the development and application
of Machine Learning methods in Remote Sensing. A complete overview about cor-
responding methods and applications would fill a whole book series and is therefore
not possible within a single book chapter. Instead, this chapter focuses on data of
a single sensor type, i.e. SAR, and on a single Machine Learning aspect, i.e. the
learning of optimal features for semantic segmentation, i.e. the estimation of a class
label for each pixel within the images. Most of the methods introduced in the next
sections, however, can be easily transferred to other sensors (e.g. hyperspectral
cameras) or other tasks (e.g. regression).
2. The Traditional Processing Chain of PolSAR Image Analysis
Synthetic Aperture Radar (SAR) is an active air- or space-borne sensor that emits
microwaves and records the backscattered echo. It is independent of daylight, only
insignificantly weather-dependent and able to penetrate clouds, dust and, to a cer-
tain extent, vegetation depending on the wavelength used. Polarimetric SAR (Pol-
SAR) transmits and receives in different polarizations and thus records multichannel
images. The change in orientation and degree of polarization depends on various
surface properties, including moisture, roughness, and object geometry. Conse-
quently, the recorded data contains valuable information about physical processes
as well as semantic object classes on the illuminated ground.
Since modern sensors record increasing volumes of these data which makes man-
ual interpretation infeasible, there is a large need for methods that automatically
analyse PolSAR images. One typical task is the creation of semantic maps, i.e. the
assignment of a semantic label to each pixel in the images. This task is accom-
plished by supervised Machine Learning methods that change the internal param-
eters of a generic model so that the system provides (on average) the correct label
when a training sample is given, i.e. a sample for which the true class is known.
One way to approach this problem is to model the relationship between data and
semantic label with probabilistic distributions (or mixtures thereof) (e.g. [2; 3;
4]). On the other hand, there are discriminative approaches which are usually eas-
ier to train and more robust as those generative models. These methods extract
task-specific image features and apply classifiers such as Support Vector Machines
190 Ronny Hänsch
(SVMs, e.g. in [5]), Multi-Layer Perceptrons (MLPs, e.g. in [6]) or Random Forests
(RFs, e.g. in [7]). The feature extraction step often involves manual designing
and selecting operators that are specific for the given classification task and thus
requires expert knowledge.
Modern approaches avoid the extraction of predefined features by adapting the
involved classifier to work directly on the complex-valued PolSAR data, e.g. by
using complex-valued MLPs [8] or SVMs with kernels defined over the complex-
domain [9]. Other methods use quasi-exhaustive feature sets that at least poten-
tially contain all the information necessary to solve any given classification problem.
The high dimensionality of the corresponding feature space is problematic for many
modern classifiers which is why it is often reduced by techniques such as principal
component analysis [10], independent component analysis [11], or linear discrimi-
nant analysis [12]. Another solution to this problem is to apply classifiers that are
not prone to this curse of dimensionality. One example are Random Forests (RFs)
due to their inbuilt feature selection as for example in [13] which computes hundreds
of features from a given PolSAR image as input for a RF.
These kind of methods are less biased towards specific classification tasks. How-
ever, the large amount of features consumes a huge amount of memory and com-
putation time. The following sections present a RF variant that can directly be
applied to PolSAR data without the need to pre-compute any features and thus
drastically decrease the required amount of memory and processing time.
3. Holistic Feature Extraction and Model Training
The basic idea of feature learning is to avoid the precomputation of features by

including the feature extraction into the optimization problem of the classifier. A
well known example are Convolutional Networks (ConvNets), which - in the case of
PolSAR data - are either applied to simple real-valued representations (e.g. [14]) or
adapted to the complex domain [15; 16]. An example of this approach is discussed
in Section 3.2.
While Deep Learning is probably the best known example for feature learning,
it is also possible with shallow learners. One example are Random Forests that
are tailored towards structured data such as images, i.e. that are applied to image
patches [17; 18; 13]. The next section discusses how to adapt these RFs to work
directly on the complex-valued data of PolSAR images [19].
3.1. Random Forests

A Random Forest (RF) is a set of multiple (usually) binary decision trees [20; 21]
and can be applied to regression and classification tasks. They leverage the benefits
of single decision trees (e.g. ability to handle different data types, interpretability,
simplicity) and avoid their limitations (e.g. high variance, prone to overfitting). RFs
aim to create multiple, equally accurate but still slightly different decision trees by
allowing a certain randomness during tree creation. Many of these trees will agree
on the correct label for most samples, while the remaining trees give wrong but
inconsistent answers. Consequently, a simple majority vote of all trees leads to the
correct answer. An in-depth discussion of RFs is beyond the scope of this chapter
but can be found e.g. in [17]. The following sections provide a brief introduction
to RFs but focus on how to enable them to learn from PolSAR images directly.
3.1.1. RF-based Feature Learning

Each tree within a RF has a single root node (i.e. no ingoing connections from
other nodes), multiple internal or split nodes (i.e. one in- and two outgoing con-
nections), and multiple terminal nodes or leafs (i.e. no outgoing connection). Tree
creation and training are based on a training set D = {(x, y)i }i=1,...,N of size N ,
where x is a sample and y the corresponding target variable, e.g. a class label
(y = y ∈ N). While standard RFs assume the samples to be real-valued feature
vectors, i.e. x ∈ Rn , it is also possible to model the samples as image patches of
size w [19]. In the case of k-channel PolSAR images this means that x ∈ Cw×w×k×k
(i.e. k = 2 for dual-, k = 3 for fully-polarized data). Each tree t samples the given
training data D creating its own training subset Dt ⊂ D (Bagging, [22]) which is
then propagated through the tree starting at the root node. Each non-terminal node
applies a binary test to each sample. Depending on the test outcome, the sample
is either shifted to the left or right child node. When certain stopping criteria are
met, this recursive splitting ends. Typical criteria are reaching the maximum tree
height, all samples in a node belong to one class, or a node received too few samples.
In this case a leaf node is created and the local class posterior is assigned to it.
The node tests are most crucial for the performance of a RF. On the one hand,
they ensure a high diversity among the trees by sampling a suitable test from a set
of candidates. On the other hand, a single tree contains and applies thousands to
millions of such test functions which requires them to be memory and time efficient.
In the case of real-valued vectors, i.e. x ∈ Rn , such a test is usually defined
as “xi < θ?” where i is a randomly selected dimension of x. There are several
possible methods to define the split point θ (including random sampling), many of
them are reviewed and evaluated in [23]. Node tests of this form create piece-wise
linear and axis-aligned decision boundaries within the feature space. For images
more sophisticated node tests have been proposed that analyse the local spatial
image structure [24]. These ideas can be extended to the characteristics of PolSAR
images [19] by defining an operator φ : Cw̃×w̃×k×k → Ck×k that is applied to one,
two, or four regions Rr ⊂ x (r = 1, ..., 4) of size w̃r × w̃r inside a patch x (where
w̃r < w). Possible operators are the centre or average value of the region, or
192 Ronny Hänsch
the region element with minimal/maximal span within the region:

⎧
⎪
⎪ R(w̃/2, w̃/2)
⎪
⎪ w̃
⎪
⎪
w̃
⎨ w̃12 R(i, j)
R
CR = φ(R) = i=1 j=1 (1)
⎪
⎪ R(i∗ , j ∗ ) with span R(i∗ , j ∗ ) ≤ min span R(i, j)
⎪
⎪
⎪
⎪
0<i,j<w̃
⎩ R(i∗ , j ∗ ) with span R(i∗ , j ∗ ) ≥ max span R(i, j)
0<i,j<w̃
Operator, region size and position are randomly selected. The results of applying
the operator at each region are compared to each other by Equations 2-4, where C̃
is a reference covariance matrix randomly selected from the whole image:
1-point projection: d(CR1 , C̃) <θ (2)
2-point projection: d(CR1 , CR2 ) <θ (3)
4-point projection: d(CR1 , CR2 ) − d(CR3 , CR4 ) < θ (4)
These projections (illustrated in Figure 1) are able to analyse the local spectral
and textural content. They make use of a proper distance measure d(A, B) defined
over the corresponding data space, i.e. Hermitian matrices in the case of PolSAR
images, as for example the log-Euclidean distance d(A, B) = ||log(A) − log(B)||F
(where || · ||F is the Frobenius norm).
Every internal node creates multiple test candidates and selects the best test
according to a quality criterion which is usually based on the drop of impurity ΔI:
ΔI = I(P (y|Dn )) − PL I(P (y|DnL )) − PR I(P (y|DnR )) (5)

C
I(P (y)) = 1 − P (yi )2 (6)
i=1
where nL , nR are the left and right child nodes of node n with the respective data
subsets DnL , DnR (with DnL ∪ DnR = Dn and DnL ∩ DnR = ∅) and the corre-
sponding prior probabilities PL/R = |DnL/R |/|Dn |. The Gini impurity (Equation 6)
of the corresponding local class posteriors P (y|Dn ) of node n is a typical choice to
measure the node impurity and is estimated based on the local subset Dn of the
training set.
After the RF is created and trained, it can be used for prediction during which a
query sample is propagated through all trees. It will reach exactly one leaf nt (x) in
every tree t. The estimate stored in these leafs, i.e. the class posterior P (y|nt (x)),
is averaged to obtain the final class posterior P (y|x):
1
T
P (y|x) = P (y|nt (x)) (7)
T t=1
The result of applying this method (using T = 30 trees of maximum height

H = 50 and the log-Euclidean distance) to the fully polarimetric image shown in
Figure 2a (acquired by the ESAR sensor (DLR) in L-band over Oberpfaffenhofen,
(a) 1-point projection (b) 2-point projection (c) 4-point projection
Fig. 1.: Different spatial projections within a node test function [19].
Germany) is shown in Figure 2c (the used reference data is shown in Figure 2b)
[19]. Table 1 shows the corresponding confusion matrix.
If compared to the results obtained by extracting a large set of real-valued
features as input to the RF [13], a very similar accuracy is achieved, i.e. a slight
drop from 89.4% to 87.5%.
(a) Image data acquired by (b) Reference data. (c) Result (Log-Euclidean).
ESAR sensor (DLR).
Fig. 2.: Oberpfaffenhofen data set [19].
BA = 87.5% City Forest Field Shrubl. Road

City 0.87 0.06 0.00 0.06 0.01
Forest 0.02 0.96 0.00 0.02 0.00
Field 0.00 0.00 0.93 0.04 0.03
Shrubl. 0.00 0.02 0.08 0.90 0.00
Road 0.11 0.01 0.13 0.02 0.73
Table 1.: Confusion matrix (Log-Euclidean) [19].
194 Ronny Hänsch
3.1.2. Batch Processing for Random Forests

Machine learning approaches often represent the neighbourhood of a pixel by a set
of features. This transforms low-dimensional image matrices into high-dimensional
data cubes which often exceed the memory capacity of common computers. One
solution to this problem is to create a sufficiently small subset by sampling the
available data. This allows to train any Machine Learning frameworks “offline”,
i.e. with access to all samples in this subset. If the used features are sufficiently
descriptive this approach can be successful. However, more fine-grained modern
classification problems and methods with many internal parameters will not lead to
acceptable results if the training set size is too small. Therefore, a second approach
are methods capable of batch-processing, i.e. that adjust their internal parameters
incrementally based on small data subsets.
While standard Random Forests assume access to all samples during training to
optimise the test selection, it is possible to change the training procedure to process
data in batches by decoupling tree creation and training [25]: The application of
RFs to a given problem is roughly divided into three phases: a) Tree creation: The
definition of the tree topology; b) Tree training: The definition of leaf predictors;
c) Tree prediction: Using a created and trained tree to estimate the target variable
for query samples.
Prediction is already an online process, since all query samples are handled
independently by the different trees in a RF. Tree training is an online process as
well if leaf predictors are based on simple statistics (such as class histograms) that
can be computed incrementally. This has the advantage that all available samples
can be used to train the leaf predictors.
The only phase that requires special attention is tree creation during which
statistics are computed over the whole sample set. While the projection of a sam-
ple into a scalar value as discussed above is usually independent of other samples,
the split point computation cannot be based on a single sample alone but is based
on statistics over the whole sample set. However, many of the required statistics
(e.g. mean value, standard deviation, minimum/maximum value) can be computed
incrementally. For approaches that are based on statistics that can not be incre-
mentally updated (e.g. the median value) it is sufficient to keep only the projected
values instead of the original samples. Every internal node accumulates the neces-
sary statistics until a certain number τ of samples has been observed after which
the split point is computed. The next step is the split candidate evaluation, i.e. the
computation of a quality criterion such as the drop of impurity. For classification
tasks it is based on the class posteriors of the current node and its child nodes
which can be computed incrementally. Again, the split node collects the required
statistics until the sample count reaches a given threshold and then selects the best
test. After that the node is completely defined and can pass all subsequent samples
to its child nodes.
This allows to load only as many samples as the memory size allows while the
learning process still exploits all available data. Only the current batch of samples
and the local split statistics have to be kept in memory. The threshold τ should
be selected to be sufficiently large to perform an accurate estimation of the split
statistics. However, RFs do not rely on optimal node splits and not only tolerate but
even require a considerable amount of uncertainty, which allows to keep τ relatively
small. Furthermore, a too large τ would lead to well optimised splits but slow down
the the tree growth.
This batch processing (with a RF with T = 50 trees of maximum height H = 50,
10 × 1 tiles and a batch size of B = 10, 000) was applied to the fully-polarimetric
example image of TerraSAR-X (X-band) provided by DLR shown in Figure 3a (with
the corresponding reference data shown in Figure 3b). It was acquired over Plat-
tling, Germany, a large rural area with multiple small settlements, roads, agricul-
tural fields, forests, and water and contains 10, 310×11, 698 pixels which corresponds
to roughly 2.1GB of memory.
(a) PolSAR image, TerraSAR- (b) Reference data: Urban (c) Classification result (same
X, DLR. (red), Road (magenta), Forest color code as for reference
(green), Field (yellow), Water data)
(blue)
Fig. 3.: Plattling data set [25].
Figure 3c shows the estimated label map which has a balanced accuracy (i.e.
average class detection rate) of 75.1% (please see [25] for more details).
3.1.3. Stacking of Random Forests

This section extends the ideas presented in the previous sections by applying an
additional Ensemble Learning method, namely Stacking (sometimes also called
blending, stacked generalisation [26], stacked regression [27], or super learning [28]).
Stacking consists of two phases: In a first step, multiple base learners (the so called
Tier-1 models) are trained. This is similar to the training of the individual decision
tress within a RF, but here their individual output is not fused by simple averag-
ing. Instead, their output serves as input to another classifier (the so called Tier-2
196 Ronny Hänsch

Fig. 4.: The here discussed stacking framework uses at level 0 a RF that is trained on the
image data and the reference data only. Subsequent RFs use the estimated class posterior
as an additional feature. This enables refining the class decision and leads to more accurate
semantic maps [29].
model) within the second stage. The Tier-2 model uses a more sophisticated fusion
rule by learning when to ignore which of the Tier-1 models and how to combine
their answers. This means that even errors of the Tier-1 models can be exploited
since consistent errors might provide descriptive information about the true class.
The idea presented in the following slightly differs from the original formulation
of stacking in two major points [29]: 1) Only a single RF is trained as Tier-1 model,
i.e. the RF variant discussed above which can directly be applied to PolSAR data.
The estimated pixel-wise class posterior of this RF contains already a high level of
semantic information and is then used by a second RF as Tier-2 model together
with the original image data. To this aim, the original RF framework is extended
by including internal node tests that can be applied to class posteriors. 2) This
procedure is repeated multiple times: A RF at the i + 1-th level uses the original
image data and the posterior estimate of the RF at level i as input. This allows to
improve the class posterior estimate by learning which decisions are consistent with
the reference labels and how to correct errors.
Figure 4 illustrates the basic principle of stacking as applied here. The first level
(i.e. level 0) consists of a RF that is trained as described in Section 3.1.1. It only
uses the image data, i.e. the polarimetric sample covariance matrices, as well as the
reference data. This RF is then used to predict the class posterior of each sample
of the training data which completes the first level.
A RF at level l (with 0 < l ≤ L) uses image and reference data, but also the
class posterior estimated at level l − 1. This allows to refine the class estimates and
to correct errors made by the previous RF. One possible example are pixels showing
double bounce backscattering which frequently happens within urban areas due to
the geometric structure of buildings. It also happens frequently within a forest
where it is caused by tree stems. A RF in the early stages will interpret double
bounce as an indication for urban areas and thus mislabel a forest pixel showing
this type of backscattering. A RF at an higher stage will learn that isolated double
bounce pixels labelled as urban area but surrounded by forest actually belong to
the forest class.
RFs that are supposed to analyse local class posteriors require node tests that
are designed for patches of probability distributions. In such a sample patch x each
pixel contains a probability distribution P (c) ∈ [0, 1]|C| defining to which degree
this pixel belongs to a class c ∈ C. As described above in Section 3.1.1 every node
test randomly samples several regions within a patch and selects one of the pixels
based on an operator. Possible operators for probability regions are for example
selecting the centre value or the region element with minimal/maximal entropy.
These probability distributions are then compared by a proper distance measure d

as for example the histogram intersection dHI (P, Q) = c∈C min(P (c), Q(c)) [29].
While the node tests in Section 3.1.1 analyse local spectral and textural proper-
ties within the image space, the node tests discussed here analyse the local structure
of the label space. This integrates the final classification decision of the previous RF
with its certainty as well as their spatial distributions. This enables the framework
to analyse spectral, spatial, as well as semantic information.
The following experiments are carried out on the Oberpfaffenhofen data set of
the previous section.
The RF at the 0-level has only access to the image (and reference) data and
achieved a balanced accuracy of 86.8% (the corresponding semantic map is shown
in Figure 5c. Despite the already quite high accuracy, there are several remaining
problems as for example fluctuating labels (e.g. within the central forest area) and
areas that are consistently misclassified. Figure 6 shows one of the problematic
areas in greater detail. The RF associates image edges with either city or road and
thus assigns wrong class labels to boundaries between fields, forest, and shrubland
(see first image in the first row of Figure 6).
The entropy and margin of the estimated class posterior (shown in the second
and third row of Figure 6) measure the degree of certainty of the classifier in its
decision. They range from completely uncertain (margin equals zero, entropy equals
one, both shown in blue) to completely certain (margin equals one, entropy equals
zero, both shown in dark red). While most parts of the forest and field class show
a high degree of certainty, in particular the wrongly labelled regions show a high
degree of uncertainty. The remaining rows of Figure 6 show the class posterior. The
columns of Figure 6 illustrate the learning through the individual stacking levels and
show that the largest changes occur within the first few levels. Every RF corrects
some of the remaining mistakes and gains certainty in decisions that are already
correct by using the semantic information provided by its predecessors as well as
the original image data. RFs at higher levels learn that edges within the image do
only correspond to city or road if other criteria (e.g. certain context properties) are
fulfilled. The large field area at the top of the image which is largely confused as
198 Ronny Hänsch
(a) Image Data (E-SAR, DLR, L- (b) Reference Data: City (red),
Band) Road (blue), Forest (dark green),
Shrubland (light green), Field
(yellow), unlabelled pixels in white
(c) Classification map obtained by (d) Classification map obtained by

the RF at Level 0. the RF at Level 9.
Fig. 5.: Input data and results of the Oberpfaffenhofen data set [29].
field at the 0-level is now correctly labelled as field. Not all errors are corrected, e.g.
the large field area at the bottom of the image stays falsely classified as shrubland.
Nevertheless, the overall accuracy of the estimate of the last RF and thus the final
output improved significantly as shown in Figure 5d.
Figure 7 shows the development of margin (Figure 7a), entropy (Figure 7b), and
classification accuracy (Figure 7c) over different stacking levels. The classification
accuracy does monotonically increase over all stacking levels starting at 86.8% for
level 0 and ending at 90.7% at level 9. There is a significant change in accuracy at
the first levels which saturates quickly after roughly four levels. Interestingly, the
results differ for different classes: All classes (besides the street class which loses 1%
accuracy) benefit from stacking but not to the same extent. While, for example,
the accuracy of the field class appears to saturate already after level 2, the forest
class continuous to slightly improve even at the last iteration.
Fig. 6.: Detail of the Oberpfaffenhofen data set. The columns illustrate level 0, 1, 2, 5,
and 9 of stacking. From top to bottom: Label map (same color code as in Figure 5b);
Entropy; Margin; Class posterior of City, Street, Forest, Shrubland, Field [29].
While the accuracy quickly saturates, the certainty of the RFs continues to
increase (shown in Figures 7a and 7b). While the changes are large at the early
stacking levels, higher levels only achieve marginal improvements.
200 Ronny Hänsch
1.0 0.4 1.0
0.9 0.3
0.8 0.2 0.9
0.7 0.1
0.6 0.8
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
(a) Margin (b) Entropy (c) Accuracy
Fig. 7.: Results over the different stacking levels on the Oberpfaffenhofen data set [29].
3.2. Deep Convolutional Networks
Deep Learning methods that extract learned features are said to outperform tradi-
tional Machine Learning approaches with hand-crafted features. They have changed
paradigms in many areas, shifting research efforts from feature development to de-
signing deep architectures and creating databases. The latter is especially important
because deep learning works very well if (and often only if) large amounts of data
are available.
Recently, there has been a trend to provide satellite data for research for free
(e.g., [1]), which opens up new possibilities for methods requiring much data. In-
deed, deep learning methods are increasingly being used in Remote Sensing applica-
tions [30]. Traditional methods based on hand-crafted features, however, sometimes
still outperform deep learning approaches. An example is a recent classification chal-
lenge [31] where the four winning approaches (e.g. [32]) used ensemble techniques.
The problem of deep learning is usually not the lack of data, but the lack of labelled
data. More specifically, the lack of data of a particular sensor in conjunction with
the particular target variable to be learned.
As discussed at the beginning of this chapter, RS sensors cover a lot of very dif-
ferent modalities, which indicates that very large data sets are required for different
combinations of sensor type and target variable. More and larger labelled data sets
will surely emerge over time. However, the sheer amount of possible combinations
is clearly prohibitive.
In the absence of large amounts of labelled data, unsupervised or semi-supervised
methods may be beneficial: Instead of solving the primary problem directly, another
task is being learned for which more data is available. This proxy task is intended
to force the model to partially learn the primary task as well.
This paradigm can be successfully applied to the classification of PolSAR images
by using the proxy task of transcoding PolSAR images into optical multispectral
images for which large amounts of data is freely available. On the one hand, such
a transcoding provides a more intuitive visualization of SAR data. On the other
hand, the corresponding network must learn to recognize semantic entities in order
Fig. 8.: Transcoding examples using either adversarial loss (left) or L2 regression loss
(right). With the adversarial loss, structures not present in the SAR data are hallucinated,
leading to the synthesis of texture [33].
to synthesize the corresponding optical textures. This recognition is learned on the

basis of large amounts of data and therefore generalizes quite well. After learning
the proxy task, the layers of the network generating the multispectral output are
replaced with a new classifier that has only a few parameters and is trained on a
small data set. This new classifier is extremely powerful compared to methods that
have been trained from scratch, especially when the amount of training data is very
low.
The transcoding is based on pairs of Sentinel-1 and Sentinel-2 products with
large spatial overlaps, similar acquisition dates, and small cloud cover in the opti-
cal image. The polarimetric covariance matrices are represented as 5-dimensional,
real-valued vectors containing the logarithm of the diagonal elements and the off-
diagonal magnitude, as well as the normalized real and imaginary parts of the
off-diagonal argument [33] and are — after centering and scaling to unit variance
— used as input to the transcoder and classifier.
The transcoding network is based on the U-shape of [34], but with less down-
sampling steps and more convolutions (the exact architecture can be found in [33]).
Transcoding from PolSAR data to multispectral images is an ill-posed problem
since the multi-spectral data contains information that is simply not present in a
SAR image. Figure 8 shows the results of a least squares regression (using a L2
202 Ronny Hänsch
loss). Structures are transcoded into their respective average colors if they can
be differentiated in the PolSAR data, but are lost if there is no direct translation
causing many different types of land use to be mapped to similar colors.
If the network is supposed to differentiate between classes, it needs to repro-
duce not only the average colors but also the corresponding, class-specific textures.
This can be achieved by training the network as generator of a conditioned Gen-
erative Adversarial Network (conditioned GAN) similar to [34] which does not aim
to reproduce the exact optical image from the SAR image but tries to produce an
plausible optical image. In this context, plausible means that it is not possible to
distinguish the transcoded optical image from a real image given the SAR data.
A second convolutional network, the discriminator, computes the this adversarial
loss, i.e. whether the optical output appears real given the SAR data. The training
of generator and discriminator are interleaved (more details on the exact training
procedure can be found in [33]).
The described framework is applied to two scenes around Wroclaw and Poznan,
Poland, where the first scene is used for training and the second for evaluation.
Figure 9 shows crops from the training set from the optical image, the PolSAR
image, and the transcoding result. While the transcoding is not always accurate
in terms of colors (in particular for fields), the result does match the semantic
meaning of a region, i.e. the optical texture of a class is synthesized if this class
is truly present in the image. This shows that the transcoding network learned
useful features to detect and differentiate between different semantic classes. These
features can then be used in a subsequent classification network to achieve high
performance with only very few labelled training samples.
In the following, the prefix “FS” denotes methods trained f rom scratch for
classification, i.e. a simple U-shape ConvNet (FS U-net), a ConvNet of similar ar-
chitecture as the generator (FS generator), and a RF as described in Section 3.1.1
(FS RF). The prefix “PT” refers to methods that are pre-trained on the transcoding
task, i.e. only fine tuning the last three convolution layers of the transcoding net-
work (PT last layers) or retraining a smaller upsampling branch of the transcoding
network (PT upsampling).
Similar to [35], samples of each class are spatially clustered into 16 clusters to
reduce the amount of training data in a more natural way then simple random
sampling. This allows to choose 1, 2, 4, 8, or all 16 clusters which corresponds
roughly to 1/16, 1/8, 1/4, 1/2, or all the training samples.
Qualitative and quantitative results on the test data set, i.e. a completely sepa-
rate image pair that was neither used for training nor for manual tuning, are shown
in Figures 10 and 11. All methods perform reasonably well if a sufficient amount
of training samples are available. The pre-trained networks using features from the
transcoding task outperform the methods trained from scratch. In particular, when
the amount of training data is small the methods trained from scratch show a large
performance decrease, while the pre-trained methods remain surprisingly stable at
Fig. 9.: Left: PolSAR image from Sentinel-1. Middle: Transcoded optical image using
only the SAR image as input. Right: Actual optical image from Sentinel-2. Not all
bands/channels are shown for the sake of brevity [33].
about 75% accuracy. Interestingly, the RF (as a shallow feature learner) is on par
with the deep ConvNets if they are not pre-trained and even slightly superior if
only few training samples are available.
204 Ronny Hänsch
SAR Optical [FS] generator [PT] last layers [FS] generator [PT] last layers
(classier input) (for reference) 1/16 1/16 Full Full
Urban Forest Field Water Highway
Fig. 10.: Qualitative classification results for several parts of the test data. Only “FS
generator” and “PT last layers” are shown. The two right columns show the effect of
different amounts of training data (1/16th and full amount) [33].
Fig. 11.: Classification accuracies achieved with different amounts of training data [33].
4. Conclusion
This chapter discussed the crucial role Remote Sensing plays for Earth Observation
and consequently for human welfare. City planning, forest monitoring, and nat-
ural hazard management are only a few possible applications which benefit from
high quality Remote Sensing data and a corresponding automatic analysis. Auto-
matic procedures to interpret (remotely sensed) images are often based on Machine
Learning which in particular in the last years showed tremendous success achieving
unprecedented accuracy and robustness.
One of the most challenging and interesting Remote Sensing data sources is
Synthetic Aperture Radar (SAR) which served as an example to discuss modern
Machine Learning approaches in this chapter. In particular, Random Forests (RFs)
were presented as one of the few shallow learning methods that is able to learn a
direct mapping from the data to the desired output variable. It was shown, how
the general concept of RFs can be extended to process large amounts of data as
well as how to integrate it into even more elaborated frameworks. As a second
contemporary Machine Learning example Generative Adversarial Networks (GAN)
were discussed in the context of feature learning based on the proxy task of image
to image transcoding. The resulting optical-like SAR image representation might
have an intrinsic value, e.g. for visualization purposes, but more importantly can
the learned features be used to ease a subsequent classification task.
The presented results show that automatic feature learning leads to superior
performance and can relax the requirement of ConvNets on a large training set.
It was also shown that in particular for small training set sizes a shallow learner
such as RFs can still compete with Deep Learning approaches if trained and applied
properly.
The future of Machine Learning in Remote Sensing will make even stronger
use of larger data sets of freely available multi-modal, multi-temporal data. It will
connect new learning strategies as for example the integration of physical models
with new target variables leading to many high-level data products that will further
strengthen our understanding of geological, biological, and physical processes in our
environment.
References
1. E. S. A. (ESA). Copernicus Open Access Hub. scihub.copernicus.eu, (2014–2018).

2. C. Tison, J. M. Nicolas, F. Tupin, and H. Maitre, A new statistical model for marko-
vian classification of urban areas in high-resolution sar images, IEEE Transactions on
Geoscience and Remote Sensing. 42(10), 2046–2057, (2004).
3. V. A. Krylov, G. Moser, S. B. Serpico, and J. Zerubia, Supervised high-resolution dual-
polarization SAR image classification by finite mixtures and copulas, IEEE Journal
of Selected Topics in Signal Processing. 5(3), 554–566, (2011).
4. J. M. Nicolas and F. Tupin. Statistical models for SAR amplitude data: A unified
206 Ronny Hänsch
vision through Mellin transform and Meijer functions. In 2016 24th European Signal
Processing Conference (EUSIPCO), pp. 518–522, Budapest, Hungary, (2016).
5. P. Mantero, G. Moser, and S. B. Serpico, Partially supervised classification of remote
sensing images through svm-based probability density estimation, IEEE Transactions
on Geoscience and Remote Sensing. 43(3), 559–570, (2005).
6. L. Bruzzone, M. Marconcini, U. Wegmuller, and A. Wiesmann, An advanced system
for the automatic classification of multitemporal sar images, IEEE Transactions on
Geoscience and Remote Sensing. 42(6), 1321–1334, (2004).
7. R. Hänsch and O. Hellwich. Random forests for building detection in polarimetric sar
data. In 2010 IEEE International Geoscience and Remote Sensing Symposium, pp.
460–463, (2010).
8. R. Hänsch, Complex-valued multi-layer perceptrons - an application to polarimetric
sar data, Photogrammetric Engineering & Remote Sensing. 9, 1081–1088, (2010).
9. G. Moser and S. B. Serpico. Kernel-based classification in complex-valued feature
spaces for polarimetric sar data. In 2014 IEEE Geoscience and Remote Sensing Sym-
posium, pp. 1257–1260, (2014).
10. G. Licciardi, R. G. Avezzano, F. D. Frate, G. Schiavon, and J. Chanussot, A novel ap-
proach to polarimetric SAR data processing based on nonlinear PCA, Pattern Recog-
nition. 47(5), 1953 – 1967, (2014).
11. M. Tao, F. Zhou, Y. Liu, and Z. Zhang, Tensorial independent component analysis-
based feature extraction for polarimetric SAR data classification, IEEE Transactions
on Geoscience and Remote Sensing. 53(5), 2481–2495, (2015).
12. C. He, T. Zhuo, D. Ou, M. Liu, and M. Liao, Nonlinear compressed sensing-based
LDA topic model for polarimetric SAR image classification, IEEE Journal of Selected
Topics in Applied Earth Observations and Remote Sensing. 7(3), 972–982, (2014).
13. R. Hänsch. Generic object categorization in PolSAR images - and beyond. PhD thesis,
TU Berlin, (2014).
14. Y. Zhou, H. Wang, F. Xu, and Y. Q. Jin, Polarimetric SAR image classification using
deep convolutional neural networks, IEEE Geoscience and Remote Sensing Letters.
13(12), 1935–1939, (2016).
15. R. Hänsch and O. Hellwich. Complex-valued convolutional neural networks for object
detection in polsar data. In 8th European Conference on Synthetic Aperture Radar,
pp. 1–4, Aachen, Germany, (2010).
16. Z. Zhang, H. Wang, F. Xu, and Y. Q. Jin, Complex-valued convolutional neural net-
work and its application in polarimetric SAR image classification, IEEE Transactions
on Geoscience and Remote Sensing. PP(99), 1–12, (2017).
17. A. Criminisi and J. Shotton, Decision Forests for Computer Vision and Medical Image
Analysis. (Springer Publishing Company, Incorporated, 2013).
18. B. Fröhlich, E. Rodner, and J. Denzler. Semantic segmentation with millions of fea-
tures: Integrating multiple cues in a combined Random Forest approach. In 11th Asian
Conference on Computer Vision, pp. 218–231, Daejeon, Korea, (2012).
19. R. Hänsch and O. Hellwich, Skipping the real world: Classification of polsar images
without explicit feature extraction, ISPRS Journal of Photogrammetry and Remote
Sensing. 140, 122–132, (2017). ISSN 0924-2716.
20. T. K. Ho, The random subspace method for constructing decision forests, IEEE Trans-
actions on Pattern Analysis and Machine Intelligence. 20(8), 832–844, (1998).
21. L. Breiman, Random forests, Machine Learning. 45(1), 5–32, (2001).
22. L. Breiman, Bagging predictors, Machine Learning. 24(2), 123–140, (1996).
23. R. Hänsch and O. Hellwich. Evaluation of tree creation methods within random forests
for classification of polsar images. In 2015 IEEE International Geoscience and Remote
Sensing Symposium (IGARSS), pp. 361–364, (2015).

24. V. Lepetit and P. Fua, Keypoint recognition using randomized trees, IEEE Trans.
Pattern Anal. Mach. Intell. 28(9), 1465–1479 (Sept., 2006).
25. R. Hänsch and O. Hellwich. Online random forests for large-scale land-use clas-
sification from polarimetric SAR images. In IGARSS 2019 - 2019 IEEE Interna-
tional Geoscience and Remote Sensing Symposium, pp. 5808–5811 (July, 2019). doi:
10.1109/IGARSS.2019.8898021.
26. D. H. Wolpert, Stacked generalization, Neural Networks. 5, 241–259, (1992).
27. L. Breiman, Stacked regressions, Machine Learning. 24(1), 49–64, (1996).
28. M. J. van der Laan, E. C. Polley, and A. E. Hubbard, Super learner, Statistical
Applications in Genetics and Molecular Biology. 6(1), (2007).
29. R. Hänsch and O. Hellwich, Classification of polsar images by stacked random forests,
ISPRS International Journal of Geo-Information. 7(2), (2018). ISSN 2220-9964. doi:
10.3390/ijgi7020074. URL https://www.mdpi.com/2220-9964/7/2/74.
30. Y. Zhou, H. Wang, F. Xu, and Y. Q. Jin, Polarimetric SAR image classification using
deep convolutional neural networks, IEEE Geoscience and Remote Sensing Letters.
13(12), 1935–1939, (2016).
31. D. Tuia, G. Moser, B. Le Saux, B. Bechtel, and L. See, 2017 IEEE GRSS Data Fusion
Contest: open data for global multimodal land use classification, IEEE Geoscience
and Remote Sensing Magazine. 5(1), 70–73, (2017).
32. N. Yokoya, P. Ghamisi, and J. Xia, Multimodal, multitemporal, and multisource
global data fusion for local climate zones classification based on ensemble learning,
IEEE International Geoscience and Remote Sensing Symposium (IGARSS). (2017).
33. A. Ley, O. Dhondt, S. Valade, R. Hänsch, and O. Hellwich. Exploiting gan-based sar
to optical image transcoding for improved classification via deep learning. In EUSAR
2018; 12th European Conference on Synthetic Aperture Radar, pp. 396–401. VDE (06,
2018). ISBN 978-3-8007-4636-1.
34. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, Image-to-image translation with condi-
tional adversarial networks, arxiv. (2016).
35. R. Hänsch, A. Ley, and O. Hellwich. Correct and still wrong: The relationship be-
tween sampling strategies and the estimation of the generalization error. In 2017 IEEE
International Geoscience and Remote Sensing Symposium (IGARSS), pp. 3672–3675
(July, 2017). doi: 10.1109/IGARSS.2017.8127795.
CHAPTER 2.2
HYPERSPECTRAL AND SPATIALLY ADAPTIVE

UNMIXING FOR AN ANALYTICAL
RECONSTRUCTION OF FRACTION SURFACES FROM
DATA WITH CORRUPTED PIXELS
Fadi Kizel1 and Jon Atli Benediktsson2

1
Technion-Israel Institute of Technology, Department of Mapping and Geoinformation
Engineering, Haifa 3200, Israel
fadikizel@technion.ac.il
2
University of Iceland, Faculty of Electrical and Computer Engineering,
107 Reykjavik, Iceland
benedikt@hi.is
Spectral unmixing is a key tool for a reliable quantitative analysis of remotely

sensed data. The process is used to extract subpixel information by estimating the
fractional abundances that correspond to pure signatures, known as endmembers
(EMs). In standard techniques, the unmixing problem is solved for each pixel
individually, relying only on spectral information. Recent studies show that
incorporating the image's spatial information enhances the accuracy of the
unmixing results. In this chapter, we present a new methodology for the
reconstruction of the fraction abundances from spectral images with a high
percentage of corrupted pixels. This is achieved based on a modification of the
spectral unmixing method called Gaussian-based spatially adaptive unmixing
(GBSAU). Besides, we present a summarized review of the existing spatially
adaptive methods.
1. Introduction
Analysis of the spectral mixture is important for a reliable interpretation of spectral

image data. The abundant amount of information provided by spectral images
allows for distinguishing between different land cover types. However, due to the
typical low spatial resolution in remotely sensed data, many pixels in the image
represent a mixture of several materials within the area of the pixels. Therefore,
subpixel information is needed in different applications. The extraction of subpixel
209
210 F. Kizel and J.A. Benediktsson
information requires a process of spectral unmixing in which a vector of abundance

fractions that corresponds to a set of EMs is estimated for each pixel in the image.
Numerous unmixing algorithms have been developed for this purpose [1], [2].
According to the traditional approach, the abundance fractions are estimated for
each pixel individually, disregard the fact that EM fractions obey a certain spatial
logic that corresponds to the spatial distribution of the image objects. An
estimation of the EM fractions is usually achieved through an inversion process by
optimizing a cost function based on a specific linear [3], [4] or nonlinear [5]–[8]
mixture model. The fractions obtained must fulfill the constraints of abundance
non-negativity constraint (ANC) abundance sum-to-one constraint (ASC) [9]. The
basic formulation of a mixture model with these two constraints represents the
relations between the measured reflectance for a given pixel, the EMs, and their
corresponding fractions. For example, given a hyperspectral image with M bands
and the matrix of L EMs, E \ MqL , then according to the linear mixture model
(LMM), a mixed pixel signature, m [m1,..., m M ]T is obtained as a linear
combination of all existing EMs as given by
m = Ef + n (1)
where f \ Lq1 is a vector containing the actual fractions of the EMs, and
n \ Mq1 is assumed to be a zero-mean Gaussian vector representing the system
noise. The constrained form of the LMM also composes the ANC: fi p 0 for
Lq1
i 1, ! , L and the ASC: f 1 b 1 , where 1 \
5
is a vector of ones. The basic
formulation of the mixture model is modified in some cases by adding a term that
represents the sparsity of the fractions vector. The consideration of the obtained
solution sparsity improves the accuracy of the estimated fraction, especially in
cases where many EMs are used to unmix a given spectral image. The selection of
relevant EM sets is a crucial issue in achieving successful unmixing, and the
overall process usually combines two main parts: (a) EMs selection and (b)
fraction estimation. Among several other parameters, different unmixing
approaches vary according to the way and stage the EMs are selected. A general
categorization of the existing methods classifies the different approaches into three
main types:
1) Supervised unmixing, where the EM set is selected in a separated preprocess
and different techniques [4], [10], both manual [11] and automatic [12]–
[24], have been applied to address this problem.
Hyperspectral and Spatially Adaptive Unmixing 211
2) Semi-supervised, i.e., sparse unmixing [25]–[27]. According to this

approach, a large library of potential EMs is used, and the optimal fractions
are estimated based on sparse regression (SR) [28].
3) Unsupervised methods that are mainly based on blind source separation

(BSS) [12], [23]–[28], and [35]. In this type of approach, both the EMs and
the corresponding fractions are estimated simultaneously in the same
optimization problem, for instance, non-negative matrix factorization
(NMF) [36].
Despite a large number of existing methods, most of the techniques were

developed using the basic LMM or the sparse-LMM models and based only on the
spectral information. Several recent works have shown that the incorporation of
spatial information within the unmixing process significantly improves the
accuracy of the obtained results of both EM selection and fraction estimation. A
variety of strategies was adopted to exploit the image’s spatial information.
Usually, the unmixing formulation is modified to combine a spectral term and a
spatial regularization term that represents the spatial relations between adjoint
pixels. A commonly used term for this purpose is total variation (TV), which
assists in incorporating local spatial information to the unmixing problem.
In some cases, non-local means are used to introduce more spatial information
to the unmixing [37], [38]. In both cases, an improvement of the unmixing results
is achieved, but in practice, only a limited part of the inherent spatial information
is exploited to promote a piecewise smooth transition of the estimated fractions. A
general unmixing model that includes both spatial and sparse regularization terms
has been also applied in some cases. While sparse regularization yields a better
solution with a small number of EMs, the spatial regularization leads to a spatially
smooth solution by constraining the similarity between the fraction values in
neighboring pixels. Moreover, spatial regularization improves the convergence of
the process to an optimal solution by shrinking the feasible fractions space.
According to the taxonomy presented in [39], spatially adaptive unmixing
methods can be classified into five main groups. First, the different methods are
sorted according to three main categories: 1) methods for EM extraction, 2)
methods for fraction estimation, and 3) methods for simultaneous estimation of
both the EMs and their fractions, usually through a BSS process. Then, each
method is classified into the relevant group according to three main parameters; a)
the type of output (EMs or/and fractions), b) the type of EMs used/extracted, i.e.,
from image or library, and c) does the proposed model includes a term for sparse
regularization. Following these definitions, a general overview of the existing
spatially adaptive unmixing approaches yields the following five main groups of
methods:
x Group A. Methods within this group were designed for automatic

extraction of EMs while incorporating the image’s spatial information into
the extraction process. Three main strategies were adopted within this
group: (a) Determination of pixel purity using mathematical morphology
([40], [41]), (b) applying a spatial-spectral preprocessing to modify the
original data before the EM extraction process [42]–[44] and (c) deriving
a local subset of EMs for each spatial tile taking into account the potential
spatial-spectral variability, e.g., [45].
x Group B. Supervised and spatially adaptive inversion methods in this

group were developed to enhance the estimation of the fractions using a
predefined set of EMs. For example, in [46], different adaptive subsets of
EMs are used for different regions. Nevertheless, in other methods, the
spatial adaptation is usually achieved by adding a spatial regularization
term to the overall objective function to be optimized. A variety of spatial
metrics were used to formulate the spatial regularization term, e.g., local
smoothness [47], spatial correlation [48], [49], adaptive Markov random
field (MRF) [50], and total variation (TV) [51].
x Group C. Semi-supervised spatially adaptive methods are applied for

enhanced fraction estimation when a library with a large number of EMs
is used. In addition to the commonly used spectral and sparsity terms,
methods in this fashion combine within the mixture model, a spatial
regularization term, e.g., TV [52] and non-local Euclidean medians [38],
[53].
x Group D. Spatially adaptive BSS methods were developed to improve the

simultaneous estimation of both the EMs and their corresponding
fractions. Two methods of this kind are, for example, spatial piece-wise
convex multiple model endmember detection (Spatial P-COMMEND) [54]
and spatial complexity BSS (SCBSS) [55].
x Group E. Like methods in the previous group, methods in this group are
intended to estimate both the EMs and their corresponding fractions
simultaneously. However, the sparsity of the fractions is also considered
and represented in the objective function [56]–[59].
For more information about the taxonomy of the existing spatially adaptive
methods, please see an overview in [39].
In summary, a variety of spatial regularization strategies has been adopted recently

for spectral unmixing methods. In most of the cases, only local or a small amount
of the image’s spatial information is used. Despite the enhancement of the obtained
results due to the use of the spatial regularization, two main drawbacks are still
existing within the majority of the new spatially adaptive unmixing methods:
x The entire set of EMs is used to unmix all the pixels in the image without
any explicit decision regarding the probable presence of only a subset of
the EMs in a pixel. Non-existing EMs in a specific pixel are considered in
the optimization process and result in erroneous and overestimated
positive fraction values.
x A spatial relations model between the fractions is rationally assumed;

however, there is no posterior estimated model that describes the
continuous spatial fraction distribution.
Different from other spatially adaptive methods, the previous two drawbacks are
addressed in the GBSAU method for further enhancement of the obtained fraction
accuracy. Instead of the commonly used (local) spatial regularization, the entire
spatial information of the image is incorporated within a supervised unmixing
process, and the spatial distribution of each EM fraction is represented as a 2D
analytical surface. In the rest of this chapter, we present the basic concepts of the
GBSAU method and show how we can use these concepts for the reconstruction
of fraction surfaces for images with low SNR and a high percentage of corrupted
pixels in the image.
2. Spatially adaptive hyperspectral unmixing based on analytical

2D surfaces
2.1. Sum of anisotropic 2-D Gaussians for analytical reconstruction of

fraction surfaces
The process in GBSAU is based on the realistic assumption that the fractions of an
EM are spatially distributed around a finite number of cores (Fig. 2), and their
values are assumed to decrease (or remain at most constant in some cases) as the
pixel's distance from the closest core increases. In particular, the spatial decay of
fraction values from a spectral core outward is assumed to be Gaussian. Using
these assumptions, the overall process in GBSAU combines two main steps as
follows:
1) Extraction of potential spectral cores through a process for the detection

of regional maxima points on the EM spectral similarity surfaces.
2) Reconstruction of an analytical surface, for each EM, as a sum of

anisotropic 2D spatial Gaussians that represents the fraction values of the
EM over the image.
The first step starts with generating a spectral similarity surface for each EM by
calculating the spectral angle mapper (SAM) [60][61] between the EM spectra and
the spectral signature at each pixel in the image. Given the spectral signature at an
arbitrary location in the image, i.e. m(x , y ) , the corresponding value of the spectral
similarity surface of the ith EM, in the x , y pixel is given by
D x , y, i 1 SAM m(x , y ), Ei . (2)
Once we created the spectral similarity surfaces for all the EM, a multi-layer
surface is created by staking all the generated similarity surfaces along the z axes.
Then, the tow processes for the extraction of single-layer and multi-layer regional
maxima are applied. Single-layer regional maxima have a maximal value relative
to their spatial 2D surrounding neighborhood within the spectral similarity surface
of a given EM. Whereas, multi-layer regional maxima have a maximal value
relative to their 3D surrounding neighborhood, which includes the corresponding
2D neighborhoods in all other layers in the multi-layer surface. In addition to many
real spectral cores, the detection of points that do not represent a real spectral core
is also probable. The elimination of these unreal cores is considered during the
process in the next step. Please see the work in [39] for full details regarding these
processes.
In the second step, an optimization process is applied to reconstruct the
fraction surface of each EM by fitting an analytical surface represented as a sum
of anisotropic 2D Gaussians. Accordingly, given that hi Gaussians represent the
fractions’ surface of the ith EM, the fraction value fi at a given location x , y can
be written as:
hi
fi (x, y) Gij (x, y), (3)

j 1
where G ij is the j th out of the hi Gaussians. A single anisotropic 2D Gaussian

with an offset of a 0 , a magnitude of a 1 , and axial standard deviations T x , T y ,
centered in x 0 , y 0 and rotated at an angle R can be formulated as
u
G (x , y ) a 0 a1e 2
(4)
where
2 2
x a ¬ ¬
y a
u ,
Tx ® Ty ®
x a (x x 0 ) sin R (y y 0 ) cos R
and
y a (x x 0 ) cos R (y y 0 ) sin R.
The beneficial use of the sum of spatial Gaussians for the approximation of a
spatial surface has been shown in [62]–[67]. The objective in the second step is to
adjust the parameters of the Gaussians while reconstructing the fraction surfaces
of all the EMs. Each Gaussian has seven parameters a 0 , a 1 , T x , Ty , R, x 0 , y 0 , and
all the Gaussians are adjusted simultaneously. The process is initialized by locating
narrow 2D Gaussian at each spectral core that extracted in the previous step. Then,
a gradient descent (GD) optimization process is applied to adjust the parameters of

the Gaussians while maximizing an objective function that represents an overall
spectral similarity term as follows:
c r
8 G(x , y ) (5)
x 1 y 1
where c and r are the number of columns and rows of the image, respectively,
and G(x , y ) [60] represents the local spectral similarity between the source spectral
signature at the point (x , y ) and the reconstructed signature (by the endmember set
E and the vector of estimated fractions ˆf (x , y ) ). Then, G is defined as follows:
m(x , y )T Efˆ(x , y )
G(x , y ) (6)
m(x , y ) ¸ Efˆ(x , y )
where ˆf (x , y ) [ f1 (x , y ), f2 (x , y ), ....., f3 (x , y )]T . The unknowns to be estimated by

the optimization process are the parameters of the Gaussians for all EMs. Let
ˆ j a j , a j , T j , T j , R j , x j , y j ¯T denote the vector of estimated parameters of
Pi ¡¢ 0i 1i x i yi i 0i 0i °±
the j th Gaussian of the i th EM, the vector of all the estimated unknowns is then
given by
T
ˆ P
P ˆ1 ˆ2 ˆ h1 ˆ 1 ˆ 2 ˆ h2 ˆ1 ˆ2 ˆ hL ¯
¡¢ 1 , P1 , !, P1 , P2 , P2 , !, P2 , !, PL , PL , !, PL °± . (7)
The estimation P̂ is achieved through an iterative GD optimization process. The

progress from the current iteration is accordingly given by
ˆk
P 1 ˆk
P H Pˆ k (8)
s8
where k is the iteration index, H is the step size, and Pˆ w is the objective
ˆ
sP
function gradient at P̂ . Given that the number of EMs is L and that the number
of Gaussians that represent the fractions’ surface of the i th EM is hi , the overall
gradient is given by
T
Pˆ ¡ Pˆ1 , Pˆ 2 , !, ˆ h1 , Pˆ1 , Pˆ 2 , !, ˆ h2 , !, Pˆ 1 , Pˆ 2 , !, ˆ hL ¯° , (9)
¢ 1 1 P1 2 2 P2 L L PL ±
where Pˆ j is the vector of derivatives of the objective function 8 with respect to
i
the parameters of the j th Gaussian of the i th EM and is given by

c r j
s8 sG(x , y ) sGi (x , y )
Pˆ j ¸ . (10)
i ˆj
sP x 1 y 1 fi (x , y ) sP ˆj
i i
A full description of the derivatives is given in [39]. All the derivative of the
objective function 8 with respect the Gaussian parameter can be derived
analytically. Thus, the overall optimization process is relatively computationally
light and allows for adjusting many Gaussians for each EM. Moreover, during the
optimization process, Gaussians that are located at a point that does not represent
a real fraction's core will vanish, and their magnitude will converge to zero,
whereas other Gaussians will change their parameters and take approximately the
shape of the real fraction distribution around the core. This property significantly
reduces the sensitivity of the GBSAU method to false detection of unreal cores
during the first step.
2.2. Reconstruction of fraction surfaces from noisy data with a high

percentage of corrupted pixels
In addition to the noticeable enhancement of the fractions’ accuracy, the output of

the GBSAU provides continuous surfaces that describe the spatial distribution of
the fractions [39]. The analytical representation of these surfaces makes the use of
GBSAU advantageous for further tasks that can be applied base on the derived
information from the process itself. Moreover, although the obtained surfaces are
smooth and continuous, both the steps of the overall process do not require
continuous input. The process for core extraction is based on a derivative-free max
filter that is applied through a moving window strategy to find the maximal value
in local regions, i.e., it does not need any process that requires continuity or
smoothness of the data. Fig. 2 presents the extracted spectral cores for a given
image with and without the presence of corrupted pixels. It is important to mention
that a smaller search window was used for the data with corrupted pixels; thus,
more single-layer cores were detected. The influence of the increased number of
detected cores is minor with regards to the accuracy of the obtained results, but it,
however, increases the processing time since the parameters of more Gaussians
need to be adjusted. In practice, the single-layer regional maxima finder is not

very precise, and it may detect points that do not represent real spectral cores, while
the multi-layer maxima finder is precise that many real fraction-cores may not be
detected. Thus, let C1 and C2 be the sets of single-layer and multi-layer cores,
respectively, the union of both sets C1 and C2 such that C C 1 C 2 ensures that
any fraction core will be represented by at least one regional maximum point.
Moreover, the optimization process in the second step maximizes an overall
objective function that accumulates the spectral similarity over the entire image.
Thus, it is robust, to a great extent, to the missing data in part of the image pixels
(e.g., corrupted pixels) and accordingly does not require a continuity of the data.
Here, we use the advantages of the GBSAU method and modify it to
reconstruct the fraction surfaces from spectral images with a high percentage of
corrupted pixels over the image. Given a spectral image H and a set of EMs E ,
the process in GBSAU estimates the parameters of the anisotropic 2D Gaussians
by solving the following optimization problem
ˆ arg max 8 E, H,C , P
P \ ^ (11)
P
where C is the set of spectral cores extracted during the first step of GBSAU and
P̂ is the vector of estimated parameters of all the Gaussians in the problem. The
problem in (11) can be modified to estimate the parameters of the Gaussians based
on data with corrupted pixels as follows
ˆ arg max 8 E, H ,C , P
P \ ^, (12)
c c c c
Pc
where Hc is a spectral image with corrupted pixels, C c is a set of spectral cores

ˆ P
derived using the data in Hc and the set of EMs E . Let e p P ˆ denote the
c
ˆ , with respect to the estimated parameters

error of the estimated parameters Pc
based on data without corrupted pixels P̂ . And let cp nocp / nop denote the
percentage of corrupted pixels in the image Hc , where nocp and nop are the
number of corrupted pixels and overall pixels in the image, respectively. We
assume that the error e p is relatively robust to the factor cp , i.e., we can estimate
the parameters of the Gaussians with sufficient accuracy using data with many
corrupted pixels.
3. Evaluation and results
To test the performance of the modified GBSAU for the reconstruction of fraction
surfaces, we applied a comparative evaluation using a patch of 50 q 50 pixels
from a real AisaDUAL image Fig. 1. Six EMs of asphalt, vegetation, red roof,
concrete, and two types of soil were selected from the image and used for the
evaluation. To create data with a synthetic ground-truth of the fractions, we created
a semi-synthetic spectral image with an affinity to the real image as described in
[39]. The semi-synthetic image contains mixed pixels with a variety of
combinations of the used EMs. To simulate real conditions in each experimental
scenario, we added Gaussian noise to each pixel in the image. To test the
performance of the examined methods under the presence of corrupted pixels, we
create scenes with different percentages of corrupted pixels over the image. We
created nine scenarios with different combinations of signal to noise ratio (SNR)
and percentage of corrupted pixels, as presented in Table 1. An RGB composite of
the generated image in each scenario is presented in Fig. 3. The performance of
the modified GBSAU method in each case was compared to the performance of
the two ordinary (non-spatial) methods SUnSAL [26] and VPGDU [60] and the
spatially adaptive method SUnSAL-TV [52]. Except for the GBSAU, the other
three examined methods are not adaptable to be applied to data with corrupted
pixels. Thus, an interpolated image in each scenario was used as an input for these
methods. A linear interpolation was applied only to retrieve the data in the
corrupted pixels while pixels with original data were not modified. Whereas, the
modified GBSAU was always applied to the original image with corrupted pixels
without the use of any interpolated data.
Table 1. Experimental scenarios with different combinations of SNR and percentage of corrupted
pixels.
Scenario SNR (db) Corrupted pixel (%)
1 30 0
2 30 40
3 30 80
4 10 0
5 10 40
6 10 80
7 5 0
8 5 40
9 5 80
Fig. 1. RGB composite and reflectance spectra of the six EMs, selected from the AisaDUAL image.
Fig. 2. Scatter of spectral core points for EM of Asphalt; (a)—(b) single-layer and multi-layer
regional maxima points, respectively, using an image without corrupted pixels; (c)—(d) single-layer
and multi-layer regional maxima points, respectively, using an image with 40% corrupted pixels.
Fig. 3. RGB composite of the generated semi-synthetic spectral images (a)–(i) according to the
presented parameters in scenarios 1–9, respectively.
For a quantitative assessment of the obtained results, we compute the average

Mean Absolute Error (MAE) of the estimated fractions with respect to the
synthetic ground-truth fractions as follows:
1 L
MAE MAEi (13)
L i 1
where
1 c r ˆ
MAEi f (x, y) fi (x, y),
r ¸ c x 1 y 1 i
fi (x , y ) and fî (x , y ) are, respectively, the synthetic-true fractions and the estimated
fractions of the i th EM at (x , y ) and c and r are the numbers of columns and rows
in the image, respectively. The results are summarized in Table 2.
Table 2. Quantitative MAE measures for evaluating the accuracy of estimated fractions, in each
experimenting scenario, using the GBSAU, SUnSAL-TV, SUnSAL, and VPGDU methods relative
to the true-synthetic fractions. The results of the GBSAU are based only on the original images with
corrupted pixels, while the other methods were applied to interpolated images.
MAE
0% corrupted pixels
SNR SUnSAL VPGDU SUnSAL-TV GBSAU
30db 0.0178 0.0200 0.0132 0.0143
10db 0.0846 0.1200 0.0842 0.0628
5db 0.1101 0.1417 0.1065 0.0816
40% corrupted pixels

SUnSAL VPGDU SUnSAL-TV GBSAU
30db 0.0370 0.0463 0.0330 0.0331
10db 0.0910 0.1248 0.0916 0.0719
5db 0.1202 0.1719 0.1255 0.1031
80% corrupted pixels

SUnSAL VPGDU SUnSAL-TV GBSAU
30db 0.0702 0.0764 0.0678 0.0628
10db 0.1119 0.1388 0.1145 0.1060
5db 0.1300 0.1778 0.1384 0.1103
The results clearly show the advantage of the GBSAU over the other methods,
especially as the SNR and percentage of corrupted pixels increase. In general, the
spatially adaptive methods perform better that the ordinary ones in the case of
SNR=30db. The SUnSAL-TV method loses its advantage over the non-spatial
SUnSAL method in some of the cases with corrupted pixels and low SNR; this
probably indicates the negative influence of the interpolated data on the spatial
regularization. Otherwise, the advantage of the modified GBSAU over all the other
examined methods is significantly clear. Recall that its results are obtained without
the need for any interpolation. The modified GBSAU provides a beneficial tool for
the retrieval of valuable information from spectral data with a high level of noise
and percentage of corrupted pixels. To illustration the summarized results in Table
2, Fig. 4 presents the surface of the MAE values for the obtained fractions by
GBSAU and SUnSAL-TV. First, the MAE value in each scenario, i.e., for a
particular combination of SNR and percentage of corrupted pixels, is assigned to
a corresponding pixel to create a 3 q 3 surface. Then, for better visualization, the
surface is resized into a 15 q15 surface using a bicubic interpolation.
Fig. 4. MAE surfaces, (a) and (b), for the obtained results by GBSAU and SUnSAL-TV, respectively.
The illustration in Fig. 4 sheds light on the obtained MAE values and emphasizes
the advantage of GBSAU over SUnSAL with regards to the accuracy of the
estimated fractions. The increase of MAE values, as the SNR decreases or the
percentage of corrupted pixels increases, is evident in both methods. However, the
trend of this increase is more moderate for the results of GBSAU. For further visual
evaluation of the results, we present the obtained fractions’ surface for the red roof
EM in each scenario. Fig. 5 and Fig. 6 present the obtained surfaces by SUnSAL-
TV and GBSAU, respectively.
Fig. 5. Fraction surfaces of the red roof EM, (a)-(i) as obtained by SUnSAL-TV for the images in
scenarios 1-9, respectively. The results for cases with corrupted pixels are obtained using an
interpolated image.
Fig. 6. Fraction surfaces of the red roof EM, (a)-(i) as obtained by GBSAU for the images in scenarios
1-9, respectively. The results for cases with corrupted pixels are obtained using only non-corrupted
pixels.
It is noticeable that both methods can retrieve reliable information even from very noisy
data. However, the advantage of the GBSAU over SUnSAL-TV in all the scenarios is clear.
The ability of the GBSAU to reconstruct the fraction under conditions of low SNR and a
high percentage of corrupted pixels is noteworthy. In addition to the spatial distribution,
the GBSAU also preserves the sparsity of the fractions; we can observe this from the
presented surfaces. While SUnSAL-TV overestimates fractions with zero value, a zero
value is obtained for these pixels by GBSAU in most of the cases.
4. Conclusions
We presented a new strategy for retrieving the fractional abundances from very
noisy hyperspectral images with a high percentage of corrupted pixels. The new
strategy is based on a modification of the GBSAU method. An experimental
evaluation of the proposed strategy with respect to state-of-art spatially adaptive
and non-spatial unmixing methods was conducted. The outcomes in the evaluation
section emphasize the advantage of using the unmixing process for the extraction
of information from hyperspectral data. All the methods success to retrieve an
amount of the information regarding the spatial distribution of the abundance
fractions with a certain degree of accuracy. This accuracy decreases as the SNR or
percentage of corrupted pixels increases. The GBSAU method outperforms all the
other methods with regards to the accuracy of the obtained results. This higher is
mainly due to the use of spatial information from the entire image and not only
from local regions. The problem of spatially adaptive unmixing is similar to the
fitting of a particular function using grid data. While the solution in SUnSAL-TV
(and all the other methods) is aimed to achieve this objective through a piecewise
regularization, the solution on GBSAU fits a single continuous and smooth
function for each EM across the entire image. Thus, the influence of local
anomalies on the obtained surfaces by GBSAU is much lower. For similar reasons,
all the other unmixing methods cannot be applied to data without continuity. Thus,
an interpolation method must be applied as a preprocess to the spectral unmixing.
Whereas the framework in GBSAU provides a novel solution for unmixing images
with both low SNR and a non-continuity due to the presence of corrupted pixels.
References
[1] N. Keshava and J. F. Mustard, “Spectral unmixing,” IEEE Signal Process. Mag., vol. 19,
no. 1, pp. 44–57, 2002.
[2] A. Plaza, Q. Du, J. M. Bioucas-Dias, X. Jia, and F. A. Kruse, “Foreword to the special issue
on spectral unmixing of remotely sensed data,” IEEE Trans. Geosci. Remote Sens., vol. 49,
no. 11 PART 1, pp. 4103–4105, 2011.
[3] M. Brown, H. G. Lewis, and S. R. Gunn, “Linear spectral mixture models and support vector
machines for remote sensing,” IEEE Trans. Geosci. Remote Sens., vol. 38, no. 5, pp. 2346–
2360, 2000.
[4] J. Bioucas-Dias et al., “Hyperspectral unmixing overview: Geometrical, statistical, and
sparse regression-based approaches,” IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing, vol. 5, no. 2. pp. 354–379, 2012.
[5] B. Hapke, “Bidirectional reflectance spectroscopy,” Icarus, vol. 195, no. 2. pp. 918–926, 2008.
[6] J. M. P. Nascimento and J. M. Bioucas-Dias, “Nonlinear mixture model for hyperspectral
unmixing,” in Proceedings of SPIE conference on Image and Signal Processing for Remote
Sensing XV, 2009, vol. 7477, pp. 74770I-1-74770I–8.
[7] Y. Altmann, N. Dobigeon, J. Y. Tourneret, and S. McLaughlin, “Nonlinear unmixing of
hyperspectral images using radial basis functions and orthogonal least squares,” Geoscience
and Remote Sensing Symposium (IGARSS), 2011 IEEE International. pp. 1151–1154, 2011.
[8] R. Heylen, M. Parente, and P. Gader, “A review of nonlinear hyperspectral unmixing
methods,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Sensing, vol. 7, no. 6. pp. 1844–1868, 2014.
[9] C. I. Chang, “Constrained subpixel target detection for remotely sensed imagery,” IEEE
Trans. Geosci. Remote Sens., vol. 38, no. 3, pp. 1144–1159, 2000.
[10] N. Keshava, “A Survey of Spectral Unmixing Algorithms,” Lincoln Lab. J., vol. 14, no. 1,
pp. 55–78, 2003.
[11] A. Bateson and B. Curtiss, “A method for manual endmember selection and spectral
unmixing,” Remote Sens. Environ., vol. 55, no. 3, pp. 229–243, 1996.
[12] M. Parente and A. Plaza, “Survey of geometric and statistical unmixing algorithms for
hyperspectral images,” in 2nd Workshop on Hyperspectral Image and Signal Processing:
Evolution in Remote Sensing, WHISPERS 2010 - Workshop Program, 2010.
[13] J. W. Boardman, F. a. Kruse, and R. O. Green, “Mapping target signatures via partial
unmixing of AVIRIS data,” Summ. JPL Airborne Earth Sci. Work., pp. 3–6, 1995.
[14] C. Gonzalez, D. Mozos, J. Resano, and A. Plaza, “FPGA implementation of the N-FINDR
algorithm for remotely sensed hyperspectral image analysis,” IEEE Trans. Geosci. Remote
Sens., vol. 50, no. 2, pp. 374–388, 2012.
[15] C. I. Chang, C. C. Wu, C. S. Lo, and M. L. Chang, “Real-Time Simplex Growing Algorithms
for Hyperspectral Endmember Extraction,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 48, no. 4. pp. 1834–1850, 2010.
[16] X. Geng, Z. Xiao, L. Ji, Y. Zhao, and F. Wang, “A Gaussian elimination based fast
endmember extraction algorithm for hyperspectral imagery,” ISPRS J. Photogramm.
Remote Sens., vol. 79, pp. 211–218, May 2013.
[17] M. E. Winter, “N-FINDR: an algorithm for fast autonomous spectral end-member
determination in hyperspectral data,” SPIE’s Int. Symp. Opt. Sci. Eng. Instrum., vol. 3753,
no. July, pp. 266–275, 1999.
[18] C. I. Chang, C. C. Wu, W. M. Liu, and Y. C. Ouyang, “A new growing method for simplex-
based endmember extraction algorithm,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 10,
pp. 2804–2819, 2006.
[19] M. D. Craig, “Minimum-volume transforms for remotely sensed data,” IEEE Trans. Geosci.
Remote Sens., vol. 32, no. 3, pp. 542–552, 1994.
[20] J. M. P. Nascimento and J. M. Bioucas-Dias, “Hyperspectral Unmixing Based on Mixtures
of Dirichlet Components,” IEEE Transactions on Geoscience and Remote Sensing, 2011.
[21] E. M. T. Hendrix, I. Garcia, J. Plaza, G. Martin, and A. Plaza, “A New Minimum-Volume
Enclosing Algorithm for Endmember Identification and Abundance Estimation in
Hyperspectral Data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no.
7. pp. 2744–2757, 2012.
[22] J. M. P. Nascimento and J. M. B. Dias, “Vertex component analysis: A fast algorithm to
unmix hyperspectral data,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 4, pp. 898–910,
2005.
[23] A. Plaza, P. Matrinez, R. Perez, and J. Plaza, “A quantitative and comparative analysis of
endmember extraction algorithms from hyperspectral data,” IEEE Trans. Geosci. Remote
Sens., vol. 42, no. 3, pp. 650–663, 2004.
[24] S. Sánchez, G. Martín, and A. Plaza, “Parallel implementation of the N-FINDR endmember
extraction algorithm on commodity graphics processing units,” in International Geoscience
and Remote Sensing Symposium (IGARSS), 2010, pp. 955–958.
[25] Z. Shi, W. Tang, Z. Duren, and Z. Jiang, “Subspace matching pursuit for sparse unmixing
of hyperspectral data,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 6, pp. 3256–3274,
2014.
[26] J. M. Bioucas-Dias and M. A. T. Figueiredo, “Alternating direction algorithms for
constrained sparse regression: Application to hyperspectral unmixing,” in 2nd Workshop on
Hyperspectral Image and Signal Processing: Evolution in Remote Sensing, WHISPERS
2010 - Workshop Program, 2010.
[27] M. D. Iordache, J. M. Bioucas-Dias, and A. Plaza, “Collaborative Sparse Regression for
Hyperspectral Unmixing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52,
no. 1. pp. 341–354, 2014.
[28] M. D. Iordache, J. M. Bioucas-Dias, and A. Plaza, “Sparse Unmixing of Hyperspectral
Data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 6. pp. 2014–
2039, 2011.
[29] W. Ouerghemmi, C. Gomez, S. Naceur, and P. Lagacherie, “Applying blind source
separation on hyperspectral data for clay content estimation over partially vegetated
surfaces,” Geoderma, vol. 163, no. 3–4, pp. 227–237, 2011.
[30] I. Meganem, Y. Deville, S. Hosseini, P. Déliot, and X. Briottet, “Linear-quadratic blind
source separation using NMF to unmix urban hyperspectral images,” IEEE Trans. Signal
Process., vol. 62, no. 7, pp. 1822–1833, 2014.
[31] A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing: Learning
Algorithms and Applications. 2003.
[32] Y. Zhong, X. Wang, L. Zhao, R. Feng, L. Zhang, and Y. Xu, “Blind spectral unmixing based
on sparse component analysis for hyperspectral remote sensing imagery,” ISPRS J.
Photogramm. Remote Sens., vol. 119, pp. 49–63, Sep. 2016.
[33] C.-H. Lin, C.-Y. Chi, Y.-H. Wang, and T.-H. Chan, “A Fast Hyperplane-Based Minimum-
Volume Enclosing Simplex Algorithm for Blind Hyperspectral Unmixing,” IEEE Trans.
Signal Process., vol. 64, no. 8, pp. 1946–1961, Apr. 2016.
[34] S. Zhang, A. Agathos, and J. Li, “Robust Minimum Volume Simplex Analysis for
Hyperspectral Unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 11, pp. 6431–
6439, Nov. 2017.
[35] Y. Zhong, X. Wang, L. Zhao, R. Feng, L. Zhang, and Y. Xu, “Blind spectral unmixing based
on sparse component analysis for hyperspectral remote sensing imagery,” ISPRS J.
Photogramm. Remote Sens., vol. 119, pp. 49–63, Sep. 2016.
[36] S. Jia and Y. Qian, “Constrained nonnegative matrix factorization for hyperspectral
unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 1, pp. 161–173, 2009.
[37] A. Buades, B. Coll, and J.-M. Morel, “A Non-Local Algorithm for Image Denoising,” in
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), vol. 2, pp. 60–65.
[38] Y. Zhong, R. Feng, and L. Zhang, “Non-local sparse unmixing for hyperspectral remote
sensing imagery,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 7, no. 6, pp. 1889–
1909, 2014.
[39] F. Kizel and M. Shoshany, “Spatially adaptive hyperspectral unmixing through endmembers
analytical localization based on sums of anisotropic 2D Gaussians,” ISPRS J. Photogramm.
Remote Sens., vol. 141, pp. 185–207, Jul. 2018.
[40] A. Plaza, P. Martenez, R. Perez, and J. Plaza, “Spatial/spectral endmember extraction by
multidimensional morphological operations,” IEEE Trans. Geosci. Remote Sens., vol. 40,
no. 9, pp. 2025–2041, 2002.
[41] D. M. Rogge, B. Rivard, J. Zhang, A. Sanchez, J. Harris, and J. Feng, “Integration of spatial-
spectral information for the improved extraction of endmembers,” Remote Sens. Environ.,
vol. 110, no. 3, pp. 287–303, 2007.
[42] M. Zortea and A. Plaza, “Spatial preprocessing for endmember extraction,” IEEE Trans.
Geosci. Remote Sens., vol. 47, no. 8, pp. 2679–2693, 2009.
[43] G. Marten and A. Plaza, “Spatial-spectral preprocessing prior to endmember identification
and unmixing of remotely sensed hyperspectral data,” IEEE J. Sel. Top. Appl. Earth Obs.
[44] G. Marten and A. Plaza, “Region-based spatial preprocessing for endmember extraction and
spectral unmixing,” IEEE Geosci. Remote Sens. Lett., vol. 8, no. 4, pp. 745–749, 2011.
[45] B. Somers, M. Zortea, A. Plaza, and G. P. Asner, “Automated extraction of image-based
endmember bundles for improved spectral unmixing,” IEEE J. Sel. Top. Appl. Earth Obs.
[46] M. Shoshany and T. Svoray, “Multidate adaptive unmixing and its application to analysis
of ecosystem transitions along a climatic gradient,” Remote Sens. Environ., 2002.
[47] A. Zare, “Spatial-spectral unmixing using fuzzy local information,” in International
Geoscience and Remote Sensing Symposium (IGARSS), 2011, pp. 1139–1142.
[48] X. Song, X. Jiang, and X. Rui, “Spectral unmixing using linear unmixing under spatial
autocorrelation constraints,” in International Geoscience and Remote Sensing Symposium
(IGARSS), 2010, pp. 975–978.
[49] O. Eches, N. Dobigeon, and J. Y. Tourneret, “Enhancing hyperspectral image unmixing
with spatial correlations,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 11 PART 1, pp.
4239–4247, 2011.
[50] O. Eches, J. A. Benediktsson, N. Dobigeon, and J. Y. Tourneret, “Adaptive Markov random

fields for joint unmixing and segmentation of hyperspectral images,” IEEE Trans. Image
Process., vol. 22, no. 1, pp. 5–16, 2013.
[51] S. Bauer, J. Stefan, M. Michelsburg, T. Laengle, and F. P. Le??n, “Robustness improvement
of hyperspectral image unmixing by spatial second-order regularization,” IEEE Trans.
Image Process., vol. 23, no. 12, pp. 5209–5221, 2014.
[52] M. D. Iordache, J. M. Bioucas-Dias, and A. Plaza, “Total variation spatial regularization for
sparse hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 11 PART1,
pp. 4484–4502, 2012.
[53] R. Feng, Y. Zhong, and L. Zhang, “Adaptive non-local Euclidean medians sparse unmixing
for hyperspectral imagery,” ISPRS J. Photogramm. Remote Sens., vol. 97, pp. 9–24, Nov.
2014.
[54] A. Zare, O. Bchir, H. Frigui, and P. Gader, “Spatially-smooth piece-wise convex
endmember detection,” 2010 2nd Workshop on Hyperspectral Image and Signal
Processing: Evolution in Remote Sensing. pp. 1–4, 2010.
[55] S. Jia and Y. Qian, “Spectral and spatial complexity-based hyperspectral unmixing,” in
IEEE Transactions on Geoscience and Remote Sensing, 2007, vol. 45, no. 12, pp. 3867–3879.
[56] S. Mei, Q. Du, and M. He, “Equivalent-Sparse Unmixing Through Spatial and Spectral
Constrained Endmember Selection From an Image-Derived Spectral Library,” IEEE
Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2015.
[57] F. Zhu, Y. Wang, S. Xiang, B. Fan, and C. Pan, “Structured Sparse Method for
Hyperspectral Unmixing,” ISPRS J. Photogramm. Remote Sens., vol. 88, pp. 101–118, Feb.
2014.
[58] J. Sigurdsson, M. O. Ulfarsson, J. R. Sveinsson, and J. A. Benediktsson, “Smooth spectral
unmixing using total variation regularization and a first order roughness penalty,” in
International Geoscience and Remote Sensing Symposium (IGARSS), 2013, pp. 2160–2163.
[59] J. Sigurosson, “Hyperspectral Unmixing Using Total Variation and Sparse Methods,”
University of Iceland, 2015.
[60] F. Kizel, M. Shoshany, N. S. Netanyahu, G. Even-Tzur, and J. A. Benediktsson, “A
Stepwise Analytical Projected Gradient Descent Search for Hyperspectral Unmixing and Its
Code Vectorization,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 9, pp. 4925–4943, Sep.
2017.
[61] M. Shoshany, F. Kizel, N. S. Netanyahu, N. Goldshlager, T. Jarmer, and G. Even-Tzur, “An
iterative search in end-member fraction space for spectral unmixing,” IEEE Geosci. Remote
Sens. Lett., 2011.
[62] A. Goshtasby, “Gaussian decomposition of two-dimensional shapes: A unified
representation for CAD and vision applications,” Pattern Recognit., vol. 25, no. 5, pp. 463–
472, 1992.
[63] A. Goshtasby and W. D. O’Neill, “Surface fitting to scattered data by a sum of Gaussians,”
Comput. Aided Geom. Des., vol. 10, no. 2, pp. 143–156, 1993.
[64] C. Stoll, N. Hasler, J. Gall, H. P. Seidel, and C. Theobalt, “Fast articulated motion tracking
using a sums of Gaussians body model,” in Proceedings of the IEEE International
Conference on Computer Vision, 2011, pp. 951–958.
[65] F. Bellocchio, N. A. Borghese, S. Ferrari, and V. Piuri, 3D Surface Reconstruction. New
York, NY: Springer New York, 2013.
[66] I. Zelman, M. Titon, Y. Yekutieli, S. Hanassy, B. Hochner, and T. Flash, “Kinematic

decomposition and classification of octopus arm movements,” Front. Comput. Neurosci.,
vol. 7, p. 60, May 2013.
[67] J. Liang, F. Park, and H. Zhao, “Robust and Efficient Implicit Surface Reconstruction for
Point Clouds Based on Convexified Image Segmentation,” J. Sci. Comput., vol. 54, no. 2–
3, pp. 577–602, Feb. 2013.
March 12, 2020 15:13 ws-rv961x669 HBPRCV-6th Edn.–11573 ch12˙Zhang-Qin page 231
CHAPTER 2.3
IMAGE PROCESSING FOR SEA ICE PARAMETER

IDENTIFICATION FROM VISUAL IMAGES
Qin Zhang
Department of Marine Technology, Norwegian University of Science and
Technology, 7052, Trondheim, Norway
chin.qz.chang@gmail.com
Sea ice statistics and ice properties are important for the analysis of ice–structure
interaction. The use of cameras as sensors on mobile sensor platforms will aid
the development of sea ice observation to, for instance, support the estimation
of ice forces that are critical to marine operations in Arctic waters. The time-
and geo-referenced sea ice images captured from cameras will provide valuable
information to observe the type and state of sea ice and the corresponding physical
phenomena taking place. However, there has been a lack of methods to effectively
extract engineering-scale parameters from sea ice images, leaving scientists and
engineers to do their analysis manually. This chapter introduces novel sea ice
image processing algorithms to automatically extract useful ice information, such
as ice concentration, ice types and ice floe size distribution, which are important
in various fields of ice engineering.
1. Introduction
Various types of remotely sensed data and imaging technology have been developed
for sea ice observation. Image data from various sources, such as visible cameras,
infrared cameras, radar, and satellites, are rich on information of the environment,
from which many of the sea ice parameters can be extracted. Identifying ice pa-
rameters for a wide scale region from satellite data has been widely studied.1–6
Recently, the ice cover data on a global scale has become available on a daily basis
due to the development of microwave satellite sensors, making it possible to moni-
tor the global variability of sea ice extent over the time-scale from days to seasons.
However, the satellite observing systems are unable to monitor the local variability
of sea ice parameters (e.g., the sea ice in contact with a marine/offshore structure
or coastal infrastructure), and it is still an issue in engineering scale (e.g., to predict
sea ice behavior and loads by numerical models) due to lack of sub-grid scale infor-
mation of the ice parameters.7 This motivated attention to boundary detection of
individual floes and estimating the floe size distributions.8,9
231
232 Qin Zhang
Focusing on a relatively small scale, camera imagery becomes one of the most
information-rich remote sensing tools and has been used on mobile sensor platforms
(e.g., aircrafts, shipboard, or on unmanned vehicles) to characterize ice conditions
for engineering purposes.10–12 Cameras as field observation sensors have the poten-
tial of continuous measurements with high precision that allows capturing a wide
range of the sea ice field, from a few meters to hundreds of meters. The visible image
data obtained from cameras have high resolution, which is particularly important
for providing detailed localized information of sea ice to collect observational data
on an operational basis.13 The information of the object and environment provided
by such visible images are close to human visual perception on them both in tonal
structure and resolution, meaning that determination of sea ice characteristics via
visible sea ice images is similar to manual visual observation. Thus cameras can be
used as supplementary means to necessary and important information of the actual
ice conditions for validation of theories and estimation of parameters in combination
with other instruments for sea ice remote sensing.
Despite the advantages of using visible cameras for sea ice observation, an im-
portant requirement is clear weather and sight when capturing sea ice image data.
Moreover, one of the major problems of surveying sea ice via cameras has been the
difficulty in image processing for numerical extraction of sea ice information, which
is vital for estimating the sea ice properties and understanding the behavior of sea
ice, especially on a relatively small scale. Since sea ice condition is a complicated
multi-domain process, it is not easy to analyze sea ice images quantitatively. The
lack of an effective method for image processing has hampered the understanding
of the dynamic properties of sea ice on small scale. This chapter introduces novel
sea ice image processing algorithms to automatically extract useful ice informa-
tion, such as ice concentration, ice types and ice floe size distribution, which are
important in various fields of ice engineering.
2. Ice Pixel Detection
Ice Concentration (IC) from a digital visual image is, for simplicity, defined as the
area of sea surface covered by visible ice observable in the 2D visual image taken
vertically from above, as a fraction of the whole sea surface domain area.14 It can
be calculated as the ratio of the number of pixels of visible ice to the total number
of pixels within the image domain, where the domain area is an effective area within
the image excluding land or other non-relevant areas. This means, ice concentration
is given by a binary decision of each pixel to determine whether it belongs to the
class “ice” or to the class “water”, and it is clear that the distinction of the ice
pixels from water pixels is thus crucial to calculating the ice concentration value.
The pixels in the same region normally have similar intensity. Based on that
ice is whiter than water, ice pixels usually have higher intensity values than those
belonging to water in a uniformly illuminated ice image. The thresholding method,
which extracts the objects from the background based on the pixels’ grayscale val-
Image Processing for Sea Ice Parameter Identification from Visual Images 233
ues and converts the grayscale image to a binary one, is a natural choice to separate
an ice image into an “ice region” and a “water region”. The automatic selection
of an appropriate grayscale threshold value is crucial when using the thresholding
method for determing ice pixels. The Otsu thresholding, which is an exhaustive
algorithm of searching for the global optimal threshold, is one of the most com-
mon automatic threshold segmentation methods.15 This method assumes that the
grayscale histogram of an image is bimodal and the illumination of the image is uni-
form. It then divides the histogram into two classes (i.e., the pixels are identified as
either foreground or background) and finds the threshold value that minimizes the
within-class variance. The Otsu’s method is a bi-level image thresholding technique
and can be further extended to multi-level thresholding for image segmentation.
Multi-level thresholding using Otsu’s method is computationally simple when di-
viding the image into two or three classes. But as the number of classes increases,
the minimization procedure becomes more complex, which makes the multi-level
Otsu thresholding method time consuming.
K-means clustering, which minimizes the within-cluster sum of distance to par-
tition a set of data into groups, is another ice pixel detection method.16 This
algorithm iterates two steps: assignment and update. In the assignment step, each
point of the data set is assigned to its nearest centroid; and in the update step, the
position of the centroid is adjusted to match the sample means of the data points
that they are responsible for. The iteration stops when the positions of centroids no
longer change. The k-means clustering is actually to minimize a within-class vari-
ance too, but it does not require to compute any variance. Therefore, this algorithm
is computationally fast and is a good way for a quick review of data, especially if
the objects are classified into many clusters.17
Using Otsu thresholding or k-means clustering method for determining ice pixels,
the ice image is divided into two or more classes. The class with the lowest average
intensity value is considered to be water, while the other classes are considered to
be ice.14,18 The ice pixel detection results using the Otsu thresholding method is
similar to the results using the k-means method when the intensity values of all the
ice pixels are significantly higher than water pixels.19 However, the bi-level Otsu
thresholding method can only find “light ice” pixels. The “dark ice” (e.g., brash
ice, slush, and the ice that are submerged in water), whose pixel intensity values
are between the threshold and water, may be lost. According to the definition of
ice concentration, both “light ice” and “dark ice” visible in the ice images should be
included when calculating the ice concentration. Since using the multi-level Otsu
thresholding method to detect ice pixels requires more computational time, the k-
means clustering method, on the other hand, has a better detection by dividing the
image into three or more clusters, as seen in Fig. 1.
234 Qin Zhang
(b) Bi-level Otsu thresholding (c) Multi-level Otsu threshold-

method, IC = 72.63%. ing method with 2 thresholds,
IC = 96.50%.
(a) Sea ice image.
(d) K-means method with 2 (e) K-means method with 3

clusters, IC = 96.50%. clusters, IC = 97.11%.
Fig. 1. Ice pixel detection and ice concentration.
3. Ice Floe Identification
3.1. Ice Boundary based Segmentation

Ice floe boundary detection is crucial to extracting information of ice floe and floe
size distribution from ice images. In an actual ice covered environment, especially
in the marginal ice zone (MIZ), ice floes are typically very close or connected to each
other. The boundaries between apparently connected floes have a similar brightness
to the floes themselves in the images, and these boundaries become too weak to be
detected. This issue challenges automatic identification of individual ice floes and
significantly affects the ice floe statistical result. A remedy to this issue is to use
the GVF (gradient vector flow) snake algorithm20 to identify floe boundaries and
separate seemingly connected floes into individual ones.
The GVF snake algorithm is an extension of the traditional snake (also called
deformable contour or active contour) algorithm.21 In the traditional snake algo-
rithm, a snake is a given closed curve that will move and evolve its shape under
the influence of internal forces from the curve itself and external forces computed
from the image data until the internal and external forces reach equilibrium. The
internal and external forces are defined such that the snake will conform to an
object boundary or other desired features within an image. The traditional snake
algorithm has a good detection capability of weak boundaries. However, there are
two key limitations in the traditional snake algorithm: a) the capture range of the
external force fields is limited; and b) it is difficult for the snake to progress into
boundary concavities. The traditional snake algorithm is therefore sensitive to the
initial contour, which is a starting set of snake points for the evolution and should
be placed close to the true boundary. Otherwise, the snake will likely converge to
an incorrect result. To overcome these limitations, the gradient vector flow, which
is derived from the image by minimizing a certain energy functional in a variational
framework, was introduced into the traditional snake algorithm.20 The GVF field
is computed as a spatial diffusion of the gradient vectors of an edge map to expand
the capture range of external force fields from boundary regions to homogeneous
regions and to enable the external forces to point into deep concavities of object
boundaries. The GVF snake is thus computationally faster and less restricted by
the initial contour.
The GVF snake algorithm operates on the grayscale image in which the real
boundary information, particularly weak boundaries, has been preserved. This al-
gorithm is able to detect the weak edges between ice floes, and to ensure that the
detected boundary is closed. As an example, shown in Fig. 2(b), given an initial
contour (red curve), the snake finds the floe boundary (green curve) after a few
iterations (yellow curves). The GVF snake algorithm relaxes the requirements of
the initial contour. However, due to that the snake deforms itself to conform with
the nearest salient contour, a proper initial contour for an object is still necessary.
Especially when identifying the mass of ice floes in an ice image, many initial con-
tours are required for performing the GVF snake algorithm to detect all individual
ice floe boundaries, and these initial contours should have proper locations, sizes
and shapes.
Fig. 2 gives an example showing the floe boundary detection results affected
by initializing the contour at different locations. In Fig. 2(a), the initial contour
is located at the water, close to the ice boundaries. The snake rapidly detects the
boundaries however, not the ice but the boundaries of the water region. When
initializing the contour at the center of an ice floe, as shown in Fig. 2(b), the
snake accurately finds the boundary after a few iterations even though the initial
contour is some distance away from the floe boundary. A weak connection will also
be detected if the initial contour is located on it, as shown in Fig. 2(c). However,
when the initial contour is located near the floe boundary inside the floe, as shown
in Fig. 2(d), the snake may only find a part of the floe boundary near the initial
contour (it should be noted that the curve is always closed regardless of how it
deforms, even in the cases of Fig. 2(c) and Fig. 2(d), which appear to be non-
closed curves. This behavior occurs because the area bounded by the closed curve
tends toward zero). This example indicates that, the snake will find a boundary
regardless of where the initial contour is located, the result where the initial contour
is located inside the floe and close to the floe center is the most effective.
In addition to locations, the sizes of initial contour will also affect the results
of ice floe boundary detection. The initial contour in the GVF snake algorithm
does not need to be as close to the true boundary as for in the traditional snake
algorithm. However, if the initial contour, located at the floe center and inside of
236 Qin Zhang
(a) Initial contour 1 (b) Initial contour 2 lo- (c) Initial contour 3 lo- (d) Initial contour 4
located at the water, cated at the center of cated at a weak connec- located near the floe
and the water region an ice floe, and the tion, and the weak con- boundary inside the
boundary is found. whole floe boundary is nection is found. floe, and only a part of
found. floe boundary is found.
Fig. 2. Initial contours located at different positions and their corresponding curve evolutions.
The red curves are the initial contours, the yellow curves are iterative runs of the GVF snake
algorithm, and the green curves are the final detected boundaries.
the floe, is too small, it will be slightly “far away” from the floe boundary and more
iterations will be needed for the snake to find the boundary. The snake may also
converge to an incorrect result if the initial contour is further distanced from the
floe boundary, especially when the grayscale of floe is uneven. Fig. 3 serves as an
example. Fig. 3(a) contains some light reflection in the middle of a model ice floe
(in an ice tank) where the pixels that belong to the reflection are lighter than the
other pixels of the floe. And Fig. 3(d) contains speckle inside of a sea ice floe where
the pixels of the speckle are darker. These phenomenons will affect the boundary
detection when the initial contour (the red curves in Fig. 3(b) and Fig. 3(e)) is too
small and not close to the actual boundary. The snake uses many steps (the yellow
curves in Fig. 3(b) and Fig. 3(e)) and find a part of floe boundary (the green curve
in Fig. 3(b)), or does not find the complete boundary being blocked by the speckle
(the green curve in Fig. 3(e)). If we enlarge the initial contour, as shown in Fig.
3(c) and Fig. 3(f), the initial contour allows for a faster determination of the entire
floe boundary. Therefore, the initial contour should still be set as close as possible
to the actual floe boundary.
As a conclusion of the above, to increase the efficiency of the ice floe boundary
method based on the GVF snake algorithm, the initial contours should be adapted
to floe size, located inside the floe and centered as close to floe center.18 In image
analysis, the ice floes can be separated from water when converting the ice image
into a binary one by using a thresholding method or k-means clustering method.
These methods make it easy to locate the initial contours inside the ice floes. Thus,
the binarized ice image and its distance transform is use to automatically initialize
contours for evolving the GVF snake efficiently in ice floe boundary detection. The
step-by-step of this automatic contour initialization algorithm is described below:
Step 1: Convert the ice image into a binary image after separating ice regions from
water regions, in which case the pixels with value ‘1’ indicate ice, and the pixels
with value ‘0’ indicates water; see Fig. 4(a) and Fig. 5(b).
Step 2: Perform the distance transform to the binarized ice image. Find the regional
(a) Model ice floe image (b) A small contour ini- (c) A large contour ini-
with light reflection. tialized at the model ice tialized at the model ice
floe center, giving con- floe center, giving con-
vergence of the snake to vergence of the snake to
the incomplete bound- the correct boundary.
ary.
(d) Sea ice floe image (e) A small contour ini- (f) A large contour ini-
with speckle. tialized at the sea ice floe tialized at the sea ice
center, giving erroneous floe center, giving con-
evolutions of the snake. vergence of the snake to
the correct boundary.
Fig. 3. Initial circles with different radii and their curve evolutions. The red curves are the initial
contours, the yellow curves are iterative runs of the GVF snake algorithm, and the green curves
are the final detected boundaries.
maxima shown as the green numerals in Fig. 4(b) and green ‘+’ in Fig. 5(d).
Step 3: Merge those regional maxima within a short distance (within as a threshold
Tseed ) of each other. Find the “seeds” that are centers of regional maxima and
merged regions, shown as red ‘+’ in Fig. 4(b) and Fig. 5(e).
Step 4: Initialize the contours to be located at the seeds with the circular shape.
The radius of the circle is then chosen according to the pixel value at the seed in
the distance map; see the blue circles in Fig. 4(b) and Fig. 5(e).
At Step 2 in this algorithm, a regional maximum in the distance map of the
binarized ice image ideally corresponds to the center of an ice floe, but more than
one regional maximum are detected in many cases. Thus, the regional maxima that
have a short distance to each other will be merged (e.g., by a dilation operator) into
a big one at Step 3. The circular shape is chosen as the shape of the initial contour
at Step 4, because this shape deforms to the floe boundary more uniformly than
other shapes, being unaware of the floe’s irregular shape and orientation. Moreover,
238 Qin Zhang
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0
0 1 1 1 1 1 1 0 0 1 2 2 2 2 1 0
0 1 1 1 1 1 1 0 0 1 2 3 3 2 1 0
0 0 1 1 1 1 1 0 0 0 1 2 3 2 1 0
0 0 1 1 1 1 0 0 0 0 1 1 2 1 0 0
0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(a) Binary image (b) Distance map of

matrix. Fig. 4(a), regional
maximum, seed, and
initial contour.
Fig. 4. Contour initialization algorithm based on the distance transform.
the use of seed’s pixel value in the distance map as the basis for selecting the radius
of the circle ensures that the initial contour (circle) is contained strictly inside the
floe and adapted to the floe size. Therefore, this contour initialization algorithm
accomplishes the requirements of the initial contour for the GVF snake without
manual interaction.
After initializing the contours, the GVF snake algorithm is run on each contour
to identify the floe boundary. Superimposing all the boundaries over the binarized
ice image, i.e., setting all the identified boundary pixels value ‘0’ (note that the
boundary pixels can be specifically labeled for special handling in subsequent use),
results in the separation of the connected ice floes. This GVF snake-based ice floe
segmentation procedure carried out on a sea ice floe image is shown in Fig. 5.
It should be noted that, a regional maximum whose distance is larger than the
given threshold Tseed will not be merged into one seed at Step 3 in the introduced
contour initialization algorithm. This means that some floes may have more than
one seed. However, two or more seeds for one ice floe will not affect its boundary
detection, but it may increase the computational time.
3.2. Ice Shape Enhancement

After boundary detection, some segmented floes may contain holes or smaller ice
floes inside because of the noise and speckle, as shown in Fig. 6(a). This means that
the ice floe cannot be completely identified, and the shape of the segmented ice floe
is rough, as seen in Fig. 6(b). To smoothen the shape of the ice floe, morphological
cleaning22 is used after ice floe segmentation.
Morphological cleaning is a combination of first morphological closing and then
morphological opening on a binary image. Both binary closing and opening oper-
ations can smooth the contours of objects, yielding results that are similar to the
original shapes of the objects but with different level of details. The closing oper-
ation is able to close narrow cracks, fill long thin channels, and eliminate the holes
that are smaller than the structuring element. While the opening operation is able
to break thin connections between objects, remove small protrusions, and eliminate
(a) Sea ice floe image. (b) Binarized image of (c) Distance map of Fig.
Fig. 5(a). The ice floes 5(b).
are connected.
(d) Binary ice floe im- (e) Binary ice floe image (f) Segmentation result.
age with regional maxima with seeds (red ‘+’) and The connected ice floes are
(green ‘+’). initial contours (blue cir- separated.
cles).
Fig. 5. The procedure of GVF snake-based ice floe segmentation.
complete regions of an object that cannot contain the structuring element.

Using the morphological cleaning, all the segmented ice pieces are first arranged
from small to large. Then, morphological cleaning with a proper structuring ele-
ment (e.g., a disk-shaped structuring element whose radius can be automatically
adapted to the size of each ice piece according to a certain rule9 ) is performed and
holes are filled to the arranged ice pieces in sequence. This process will ensure the
completeness of the ice floe. An example showing the ice floe shape enhancement
result can be found in Fig. 6.
(a) Ice floe image (b) Segmentation re- (c) Shape enhance-
with speckle. sult of Fig. 6(a). ment result of 6(b).
Fig. 6. Ice floe shape enhancement.

240 Qin Zhang
It should be noted that the arrangement of ice pieces in order of increasing size is
required for the morphological cleaning. Otherwise, the smaller ice piece contained
in a larger ice floe may not be removed.
4. A Case Study and Its Application
4.1. MIZ Image Processing
4.1.1. Marginal Ice Pixel Extraction
Sea ice typically has a wide variability of ice floe, together with content of brash
ice and slush and, possibly, a snow cover. Usually, part of the ice pixels have low
intensity values close to water pixels, as seen in a MIZ image in Fig. 7(a), and
they may not be identified by the bi-level Otsu thresholding method. The k-means
clustering method with three or more clusters is then applied to determine more ice
pixels. By comparing the difference between bi-level Otsu thresholding detection
result and k-means clustering detection result, we obtain “dark ice” pixels, as shown
in Fig. 7(d). We will see later that creating the individual “light ice” and “dark
ice” image layers is advantageous for the computation of the initial contours for the
GVF snake algorithm.
(a) A MIZ image. (b) “Light ice”, detected by bi-level Otsu

thresholding method.
(c) Ice detection using k-means method with 3 (d) “Dark ice”, the difference between Fig. 7(b)
clusters. and Fig. 7(c).
Fig. 7. Ice pixel detection of a MIZ image.

Note that, the results of ice pixel detection by using the same method but with
different multilevel may be similar to each other,9 see for example in Fig. 1(d) and
Fig. 1(e), which show ice pixel detection results by using the k-means clustering
with two and three clusters, respectively. The difference between these two resulting
images is too minor to be able to determine the “dark ice” pixels. Therefore, it is
necessary to use two different methods for the extractions of the sea ice pixels in
order to widen the gap between the results. Also note that, the k-means clustering
is computationally faster than the multi-level Otsu thresholding method. Thus, the
bi-level Otsu thresholding method is used to detect the “light ice” pixels, and the
k-means clustering method with three or more clusters is used to determine “dark
ice” pixels.
The bi-level Otsu method detects less ice pixels, however, the under-detected ice
pixels by the bi-level Otsu method results in more “holes” in the binarized image
as seen the “light ice” image layer in Fig. 7(b), and this is essential for the further
initialization of the contours for the GVF snake algorithm, in order to separate
the connected ice floes in the area where the sea ice are crowded. Those under-
detected ice pixels can next be compensated by the detected “dark ice” pixels when
using the additional k-means clustering method with three or more clusters. On
the contrary, if we use the k-means clustering method to detect “light ice” or more
ice pixels, there might be few “holes” among a massive amount of ice floes that are
connected to each other, as seen in Fig. 1(d) and Fig. 7(c), it then becomes difficult
to initialize the contours for the GVF snake algorithm. Therefore, both “light ice”
and “dark ice” image layers are needed for a more accurate result, especially for
individual ice floe identification.
4.1.2. Marginal Ice Boundary Detection
To start the GVF snake algorithm, the “light ice” and “dark ice” layers are used
individually to calculate the initialization of the contours. Then the GVF snake
algorithm is run to individually derive “light ice” segmentation as seen by the white
ice pieces in Fig. 8, and the “dark ice” segmentation as seen by the gray ice pieces
in Fig. 8. Collecting all the ice pieces in the “light ice” and “dark ice” segmented
image layers results in the final segmented image, as exemplified in Fig. 8.
It should be noted that the “light ice” and the “dark ice” should be labeled
differently in the final segmented image. Otherwise, it may be impossible to separate
some “light ice” and “dark ice” pieces if they are connected.
4.1.3. Marginal Ice Shape Enhancement and Final Image Processing Result
In many cases, the grayscale of an ice floe is uneven, as seen in Fig. 9(a). The
lighter part of the floe is considered as “light ice” (the white pixels in Fig. 9(b)
and Fig. 9(c)), while the darker part is considered as “dark ice” (the gray pixels in
Fig. 9(b) and Fig. 9(c)). This means the ice floe, as shown in Fig. 9(b), cannot be
242 Qin Zhang
Fig. 8. Sea ice segmentation image. The white ice is the segmentation result for the “light ice”
in Fig. 7(b), and the gray ice is the segmentation result for the “dark ice” in Fig. 7(d).
completely identified when it has both “light ice” pixels and “dark ice” pixels. If we
perform the ice shape enhancement to the “light ice” segmentation and “dark ice”
segmentation independently, there will be overlap between the resulting individual
light ice piece identification and individual dark ice piece identification. This means
that some ice pixels may be identified as belonging to different ice floes, with the
risk that large ice floes are still incomplete. Therefore, all the detected ice pieces
from both segmented “light ice” and “dark ice” layers should be labeled as one
input to the step of ice shape enhancement to ensure the completeness of the ice
floe and smaller ice pieces contained in a lager ice floe are removed.
(a) Ice floe image with (b) Segmentation result (c) Shape enhancement
uneven grayscale. of Fig. 9(a). result of 9(b).
Fig. 9. Sea ice shape enhancement. The white pixels are the “light ice” pixels, and the gray
pixels are the “dark ice” pixels.
The ice shape enhancement results in the identification of individual ice pieces.
Furthermore, to distinguish brash ice from ice floes, we define a brash ice threshold
parameter (pixel number, area, or characteristic length) that can be tuned for each
application. The ice pieces with sizes larger than the threshold are considered to
be ice floes, while smaller pieces are considered to be brash ice. The remaining ice
pixels, e.g., single ice pixels or the ice pieces that are too small to be treated as
brash ice, are labeled as slush. This results in four layers of a sea ice image (using
Fig. 7(a) as an example): ice floe (Fig. 10(a)), brash ice (Fig. 10(b)), slush (Fig.
10(c)), and water (Fig. 10(d)) (noted that, the incorrect identification in the foggy
bottom-left corner of the figure can be improved by, e.g., processing this blurred
region locally.18 However, we keep this error here as a special case and will discuss
it in later sections). Moreover, the residue ice, which is the detected edge pixels
between the connected ice pieces, is in this example considered as slush (since there
is often edge layer of slush ice between two ice floes) and included in Fig. 10(c).
However, the residue ice, as shown in Fig. 10(e), can also simply be identified as
“residue ice” and defined specifically by the user and the application of the data.
Based on the four layers, a total of 2888 ice floes and 3452 brash ice pieces are
identified from Fig. 7(a). The coverage percentages are 58.00% ice floe, 4.85%
brash ice, 21.21% slush, and 15.94% water. The total ice concentration is 84.06%,
and the histogram of ice floe size distribution grouped by mean clipper diameter
(MCD) is shown in Fig. 11.
4.2. Digital Ice Field Generation

Ice floes identified by the GVF snake-based algorithm (see e.g., in Fig. 10(a)) are
not necessarily convex. In a MIZ, however, ice floes generally exhibit a rounded
shape.5 To better approximate ice floes’ geometry and also for numerical simpli-
fications, both the ice floes’ and brash ice’s geometries are further modified, i.e.,
each ice floe is represented by a minimum bounding polygon; and the brash ice were
reshaped by circular disks of equivalent area. This numerical representation of sea
ice is thereafter utilized to generate its corresponding ice field to bridges the gap
between a natural ice field and its numerical applications, e.g., simulations involving
ice–structure interactions.
The numerical representation of sea ice will result in overlaps between the sim-
plified ice pieces, as seen in Fig. 12. Identifying the floe-floe, floe-brash and brash-
brash overlaps is important when using the identified ice floes/brash ice as a start-
ing condition for the initialization of an ice field in the numerical simulation of
ice–structure interactions.23 In order to prepare a physically sound ice field, these
overlaps should be resolved in advance.
Currently, most of the relevant simulation tools are based on the discrete element
method (DEM) considering broken ice floes’ discrete nature. This includes the
traditional DEM’s application for calculating ice loads on structures,24,25 and an
emerging method, i.e., the non-smooth DEM.26,27 Given a field composed of discrete
objects, the application of both the traditional and non-smooth DEM generally
involve two numerical procedures, i.e., collision detection and collision response
calculations.28 The major difference between the traditional and non-smooth DEM
lies in the calculation of collision responses. The non-smooth DEM is formulated
on the level of velocities and impulses, whereas the traditional one is formulated
upon acceleration and forces.28,29 Comparatively, the non-smooth DEM is rather
244 Qin Zhang
≥
[m]
(a) Layer showing the “ice floes” (marked with white dots at floe centers), the color bar
shows the MCD of ice floes.
(b) Layer showing the “brash ice”. (c) Layer showing the “slush”.
(d) Layer showing the “water”. (e) Residue ice (edge pixels).
Fig. 10. Sea ice image processing result of Fig. 7(a).
efficient in resolving a large number of overlaps among bodies.29 Therefore, it is

adopted in the present application of ice field generation.
The initialization stage in the ice field generation requires little consideration on
the ice material’s behavior at contact as long as the overlaps are resolved efficiently.
Ice floes are treated as discrete bodies after importing the ice field’s numerical rep-
resentation into the non-smooth DEM based simulator,30 and floe pairs involving
≥
[m]
Fig. 11. Floe size distribution histogram of Fig. 10(a).
Fig. 12. A close-up view of floe ice, brash ice and their corresponding numerical representation.
overlap are labeled with red color in Fig. 13(a). Afterwards, for each calculation
iteration, the collision detection algorithm identifies existing overlaps; and the col-
lision responses are calculated and applied to eliminate the overlaps. Fig. 13(b)
shows one snapshot of the ice field domain, within which, overlaps are gradually re-
solved. Notably, for saving computation resources, not all the ice floes are involved
in the calculation of each iteration. For ice floes without overlap and are far away
from the overlapped ice floe clusters, they are in “sleeping mode” in the adopted
algorithm (see Fig. 13(b)).
Fig. 13(b) shows that ice floes in the ice field’s bottom-left corner have more
overlaps. Nevertheless, applying the non-smooth DEM calculation procedures, all
the overlaps are eventually resolved in Fig. 13(c) and the finally generated ice field
is shown in Fig. 13(d). After resolving the overlaps, the exact location of each ice
floe in Fig. 13(d) and Fig. 13(a) are not the same, but with only minor differences.
On the other hand, each ice floe’s shape and size, and the overall ice mass are
conserved.
246 Qin Zhang
Initial phase, ice floes: without overlap with overlap Calculation phase, ice floes: sleeping; active with overlap
active without overlap
(a) Initial phase of the ice floe field with overlap. (b) Calculation phase of the ice floe field with
overlap.
(c) All overlaps are resolved. (d) Finally generated ice floe field.
Fig. 13. Ice floe field generation.
Similarly, brash ice can be imported into the same non-smooth DEM based
simulator and be treated as discrete bodies. From a non-smooth DEM calculation’s
point of view, the simplification of each brash ice as a circular disk with equivalent
area makes the collision detection and consequent collision response calculation
much more efficient comparing to arbitrary polygons. Given the amount of brash
ice and its relatively small mass, this simplification is reasonable and have been
adopted in previous studies.31–33
For the current demonstration, the identified brash ice and its numerical repre-
sentation are additionally imported to the ice field in Fig. 13(a). This is illustrated
in Fig. 14(a), which shows relatively more overlaps. An enlarged view within the
field center is also presented, where the circular disk-shaped bodies are the brash
ice representations.
For the current ice field’s composition, i.e., 58.00% ice floe, 4.85% brash ice,
the calculation time to resolve all overlaps for the cases with and without brash ice
poses no significant difference. In both cases, the bottleneck for calculation time is
on the overlap resolution in the bottom-left corner’s large ice floes. However, it is
expected that as the amount of brash ice increases, the calculation time would also
increase, which eventually becomes the decisive bottleneck for the calculation.
(a) Initial ice condition with overlaps.
(b) Final ice condition without overlaps.
Fig. 14. Ice field generation with both floe ice and brash ice.
5. Discussions and Further Work
5.1. Ice Pixel Detection
To calculate ice concentration, both the Otsu thresholding and the k-means clus-
tering methods separate classes of ice pixels from water pixels by dividing an image
into two or more classes in a mandatory manner. This implicitly assumes there must
be some water and some ice in the image, and will fail in the boundary conditions
when ice concentration is 0% or 100%, which have to be dealt with as particular
cases. How to choose the number of classes automatically for any image data is
critical and there is no explicit mathematical criterion that can be evaluated to
find it. Both methods are also not adaptive to varying light conditions, varying
shading, melt ponds and surface water on the ice, etc. Instead, they must be tuned
to become as robust as possible to such variations for a given place and ambient
conditions. Moreover, both methods do not include detailed ice physics except the
grayscale value of the image. Therefore, the learning-based object detection could
248 Qin Zhang
be a future research direction, as long as there is a sufficiently rich dataset of images

as the basis for algorithm development.
5.2. Ice Boundary Detection
To determine ice floe statistics and properties, the GVF snake algorithm is adopted
to identify individual floes due to its superior detection capability of weak bound-
aries. The GVF snake uses a diffusion of the gradient vectors of an edge map as the
source of its external forces, resulting in a smooth attraction field that is defined
in the whole image and spreads the influence of boundary concavities. However,
due to the inherent competition of the diffusion process, the capture range of the
strong edges may dominate the external force field. The external forces near the
weak edges, which are close to the stronger ones, will be too weak to pull the
snake toward the desired weak boundaries. As a result, the snake is likely to pass
over the weak edge and terminate at the corresponding strong edge. Hence, the
under-detection of the blurred ice edges by the GVF snake results in the incorrect
identification in the the foggy bottom-left corner of Fig. 7(a). A solution to this
would be to process this region as a particular case.18
Furthermore, the GVF snake-based method separates the connected ice floes
and identifies individual floes one by one in the image, and may take from minutes
to hours to complete depending on image size and the amount of ice floe existed
in the image. It typically requires more computational time for larger images with
more ice floes, and will challenge the real-time applications. Thus, an adaptive,
faster, and parallelized algorithm for identifying individual ice floes needs to be
developed.
5.3. Ice Field Generation
With the identified sea ice field parameters involving geometries and locations of
ice floes and brash ice, a non-smooth DEM based method is adopted to assign basic
physics to each ice floes and brash ice. The digitalized ice field usually involves
overlaps among different bodies mainly because of the geometrical simplifications
made for ice floe and brash ice. The primary intention is hence to revolve these
overlaps. As a demonstration, the non-smooth DEM successfully resolved all these
overlaps among ice floes and brash ice. Notably, given the simplified circular disk-
shape numerical representation of brash ice, reaching the final ice field without
overlap demands minimal additional computational time. However, as the amount
of brash ice keeps increasing, further simplification might be desired, e.g., modeling
brash ice as a viscous flow which is governed by conservation laws as a material
collection.
References
1. C. Gignac, Y. Gauthier, J. S. Bédard, M. Bernier, and D. A. Clausi. High res-

olution RADARSAT-2 SAR data for sea-ice classification in the neighborhood of
Nunavik’s marine infrastructures. In Proc. Int. Conf. on Port Ocean Eng. Arct. Cond.
(POAC’11), Montréal, Canada (July, 2011).
2. A. V. Bogdanov, S. Sandven, O. M. Johannessen, V. Y. Alexandrov, and L. P. Bobylev,
Multisensor approach to automated classification of sea ice image data, IEEE Trans.
Geosci. Remote Sens. 43(7), 1648–1664, (2005). ISSN 1558-0644.
3. L.-K. Soh, C. Tsatsoulis, D. Gineris, and C. Bertoia, ARKTOS: An intelligent system
for SAR sea ice image classification, IEEE Trans. Geosci. Remote Sens. 42(1), 229–
248, (2004). ISSN 1558-0644.
4. D. Haverkamp and C. Tsatsoulis, Information fusion for estimation of summer MIZ
ice concentration from SAR imagery, IEEE Trans. Geosci. Remote Sens. 37(3), 1278–
1291, (1999). ISSN 1558-0644.
5. L.-K. Soh, C. Tsatsoulis, D. Gineris, and C. Bertoia, Measuring the sea ice floe size
distribution, J. Geophys. Res. Oceans. 89(C4), 6477–6486, (1984). ISSN 2169-9291.
6. T. Toyota and H. Enomoto. Analysis of sea ice floes in the sea of Okhotsk using
ADEOS/AVNIR images. In Proc. Int. Symp. on Ice (IAHR’02), pp. 211–217, Dunedin,
New Zealand (Dec., 2002).
7. W. Lu, Q. Zhang, R. Lubbad, S. Løset, R. Skjetne, et al. A shipborne measurement
system to acquire sea ice thickness and concentration at engineering scale. In Arctic
Technology Conference), St John’s Newfoundland, Canada (Oct., 2016).
8. Q. Zhang. Image Processing for Ice Parameter Identification in Ice Management. PhD
thesis, Norwegian University of Science and Technology, Trondheim, Norway (Dec.,
2015).
9. Q. Zhang and R. Skjetne, Sea Ice Image Processing with MATLAB. R (CRC Press,
Taylor & Francis, USA, 2018).
10. S. Ji, H. Li, A. Wang, and Q. Yue. Digital image techniques of sea ice field observation
in the bohai sea. In Proc. Int. Conf. on Port Ocean Eng. Arct. Cond. (POAC’11),
Montréal, Canada (July, 2011).
11. J. Millan and J. Wang. Ice force modeling for DP control systems. In Proc. of the
Dynamic Positioning Conference), Houston, Texas, USA (Oct., 2011).
12. R. Hall, N. Hughes, and P. Wadhams, A systematic method of obtaining ice concen-
tration measurements from ship-based observations, Cold Reg. Sci. Technol. 34(2),
97–102, (2002). ISSN 0165-232X.
13. J. Haugen, L. Imsland, S. Løset, and R. Skjetne. Ice observer system for ice manage-
ment operations. In Proc. Int. Conf. on Ocean and Polar Eng. (ISOPE’11), Maui,
Hawaii, USA (June, 2011).
14. Q. Zhang, R. Skjetne, S. Løset, and A. Marchenko. Digital image processing for sea
ice observations in support to Arctic DP operations. In Proc. ASME Int. Conf. on
Ocean, Offshore and Arctic Engineering (OMAE’12), pp. 555–561, Rio de Janeiro,
Brasil (July, 2012).
15. N. Otsu, A threshold selection method from fray-level histograms, Automatica. 11
(285-296), 359–369, (1975). ISSN 0005-1098.
16. J. MacQueen. Some methods for classification and analysis of multivariate observa-
tions. In Proc. Fifth Berkeley Symp. on Math. Statist. and Prob., pp. 281–297, Berke-
ley, USA (June, 1967).
17. S. C. Basak, V. R. Magnuson, G. J. Niemi, and R. R. Regal, Determining structural
similarity of chemicals using graph-theoretic indices, Discrete Applied Mathematics.
250 Qin Zhang
19(1), 17–44, (1988). ISSN 0166-218X.

18. Q. Zhang and R. Skjetne, Image processing for identification of sea-ice floes and the
floe size distributions, IEEE Trans. Geosci. Remote Sens. 53(5), 2913–2924, (2015).
ISSN 1558-0644.
19. Q. Zhang, S. van der Werff, I. Metrikin, S. Løset, and R. Skjetne. Image processing
for the analysis of an evolving broken-ice field in model testing. In Proc. ASME Int.
Conf. on Ocean, Offshore and Arctic Engineering (OMAE’12), pp. 597–605, Rio de
Janeiro, Brasil (July, 2012).
20. C. Xu and J. L. Prince, Snakes, shapes, and gradient vector flow, IEEE Trans. Image
Process. 7(3), 359–369, (1998). ISSN 1057-7149.
21. M. Kass, A. Witkin, and D. Terzopoulos, Snakes: Active contour models, Int. J.
Comput. Vis. 1(4), 321–331, (1988). ISSN 0920-5691.
22. L.-K. Soh, C. Tsatsoulis, and B. Holt. Identifying ice floes and computing ice floe
distributions in SAR images. In eds. C. Tsatsoulis and R. Kwok, Analysis of SAR
Data of the Polar Oceans, pp. 9–34. Springer, Berlin, (1998).
23. R. Lubbad and S. Løset. Time domain analysis of floe ice interactions with floating
structures. In Arctic Technology Conference, Copenhagen, Denmark (Mar., 2015).
24. M. Lau, K. P. Lawrence, and L. Rothenburg, Discrete element analysis of ice loads on
ships and structures, Ships and Offshore Structures. 6(3), 211–221, (2011).
25. M. Richard and R. McKenna. Factors influencing managed sea ice loads. In Proc. Int.
Conf. on Port Ocean Eng. Arct. Cond. (POAC’13), Espoo, Finland (June, 2013).
26. R. Lubbad and S. Løset, A numerical model for real-time simulation of ship–ice in-
teraction, Cold Reg. Sci. Technol. 65(2), 111–127, (2011). ISSN 0165-232X.
27. I. Metrikin and S. Løset. Nonsmooth 3D discrete element simulation of a drillship in
discontinuous ice. In Proc. Int. Conf. on Port Ocean Eng. Arct. Cond. (POAC’13),
Espoo, Finland (June, 2013).
28. M. G. Coutinho, Guide to Dynamic Simulations of Rigid Bodies and Particle Systems.
(Springer Science & Business Media, 2012).
29. M. Servin, D. Wang, C. Lacoursière, and K. Bodin, Examining the smooth and nons-
mooth discrete element approaches to granular matter, Int. J. Numer. Meth. Eng. 97
(12), 878–902, (2014). ISSN 1097-0207.
30. R. Yulmetov, S. Løset, and R. Lubbad. An effective numerical method for generation
of broken ice fields, consisting of a large number of polygon-shaped distinct floes. In
Proc. Int. Symp. on Ice (IAHR’2014), pp. 829–836, Singapore (Aug., 2014).
31. A. Konno. Resistance evaluation of ship navigation in brash ice channels with physi-
cally based modeling. In Proc. Int. Conf. on Port Ocean Eng. Arct. Cond. (POAC’09),
Lulea, Sweden (June, 2009).
32. A. Konno, A. Nakane, and S. Kanamori. Validation of numerical estimation of brash
ice channel resistance with model test. In Proc. Int. Conf. on Port Ocean Eng. Arct.
Cond. (POAC’13), Espoo, Finland (June, 2013).
33. C. Gignac, Y. Gauthier, J. S. Bédard, M. Bernier, and D. A. Clausi. Numerical inves-
tigation of effect of channel condition against ships resistance in brash ice channels.
In Proc. Int. Conf. on Port Ocean Eng. Arct. Cond. (POAC’11), Montréal, Canada
(July, 2011).
March 12, 2020 15:22 ws-rv961x669 HBPRCV-6th Edn.–11573 main page 251
CHAPTER 2.4
APPLICATIONS OF DEEP LEARNING TO BRAIN

SEGMENTATION AND LABELING OF MRI BRAIN
STRUCTURES
Evan Fletcher∗ and Alexander Knaack†

IDeA Laboratory, Department of Neurology,
University of California, Davis, California, USA
Email: ∗ emfletcher@ucdavis.edu, † abknaack@ucdavis.edu
Deep learning implementations using convolutional neural nets (CNNs) have re-
cently demonstrated promise in many areas of medical imaging. This chapter
presents two aspects of CNN use with magnetic resonance (MRI) brain images.
First, we describe production-level output of brain segmentation from whole head
images, a crucial processing task that is resource-intensive under standard CPU
methods and human quality control. With an extremely large archive of MRIs for
training and testing, our segmentation performs robustly across multiple imag-
ing cohorts, with greatly increased throughput. Second, we present robust brain
structure edge labeling which enables studies of greater statistical power than
that of Canny edge detection or hand crafted probabilistic algorithms.
1. Introduction
Deep learning refers to a variety of techniques for extracting features from data,
usually by neural networks involving a hierarchy of many layers.1,2 This chapter
reports on our use of deep learning via convolutional neural nets (CNNs)3 to au-
tomate and improve the robustness and statistical power of two necessary tasks
for processing structural brain magnetic resonance images (MRIs). CNNs have re-
cently been used in a variety of applications for medical image processing.4–6 CNN
applications have the potential to equal or exceed expert medical image evaluation
and to greatly speed up computationally intensive aspects of image processing.
In this chapter we describe two projects using CNNs for rapid and robust identi-
fication of brain locations within structural MRIs. First, segmentation of the brain
from the whole head (i.e. skull-stripping) is an indispensable task in any standard
pipeline of MRI processing. However, it can be resource intensive, thus creating a
processing bottleneck in the analysis of large data sets. For example, using a state-
of-the-art technique of atlas-matching,7 in which at least 10 carefully segmented
atlas images are nonlinearly matched to a target, skull-stripping a single whole
head MRI requires around 27 CPU hours and must be followed by human qual-
ity control (QC) taking typically an hour or more. In answer to this, we present
251
252 E. Fletcher and A. Knaack
a method that greatly reduces processing time while robustly performing over a
variety of imaging cohorts. Next, brain structural edge labeling is vital for many
analyses. Here we will describe the use of CNN-derived edge labels to enhance lon-
gitudinal registration of same-subject scans, an important task in the burgeoning
field of longitudinal analysis. This approach leads to improved robustness of edge
labeling and increased statistical power for computing longitudinal atrophy rates.
Both of the tasks that we focus on in this chapter fall under the heading of
segmentation, or the labeling of 3D voxel locations in a structural brain MRI.2 The
gold standard for this process has typically been manual human labeling from draw-
ing on images slice-by-slice, but this is slow, subject to human error and effectively
limits the practical size of useful data sets of labeled images. The promise of deep
learning derives from its potential to rival or exceed human expert recognition of
brain structures, while performing automatically or with reduced need for human
quality control, in much shorter times than are needed for manual labeling. Chal-
lenges to the success of deep learning in medical image processing stem from the
relative dearth of high-quality ground truth labeling (i.e. the very problem that
deep learning aims to address) which is necessary for training to achieve robust
performance given varied image quality and scanner characteristics.2
2. Methods
This section outlines our deep learning hardware and software environment, followed
by specific methods for each of the two tasks we addressed.
2.1. Hardware and software

2.1.1. Hardware
Training, testing, preprocessing, postprocessing, and metric calculations were per-
formed on an NVIDIA DGX Station using a 20-core Intel Xeon CPU and a single
Tesla V100 GPU with 16GB of GPU memory (of the four on the station).
2.1.2. Software
The TensorFlow platform (https://www.tensorflow.org/) was used to implement
and train the neural network, as well as to calculate the similarity and volume
metrics. The relevant steps for the analyses presented below were performed by in-
house code developed as part of our image processing suite. These were: multi-atlas
brain mask extraction for ground truth training, testing and validation (Sec. 2.2.2
and Sec. 2.2.3), structural edge labeling using our implementation of a 3D Canny
labeling algorithm for edge ground truth labels (Sec. 2.3.2), and longitudinal same-
subject sequential scan registration to compute patterns of atrophy (Sec. 2.3.4),
followed by statistical analysis of the results (Sec. 2.3.6 and Sec. 2.3.7). All addi-
tional processing was done in Python using built-in as well as the following external
Applications of Deep Learning to Brain Structural Labeling 253
modules: NiBabel, NumPy, SciPy, scikit-image, Pydicom, pytoml, and tqdm. Re-
sult analysis was performed using Pandas and plotted using Bokeh via HoloViews
in JupyterLab.
2.2. Brain segmentation
In brain segmentation or “skull-stripping” of structural MRI images, the 3D brain

image is extracted from the whole head prior to further processing steps. A binary
mask of brain locations is computed and used to cut the image, leaving the brain,
as illustrated in Fig. 1. Until recently, the most common approaches for brain
extraction have used variations on the techniques of growing a volume outwards to
the brain edge or fitting deformable mesh models to the brain surface.8–10 These
applications often require the user to specify input parameters governing the quality
of the skull-strip. Outcomes can vary by scanner and may contain systematic errors
that need manual cleanup. Another type of approach is based on multi-atlas7
matching, which registers an array of carefully labeled atlas brain images to the
target image, then uses a voting scheme with refinements to estimate the brain
mask in the target. This technique avoids the need for preset parameters, although
human quality control (QC) is still required. We have used this approach in our
laboratory with good results, but it is very computationally expensive.
In contrast, CNN architectures are trained to recognize likely brain voxels, pro-
ducing brain membership probability masks as output. There have been at least two
recent articles devoted specifically to CNN applications for skull stripping,11,12 but
these articles were focused on proof-of-concept; with small training sets, they did
not have the breadth of data to address the diversity of MRI encountered in large
scale production. In this section we outline our methods for achieving production
scale brain segmentation.
(a) Whole head MRI. (b) Brain location mask. (c) Extracted brain.
Fig. 1. Illustration of brain segmentation.

2.2.1. Neural network architecture
Our architecture is an end-to-end volumetric CNN adapted from a network designed

for vascular boundary segmentation in 3D computed tomography scans.13 It takes
as input whole-head 3D structural MRI volumes and produces probability maps
estimating the likely brain membership of each voxel location in the MRI. Binary
brain segmentation masks are derived by thresholding the probability maps. The
encoder consists of 13 convolutional/ReLU layers divided into 5 stages of decreasing
resolution. Stages are connected, and scale is decreased by a factor of 2 at each
stage, via 4 max-pooling layers. The decoder consists of 6 convolutional layers.
Resolution is restored during decoding using trilinear interpolation in the form of 4
convolutional transpose layers. A diagram of the CNN architecture is seen in Fig. 2
and a summary of its characteristics in Table 1.
Table 1. CNN architectural detail.
Stage Layer Count Filter Size Filter Count
1 2 3×3×3 32
2 2 3×3×3 128
3 3 3×3×3 256
4 3 3×3×3 512
5 3 3×3×3 1024
2.2.2. Ground truth data for training and testing
For training and testing, we used 11,663 structural T1-weighted MRI brain scans
selected from our archive of almost 26,000 scan sessions, representing data from
multiple national imaging studies. The composition of this set is detailed in Table 2.
The diversity of imaging cohorts by subject demographics and MRI acquisitions
allowed us to train for robust segmentation in the face of image variability due to
multiple factors.
Each structural MRI in our laboratory has an accompanying brain mask created
by an automated multi-atlas segmentation procedure7 followed by human quality
control. Our brain segmentation protocol includes the entire intracranial cavity
(ICC) defined out to the pia mater. This deviates from many standard brain seg-
mentations, which stop at the brain boundary. By segmenting the larger space, we
obtain a more robust and invariant measure of head size over time.
We started with previously generated ICC masks via this method as our ground
truth (GT). Our full training set included about 90% as masks from atlas-based
segmentation, supplemented by about 10% CNN-generated masks from previous
iterations of training. A handful of volumes were excluded from the set due to large
slice thickness, excessive noise (e.g. ghosting), or severe pathologies such as large
tumor or stroke.
Table 2. Data set formulation by cohort and subset.
Cohort Train Eval Test Total
90+ 124 32 32 188

ADC 456 58 62 576
ADNI 3127 391 391 3909
BIOCARD 260 33 33 326
CHAP 230 29 27 286
COINS 87 30 30 147
Framingham 2851 357 357 3565
Heart Study
Jackson - - 50 50
Heart Study
K-STAR - - 28 28
KHANDLE 44 22 22 88
NACC 1295 162 162 1619
SOL-INCA 360 46 46 452
VCID 343 43 43 429
9177 1203 1283 11663
2.2.3. Training the CNN
Network training was deeply supervised, with loss function penalties calculated
at each stage as well as the final fused prediction. Training example pairs were
sampled one at a time in round-robin fashion by cohort. Individual cohort sets
were continually cycled to maintain influence until a fixed number of training steps
had been completed. Training was completed in about 32 hours using the following
hyperparameters:
• Loss function: Summed cross-entropy of each stage and the fused prediction
compared to ground truth.13
• Optimization: Exponential moving average of a stochastic gradient descent
with Nesterov momentum.14
• Learning rate: 10−2
• Momentum: 0.9
• Moving average decay: 0.999
• Batch size: 1
• Steps: 25,000
2.2.4. Image pre- and post-processing
Pre-processed images are auto-cropped and padded to achieve minimal background

while conforming to lattice dimensions divisible by 25 , corresponding to the number
of image resolution stages in the CNN (see Fig. 2). Intensities are normalized to
unit standard deviation with zero mean after an image is cropped, but before it is
zero padded. To preserve the original volumes and save space, pre-processing steps
Input Fuse Output

1 32 32
conv 3x3x3, ReLU conv 3x3x3, ReLU conv 1x1x1 conv 1x1x1
... ...
[n0,n1, n2] Stage 1
max pool 2x2x2 up-conv 2x2x2 up-conv 16x16x16
32 128 128
up-conv 8x8x8
conv 3x3x3, ReLU conv 3x3x3, ReLU conv 1x1x1

... ... ... up-conv 4x4x4
Stage 2
max pool 2x2x2
128 256 256 256
... conv 3x3x3, ReLU ... conv 3x3x3, ReLU ... conv 3x3x3, ReLU ... conv 1x1x1
Stage 3
max pool 2x2x2
256 512 512 512
Stage 4
max pool 2x2x2
512 1024 1024 1024

Stage 5
Fig. 2. Neural network diagram tracking 3D volume through the 5-stage encoder and fuse decoder.
3 × 3 × 3 convolution followed by a ReLU nonlinearity, 2 × 2 × 2 Max pooling layer, 1 × 1 × 1
convolution reduction, Upsampling convolutional transpose.
are calculated non-destructively, and applied on the fly right before neural network
processing.
Prediction maps for voxel brain membership from the CNN are binarized using
a threshold of p > 0.34 to form the brain segmentation mask.
2.2.5. Measures for evaluating the brain mask predictions
We evaluated the performance and quality of the CNN brain mask predictions using
three comparison measures: model generalization, model consistency and resource
efficiency. Model generalization refers to the ability of the trained neural net to
match ground truth masks of the test samples across a variety of imaging cohorts.
This is important because imaging cohorts vary by characteristics of scanner and
participants, and we want to achieve consistently good matches regardless of cohort.
We used the Dice similarity coefficient (DSC)11,12,15 for match quality between CNN
and ground truth masks. The DSC is defined as follows:

2A ∩ B
DSC = ,
(1)
A + B
where A is the set of predicted voxels in the CNN mask and B is the set of voxels
in the GT mask.
Model consistency is the ability to generate brain masks for longitudinal same-
subject repeated scans that are close in volume. This is crucial because of our
protocol segmenting the ICC volume, which unlike brain volume is constant over
time, meaning that estimated ICC volumes should ideally be unchanging over re-
peated scans. We used the maximum volumetric differences over all scans of a
subject to assess this consistency. Resource efficiency encompasses the two aspects
of computational and human resource time. Our current atlas-based brain mask
computations require about 27 CPU hours of computation followed by an average of
45-75 minutes of human QC. We compare the corresponding times for computation
and human QC of the CNN masks.
2.3. Structural edge detection
Our second analysis examined the ability of CNN brain structural edge predictions
to enhance the sensitivity of brain atrophy rate computations. In previous arti-
cles, we showed that supplementing structural MRI images with estimates of edge
presence increased our sensitivity and localization of atrophy maps, leading to aug-
mented statistical power for detecting differences in atrophy rates between impaired
and normal cohorts.16,17 In the current chapter, we investigate whether CNN edge
predictions further enhanced these characteristics.
2.3.1. Neural network architecture
The CNN architecture for edge recognition was modified from that for brain segmen-
tation described above (see Table 1 for reference to the segmentation architecture).
We reduced the number of stages from 5 to 3, with the intention of loosening the
context restrictions imposed by later stages in the mask segmentation architecture:
we wanted edge patterns learned in one part of the brain to be recognizable in
other regions. We doubled the number of filters in both layers of the first stage, to
increase the network capacity for recognizing variations in small detail, given that
structural edges have finer and more varied characteristics than whole brain masks.
2.3.2. Ground truth data for training and testing
Ground truth training data consisted of 10,910 edge-labeled structural MRI images,
previously skull-stripped, from the ADNI, ADC, Framingham and VCID cohorts.
The Alzheimer’s Disease Neuroimaging Initiative (ADNI) (adni.loni.usc.edu)
was launched in 2003 as a public-private partnership. The primary goal of ADNI
has been to test whether serial MRI, positron emission tomography, other biologi-
cal markers, and neuropsychological assessment can be combiined to measuree the
progression of mild cognitive impairment and early AD. The principal investigator
of ADNI is Michael Weiner, MD, VA Medical Center and University of California,
San Francisco. For current information, see www.adni-info.org.
For testing, we used a further set of 1,070 subjects from the ADNI cohort which
had two serial longitudinal scans with at least one year interscan separation, since
the goal was not only to evaluate qualities of edge prediction, but their effectiveness
when used to augment longitudinal registration. Edge labeling for ground truth was
performed by an in-house implementation of 3D Canny edge detection,18 aimed at
delineating brain tissue boundaries between white matter (WM) and gray matter
(GM), and GM with cerebrospinal fluid (CSF).
2.3.3. Training the CNN
For ground truth, the Canny edge labels were scaled to the interval [0,1] and then
thresholded at 0.1 to produce a binary edge mask. The only modification to the
training protocol for brain masks (see Sec. 2.2.3) was to increase the number of
training steps to 30,000. Testing was performed both by visual inspection of the
CNN-predicted edge maps and by evaluating their ability to enhance longitudinal
image registration (see Sec. 2.3.5 and Sec. 2.3.7 below).
2.3.4. Edge-enhanced longitudinal same-subject scan registration
The study of brain change over time requires precise computations of local volume
change rates (tissue atrophy or CSF cavity expansions). These are accomplished
using non-linear registration between two structural MRI scans of the same subject
separated by a time interval. Local volume changes are computed by the loga-
rithm of the determinant of the jacobian 3 × 3 matrix of the deformation field
partial derivatives (log-jacobians). The jacobian determinant yields volume change
as multiplicative factor at each image voxel, and the log-transform converts it to
a distribution centered at 0, with negative values indicating contraction (atrophy)
and positive values expansions. For small magnitudes of the determinant, the log-
jacobian approximates the local percentage volume change. Deformation fields are
computed via Navier-Stokes equations driven by a force generated from image mis-
match; solutions are velocities which are integrated to yield spatial deformations
needed to register the images. A full explanation is provided in our previous arti-
cles.16,17
There we demonstrated that the accuracy and resulting statistical power of
the log-jacobian atrophy maps could be improved by incorporating estimates of
tissue or structural boundary likelihood at each point into the force field, since the
computed deformations rely heavily on mismatches of edges that have moved.16,17
In brief, the driving force F in the Navier-Stokes equation at each voxel is a weighted
sum of components derived from the gradient F1 of the image intensity mismatch
metric and the gradient F2 of a modulating penalty function. The weights use the
probability P (edge) of a structural boundary at a voxel to 1) allow strong effects of
the mismatch gradient while minimizing the penalty at highly likely edge locations,
and 2) dampen the driving force, which might be incorrectly high due to image
noise, while allowing full strength of the penalty gradient in areas more likely to be
inside homogeneous tissue:
F = P F1 + λ(1 − P )F2 , (2)
where λ is a penalty weighting factor.
2.3.5. Tests of the CNN edge-predictions used in volume change

The previous articles demonstrated that increasingly sophisticated estimates of edge
probabilities16,17 led to increasing statistical power in population studies where local
brain atrophy rates were used to differentiate cognitively normal control subjects
and those diagnosed with Alzheimer’s disease. In the first article,16 the probability
P of edge presence was simply the cumulative distribution function of the intensity
gradient magnitude at each voxel; this was shown to improve the specificity and
power of longitudinal registration over no edge probability estimates. The second
project17 demonstrated that a more sophisticated algorithm combining intensity
gradients with boundary estimates based on tissue segmentation further improved
the registrations. We shall refer to this as our Grad-Enhanced method, for compar-
ison with the method using CNN estimates of edge labels in this chapter.
In summary, the current chapter reports corresponding results for longitudinal
atrophy calculations in which P is now an edge probability estimate generated by
our CNN architecture. We hypothesized that the CNN probabilistic edge predic-
tions, using generalization that could improve upon our earlier edge estimates as
well as the Canny edge ground truth data, would further sharpen the localization
of atrophy maps and their consequent statistical power.
2.3.6. Voxel-based analysis in template space

Analysis of the quality of CNN edge predictions used voxel-based statistics per-
formed in a common minimal deformation template space. We transformed all
native space voxel edge maps and longitudinal log-jacobian atrophy maps to an
age-appropriate synthetic structural brain image (minimal deformation template,
or MDT)19 using nonlinear B-spline transformations.20 In MDT, edge map pre-
dictions and log-jacobian maps could be analyzed at each voxel individually across
the entire population. Clusters of contiguous voxels having significant effects for
cognitive differences were computed using non-parametric corrections for multiple
comparisons.21
2.3.7. Statistical evaluations

We performed power analyses for our previous Grad-Enhanced and current CNN
methods, comparing the minimum sample sizes needed to detect differences in
brain atrophy of cognitively impaired subjects above levels in normal aging. We
used change computations for normal subjects (CN) and mild cognitive impairment
(MCI) or Alzheimer’s disease (AD) groups, over statistically defined regions of inter-
est (statROIs),22 defined as brain regions that best characterize atrophy difference
rates between two groups. We computed statROIs for the pairs AD vs. CN and
AD vs. MCI using non-parametric cluster size permutation testing21 with 1000 it-
erations, to find significant (p < 0.05, corrected) clusters of voxels whose atrophy
difference T-values were at least 5 for the pair of groups. The statistical power es-
timate was based on the minimum sample size n80 needed to detect a 25% increase
in atrophy in an impaired cohort beyond that of CN with 80% probability:22,23
2
2σimpaired (z0.975 + z0.80 )2
n80 = , (3)
(0.25(μimpaired − μCN ))2
where μ is the mean statROI atrophy and σ the standard deviation for a given
group (impaired or CN), and for b = 0.975 or 0.80, zb is the threshold defined by
P (Z < zb ) in the standard normal distribution.
3. Results
This section presents results of our current applications of CNN learning to the areas
of brain segmentation and brain structural edge labeling for improving longitudinal
image registration.

In this section on brain segmentation, we focus on results in the measures of model
generalization, model consistency and resource efficiency outlined in Sec. 2.2.5.
3.1.1. Model generalization

To test model generalization across diverse cohorts, we used the DSC between CNN
ICC mask predictions and ground truth. Results are shown in Fig. 3. Across
test data in 13 cohorts (see Table 2 for numbers of test subjects in each cohort),
whisker plots of matching scores show means uniformly above the best recent report
for brain mask performance12 (lower, lightly dashed line) and approaching and
occasionally exceeding the estimated mean of human QC vs. human consensus on
a small demonstration set (upper, heavily dashed line).
3.1.2. Model consistency

To test CNN model consistency in same-subject repeat scans, we examined longi-
tudinal scans for 117 subjects in our ADC cohort, containing 259 scans in total.
We computed the variability of the maximum pairwise difference statistic between
two masks of a given subject. Results are displayed in Fig. 4. This shows smaller
mean and max differences (better volume consistency) for the CNN masks than for
human QC ground truth masks. Since head size (and hence ICC) does not change
Fig. 3. Whisker plot distributions of the DSC between CNN masks and ground truth grouped by
cohort. The upper heavily dashed line at DSC = 0.984 is our mean estimated level of human inter-
rater performance vs consensus. The lower (lightly dashed) line at 0.977 is the best previously
reported mean of prediction mask vs. ground truth using LPBA40 and OASIS data.12
over time, the mask volume predictions of sequential scans in one subject should
be the same. Their variability is a measure of the consistency of the segmentation
method.
Fig. 4. Repeat scan ICC volume ranges (max-min per subject) of 117 subjects across 259 scans,
for ground truth masks finished by human QC (left) and CNN masks (right).
3.1.3. Resource efficiency

Resource efficiency is the total time, including both computational time and human
QC, to produce a brain segmentation mask from a whole head structural MRI.
Results comparing times for our standard multi-atlas protocol with those of the
CNN masks are displayed in Table 3. These indicate that CNN prediction improves
computation time per subject by more than 3 orders of magnitude (from 15–27
hours to roughly 10 seconds) and the resulting smoothness and robustness of the
raw CNN masks also reduces human QC time (from 45–75 minutes to about 10).
Table 3. Mean resource efficiency (minutes).
Task Multi-atlas CNN
Preprocessing - 2
Mask generation 900-1600 0.15
Human QC 45-75 10
Totals 945-1675 12
3.2. Edge labeling

This section examines characteristics of CNN brain structural edge labels by com-
parison with Canny GT labels. It then presents results of incorporating CNN prob-
abilistic brain edge prediction labels into longitudinal registration of same-subject
repeated scans. We focus on sensitivity, localization and statistical power of the
longitudinal registrations as outlined in Sec. 2.3.5 and Sec. 2.3.7. We evaluated
CNN edge labeling and its performance for augmenting longitudinal registrations
using 1,070 ADNI subjects whose interscan intervals were at least one year. Results
presented here depend on voxel-based statistics in a common brain template space
as outlined in Sec. 2.3.6.
3.2.1. Characteristics of CNN edge labels

We examined the quality of CNN edge labels using images transformed into template
space via nonlinear B-spline registration and averaged. The average of our 1,070
CNN edge labeling of baseline images was visually almost indistinguishable from
that of averaged Canny ground truth edge masks. Since ground truth labels all
had the value 1 at edge voxels, while CNN predictions had values between 0 and
1, the average images are also in this range. We display the average CNN edge
maps overlaid on the template brain in Fig. 5(a). Then Fig. 5(b) displays strength
differences (significant, after testing for multiple comparisons) between CNN and
Canny GT edge labels.
Low average edge p-values (red in Fig. 5(a)) represent the combined effects
of regions where the CNN edge predictions were of lower certainty and inevitable
(a) Averaged CNN edge label predictions. (b) Average CNN edge labels overlaid by sig-
Thresholded map (p > 0.30) on template nificant differences with Canny labels. Warm
brain. Red indicates p-values near 0.30; yel- colors: CNN > Canny. Cool: Canny > CNN.
low, near 0.85.
Fig. 5. Mapping of average CNN edge labels.
imperfections due to variability in B-spline matching of individual cortical configu-

rations to the MDT target. The average map of Canny GT mappings (image not
shown) was visually almost indistinguishable from that of CNN in Fig. 5(a). The
effects due to varying certainty in edge predictions are very real, as seen by the fact
that the CNN and Canny predictions differed in systematic ways. Fig. 5(b) shows
extensive areas where the average CNN edge predictions were actually stronger than
the Canny, even though the latter consisted of binary labels, suggesting that there
were regions where the CNN pattern generalization produced consistently more ro-
bust edge labeling. We specifically note in Fig. 5(b) portions of the edges for the
brain and ventricle boundaries.
3.2.2. Characteristics of longitudinal registration incorporating CNN labels

We compared longitudinal same-subject registrations using CNN edge labels and
our Grad-Enhanced method. Results are displayed in Fig. 6. While both methods
indicated two-year atrophy rates in similar areas (see Fig. 6(a) for atrophy com-
puted with CNN edges, Fig. 6(b) for atrophy from Grad-Enhanced; both are to
same color scale), the CNN estimates showed greater localization of atrophy rates
to the cortical brain surface, ventricular edges and subcortical nuclei (warm regions
in Fig. 6(c)) which may be more physiologically accurate. Areas of significant dif-
ferences between the methods were computed using non-parametric corrections for
multiple comparisons, described previously. The high atrophy rates shown for the
Grad-Enhanced method in white matter regions (blue-purple in Fig. 6(c)) are less
biologically plausible and may be a consequence of that method’s weaker localiza-
tion to gray matter structures.
(a) Average atrophy rates (b) Average atrophy com- (c) Significant atrophy differ-
computed using CNN edge puted using Grad-Enhanced ences for CNN vs.
predictions. Blue indicates edge estimates. Same color Grad-Enhanced edges. Cod-
severe (3-4% biannual loss) scales and coding as in (a). ing: CNN atrophy > Grad-
and green modest atrophy Enhanced (warm) and Grad-
(around 1%). Enhanced > CNN (cool)
Fig. 6. Mapping of average longitudinal atrophy patterns.
3.2.3. Statistical power computations
We computed the n80 minimum sample size measure (Sec. 2.3.7, Eq. (3)) using
statROIs capturing the regions of significant atrophy difference between AD vs.
CN, and AD vs. MCI cohorts. We did not compare CN and MCI cohorts because
we found little difference between these via either the CNN or Grad-Enhanced
methods.
Although the statROIs for each group difference covered broadly the same re-
gions in both methods, the CNN-derived statROIs showed both more precision to
small structures and more consistent coverage of regions whose atrophy is known to
be associated with cognitive decline. Resulting minimum sample sizes are displayed
in Table 4. For each comparison, the CNN method of atrophy calculation shows a
smaller sample size.
Table 4. Minimum sample size computations.
Method AD vs. CN AD vs. MCI
CNN 152 281

Grad-Enhanced 180 310
4. Discussion
We have outlined the methods and presented results for two applications of CNN-
based deep learning to structural recognition tasks in 3D MRI brain images. With
minor modifications, our CNN architectures were successfully adapted to segment-
ing the brain from the whole head, and to labeling structural boundary edges.
The products of each application demonstrated a robustness and consistency that
exceeded previous efforts, making them useful both for fast and consistent image
processing (in brain segmentation) and statistically powerful computations of lon-
gitudinal atrophy rates. These illustrate the flexibility and adaptability of deep
learning in disparate areas of brain imaging.
Brain segmentation is a necessary and routine step of any image processing pipeline.
Many solutions have been developed over a 20-year period (Ref. 9 provides a good
review). Prior to the application of deep learning to brain segmentation, most ap-
proaches required numerical parameter inputs in order to accommodate the charac-
teristics of particular scans. For example, Brain Extraction Tool (BET)8 has user
inputs for “fractional intensity threshold” that affects the size of the output seg-
mented mask and “threshold gradient” that influences the relative size of estimates
at the top and bottom of the brain. While default values can be used, they do not
work on every scan and systematic errors can result. Although computation times
for many algorithmic approaches are fast (for BET, about 2 seconds; see Table III of
Ref. 12 for an overview of CNN and non-CNN segmentation times), accuracies are
variable and the search continues for an algorithm that is robust across scanners.
An alternative to which we have turned in our laboratory is multi-atlas matching,7
which gives consistently acceptable results when paired with human QC. But it is
resource-intensive, thereby posing a limitation for processing large amounts of data.
In response, we aimed to develop a deep learning implementation with increases
of throughput over our current methods while maintaining robust generalizations
across cohorts and longitudinal consistency. The results of this chapter suggest
that these goals are feasible. Other recent efforts have demonstrated the utility of
CNN approaches to brain segmentation.11,12 However, these used relatively small
training sets which were of limited variety and segmentation quality. One of the
limitations for any deep learning application in medical imaging has been a dearth
of high quality ground truth. In this regard, our laboratory archives afford a rare
instance of large numbers (see Table 2) of high-quality GT generated from a com-
bination of multi-atlas matching and human QC, using our established protocol of
segmenting the ICC. This has given us a close-to-optimal setting for deep learn-
ing. Drawn from multiple imaging cohorts throughout the United States, our data
have provided sufficient numbers and variability to train our CNN segmentation
architecture across different scanners and subject populations. We documented its
capacity for generalization in Fig. 3. We further showed that the CNN segmenta-
tions improved upon the longitudinal consistency produced by our multi-atlas and
human QC method (Fig. 4). In sum, our CNN learning combines computation
times that are comparable to those of fast non-CNN approaches with improvements
over multi-atlas matching for the measures of consistency and human QC (Table 3).
This will allow us to process very large data sets that we anticipate coming online
in our laboratory over the next three years.
4.1.1. Limitations of our brain segmentation

Limitations of our efforts may stem from 1) residual inconsistencies in the available
ground truth, mainly due to human variability, and 2) inability to train our brain
segmentation on pathological cases like tumor or massive stroke.
For the first limitation, we note that our DSC scores showed some variability
across cohorts (Fig. 3), with some scores exceeding human-level estimates but others
dropping below it. A previous test using the non-local patch segmentation BEaST9
produced mean DSC scores all above 0.98 from leave-one-out training and validation
using 80 samples drawn from four cohorts. Computation time was less than 30
minutes per subject. As the paper acknowledges, this validation set was relatively
small (i.e. from 2 to 40 priors) and these may have been favorably biased, even
using leave-one-out, because testing examples were all drawn from the same pool
of priors as for training. In a second test using independent validation with 840
ADNI images, their mean DSC was in the lower ranges above 0.98 (Fig. 5A of Ref.
9), equal to or less than the scores we report for our 391 ADNI test images (Fig. 3)
that were generated with much faster computation times. Thus, to the extent that
the two sets of results are comparable, our CNN segmentation does as well as or
better than BEaST and with better resource efficiency.
The limitation due to residual human-introduced variability may have effectively
capped the level of generalization we could achieve. DSC scores comparing CNN
predictions to ground truth could be slightly degraded because CNN predictions
are smooth from one brain slice to the next, whereas human QC introduces slice
inconsistencies, in effect creating built-in differences between GT and predictions.
This in turn raises the issue that when the CNN predictions begin to surpass GT by
not reproducing random flaws in the latter, then analyzing their quality becomes
less straightforward. In that outcome, DSC scores do not tell the full story. We
hypothesize that we are starting to encounter instances of this in our current data.
The second limitation (inability to segment the brain in the presence of severe
pathology) may be unavoidable at present because of the extreme scarcity of patho-
logical examples available for training.
4.1.2. Future work
In future work, we aim to address the first limitation using iterations of training
in which our previous atlas-matching masks are replaced by CNN-generated masks
that show more consistency due to less human variability in the QC phase. To
be clear, the CNN masks still require human QC, but this is far less than for the
atlas-matching masks (Table 3), thus reducing the risks of QC analysts introducing
variability at the brain edges. Other work in the near future will include adapting
our volumetric techniques to segmentation of important brain sub-structures such
as the hippocampus24 (for which we now also use multi-atlas segmentation) and
subcortical nuclei25 whose boundaries tend to be indistinct in structural MRI.
4.2. Brain edge labeling in longitudinal registration
Our application of deep learning to structural edge labeling was aimed at improv-
ing localization, biological accuracy and statistical power of voxel-based registration.
Nonlinear registration is a powerful technique for studying local brain variability.
In the cross-sectional setting, where many images are registered to a template, it
is necessarily imperfect since brains have individual cortical signatures that cannot
be completely matched. This is not the case in longitudinal registration of two
scans for the same subject; however, random noise in the form of image artifacts
typically still causes false indications of change over time. Common measures to
address this problem have involved penalty functions and levels of smoothing,26,27
but these risk degrading the level of local detail that otherwise might be possible.
Our previous work in this area16,17 aimed to support localization by incorporating
auxiliary estimates of edge likelihoods, in the form of hand-crafted algorithms us-
ing one or a few pre-selected features, to reinforce change gradients in likely edge
locations but inhibit indications of “change” in areas not associated with edges.
The pattern recognition capabilities of deep learning, going beyond the use of a
handful of features, suggested the possibility to do better. Results in this chapter
have indicated that CNN-generated edge predictions can in fact extend the power of
our previous methods. Registrations generated with CNN edge predictions showed
increased detail regarding structural boundaries (compare Fig. 6(a) and Fig. 6(b)),
with regions of computed atrophy rates that were both larger in some areas and
smaller in others compared to our previous algorithm, in biologically plausible ways
(Fig. 6(c)). These led to improved statistical power (Table 4) in the form of reduced
sample sizes needed to detect change.
Applications of deep-learning methods to brain structural edge recognition are
of course not limited to the one presented in this chapter. Tasks such as tissue
classification28 and the segmentation of brain structures,25 including subcortical
nuclei whose boundaries are difficult to localize with certainty, can benefit from the
edge prediction capability we demonstrated here. This will be the subject of future
work.
5. Conclusion
We have demonstrated deep learning CNN applications in two areas of brain struc-
tural image processing. One application focused on improving production and ro-
bustness in brain segmentation, a routine, essential image processing task that
suffers from variable reliability among available non-CNN approaches, and a dearth
of training GT data for previous deep learning efforts. The other aimed at improv-
ing edge prediction, leading to greater biological accuracy and statistical power for
analysis of large data sets. Our results suggest that we have attained these goals.
This demonstrates the flexibility and broad applicability of CNN learning in medical
image processing and analysis.
References
1. Dinggang Shen, G. Wu, and H.-I. Suk, Deep Learning in Medical Image Analysis,
Annual Review of Biomedical Engineering. 19(1), 221–248, (2017). ISSN 1523-9829.
doi: 10.1146/annurev-bioeng-071516-044442. URL http://www.annualreviews.org/
doi/10.1146/annurev-bioeng-071516-044442.
2. Z. Akkus, A. Galimzianova, A. Hoogi, D. L. Rubin, and B. J. Erickson, Deep Learning
for Brain MRI Segmentation: State of the Art and Future Directions, Journal of Digi-
tal Imaging. 30(4), 449–459, (2017). ISSN 1618727X. doi: 10.1007/s10278-017-9983-4.
3. Y. Bengio, Learning Deep Architectures for AI. vol. 2, 2009. ISBN 2200000006. doi:
10.1561/2200000006.
4. Z. Zhang, F. Xing, H. Su, X. Shi, and L. Yang, Recent Advances in the Applications of
Convolutional Neural Networks to Medical Image Contour Detection, arXiv preprint
arXiv:1708.07281. (2017). URL http://arxiv.org/abs/1708.07281.
5. G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A.
W. M. van der Laak, B. van Ginneken, and C. I. Sánchez, A Survey on Deep Learning
in Medical Image Analysis, Medical Image Analysis. 42, 60–88, (2017). ISSN 1361-
8423. doi: 10.1016/j.media.2017.07.005. URL http://arxiv.org/abs/1702.05747{\%
}0Ahttp://dx.doi.org/10.1016/j.media.2017.07.005.
6. A. S. Lundervold and A. Lundervold, An overview of deep learning in medical imag-
ing focusing on MRI, Zeitschrift fur Medizinische Physik. 29(2), 102–127, (2019).
ISSN 18764436. doi: 10.1016/j.zemedi.2018.11.002. URL https://doi.org/10.1016/
j.zemedi.2018.11.002.
7. P. Aljabar, R. Heckemann, Hammers, J. Hajnal, and D. Rueckert, Multi-atlas based
segmentation of brain images: atlas selection and its effect on accuracy., NeuroImage.
46(3), 726–38 (jul, 2009). ISSN 1095-9572. doi: 10.1016/j.neuroimage.2009.02.018.
URL http://www.ncbi.nlm.nih.gov/pubmed/19245840.
8. S. M. Smith, Fast robust automated brain extraction, Human Brain Mapping.
17(3), 143–155, (2002). ISSN 1097-0193. doi: 10.1002/hbm.10062. URL https:
//onlinelibrary.wiley.com/doi/abs/10.1002/hbm.10062.
9. S. F. Eskildsen, P. Coupé, V. Fonov, J. V. Manjón, K. K. Leung, N. Guizard, S. N.
Wassef, L. R. stergaard, and D. L. Collins, BEaST: Brain extraction based on nonlo-
cal segmentation technique, NeuroImage. 59(3), 2362–2373 (Feb., 2012). ISSN 1053-
8119. doi: 10.1016/j.neuroimage.2011.09.012. URL http://www.sciencedirect.com/
science/article/pii/S1053811911010573.
10. J. E. Iglesias, C. Liu, P. M. Thompson, and Z. Tu, Robust Brain Extraction Across
Datasets and Comparison With Publicly Available Methods, IEEE Transactions on
Medical Imaging. 30(9), 1617–1634 (Sept., 2011). ISSN 0278-0062. doi: 10.1109/TMI.
2011.2138152.
11. J. Kleesiek, G. Urban, A. Hubert, D. Schwarz, K. Maier-Hein, M. Bendszus, and
A. Biller, Deep MRI brain extraction: A 3D convolutional neural network for
skull stripping, NeuroImage. 129, 460–469, (2016). ISSN 10959572. doi: 10.1016/
j.neuroimage.2016.01.024. URL http://dx.doi.org/10.1016/j.neuroimage.2016.
01.024.
12. S. S. M. Salehi, D. Erdogmus, and A. Gholipour, Auto-context Convolutional Neural
Network for Geometry-Independent Brain Extraction in Magnetic Resonance Imaging,
IEEE Transactions on Medical Imaging. 36(11), 2319–2330, (2017). ISSN 0278-0062.
doi: 10.1109/TMI.2017.2721362. URL http://arxiv.org/abs/1703.02083.
13. J. Merkow, D. Kriegman, A. Marsden, and Z. Tu. Dense Volume-to-Volume Vascular
Boundary Detection. In eds. S. Ourselin, L. Joskowicz, M. R. Sabuncu, G. Unal, and
W. Wells, MICCAI 2016, vol. 9902, pp. 1–8. Springer, (2016).
14. I. Sutskever, J. Martens, G. Dahl, and G. Hinton, On the importance of initialization
and momentum in deep learning, 30th International Conference on Machine Learning,
ICML 2013. (PART 3), 2176–2184, (2013).
15. L. R. Dice, Measures of the Amount of Ecologic Association Between Species, Ecology.
26(3), 297–302, (1945).
16. E. Fletcher, A. Knaack, B. Singh, E. Lloyd, E. Wu, O. Carmichael, and C. De-
Carli, Combining boundary-based methods with tensor-based morphometry in the
measurement of longitudinal brain change., IEEE transactions on medical imag-
ing. 32(2), 223–36 (feb, 2013). ISSN 1558-254X. doi: 10.1109/TMI.2012.2220153.
URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3775845{\&
}tool=pmcentrez{\&}rendertype=abstract.
17. E. Fletcher, Using Prior Information To Enhance Sensitivity of Longitudinal Brain
Change Computation, In ed. C. H. Chen, Frontiers of Medical Imaging, chapter 4,
pp. 63–81. World Scientific, (2014). doi: 10.1142/9789814611107 0004. URL http:
//www.worldscientific.com/doi/abs/10.1142/9789814611107{\_}0004.
18. J. Canny, A computational approach to edge detection., IEEE transactions on pattern
analysis and machine intelligence. 8(6), 679–698, (1986). ISSN 0162-8828. doi: 10.
1109/TPAMI.1986.4767851.
19. P. Kochunov, J. L. Lancaster, P. Thompson, R. Woods, J. Mazziotta, J. Hardies,
and P. Fox, Regional Spatial Normalization: Toward and Optimal Target, Journal of
Computer Assisted Tomography. 25(5), 805–816, (2001).
20. D. Rueckert, P. Aljabar, R. A. Heckemann, J. V. Hajnal, A. Hammers, R. Larsen,
M. Nielsen, and J. Sporring. Diffeomorphic registration using b-splines. In MICCAI
2006, vol. LNCS 4191, pp. 702–709. Springer-Verlag, (2006).
21. T. Nichols and A. P. Holmes, Nonparametric permutation tests for functional neu-
roimaging: a primer with examples, Human Brain Mapping. 15(1), 1–25, (2001).
22. X. Hua, B. Gutman, C. P. Boyle, P. Rajagopalan, A. D. Leow, I. Yanovsky,
A. R. Kumar, A. W. Toga, C. R. Jack Jr, N. Schuff, G. E. Alexander,
K. Chen, E. M. Reiman, M. W. Weiner, P. M. Thompson, Initiative, the
Alzheimer’s Disease Neuroimaging, and C. R. Jack, Accurate measurement of brain
changes in longitudinal MRI scans using tensor-based morphometry, NeuroImage.
57(1), 5–14 (jul, 2011). ISSN 1095-9572. doi: 10.1016/j.neuroimage.2011.01.079.
23. D. Holland, L. K. McEvoy, and A. M. Dale, Unbiased comparison of sample size

estimates from longitudinal structural measures in ADNI., Human brain mapping.
000(May), 2586–2602 (aug, 2011). ISSN 1097-0193. doi: 10.1002/hbm.21386. URL
http://www.ncbi.nlm.nih.gov/pubmed/21830259.
24. B. Thyreau, K. Sato, H. Fukuda, and Y. Taki, Segmentation of the hippocampus by
transferring algorithmic knowledge for large cohort processing, Medical Image Anal-
ysis. 43, 214–228, (2018). ISSN 13618423. doi: 10.1016/j.media.2017.11.004. URL
https://doi.org/10.1016/j.media.2017.11.004.
25. Z. Tu, K. L. Narr, P. Dollár, I. Dinov, P. M. Thompson, and A. W. Toga, Brain
Anatomical Structure Segmentation by Hybrid Discriminative / Generative Models,
IEEE Transactions on Medical Imaging. 27(4), 495–508, (2008).
26. X. Hua, A. D. Leow, N. Parikshak, S. Lee, M.-C. Chiang, A. W. Toga, C. R. Jack,
M. W. Weiner, and P. M. Thompson, Tensor-based morphometry as a neuroimaging
biomarker for Alzheimer’s disease: an MRI study of 676 AD, MCI, and normal sub-
jects., NeuroImage.
43(3), 458–69 (nov, 2008). ISSN 1095-9572. doi: 10.1016/j.neuroimage.2008.07.013.
27. I. Yanovsky, A. D. Leow, S. Lee, S. J. Osher, and P. M. Thompson, Comparing regis-
tration methods for mapping brain change using tensor-based morphometry, Medical
Image Anal-
ysis. 13(5), 679–700 (oct, 2009). ISSN 13618415. doi: 10.1016/j.media.2009.06.002.
28. P. Moeskops, J. de Bresser, H. J. Kuijf, A. M. Mendrik, G. J. Biessels, J. P. Pluim, and
I. Išgum, Evaluation of a deep learning approach for the segmentation of brain tissues
and white matter hyperintensities of presumed vascular origin in MRI, NeuroImage:
Clinical. 17(October 2017), 251–262, (2018). ISSN 22131582. doi: 10.1016/j.nicl.2017.
10.007.

CHAPTER 2.5
AUTOMATIC SEGMENTATION OF INTRAVASCULAR

ULTRASOUND IMAGES BASED ON TEMPORAL TEXTURE
ANALYSIS
Adithya G. Gangidi and Chi Hau Chen*
University of Massachusetts Dartmouth
North Dartmouth, MA 02747 USA
*
cchen@umassd.edu
Intravascular ultrasound (IVUS) continues to be an important technique

for imaging of coronary arteries and the detection of atherosclerotic disease.
Since its inception, much of the efforts towards IVUS image analysis have
been done using spatial information of one single IVUS frame at a time.
Accuracy of such approach is limited by the absence of clear boundary
between lumen and arterial wall structure due to noise induced by catheter
artifacts and speckle echo at high frequency. In our study we developed a
novel automatic algorithm for the analysis and delineation of lumen and
external elastic membrane (EEM) boundaries using both temporal and
spatial variation of IVUS data. The pre-processing steps involve the
construction of gradient image from neighboring images, and the use of
discrete wavelet frame decompositions. The observation that Lumen is
characterized by fine texture and that EEM is characterized by coarser
texture is used to initialize the contours. A smooth Lumen and EEM contour
is predicted by applying radial basis functions on contour initialization. This
algorithm is evaluated on large datasets of multi-patient (15) IVUS images
(~200 each) and pitted against the manually segmented contours by medical
experts. It is observed that this algorithm reliably performs contour
prediction with clinically appreciated limits of average prediction error
equaling 0.1254 mm and 0.0762 mm for Lumen and EEM respectively. In
an effort to provide direction to further improvements, a custom Lumen
detection algorithm for stented images is proposed and tested with reported
average prediction error of 0.048 mm.
1. Introduction
Intravascular Ultrasound (or IVUS) allows us to see the coronary artery from the
inside out. This unique picture, generated in real time, yields information that is
not possible with routine imaging methods or even non-invasive multi slice CT
scans. A growing number of cardiologists think that new information yielded by

* Corresponding author.
271
272 A. G. Gangidi and C. H. Chen
IVUS can make a significant difference in how patient is treated and can provide
more accurate information that will reduce complications and incidence of heart
diseases.
Intravascular ultrasound (IVUS) is a catheter-based technique that provides
high-resolution images allowing precise tomographic assessment of lumen area.
IVUS uses high- frequency sound waves called Ultrasound that can provide a
moving picture of your heart. These pictures come from inside the heart rather than
through the chest wall.
In a typical IVUS image (Fig. 1a), the lumen is typically a dark echo-free area
adjacent to the imaging catheter and the coronary artery vessel wall mainly appears
in three layers: Intima, Media, and Adventitia. Fig. 1b defines the layers. As the
two inner layers are of principal concern in clinical research, segmentation of
IVUS images is necessary to isolate the intima-media and lumen which provides
important information about the degree of vessel obstruction as well as the shape
and size of plaques. Such segmentation can be performed manually by a human
expert but it is very time consuming and costly also. Computer based analysis and
in fact a fully automatic image segmentation is much needed.
Fig. 1a Fig. 1b
There are several factors (artifacts) that significantly reduce the accuracy of
segmentation and ultimately cause difficulty in interpretation:
1. The ever present speckle noises in the ultrasonic images and particularly on
human tissues.
2. Guide wire with reverberation
3. Reflection from sheath surrounding transducer
4. Barely identifiable lumen intima boundary
Automatic Segmentation of Intravascular Ultrasound Images 273
5. EEM-like features beyond EEM

6. Bright echo from vessel wall being close to transducer
There have been a large amount of efforts made including the use of
automated contours models, machine learning and other methods for the IVUS
segmentation in recent years (see e.g. [1-12]). Though deep learning may provide
better performance with a very large data set, it is not used in the work as the
available data set size, in our view, is not large enough.
In this chapter we will present an automated segmentation methods making
use of both temporal and spatial information and the discrete waveframe
decomposition to extract the texture information and for initialization of contours.
The radial basis function is used to constructing the final contours in a few Iterative
steps. The proposed method is tested in a data base provided by the Brigham and
Women Hospital in Boston with very encouraging results.
2. The Data Base

The original 2D cross sectional images obtained from IVUS sensor as provided
Table 1
Name of pullback Number of Gated Frames Approx. number of

sequence with Manual Segmentation total frames
101-001_LAD 205 5369
101-011_RCA 92 2613
102-006_RCA 151 3464
103-007_LAD 247 4662
103-007_LCX 256 5229
106-001_LAD 62 3665
106-001_LCX 131 3848
110-001_RCA 167 3961
111-003_LAD 143 3477
111-003_LCX 131 3362
11-003_RCA 166 4611
114-001_LCX 143 3098
116-001_LAD 191 4606
116-001_LCX 108 2664
116-001_RCA 100 2469
Total 2293 57098
by Brigham Women Hospital for our academic research is in envelop file format.
They are converted into PC-Matlab format with 256x256 pixels in polar format.
The data available is listed Table 1. There are 15 pull-out sequences from 9
patients. There are a total of 2293 gated image frames which have been manually
segmented and are useful for training and validation purposes. A total of 57098
image frames provides us a large data set for algorithm testing. Although many
studies on IVUS image segmentation have been conducted with different but
limited amount of success, none has employed such a large data base. We believe
IVUS image segmentation is a problem in pattern recognition and computer vision.
Considering the successes of many pattern recognition and computer vision
problems in the last 55 years and their great impact on modern society, we are
confident that an effective automated segmentation process can be developed as
proven in this paper.
3. The Proposed Method

It is proposed to start with the Temporal IVUS Image correction. Obtaining
temporal Laplacian image gradient on basis of the four-image neighborhood does
noise correction. The idea is that the motion of cells around the arterial wall is
faster over time when compared to the change in noise artifices between successive
IVUS frames. In the next stage we will be tracking the lumen wall. The lumen
wall in IVUS images shows significant high frequency variation i.e. fine texture
around Lumen. In certain images due to the catheter induced artifacts and the guide
wire shadow there is significant Intensity variation around Lumen too. For media-
adventitia, texture variation around wall is smooth or coarse. There is significant
intensity variation too around this wall, due to its properties and catheter induced
artifacts. Thus we can use this texture and intensity information combined to trace
the two contours.
The proposed method makes use of a composite operator that depends both
on texture and temporal variation of intensity. Lumen contour can be traced based
on finest texture and intensity specifics. Once we have the contour of lumen border
we get the media-adventitia border by finding the coarse-most texture located
outside the lumen border.
Once this information of the two contour initializations is obtained we can
use Low pass filtering / 2D Radial basis functions to obtain smooth 2D contours.
The detailed procedures are as follows.
Step1: Pre-Processing:
The IVUS images represented in polar coordinated by I include not only tissue
and blood regions, but also the outer boundary of the catheter itself. The latter
defines a dead zone of radius equal to that of the catheter where no useful
information is contained. Knowing the diameter D of the catheter, these catheter-
induced artifacts are easily removed by setting.
I (r,T ) 0 for r D / 2 e
D Diameter of chatheter (1)
e Constant term
Fig. 2 illustrates the effect of the pre-processing.
Fig. 2: A. Original IVUS Cross-sectional image in rectangular co-ordinates; B. Catheter induced

artifacts Highlighted (red) in corresponding Polar form of IVUS Image; C. IVUS Image in polar form
after pre-processing, where in the artifacts set to “ZERO”.
Step 2: Noise Correction by Temporal Analysis :

In this stage, we use temporal information of the IVUS images to enhance the
quality of the said images. These steps account for noise correction by obtaining
temporal Laplacian image gradient on the basis of the four-image neighborhood.
The below equation explains this process.
Ig (m, n)(t) =5 * I (m,n)(t) -[I (m,n)(t+1)+ I(m,n)(t+2) + I(m,n)(t- 1) + I(m,n)(t-2)] (2)
In the above equation, I(m,n)(t+n) is the (t+n)th frame, Ig (m, n)(t) is the gradient image
obtained corresponding to frame number “t”.
Step 3: Getting the Intensity and Texture Information:

a. Intensity: IVUS Images are acquired successively frame by frame and
represented in the default form of polar co-ordinates I (r,‫ )ڧ‬for better visualization
and manipulation. Representation of the images in polar coordinates is important
for facilitating the description of local image regions in terms of their radial and
tangential characteristics. It also facilitates a number of other detection steps, such
as contour initialization and the smoothing of the obtained contour. For this
purpose, each of the original IVUS images is transformed to a polar coordinate
image where columns and rows correspond to angle and distance from the center
of the catheter, respectively, and this image alone, denoted I(r, ‫)ڧ‬, is used
throughout the analysis process.
b. Texture: Discrete Wavelet Frames (DWF) decomposition is used for detecting

and characterizing texture properties in the neighborhood of each pixel. This is a
method similar to the Discrete Wavelet Transform (DWT) that uses a filter bank
to decompose the grayscale image to a set of sub bands. The main difference
between DWT and DWF is that in the latter the output of the filter bank is not
subsampled. The DWF approach has been shown to decrease the variability of the
estimated texture features, thus improving pixel classification for the purpose of
image segmentation.
Thus Intensity images in polar form are run on Discrete Wavelet Frames
Decomposition, which is equivalent to DWT without the step of sub-sampling.
The 2 functions:Low pass Haar and corresponding High pass filter are used in
Discrete wavelet Frames Transform:
1
H ( z) (1 z 1 )
2
H[z] = Haar filter
(3)
1
G( z ) zH ( z )
G[z] = High Pass filter
Four successive passes of discrete wavelet Frames decomposition results in twelve

images. The resulting images are denoted by IK, where K =1 to 12. These images
are a result of a series of low pass operations containing the most coarse texture,
denoted by ILL. This image is critical in tracking media-adventitia border later.
Fig. 3: Image showing the difference between DWT and DWFT; Due to absence of subsampling in
DWFT all the resulting images are of same size.

Fig. 4: Image showing the different textures that can be obtained for DWF.

Fig. 5: DWF Application on IVUS Images in a stage by stage to serve as a basis for exture
segmentation.
Step 4: Lumen Border Contour Initialization:

Lumen border is initialized based on the sum of images corresponding to the fine
texture image(Obtained from DWFT) and the intensity images. A significant edge
towards the other r = 0 is detected by following the thresholding operator for each
theta on such sum image representing the Internal Energy:
255 '
I int ( r,T ) '
I int ( r,T )
max ( r ,T ) {I int ( r,T )}
'
I int ( r ,T ) ¦
k {7,8,10,11}
I k ( r,T ) I ( r ,T )
| |
Texture Intensity (4)
cint,t { pint,t [ U ,T ]}
I int,t ( U ,T ) ! T and I int,t ( r,T ) T r U
defining a lumen contour function Cint,t (T ) U
Only significant edges are saved towards a set of contour initialization points, Later
these are used to get Actual approximation of Contour by applying radial basis
functions. The choice of the images Ik in above formulae, employed in this
initialization process was done based on visual evaluation of all K generated
images and is in line with the aforementioned observations regarding the Intensity
and texture properties of the lumen and wall areas, in combination with the
characteristics of the filter bank used for the generation of images Ik T is the
threshold defined for initialization is found to be best at T= 42 for Image in range
[0 255].
Step 5: Tracking Media-Adventitia Border Initialization:

The motivation behind the choice of image data to be used for the initialization of
the media-adventitia boundary lies under the proposed approach in the observation
that in many cases the adventitia is represented in IVUS images by a thick bright
ring (a thick bright zone in polar coordinates) that is dominant in the image, as
opposed to the media region or any other region of an IVUS image. Consequently,
for the localization of the adventitia region, low-pass filtering could be used to
suppress undesirable details of the image while preserving well the former.
Once the lumen contour is initialized we look to initialize the media-
adventitia border on the outer side of the lumen (As media adventitia wall is present
outside the Lumen area). This saves us the unnecessary computing time looking
for the wall inside of the lumen area. For media-adventitia we take the sum of the
most coarse texture image obtained for the Discrete wavelet frames and the
Intensity. In this sum image we look for the maxima over the lumen border with
respect to a threshold resulting all the pixels at given theta.
This process can be represented by the following operator.
cext { pext [ P ,T ]}
I LL ( P ,T ) max{I LL ( P ,T ) ( r,T )}, (5)
r !U
[ U ,T ] are the points of thelumen contour,
Selecting, according to the above equations (5), the pixels to which the
intensity of the low-pass filtered image is maximized serves the purpose of
identifying the most dominant low frequency detail in the image, in case low-pass
filtering has failed to suppress all other higher-frequency information. The selected
pixels correspond to those on the boundary between the adventitia and the media
regions.
Step 6: Obtaining Smooth Contours with Radial Basis Functions:

After both the contours are initialized there are multiple ways in which we can
obtain the final smooth contour. Conventionally low pass filter based operators
were used to do such operations. But such methods have limitations as even few
erroneous values can change the whole distribution of contour, which comes from
the fact that such methods include every point in the path for obtaining the final
contour. Due to catheter-based artifacts multiple off shooting values thus affect
the final contour path.
Radial basis functions account for such drawback. Any values which offshoot
the continuity locally and globally, greater than a threshold, will not contribute to
distortion of contour. As lumen and media adventitia are continuous closed layers,
such feature of radial basis function is very useful.
We define and optimize the radial basis function f according to follwing
theory:
Define
r max{C (T )} 1
T
(6)
r min{C (T )} 1
T
For the range of values of r given I (r,‫)ڧ‬, we define:
f (T , C (T )) 0
f (T , r z C (T )) r C (T ) (7)
_ C (T ) here denotes either Cint (T ) or Cext (T )
Following the definition of f, the FastRBF library (FarField, 2012) [13] is used to
generate the smooth contour approximation in the following three steps.
Step A: Duplicate points where f has been defined (i.e. points in the 2D space
which are located within a specific minimum distance from other input points are
removed) the remaining points serve as the centers of the RBF.
Step B: The fitting of an RBF to this data is performed using the spline smoothing
technique, chosen for not requiring the prior estimation of the noise measure
related to each input data point, as opposed to other fitting options such error bar
fitting.
Step C: The fitted RBF is evaluated in order to find the points which correspond
to zero value; the latter define the contour approximation C’.
T
Fig. 6: Initialized Contour before Applying Radial Basis Function.
Fig. 7: Contour after Applying Radial Basis Function.
Fig. 8: Lumen and Media Adventitia Contour before and after applying Radial Basis Function.
4. Implementation and Results

The above-proposed methodology is implemented on 15 instances of the patient’s
arterial data on the MATLAB platform. The block diagram below (Fig. 9)
summarizes the MATLAB code interface that accesses the IVUS images in. env
format and returns a set of four contours Manual Lumen and EEM vs Predicted
Lumen and EEM. The algorithm takes 0.07 seconds to work with single image
and give out the contours. The manual segmented data is in Manual lumen.er and
Manual EEM.er files for each of the patient data specifically for gated frames only.
Fig. 9
The objective of this research is to segment the Lumen and EEM. The
benchmarking of these results is done against the Manual and EEM results. In
order to observe the comparatively for each frame: Manual Lumen and EEM are
plotted against predicted Lumen and EEM on the IVUS image as follows:
Blue Manual segmented Lumen Yellow Predicted Lumen

Red Manual segmented EEM Green Predicted EEM
Similar analysis is run on the each of 2293 images from 15 patients.

For each patient data average error is calculated by averaging the error in
pixels between corresponding points of Manual and Predicted contours for each
gated frame and these values of respective patient are plotted together. Similarly
maximum errors for each frame can also be plotted.
By working with the entire data base described earlier, and with the use of 5
consecutive images, Lumen could be predicted with an error of 6.9566 ±2.2144
pixels corresponding to 0.1254 ±0.04121 mm. EEM could be predicted with a error
of 4.1915 ±2.3017 pixels corresponding to 0.0762 ±0.04514 mm. This is a
remarkable improvement over the use of single image only that has an estimated
error of 0.25 mm. For stented images our method has a reported average prediction
error of 0.048 mm.
5. Concluding Remarks
A novel algorithm has been proposed to auto-detect the Lumen and EEM and it is
observed that this algorithm reliably performs contour prediction with clinically
appreciated limits of average prediction error under 0.13 mm. The proposed
approach does not require manual initialization of the contours, which is a common
requirement of several other prior approaches to IVUS image segmentation.
The experiments conducted with the combination of temporal analysis,
contour initialization and contour refinement methods proposed in this work
demonstrated the usefulness of the employed texture features for IVUS image
analysis as well as the contribution of the approximation technique based on Radial
Basis Functions to the overall analysis outcome. The comparative evaluation of
the different alternate approaches revealed that use of the temporal texture based
initialization and the 2D RBF-based approximation results in a reliable and quick
IVUS segmentation, comparable to the manual segmentation and other alternate
segmentation algorithms.
Our automated segmentation algorithm has several clinical applications. It
could facilitate plaque morphometric analysis i.e. planimetric, volumetric and wall
thickness calculations, contributing to rapid, and potentially on-site, decision-
making. Similarly, our method could be utilized for the evaluation of plaque
progression or regression in serial studies investigating the effect of drugs in
atherosclerosis.
Based on the results at branched and stented region and an increase in

accuracy with modified Texture based dilation method targeted to stented IVUS
images, a promising idea for the future is recommended. In order to increase the
prediction accuracies it is recommended that we have a stent, branch or normal
IVUS image detector first and then have customized texture based algorithm to
detect contours in each of these regions.
References
1. G. D. Giannoglou, Y. S. Chatzizisis, V. Koutkias, I. Kompatsiaris, M. Papadogiorgaki,

V. Mezaris, E. Parissi, P. Diamantopoulos, M. G. Strintzis and N. Maglaveras, "A novel
active contour model for fully automated segmentation of intravascular ultrasound images:
In vivo validation in human coronary arteries," Comput. Biol. Med., vol. 37, pp. 1292-1302,
2007.
2. M. Papadogiorgaki, V. Mexaris, Y. S. Chatzizisis, G.D. Giannoglou and I. Kompatsiaris,
“Image analysis techniques for automated IVUS contour detection”, Ultrasound in
Medicine and Biology Journal, vol. 34, no. 9, pp. 1482-1498, Sept. 2008.
3. P. Manandhar, C.H. Chen, A.U. Coskun and U. Qidwai, “An automated robust
segmentation method for intravascular ultrasound images”, Chapter 19 of Frontiers of
Medical Imaging, edited by C.H. Chen, World Scientific Publishing, pp. 407-426, 2015.
4. E.G. Mendizabal-Ruiz and I.A. Kakadiaris, “Computational methods for the analysis of
intravascular ultrasound data”, Chapter 20 of Frontiers of Medical Imaging, edited by C.H.
Chen, World Scientific Publishing, pp. 427-444, 2015.
5. S. Balocco, C. Gatta, C. Francesco, P. Oriol, X. Carrillo, J. Mauri, P. Radeva.,
“Combining growcut and temporal correlation for IVUS lumen segmentation”, Pattern
Recognition and Image Analysis m Springer Volume 6669, 2011, pp 556-563.
6. M.H. Cardinal, G. Soulez, J. Tardif, J. Meunier, and G. Cloutier., “Fast- marching
segmentation of three-dimensional intravascular ultrasound images: A pre- and post-
intervention study”, International Journal on Medical Physics, Volume 37, 2010.
7. Z. Luo, Y. Wang and W. Wang. “Estimating coronary artery lumen area with optimization-
based contour detection”, IEEE Trans. on Medical Imaging April 2003;22(4):564–566.
8. G.D. Giannoglou, Y.S. Chatzizisis, and G. Sianos, “In-vivo validation of spatially correct
three- dimensional reconstruction of human coronary arteries by integrating intravascular
ultrasound and biplane angiography” Coron Artery Dis. 2009; 17: 533–543.
9. D. Gil, P. Radeva, J. Saludes and J. Mauri, "Automatic segmentation of artery wall in
coronary IVUS images: a probabilistic approach," Computers in Cardiology 2000, pp. 687-
690, 2000.
10. J. Marone, S. Balocco, M. Bolanos, J. Massa and P. Radeva, “Learning the lumen border
using a Convolutional neural networks classifier”, Conference paper, CVIT-STENT
Workshop, MICCAI held at Athens 2016.
11. S.J. Al’Aref, et al., “Clinical applications of machine learning in cardiovascular disease
and its relevance to cardiac imaging”, European Heart Journal, July 2018.
12. S. Balocco, M.A. Zalaaga, G. Zahad, S.L. Lee and S. Demirci, editors, “Computing and
Visualization for Intravascular Computer-Assisted Stenting”, Elsevier 2017.
13. R. Krasny and L. Wang, “Fast evaluation of multiquadric RBF sums by a artesian
treecode”, SIAM J. Scientific Computing, vol.33, no. 5, pp. 2341-2355, 2011.
CHAPTER 2.6
DEEP LEARNING FOR HISTORICAL DOCUMENT

ANALYSIS
Foteini Simistira Liwicki∗ , Marcus Liwicki†

Luleå University of Technology, Sweden
Email: ∗ foteini.liwicki@ltu.se, † marcus.liwicki@ltu.se
This chapter gives an overview of the state of the art and recent methods in the
area of historical document analysis. Historical documents differ from the ordi-
nary documents due to the presence of different artifacts. Issues such as poor
conditions of the documents, texture, noise and degradation, large variability of
page layout, page skew, random alignment, variety of fonts, presence of embel-
lishments, variations in spacing between characters, words, lines, paragraphs and
margins, overlapping object boundaries, superimposition of information layers,
etc bring complexity issues in analyzing them. Most methods currently rely on
deep learning based methods, including Convolutional Neural Networks and Long
Short-Term Memory Networks. In addition to the overview of the state of the art,
this chapter describes a recently introduced idea for the detection of graphical
elements in historical documents and an ongoing effort towards the creation of
large database.
1. State of the Art
1.1. Automated Document Image Analysis

The first step towards DIA is document image binarization which refers to the
inference of the visual archetype for elements we assume had only two tones and can
be extended to multiple information generation processes which have been sequen-
tially applied‡ . While it is currently discussed if binarization is needed,3 more than
90 % of the recent DIA methods published at ICDAR and ICPR apply binarization
at some point during the preprocessing, e.g., for improving text extraction. For
the final recognition the original input image can be taken again to make sure that
no information is lost. For the realization of binarization, many heuristic methods
(global, local,4 and hybrid5 ) have been proposed. Apart from heuristic methods for
binarization, there exist machine learning based methods. They are applied in two
different ways: (i) automatically learning the parameters of a given binarization
method6 or (ii) dividing the image into different regions and learning to select the
appropriate method for every region. Most of the state of the art approaches report
‡ For a detailed discussion of binarization and a proper definition and analysis refer to.2
287
288 F. S. Liwicki and M. Liwicki
(a) British Library, Harley (b) British Library, Add MS (c) Inst. du Clergé Patriar-
MS 2970, f.6v 5153A, f.1r cal de Bzommar, BzAr 39,
38
(d) Monastery of Mor (e) New York Public Li- (f) FamilySearch, sample
Gabriel, MGMT 298, 5r brary, Spencer Collection. from ICDAR HDRC-20191
Fig. 1. Samples of historical documents in five different languages. (a) Greek: Readings for
Easter and the Bright Week. (b) Latin: lectionary of the 11th century, (the ‘Odalricus Peccator
Gospel Lectionary’). (c) Armenian: New Testament Lectionary of the 18th or 19th century.
(d) Syriac: Gospel Lectionary written in 1833. (e) Syriac: Lectionary for Holy Week, in Coptic
and Arabic. Egypt, 1948. (f) Chinese: Family record from the 19th century.
their results on Document Image Binarization Competition (DIBCO) datasets.7

However, the number of images in these datasets is very small and even in the
newest competition§ , lack of data is the bottleneck. A survey of document lay-
out analysis is given in.8 Often, document layout analysis is performed in several
stages, referred to as physical and logical layout analysis respectively. In physical
layout analysis, the document content is divided into several categories, e.g., text,
§ https://vc.ee.duth.gr/dibco2019/
Deep Learning for Historical Document Analysis 289
graphics, and background.9 A good page division results in better performance of

text recognition which is a later step in document image processing pipelines and is
described in the next section. In the succeeding logical layout analysis, the image
regions are then assigned a specific label.10
The approaches to the document content recognition could be divided into
two broad categories, i.e., segmentation-based and segmentation-free approaches.
Segmentation-based approaches first extract segments, e.g., connected compo-
nents, and then perform the recognition on the individual connected components.11
Segmentation-free approaches work directly on the text-line level and determine the
output of complete text line.12
The biggest break-through in handwriting recognition, a typical content
recognition task, has been the introduction of Deep Learning,13 a bit more than
a decade ago. The approach of Graves and Liwicki is based on Long Short-Term
Memory Networks (LSTM)14 and the introduction of an output layer known as
Connectionist Temporal Classification (CTC). In this work, the Word Error Rate
(WER) dropped from 35 % to 18 %. Current state of the art with improved archi-
tectures reach as low as 7 %15 in the handwriting recongition task.
Along with the document layout analysis and text transcription, there are new
high-level research challenges associated with document image analysis. These chal-
lenges can be categorized as word spotting, object detection, localization and
recognition, document summarizing, and captioning. There is a big need of
a larger database with Ground Truth (GT) in all respect, to enable new algorithms
and architectures in this domain.
1.2. Deep Learning

Deep learning is the branch of machine learning where the algorithms and models
learn the features (representations) automatically. The underlying neural networks
mimic processes of the human brain. Unlike traditional machine learning techniques
where the features were designed manually by humans, millions of features are
automatically learned by the deep neural networks. Deep learning models are state
of the art in present and have outperformed previous methods in the field of image
classification, scene segmentation, handwritten/printed document analysis, time
series, and sequence learning.
One of the major break through in the area of deep learning is the image clas-
sification on the ImageNet LSVRC-2010 dataset.16 The dataset comprised of more
than 1 million object images with 1000 different class labels. The deep Convolu-
tional Neural Network (CNN) AlexNet proposed by Krizhevsky et al.17 achieved
a top-5 error rate of 17.0 % in 2012, which was later topped by GoogleNet18 with
6.67 % for the image classification task. A major novelty of18 was the use of syn-
thetic data generation by rigorous data augmentation and degradation.
A second branch of deep learning uses LSTM for sequence modelling instead
of CNN for images. Graves et al.19 have proposed a deep LSTM network for
speech processing and achieved remarkable results. These models have also been
successfully applied in handwriting recognition,13 where the error was reduced from
35 % to 18 %.
The deep learning paradigm is being explored for high-level image analysis like
semantic segmentation, object detection and localization,20 document image cap-
tioning and summerization, word spotting,20 and visual question answering.21
1.3. Synthetic Image Generation

The three mayor components leading to the major breakthroughs in deep learning22
are better computation infrastructure, improved machine learning algorithms, and
the availability of huge training datasets. Especially the latter, i.e., large datasets,
is a crucial problem in historical DIA, as the amount of labeled digitized images is
rather low. Efforts for generating more training data can either be realized by the
collection and manual annotation of huge image collections16,20 or by the generation
of synthetic images.23
More relevant to documents are tools which have been specifically designed for
generating or deteriorating given input documents in specific tasks. DocCreator
toolkit24 is the one with the biggest efforts in creating modern synthesised docu-
ments. One of the most known challenges in the area of detection and recognition
of textual information in scene images is the Robust Reading Challenge (RRC)25
competition. RRC attracts more and more researchers from different domains. In
the RRC, standard data augmentation methods are used26 for the generation of
scene texts. It is important to note, that especially in the RRC series, the major
break through came because of the existence of more data.
For the specific task of Optical Character Recognition (OCR), the open source
framework OCRopus provides various methods for synthetic text line generation,
which finally lead to a word error rate of less than 0.5 %.27 Nayef et al.28 proposed a
framework for generating synthetic modern prints and then physically printing them
to generate camera captured documents in a controlled environment. While this
approach is faster than pure human labeling, it is still more time consuming than
a fully automated approach. The alternative approaches for bootstrapping, e.g.,
semi-supervised, active, or lifelong learning29 can always be combined with data
generation methods to achieve even better performance. While there is a strong
focus on such approaches in the literature, a systematic framework for synthetic
document image creation is missing.
1.4. Digital Humanities

Instead of automatically synthesizing documents, there are various projects in the
area of Digital Humanities (DH) aiming at the creation of large human-labeled or
semi-automatically labeled documents. The Himanis project∗ is currently working
∗ https://himanis.hypotheses.org/
on first aligning transcripts of texts of the royal French chancery and the royal Dutch
archives with the glyphs and then training an LSTM to analyze non transcribed
texts of the same periods with Latin scripts. Online annotation tools like SALSAH30
and Transcribe Bentham31 exists, however, they have only limited automatic DIA
support. An online tool that includes computerized tools for script analysis and
OCR, is the Monk32† . However, the Monk tools are not publicly available and the
methods cannot easily integrated into other workflows or tools. A publicly accessible
Web interface is provided in the course of the Genizah project33 for the analysis of
fragments. Users are able to semi-automatically investigate all the available data,
however, they cannot directly add additional information such as transcriptions.
2. Cross-Depicted Motif Categorization
Historical Documents comprise various kinds of graphical elements. With CNN,

most graphical elements can be detected in a straight-forward manner, i.e., by
adapting a pre-trained object detection network (c.f., Sec. 1.1) to the task of interest.
A particular challenge, however, is the problem of categorizing and identifying cross-
depicted historical motifs. As cross-depiction, we understand the problem that the
same object can be represented (depicted) in various ways.
The cross-depiction challenge is posed by watermarks (crucial for dating
manuscripts), which are created during the process of handmade paper-making
from tissue rags, as was done in Europe from the Middle Ages (13th century) till
the mid-19th century.34 Cross-depiction in watermarks arises from two reasons: (i)
there are many similar representations of the same motif, and (ii) there are several
ways of capturing the watermarks, i.e., as the watermarks are not visible on a scan
or photograph, the watermarks are typically retrieved via hand tracing, rubbing,
or special photographic techniques. This leads to different representations of the
same (or similar) objects, making it hard for pattern recognition methods to re-
trieve them. While this is a simple problem for human experts, computer vision
techniques have problems generalizing from the various depiction possibilities.
In order to identify similar or identical watermarks, many printed collections of
watermarks have been assembled and during the last two decades, several online
databases have been created.34–38 One of the most popular databases, especially
for medieval paper from the Middle and Western Europe, is the Wasserzeichen
Informations- system (WZIS)‡ .39 This section uses the WZIS dataset which con-
tains in total around 105 000 watermark reproductions stored as RGB images of
size approximately 1500×720 pixels.§¶ . The different image characteristics between
tracings (pen strokes, black and white) and the other reproduction methods (less
distinct shapes, grayscale) makes the task of watermark classification and recogni-
† http://www.ai.rug.nl/
~lambert/Monk-collections-english.html
‡ https://www.wasserzeichen-online.de/wzis/struktur.php
§ Not all images have the same size. The numbers reported are the average over the whole dataset.
¶ Furthermore, note that this section builds on the extended results of40
(a) Query
(b) R1 (c) R2 (d) R3 (e) R4
Fig. 2. Example of a query (a) with the expert annotated results in order (b, c, d, e) where (b) is
the most relevant. Our system retrieves these results among the first six ranks in the order (e, d, b,
c). Note that the system is not affected by the different reproduction techniques (a: radiography,
b/d: rubbing, c/e: tracing).
tion more difficult. Therefore, rubbings and radiography reproductions were also
included in this data set.
2.1. Classification System
In the watermark research, there exist very complex classification systems for the
motifs depicted by the watermarks. The classification system plays a major role for
both the retrieval and the label-assignment of a watermark. The user must be able
to determine uniquely the correct class for a given watermark. This is not always
easy, especially with rare watermarks. Intuitively, the labelling scheme directly
controls the rarity of a class, i.e., the more precision achievable with the classifi-
cation system, the fewer the samples for each class. This section uses the WZIS
classification system as it is a widespread standard in the considered domain.39
It is partially based on the Bernstein classification system35 and it is built in
a mono-hierarchical∗ structure.41 It contains 12 super-classes (Anthropomorphic
figures, Fauna, Fabulous Creature, Flora, Mountains/Luminaries, Artefacts, Sym-
bols/Insignia, Geometrical figures, Coat of Arms, Marks, Letters/Digits, Undefined
marks) with around 5 to 20 sub-classes each.39 The super-classes are purposely ab-
stract and are only useful as entry point for classifying an instance of a watermark.
∗ Inpractice, this means that regardless of the level of depth for class specification, there can be
only one unique parent class.
For example, for the watermark represented in Figure 2(c), the following hierarchy
applies:
Fauna
Bull ’ s head
detached , with s i g n above
with e y e s
...
The actual definition is complete only at the final level. This kind of terminology
is not trivial to be dealt with. Moreover, the user needs special knowledge not
only about the single terms but also about their usage in the different scenarios.
Especially the order of the descriptions can become a problem if different users (or
libraries) prefer different orders. To overcome this problem, an automatic motif
comparison would be beneficial.
2.2. Problem Definition

This section presents the results obtained on two disjoint test sets: “Testset 50” and
“Testset 1000”, where the number in their name denotes their size. The Testset
1000 is composed of images samples from 10 sub-classes: bull’s head, letter P,
crown, unicorn, triple mount, horn, stag, tower, circle, and grape. Testset 50 is
composed of images samples from only 5 sub-classes: bull’s head, letter P, crown,
unicorn, and grape. The choice of these classes is either motivated by their frequency
(e.g. bull’s head, letter P), or by their complexity (e.g. grape, triple mount).
The reproduction techniques in these test sets are mixed (hand tracing, rubbing,
radiography). Obtaining the ground truth for image retrieval queries is very time
consuming and hence expensive. On top of that, it requires a very high level of
expertise with both the watermarks and the classification system to produce such
queries. For Testset 50 there are 50 expert-annotated queries — one for each image
in the set. For the Testset 1000 there are 10 expert-annotated queries — one for
each motif class in the set.
In a practical use case scenario, automatic classification of watermarks with
regard to their motif classes can be a first step to facilitate the entry into the complex
classification system. However, the desideratum would be not only to automatically
assign the correct class, but also to retrieve the best results on a more precise level.
In watermark research, it is usual to assign different relevance/similarity levels to
the retrieved samples with regard to the query watermark, such as identical (which
means originating from the same watermark mould at about the same time), variant
(which means originating from the same mould but with slight differences due to the
mechanical deterioration of the mould); type (which means the motifs are the same,
but other features like divergent size prove that the watermarks do not originate
from the same mould); motif group (which means that the watermarks have the
same motifs but their shape and/or size are considerably different), and class (which
Training set Validation set Test set

ResNet18 99.40% 96.93 % 96.82 %
DenseNet121 99.48% 97.76 % 97.63 %
Training set Validation set Test set

ResNet18 91.14% 74.54 % 74.86 %
DenseNet121 93.30% 79.43 % 79.57 %
designates motif similarity on a more abstract level).

The relevance level is crucial for the precision of the dating. Identical and
variant samples can usually be dated with a precision of around 5 years, whereas
for type and motif groupthe dating range must be extended to 10-20 years or more,
depending on the retrieved samples. In order to assess the precision of the similarity
ranking, this section creates Ground Truth on 5 relevance levels (with 4 the highest
and 0 the lowest) for the Testset 50: Identical/variant = 4, type = 3, motif group
= 2, class = 1, and irrelevant = 0.
2.3. Experiments
The goal of this section is to develop a system that can help researchers in the
humanities perform content-based image retrieval in the context of historical wa-
termarks. Therefore, this section performs experiments in three different ways,
classification, tagging, and similarity matching. All experiments use the DeepDIVA
experimental framework.42
As Architecture, this section builds on a recent extension CNN, which includes
residual connections in order to avoid the vanishing gradient problem.43 In partic-
ular, using dense connections (DenseNet) which connect each layer to every other
layer in a feed-forward fashion.44 These two architectural paradigms – namely
skip-connections for ResNet and dense-connections for DenseNet – are the state-of-
the-art for computer vision tasks. In this section the experiments were rub using
two different networks, one for each paradigm. Specifically, the vanilla implemen-
tation of ResNet-18 and DenseNet-121 as they are freely available in the PyTorch
vision package† .
The first approach is to treat the problem as a object classification task. The rest
of hierarchical structure are discarded and only these 12 super-classes as labels for
the watermarks are selected. The network is trained to minimize the cross-entropy
loss function shown below:

exy
L(x, y) = −log |x| (1)

xi
i=0 e
† https://github.com/pytorch/vision/
where x is a vector of size 12 representing the output of the network and and
y = {0..11} a scalar, representing the label for a given watermark. Instead of using
randomly-initialized versions of these models, variants that have been trained for
the classification task on the ImageNet dataset from the ImageNet Large Scale
Visual Recognition Challenge45 (ILSVRC) were used. Afzal et al.46 and Singh et
al.47 have previously demonstrated that ImageNet pre-training can be beneficial
for several other domains including document analysis tasks. All input images are
pre-processed by removing the whitespace around the image, and then resized to a
standard input size of 224 x 224. The dataset is then split into subsets of size 63 626
for training, 7 082 for validation and 34 825 for testing. The performance of the
systems were evaluated using the accuracy metric which is standard for single-label
classification tasks. From Table 1, one can see that the DenseNet outperforms the
ResNet by a small but significant margin on all subsets of the dataset.
The second approach is to treat the problem as a multi-label tagging task. In this
case each image can be assigned one or more corresponding labels. In this section,
this approach is motivated by hierarchical structure of the labels, which, although
very noisy, could provide additional entropy thus increasing the performance of the
network. To avoid the ordering problem (see Sec. 2.1), each level of the hierarchy
is treated as a ”tag” thus addressing the problem of having one label appearing
at different levels in the hierarchy. The network is trained to minimize the binary
cross-entropy loss function shown below:

n
L(x, y ) = − yi · log(σ(xi )) + (1 − yi ) · log(1 − σ(xi ) (2)
i=0
where n is the number of different labels being used, x is a vector of size n repre-
senting the output of the network and y is a multi-hot vector of size n, i.e., it is a
vector of values that are 1 when the corresponding tag is present in the image and 0
when it is not. The setup of training is similar to the setup of the classification task,
just with multiple lables. The performance of the systems is evaluated using the
Jaccard Index48 which is also referred to as the Mean Intersection over Union (IoU).
This metric is used to compare the similarity of sets which makes it suitable for
multi-label classification problems. In Table 2, both networks achieve a high degree
of performance on the tagging task, with DenseNet performing significantly better
on all subsets of the dataset. This result is quite significant as the IoU accounts
for imbalance in the labels. That is, for a dataset with n classes, even if a class
is significantly over-represented in the dataset, it contributes only 1/n towards the
final score. Note that n equals 622 in this case.
It is worth to mentioned that the problem is treated as image similarity task, i.e.,
given an image produce an embedding in a high dimensional space where similar
images are embedded closer and dissimilar images are embedded far apart. This
is intuitively the closest formulation to the end goal of image retrieval. There
are different approaches to formulate this task in the literature. The triplet loss
approach49,50 is chosen which has been shown to outperform two-channel networks51
ResNet18 DenseNet121
Baseline 0.885 0.928
Classification Pre-trained 0.929 0.951
Tagging Pre-trained 0.923 0.952
Fig. 3. The first step is to create modern electronic printed documents from a Latex specification
document. The second step involves using a deep neural network to learn a mapping function to
transform the modern printed document to a historical handwritten document.
and advanced application of the Siamese approach such as MatchNet52 as well. The
triplet loss operates on a tuple of three watermarks {a, p, n} where a is the anchor
(reference watermark), p is the positive sample (a watermarks from the same class)
and n is the negative sample (a watermarks from another class). The neural network
is then trained to minimize the loss function defined as:
L(δ+ , δ− ) = max(δ+ − δ− + μ, 0) (3)
where δ+ and δ− are the Euclidean distance between anchor-positive and anchor-
negative pairs in the feature space and μ is the margin used. Pre-trained models
were used, for the Classification task and the Multi-Label Tagging task. The results
are reported in terms of mean Average Precision (mAP), which is a well established
metric in the context of similarity matching. Table 3 clearly demonstrates that the
classification and tagging pre-trained networks outperform the ImageNet baseline
networks by a significant margin on all subsets of the dataset. Similarly to the
other two tasks, one can see here that the DenseNet outperforms the ResNet by a
significant margin.
3. Towards large Historical Document Datasets with Historical

Image Synthesis
Unfortunately, in the field of Historical DIA, labelled image datasets are a scarce
resource. This lack of labelled training data makes it challenging to take advan-
tage of several deep learning breakthroughs. This section shows how it is possible
to go one step further.‡ The approach takes advantage of recent advancements in
the design of Generative Adversarial Networks (GANs) and Neural Style Transfer
Algorithms (NST). It learns a mapping function that goes from the Source Domain
S (modern printed electronic document) to the Target Domain T (historical
handwritten document). The primary goal is to generate synthetic historical
‡ This section is an updated version of our original work published in.53
handwritten document images, which looks like other historical handwritten doc-
uments from T . The secondary goal is to demonstrate a new promising and more
straightforward approach to create a large amount of complex synthetic handwrit-
ten historical documents based on different ground-truthed electronic documents.
The proposed cycleGAN framework reaches the goals and performs the generation
task in an integrated manner.
3.1. Task
As shown in Figure 3, we tackle the problem of historical document image synthesis
in two steps. The first step of generating source domain documents is achieved with
a Latex framework as described in previous work.53 In the second step, we train
a neural network to learn a mapping function between the source domain (modern
document) and the target domain (historical handwritten document). The second
step can be further divided into three tasks, a reconstruction task and classification
task which are used to pre-train the networks and the final generation task.
• The reconstruction task is used to pre-train the two encoder components

of the generators of the cycleGAN model.
• The classification task is used to pre-train the Neural Style Transfer model
and the discriminator component of the cycleGAN model.
• The generation task is the main Image-to-Image translation task that uses
unpaired data. This task is performed with two different approaches, the
cycleGAN model and the Neural Style Transfer model.
3.2. Data Pre-processing

The historical documents in this work come from the HBA 1.0 dataset§ . The
dataset is composed of eleven books, five manuscripts and six printed books written
in different languages and typographies, and published between the twelfth and
nineteenth centuries. It contains 4436 pages: 2,435 handwritten pages and 2,001
printed pages. This section uses 2553 images from six books, of which four books
have handwritten text and two books have printed text. The approximate average
resolution of the documents is 2600×4000 pixels. Additionally, all blank and binding
pages are removed to have only pages containing text.
The high resolution of the images makes it difficult to train the cycleGAN and
the VGG-19 models on the native resolution image due to computational memory
constraints. Therefre, two different data pre-processing methods are applied before
training:
• Complete Document: The entire document image serves as input to the

network after resizing it to 256 × 256 pixels. This reduces the fine detail,
but preserves the global structure and look of the document.
§ http://hba.litislab.eu/index.php/dataset/
• Random Crop: A random crop of size 256 × 256 pixels is taken from the
central portion (to avoid blank border regions) of the document. In this
scenario, the fine detail of the pages are preserved, and crops have almost
the same number of lines and words in both the historical and modern
domains.
3.3. Model Architecture and Training
The historical dataset used in this work does not contain paired images between
the source and target domains. Thus, this work uses a variant of GANs, called
the cycleGAN,54 that uses the cycle consistency loss. The cycleGAN architecture
performs a transformation of the images from the source to the target domains
and vice-versa. The cycle consistency loss with the bi-directional mapping function
coupled with the L1 distance loss increases the learning stability of the adversarial
framework in an unpaired image setting.55 When training the cycleGAN from
scratch, it runs for 50 epochs with a batch size of 1. The learning rate is 0.0002
with a linear decay starting from epoch 25. It is useful to use a history buffer that
stores the 50 most recently generated images. This history buffer is used to update
the discriminator and reduce model oscillations during training. When training the
cycleGAN in the pre-trained scenario, the generator and discriminator are initialized
with weights obtained from the reconstruction and classification tasks respectively.
The weights of the encoder part of the generators are initialized with the weights
of the encoder component of an auto-encoder that is trained for reconstruction on
the datasets.
For the Neural Style Transfer the VGG-19 Convolutional Neural Network im-
plementation56 is used, where the goal is to minimize the content loss and the style
functions conjointly. This work uses the VGG-19 based NST model in two differ-
ent settings – using ImageNet weights and using weights from a pre-trained model.
When using the model with the ImageNet weights, only the last layers of the net-
work are reinitialized, and then the NST procedure is applied to the images. For
pre-training, first the VGG-19 is trained on the dataset for a classification task. The
network is trained for 25 epochs with a batch size of 4, learning rate of 0.001 and
momentum of 0.9. The weights of this model are then used for the NST procedure.
3.4. Results
The quality of the results produced by generative models is typically evaluated

quantitatively or qualitatively. Quantitative methods are divided into two subcate-
gories, the quantitative perceptual studies that are human-based and the quantita-
tive metrics that are machine-based and task-dependent. In this section, qualitative
evaluation is used, based on the subjective perceptual appreciation of the results.
For more results, refer to.53
Figure 4 shows sample results generated by the cycleGAN and the NST on
(a) Target Domain (b) Source Domain (c) cycleGAN (d) NST Synthetic
Samples Samples Synthetic Samples Samples
Fig. 4. Examples of images generated by the cycleGAN and the NST after training on the Com-
plete Document images. The first and second columns contains samples from the Target Domain
and Source Domain respectively. The third column contains samples generated by the cycleGAN
trained from scratch. The samples generated by the NST model (pre-trained on the PDD dataset)
can be seen in the fourth column. Every sample contains a zoomed-in view to see the quality of
the generated pages.
the complete document images. The synthetic images generated by the cycleGAN
appear significantly better than those generated with NST. Regarding the semantic
content (font shape, readability of words and letters, marginal annotations), we
can notice many similarities between the target domain samples and the synthetic
samples. The overall style content of the target domain (background colour, texture,
paper degradation, initials style) is well expressed. However, in a structural content
point of view (column-mode, number and presence of initials, textual artifacts),
the initials are not well detected and expressed. The two column-mode is not
at all expressed. When considering the synthetic documents produced with the
NST, the structural content is better preserved. However, the style is mixed and
standardized over the entire synthetic document, leading to the presence of a lot of
coloured artifacts.
3.5. Future Work
The effectiveness of the generated synthetic images could be enhanced by pre-

training neural networks on the synthetic data and comparing the performance
of these pre-trained networks against randomly initialised networks for other DIA
tasks. Additionally, the generated synthetic documents could be improved by fus-
ing the global and local generation procedures to produce documents that have the
correct global structure as well as fine-grained details.
These techniques, combined with the techniques mentioned in Sec. 1.3 can be
used to generate a massive database of historical structured documents, including
epic texts, religious texts, demographic reports, and economic reports. Existing
databases can serve as an input for the GAN-based doucment generation framework.
In our future work, we plan to create a database containing several millions of
document images together with GT for logical layout analysis, text extraction, and
OCR. With this growing dataset we plan to host a series of novel challenging public
competitions, where world-wide researchers can participate with their methods.¶
References
1. F. S. Liwicki, R. Saini, D. Dobson, J. Morrey, and M. Liwicki, Icdar 2019 historical

document reading challenge on large structured chinese family records, arXiv preprint
arXiv:1903.03341. (2019).
2. A. Nicolaou and M. Liwicki. Redefining binarization and the visual archetype. In
Proceedings of the 12th IAPR International Workshop on Document Analysis Systems,
p. to appear, (2016).
3. A. Garz, A. Fischer, R. Sablatnig, and H. Bunke. Binarization-Free Text Line Seg-
mentation for Historical Documents Based on Interest Point Clustering. In Int.
W. Document Analysis Systems, pp. 95–99, (2012). ISBN 978-0-7695-4661-2. doi:
10.1109/DAS.2012.23.
4. J. J. Sauvola and M. Pietikäinen, Adaptive Document Image Binarization, Pattern
Recognition Letters. 33(2), 225–236, (2000). doi: 10.1016/S0031-3203(99)00055-2.
5. K. Ntirogiannis, B. Gatos, and I. Pratikakis, A combined approach for the binarization
of handwritten document images, Pattern Recognition Letters. 35, 3–15, (2014).
6. T. Sari, A. Kefali, and H. Bahi, An MLP for binarizing images of old manuscripts,
Frontiers in Handwriting Recognition. pp. 247–251, (2012). doi: 10.1109/ICFHR.2012.
176.
7. I. Pratikakis, B. Gatos, and K. Ntirogiannis. Icdar 2013 document image binarization
contest (dibco 2013). In ICDAR, pp. 1471–1476, (2013).
8. S. Mao, A. Rosenfeld, and T. Kanungo. Document structure analysis algorithms: a
literature survey. In DRR, pp. 197–207, (2003).
9. F. Shafait, D. Keysers, and T. Breuel, Performance evaluation and benchmarking of
six-page segmentation algorithms, IEEE Trans. Pattern Analysis and Machine Intel-
ligence. 30(6), 941–954, (2008). ISSN 0162-8828. doi: 10.1109/TPAMI.2007.70837.
10. J. Kim, D. X. Le, and G. R. Thoma. Automated labeling in document images. In
Proc. SPIE: Document Recognition and Retrieval VIII, pp. 111–122, (2001).
11. T. M. Breuel. The OCRopus open source OCR system. In SPIE Document Recognition
and Retrieval XV, vol. 6815, p. 68150F (Jan., 2008).
12. Z. Lu, R. M. Schwartz, P. Natarajan, I. Bazzi, and J. Makhoul. Advances in the BBN
BYBLOS OCR System. In ICDAR, pp. 337–340. IEEE Computer Society, (1999).
13. A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber,
A Novel Connectionist System for Unconstrained Handwriting Recognition, IEEE
¶ The first competition of the series, namely the ICDAR-2019-HDRC has already been organized
recently.1 The dataset is publicly available: http://tc11.cvc.uab.es/datasets/ICDAR2019HDRC_1

Trans. on Pattern Analysis and Machine Intelligence. 31(5), 855–868, (2009). doi:
10.1109/TPAMI.2008.137.
14. S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation.
9(8), 1735–1780, (1997).
15. P. Voigtlaender, P. Doetsch, and H. Ney. Handwriting recognition with large multi-
dimensional long short-term memory recurrent neural networks. In 2016 15th Inter-
national Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 228–233.
IEEE, (2016).
16. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, ImageNet Large Scale Visual
Recognition Challenge, International Journal of Computer Vision (IJCV). 115(3),
211–252, (2015). doi: 10.1007/s11263-015-0816-y.
17. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep
convolutional neural networks. In eds. F. Pereira, C. J. C. Burges, L. Bottou, and
K. Q. Weinberger, Advances in Neural Information Processing Systems 25, pp. 1097–
1105. Curran Associates, Inc., (2012).
18. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-
houcke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the
IEEE CVPR, pp. 1–9, (2015).
19. A. Graves, N. Jaitly, and A.-r. Mohamed. Hybrid speech recognition with deep bidi-
rectional lstm. In 2013 IEEE workshop on automatic speech recognition and under-
standing, pp. 273–278. IEEE, (2013).
20. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and
C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on
computer vision, pp. 740–755. Springer, (2014).
21. P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh. Yin and yang: Bal-
ancing and answering binary visual questions. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 5014–5022, (2016).
22. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. (MIT Press, 2016).
23. A. Rozantsev, V. Lepetit, and P. Fua, On rendering synthetic images for training an
object detector, Computer Vision and Image Understanding. 137, 24 – 37, (2015).
ISSN 1077-3142. doi: https://doi.org/10.1016/j.cviu.2014.12.006.
24. N. Journet, M. Visani, B. Mansencal, K. Van-Cuong, and A. Billy, Doccreator: A new
software for creating synthetic ground-truthed document images, Journal of imaging.
3(4), 62, (2017).
25. D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura,
J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. Icdar 2015 competition
on robust reading. In 2015 13th International Conference on Document Analysis and
Recognition (ICDAR), pp. 1156–1160. IEEE, (2015).
26. T. He, Z. Tian, W. Huang, C. Shen, Y. Qiao, and C. Sun. An end-to-end textspotter
with explicit alignment and attention. In Proceedings of the IEEE CVPR, pp. 5020–
5029, (2018).
27. T. M. Breuel. High performance text recognition using a hybrid convolutional-lstm
implementation. In 14th Int. Conf. on Document Analysis and Recognition (ICDAR),
vol. 01, pp. 11–16, (2017).
28. J.-C. Burie, J. Chazalon, M. Coustaty, S. Eskenazi, M. M. Luqman, M. Mehri,
N. Nayef, J.-M. Ogier, S. Prum, and M. Rusiñol. Icdar2015 competition on smartphone
document capture and ocr (smartdoc). In 2015 13th Int. Conf. Document Analysis and
Recognition (ICDAR), pp. 1161–1165. IEEE, (2015).
29. P. Krishnan and C. Jawahar, Generating synthetic data for text recognition, arXiv
preprint arXiv:1608.04224. (2016).
30. T. Schweizer and L. Rosenthaler. Salsah eine virtuelle forschungsumgebung fr die
geisteswissenschaften. In EVA, pp. 147–153, (2011).
31. T. Causer and V. Wallace, Building a volunteer community: results and findings from
transcribe bentham, Digital Humanities Quarterly. 6, (2012).
32. S. He, P. Sammara, J. Burgers, and L. Schomaker. Towards style-based dating of
historical documents. In Frontiers in Handwriting Recognition (ICFHR), 2014 14th
int. conf. on, pp. 265–270 (Sept, 2014). doi: 10.1109/ICFHR.2014.52.
33. L. Wolf, R. Littman, N. Mayer, T. German, N. Dershowitz, R. Shweka, and
Y. Choueka, Identifying join candidates in the cairo genizah, int. Journal of Com-
puter Vision. 94(1), 118–135, (2011).
34. P. Rückert, Ochsenkopf und Meerjungfrau. Wasserzeichen des Mittelalters. (Haupt-
staatsarchiv, Stuttgart, 2006).
35. E. Wenger. Metasuche in wasserzeichendatenbanken (bernstein-projekt): Heraus-
forderungen für die zusammenführung heterogener wasserzeichen-metadaten. In eds.
W. Eckhardt, J. Neumann, T. Schwinger, and A. Staub, Wasserzeichen - Schreiber -
Provenienzen : neue Methoden der Erforschung und Erschlieung von Kulturgut im dig-
italen Zeitalter: zwischen wissenschaftlicher Spezialdisziplin und Catalog enrichment,
pp. 289–297. Vittorio Klostermann, Frankfurt am Main, (2016).
36. E. Wenger and M. Ferrando Cusi, How to make and organize a watermark database
and how to make it accessible from the bernstein portal: a practical example: Ivc+r,
Paper history. 17, 16–21, (2013).
37. S. Limbeck, Digitalisierung von Wasserzeichen als Querschnittsaufgabe. Überlegungen
zu einer gemeinsamen Wasserzeichendatenbank der Handschriftenzentren, Das Mitte-
lalter Perspektiven mediävistischer Forschung. 14(2), 146–155, (2009).
38. N. F. Palmer. Verbalizing watermarks : the question of a multilingual database. In eds.
P. Rückert and G. Maier, Piccard-Online. Digitale Präsentationen von Wasserzeichen
und ihre Nutzung, pp. 73–90. Kohlhammer, Stuttgart, (2007).
39. E. Frauenknecht. Papiermühlen in Württemberg. Forschungsansätze am Beispiel
der Papiermühlen in Urach und Söflingen. In eds. C. Meyer, S. Schultz, and
B. Schneidmüller, Papier im mittelalterlichen Europa. Herstellung und Gebrauch, pp.
93–114. De Gruyter, Berlin, Boston, (2015).
40. V. Pondenkandath, M. Alberti, R. Ingold, and M. Liwicki. Identifying Cross-Depicted
Historical Motifs. In 2018 16th International Conference on Frontiers in Handwriting
Recognition (ICFHR), Niagara Falls, USA (aug, 2018).
41. E. Frauenknecht. Von wappen und ochsenköpfen: zum umgang mit groen motiv-
gruppen im wasserzeichen-informationssystem (wzis). In eds. W. Eckhardt, J. Neu-
mann, T. Schwinger, and A. Staub, Wasserzeichen - Schreiber - Provenienzen : neue
Methoden der Erforschung und Erschlieung von Kulturgut im digitalen Zeitalter: zwis-
chen wissenschaftlicher Spezialdisziplin und Catalog enrichment, pp. 271–287. Vittorio
Klostermann, Frankfurt am Main, (2016).
42. M. Alberti, V. Pondenkandath, M. Würsch, R. Ingold, and M. Liwicki. DeepDIVA:
A Highly-Functional Python Framework for Reproducible Experiments. In 2018 16th
International Conference on Frontiers in Handwriting Recognition (ICFHR), Niagara
Falls, USA (aug, 2018).
43. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
770–778, (2016).
44. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected
convolutional networks. In CVPR, pp. 4700–4708, (2017).
45. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, ImageNet Large Scale Visual
Recognition Challenge, International Journal of Computer Vision (IJCV). 115(3),
211–252, (2015). doi: 10.1007/s11263-015-0816-y.
46. M. Z. Afzal, S. Capobianco, M. I. Malik, S. Marinai, T. M. Breuel, A. Dengel, and
M. Liwicki. DeepDocClassifier : Document Classification with Deep Convolutional
Neural Network. In 13th International Conference on Document Analysis and Recog-
nition, pp. 1111–1115. IEEE, (2015). ISBN 9781479918058.
47. M. S. Singh, V. Pondenkandath, B. Zhou, P. Lukowicz, and M. Liwickit. Transforming
sensor data to the image domain for deep learningan application to footstep detection.
In Neural Networks (IJCNN), 2017 International Joint Conference on, pp. 2665–2672.
IEEE, (2017).
48. M. Levandowsky and D. Winter, Distance between sets, Nature. 234(5323), 34, (1971).
49. E. Hoffer and N. Ailon. Deep metric learning using triplet network. In Lecture Notes
in Computer Science (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics), vol. 9370, pp. 84–92, (2015). ISBN 9783319242606.
doi: 10.1007/978-3-319-24261-3 7.
50. V. Balntas, Learning local feature descriptors with triplets and shallow convolutional
neural networks, Bmvc. 33(1), 119.1–119.11, (2016). doi: 10.5244/C.30.119.
51. S. Zagoruyko and N. Komodakis, Learning to compare image patches via convolutional
neural networks, Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition. 07-12-June(i), 4353–4361, (2015). ISSN 10636919.
doi: 10.1109/CVPR.2015.7299064.
52. X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. MatchNet: Unifying feature
and metric learning for patch-based matching. In Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, vol. 07-12-June, pp.
3279–3286, (2015). ISBN 9781467369640. doi: 10.1109/CVPR.2015.7298948.
53. V. Pondenkandath, M. Alberti, M. Diatta, R. Ingold, and M. Liwicki. Historical docu-
ment synthesis with generative adversarial networks. In 2019 International Conference
on Document Analysis and Recognition Workshops (ICDARW), vol. 5, pp. 146–151.
IEEE, (2019).
54. J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired Image-to-Image Transla-
tion using Cycle-Consistent Adversarial Networks. In Computer Vision (ICCV), 2017
IEEE International Conference on, (2017).
55. H. Huang, P. S. Yu, and C. Wang, An Introduction to Image Synthesis with Generative
Adversarial Nets, arXiv preprint arXiv:1803.04469. (2018).
56. L. A. Gatys, A. S. Ecker, and M. Bethge. Image Style Transfer Using Convolutional
Neural Networks. In The IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR) (June, 2016).
March 12, 2020 16:8 ws-rv961x669 HBPRCV-6th Edn.–11573 riesen˙signature page 305
CHAPTER 2.7
SIGNATURE VERIFICATION VIA GRAPH-BASED

METHODS
Paul Maergner1,∗ , Kaspar Riesen2 , Rolf Ingold1 , Andreas Fischer1,3

1
University of Fribourg, Department of Informatics,
DIVA Group, Fribourg, Switzerland
∗
paul.maergner@unifr.ch
2
University of Applied Sciences and Arts Northwestern Switzerland
Institute for Information Systems, Olten, Switzerland
3
University of Applied Sciences and Arts Western Switzerland
Institute of Complex Systems, Fribourg, Switzerland
Signature verification systems aim at distinguishing between genuine and forged

signatures with respect to a given reference. In the last decades, a large number of
different signature verification frameworks have been proposed. In this chapter,
we review a recent line of research concerned with graph-based approaches to
offline signature verification. Actually, graphs offer a representation formalism
that is both powerful and flexible. That is, graphs allow us to simultaneously
model the local features and the global structure of a handwritten signature in
a natural and comprehensive way. In this chapter, we describe how signatures
can be represented by means of graphs, and moreover, thoroughly review two
standard graph matching algorithms that can be readily integrated into an end-
to-end signature verification framework.
1. Introduction
The first use of handwritten signatures can be traced back to the fourth century,
when signatures were used to protect the Talmud (i.e. the central text in Rabbin
Judaism) from possible changes. Since then and to this day, handwritten signatures
have been used as biometric authentication and verification measure in a wide range
of business and legal transactions worldwide.
With the widespread use of signatures, the interest and necessity to verify the
authenticity of signatures has grown. Signature verification is often synonymous
with the process of comparing a questioned signature with a set of reference signa-
tures in order to distinguish between genuine and forged signatures.1 Traditionally,
this task is performed by human experts within the framework of graphology, i.e. the
study of handwriting. However, signature verification turns out to be a demanding
task as the decision only has to be made on the basis of a few original samples.
This motivated the research and development of automatic signature verification
305
306 P. Maergner, K. Riesen, R. Ingold and A. Fischer
systems – actually an active field of research to date.2,3

Automatic signature verification can be divided into two main approaches,
viz. online and offline signature verification. In the former case, signatures are
acquired by means of an electronic input device (e.g. digital pen or tablet, or via
input on a touch screen device). This allows the recording of dynamic temporal
information during the signing process (e.g. acceleration, speed, pressure, or pen
angle). In the latter case, signatures are captured offline and eventually digitized
by means of scanning. Consequently, the verification task relies solely on the (x, y)-
positions of the handwriting (i.e. strokes), and thus offline signature verification
is generally considered the more challenging task. This chapter focuses on offline
signature verification.
Signature verification systems can also be distinguished with respect to the
formalism that is actually used for the representation of the underlying signatures,
viz. statistical vs. structural representation. In the former approach feature vectors
– or sequences of feature vectors – are extracted from signature images, while the
latter approach makes use of more powerful representations such as graphs, for
instance.
The vast majority of offline signature verification systems rely on statistical
representations. In this scenario, feature sets are either manually engineered, or
–more recently– automatically learned from signature images. For instance, some
features are based on global handwriting characteristics like contour,4,5 outline,6
projection profiles,7 or slant direction.8,9 Local feature descriptors have also been
proposed, such as, for instance, Histogram of Oriented Gradients (HoG) 10 and Local
Binary Patterns (LBP).10,11 In recent years, features are increasingly extracted
from images with the help of end-to-end learning schemes, namely Deep Learning
approaches such as Convolutional Neural Networks (CNNs).12,13,23
Regardless whether the features are manually engineered or learned on the im-
ages, the fixed-size feature vectors are eventually used in conjunction with statis-
tical classifiers or matching algorithms that operate on numerical streams. Widely
used approaches are, for instance, Support Vector Machines (SVMs),6,10,11 Dynamic
Time Warping (DTW),4,7 Hidden Markov Models (HMMs),6,9 or neural networks.14
In contrast with feature vectors, a graph-based representation allows to represent
the inherent topological characteristics of a handwritten signature in a very natural
and comprehensive way.15 When used to represent signatures, the nodes of a graph
usually represent elementary strokes of the handwriting or keypoints in the signature
images, while the edges model the binary relationships that might exist between
these parts in the global structure. Previous works include the early proposal by
Sabourin et al. to represent signatures based on stroke primitives,16 the proposal
by Bansal et al. to use a modular graph matching approach,17 and the proposal by
Fotak et al. to use basic concepts of graph theory.18
The power and flexibility of graphs are, however, often sacrificed due to a massive
increase in the computational complexity of many mathematical operations. The
Signature Verification via Graph-Based Methods 307
computation of a similarity or dissimilarity of graphs, for instance, is of much higher

complexity than computing a vector (dis)similarity. In order to address this issue,
a number of fast matching algorithms have been proposed in the last decade that
allow to compare larger graphs and/or larger amounts of graphs within a reasonable
time (e.g. Refs. 15,19,20).
In the present chapter, we review the groundwork of a recent line of research on
graph-based signature verification proposed by Maergner et al.21–24 This particular
framework is based on the graph edit distance between labeled graphs representing
individual signatures. The graph edit distance approach is rendered more efficient
by using two well-known approximations of graph edit distance, viz. the quadratic-
time Hausdorff edit distance19 and the cubic-time bipartite approximation.20
The aim of the present chapter is threefold. First, in Section 2 we describe
the basic steps that have to be carried out in a graph-based signature verification
framework (i.e. we briefly discuss the processing steps from images to verification).
Second, in Section 3 we thoroughly review the basic graph matching algorithms
that build the core of our verification framework and discuss possible adaptations
for the task of signature verification. Third, in Section 4 we present and discuss
the results of the basic experiments achieved with this framework on four widely
used signature benchmark datasets. Actually, the discussed framework and the
corresponding results build the starting point for several interesting extensions that
have been presented during the last years. These extensions are finally outlined in
the conclusions of this chapter.
2. From Signature Images to Graphs to Verifications
Offline signature verification systems are typically based on the following three
processing steps.
(1) First, digitized handwritten signatures are commonly preprocessed to reduce

noise and variations.
(2) Second, on the preprocessed signatures images, statistical (i.e. vectorial) or
graph-based representations are extracted.
(3) Finally, a questioned signature q of claimed user u is classified as genuine or
forged.
The preprocessing of signature images in our work is based on binarization and

skeletonization. In particular, we carry out the following three steps.
• A local edge enhancement is performed using a difference of Gaussian (DoG)

filter on grayscale signature images.
• The enhanced image is binarized using a global threshold.
• The binary image is finally thinned to a single-pixel width using the algorithm
proposed in Ref. 25.
Original Difference of Gaussians Binary Skeleton
Fig. 1. Image preprocessing illustrated on the first signature image of user 3941 from the
GPDSsynthetic dataset.26
The three preprocessing steps are visualized in Fig. 1.

In our project, we are using so-called keypoint graphs as our representation
formalism. In particular, the following definition for graphs is used in the context
of the present work.
Definition 1 (Graph). Let LV and LE be finite or infinite label sets for nodes
and edges, respectively. A graph g is a four-tuple g = (V, E, μ, ν), where V is the
finite set of nodes, E ⊆ V × V is the set of edges, μ : V → LV is the node labeling
function, and ν : E → LE is the edge labeling function.
In keypoint graphs, the nodes represent keypoints on the handwriting and the
node labels are the coordinates of these points, i.e. LV = IR × IR. The edges are
unlabeled and undirected, i.e. LE = ∅ and (u, v) ∈ E ⇐⇒ (v, u) ∈ E, and connect
two nodes if their corresponding points are directly connected on the handwriting.
The nodes and edges are extracted from the skeleton image of the signature. The
keypoints are selected iteratively. First, junction-points and end-points are added
to the set of keypoints. Secondly, the left outer most pixel of circular structures is
added to the keypoints if they contain no keypoints, yet. Then, additional points
are added by sampling the skeleton. This is done by tracing along the skeleton
while starting at already selected keypoints. Once the traveled distance without
meeting a keypoint is larger or equal to a user-defined threshold D, a new keypoint
is added. In order to formalize the relationship between nodes, we use undirected
edges to connect neighboring keypoints on the skeleton.
The node labels are finally normalized to make the graph representation invari-
ant to translation by subtracting the average node label of this particular graph
from each node label in the graph. Thus, the nodes of a graph are always centered
around the origin (0, 0) in a two-dimensional plane.
An example of a keypoint graph is shown in Fig. 2. In this chapter, a graph is
termed gR if it is based on a signature image R.
In our signature verification system, the decision of whether an unseen signature
is a genuine signature of the claimed user is based on a set R of known genuine
signatures graphs from that user, termed references. An unseen signature T (rep-
resented by gT ) is compared with all reference signature graphs gR ∈ R, and a
signature verification score is calculated. Formally, we match the corresponding
graphs by a certain graph matching procedure and compute several graph dissimi-
Fig. 2. Example keypoint graph generated from the first signature of user 3941 from the
GPDSsynthetic dataset26
larities d(gT , gR ). On the basis of these dissimilarities, we then derive a signature

verification score.
It can be expected that each user shows a different degree of variability in his/her
signatures. Based on the reference signatures of a given user, we aim at predicting
how much variability we expect for this user and normalize the signature verification
score accordingly.
To this end, we apply a normalization with respect to the average dissimilarity
among the reference signatures.27 In particular, each reference signature is matched
with the other reference signatures of the same user and the average of the minima
is calculated. Formally,
δ(R) = avg min d(gR , gS ) (1)

gR ∈R gS ∈R\gR
This score is used to normalize the dissimilarity scores of each user. Formally, we
ˆ R , gT , R) as the reference normalized score:
define d(g
ˆ R , gT , R) = d(gR , gT ) ,
d(g (2)
δ(R)
where gT is the graph based representation of a questioned signature image T , and

gR ∈ R is a graph based representation of a reference signature.
The signature verification score d(R, gT ) is then calculated given a set of refer-
ence signature graphs R and the questioned signature graph gT . Formally,
ˆ R , gT , R) = mingR ∈R d(gR , gT )
d(R, gT ) = min d(g (3)
gR ∈R δ(R)
A signature T will be accepted if the minimum dissimilarity d(R, gT ) between the

references R and the graph representaion gT of T is below a certain threshold.
Obviously, the dissimilarity computation d(·, ·) between two graphs builds the
fundamental building block in the complete verification procedure. In the following
section, the process of calculating a graph dissimilarity is described in more detail.
In particular, we review two well-known graph matching algorithms – the bipartite
approximation of graph edit distance20 as well as the Hausdorff edit distance.19
3. Graph Edit Distance and its Approximations
3.1. Graph Edit Distance

A large repository of graph matching procedures is available (see Refs. 28,29 for
exhaustive reviews). Graph edit distance (GED) is widely accepted as one of the
most flexible graph-matching approaches available since it can be used to match
any kind of labeled graphs given an appropriate cost function.30,31
Given two graphs, the source graph g1 and the target graph g2 , the basic idea
of GED is to transform g1 into g2 using some edit operations. The intuition behind
this idea is that for transforming g1 into g2 , only a few and weak edit operations
are needed if g1 and g2 are similar in terms of structure and labeling. Likewise, for
dissimilar graphs more and in particular strong edit operations are needed for this
transformation.
A standard set of edit operations is given by insertions, deletions, and substitu-
tions of both nodes and edges. We denote the substitution of two nodes u and v by
(u → v), the deletion of node u by (u → ε), and the insertion of node v by (ε → v).
For edges we use a similar notation. A sequence (e1 , . . . , ek ) of k edit operations
that transform g1 completely into g2 is called an edit path λ(g1 , g2 ) between g1 and
g2 .
Let Υ(g1 , g2 ) denote the set of all possible edit paths between two graphs g1 and
g2 . To find the most suitable edit path out of Υ(g1 , g2 ), one introduces a cost c(ei )
for every edit operation ei , measuring the strength of the corresponding operation.
In our scenario, we set the node substitution cost to the Euclidean distance
between the node labels of u and v. Formally,

c(u → v) = (xu − xv )2 + (yu − yv )2 , (4)
where μR (u) = (xu , yu ) and μT (v) = (xv , yv ) are the node labels of nodes u ∈ VR
and v ∈ VT , respectively.
For both deletions and insertions of nodes, we use a constant cost cnode . For-
mally,
c(u → ε) = c(ε → v) = cnode (5)
The edge substitution cost is set to zero
c(eR → eT ) = 0, (6)
where eR ∈ ER and eT ∈ ET , while the cost of both edge deletion and insertion is
set to a constant value cedge .
c(eR → ε) = c(ε → eT ) = cedge (7)
The edit distance of two graphs can now be defined by the minimum cost edit
path between two graphs.
Definition 2 (Graph Edit Distance). Let g1 = (V1 , E1 , μ1 , ν1 ) be the source

and g2 = (V2 , E2 , μ2 , ν2 ) the target graph. The graph edit distance dGED (g1 , g2 ), or
dGED for short, between g1 and g2 is defined by

dGED (g1 , g2 ) = min c(ei ) , (8)
λ∈Υ(g1 ,g2 )
ei ∈λ
where Υ(g1 , g2 ) denotes the set of all edit paths transforming g1 into g2 and c
denotes the cost function measuring the strength c(ei ) of edit operation ei .
The minimal cost edit path found in Υ(g1 , g2 ) corresponding to dGED is termed
λmin from now on.
Optimal algorithms for computing the edit distance of graphs are typically based
on combinatorial search procedures that explore the space of all possible mappings of
the nodes and edges of g1 to the nodes and edges of g2 . Such an exploration is often
conducted by means of A* based search techniques32 using some heuristics.33,34
3.2. Bipartite Graph Edit Distance
A major drawback of A* based search techniques for graph edit distance computa-
tion is their computational complexity. In fact, the problem of minimizing the graph
edit distance can be reformulated as an instance of a Quadratic Assignment Problem
(QAP ).35 QAPs belong to the most difficult combinatorial optimization problems
for which only exponential run time algorithms are known to date. The graph edit
distance approximation framework introduced in Ref. 36 reduces the QAP of graph
edit distance computation to an instance of a Linear Sum Assignment Problem
(LSAP ). For solving LSAPs, a large number of quite efficient algorithms exist.37
LSAPs are concerned with the problem of finding the best bijective assign-
(1) (1)
ment between the independent entities of two sets S1 = {s1 , . . . , sn } and
(2) (2)
S2 = {s1 , . . . , sn } of equal size. In order to assess the quality of an assign-
ment of two entities, a cost cij is commonly defined that measures the suitability
(1) (2)
of assigning the i-th element si ∈ S1 to the j-th element sj ∈ S2 (resulting in
n × n cost values cij (i, j = 1, . . . , n)).
Definition 3 (Linear Sum Assignment Problem (LSAP)). Given two dis-

(1) (1) (2) (2)
joint sets S1 = {s1 , . . . , sn } and S2 = {s1 , . . . , sn } and a cost cij for every
(1) (2)
pair of entities (si , sj ) ∈ S1 × S2 , the Linear Sum Assignment Problem (LSAP)
is given by finding

n
min ciϕi
(ϕ1 ,...,ϕn )∈Sn
i=1
where Sn refers to the set of all n! possible permutations of n integers.
The reformulation of the graph edit distance problem to an instance of an LSAP

starts with the following definition of a square cost matrix whereon the LSAP is
eventually solved.38
Definition 4 (Cost matrix C). Based on the node sets V1 = {u1 , . . . , un } and
V2 = {v1 , . . . , vm } of g1 and g2 , respectively, a (n + m) × (n + m) cost matrix C is
established as follows.
⎡ ⎤
c11 c12 · · · c1m c1ε ∞ · · · ∞
⎢ . . ⎥
⎢ c21 c22 · · · c2m ∞ c2ε . . .
. ⎥
⎢ . . . ⎥
⎢ .. .. . . ... ... . . . . . . ⎥
⎢ ∞ ⎥
⎢ cn1 cn2 · · · cnm ∞ · · · ∞ cnε ⎥
C= ⎢ cε1 ∞ · · · ∞ 0 0 · · · 0 ⎥ (9)
⎢ ⎥
⎢ . . ⎥
⎢ ∞ cε2 . . . .. 0 0 . . . . ⎥
⎢ .
⎥
⎣ .. . . . . . .
. .. ... ⎦
. . . ∞ . 0
∞ · · · ∞ cεm 0 · · · 0 0
Entry cij thereby denotes the cost of a node substitution (ui → vj ), ciε denotes
the cost of a node deletion (ui → ε), and cεj denotes the cost of a node insertion
(ε → vj ).
Obviously, the left upper corner of the cost matrix C = (cij ) represents the costs
of all possible node substitutions, the diagonals of the right upper and left bottom
corners the costs of all possible node deletions and node insertions, respectively. As
every node can be deleted or inserted at most once, any non-diagonal element of
the right-upper and left-lower part is set to ∞. Substitutions of the form (ε → ε)
should not cause any cost (thus the bottom right corner of C is set to zero).
Given the cost matrix C = (cij ), the LSAP optimization consists in finding a
permutation (ϕ1 , . . . , ϕn+m ) of the integers (1, 2, . . . , (n + m)) that minimizes the
(n+m)
overall assignment cost i=1 ciϕi . In order to solve the LSAP on our specific
cost matrix the Hungarian algorithm39 with cubic time complexity is applied.
The optimal permutation corresponds to the assignment
ψ = ((u1 → vϕ1 ), (u2 → vϕ2 ), . . . , (um+n → vϕm+n ))
of the nodes of g1 to the nodes of g2 . Note that assignment ψ includes node assign-
ments of the form (ui → vj ), (ui → ε), (ε → vj ), and (ε → ε) (the latter can be
dismissed, of course).
In fact, so far, the cost matrix C = (cij ) considers the nodes of both graphs only,
and thus mapping ψ does not take any structural constraints into account. In order
to integrate knowledge about the graph structure, to each entry cij , i.e. to each cost
of a node edit operation (ui → vj ), the minimum sum of edge edit operation costs,
implied by the corresponding node operation, is added. This particular encoding of
the minimum matching cost arising from the local edge structure enables the LSAP
to consider information about the local, yet not global, edge structure of a graph.
The LSAP optimization finds an assignment ψ in which every node of g1 is either
assigned to a unique node of g2 or deleted. Likewise, every node of g2 is either
assigned to a unique node of g1 or inserted. Note, moreover, that edit operations
on edges are always defined by the edit operations on their adjacent nodes. That
is, whether an edge (u, v) is substituted, deleted, or inserted, depends on the edit
operations actually performed on both adjacent nodes u and v.
Hence, given the node assignment ψ, the edge edit operations can be completely
(and globally consistently) inferred from ψ resulting in an admissible edit path be-
tween the graphs under consideration. The corresponding cost of this edit path can
be interpreted as an approximate graph edit distance. We denote this suboptimal
edit distance with dBP (g1 , g2 ) (or dBP for short)a .
The solution ψ found by the Hungarian algorithm may not be optimal with
respect to graph edit distance and the corresponding edit path cost derived may be
higher than the cost of the optimal edit path λmin . Hence, the distance measure dBP
provides an upper bound for the exact graph edit distance, that is dGED (g1 , g2 ) ≤
dBP (g1 , g2 ).
3.3. Hausdorff Edit Distance

The approximation framework described above has been extended with a quadratic-
time matching procedure, namely the Hausdorff edit distance (HED).19,40 Rather
than reducing the graph edit distance problem to an instance of an LSAP, the pro-
posed method reduces the assignment problem to a Hausdorff matching problem41
between two sets of local substructures.
The computation of dBP described above still has a considerable cubic time
complexity of O((n + m)3 ) in the number of nodes n = |V1 | and m = |V2 | of the
involved graphs. This is due to the fact that an optimal solution to the assignment
problem for nodes and their local structure is calculated globally, taking all node
a BP stands for Bipartite – the assignment problem can also be formulated as finding a matching
in a complete bipartite graph and is therefore also referred to as bipartite graph matching problem.

Fig. 3. HED assignment graph.
matchings into account conjointly. In this extension, each node of the first graph
is compared with each node of the second graph similar to comparing subsets of a
metric space using the Hausdorff distance.41 Accordingly, the proposed Hausdorff
edit distance (HED) can be calculated in quadratic time, that is O(nm).
We start our formalization with the definition of the Hausdorff distance of two
subsets A, B of a metric space
H(A, B) = max(sup inf d(a, b), sup inf d(a, b)) , (10)
a∈A b∈B b∈B a∈A
with respect to the metric d(a, b). In case of finite sets the Hausdorff distance is
H(A, B) = max(max min d(a, b), max min d(a, b)) , (11)
a∈A b∈B b∈B a∈A
i.e. the maximum among all nearest neighbor distances between A and B. As it is
prone to outliers, the maximum operator can be replaced with the sum

H (A, B) = min d(a, b) + min d(a, b) , (12)
b∈B a∈A
a∈A b∈B
taking all distances into account. Finally, the Hausdorff edit distance (HED) dHED
between two graphs g1 = (V1 , E1 , μ1 , ν1 ) and g2 = (V2 , E2 , μ2 , ν2 ) is

dHED (g1 , g2 ) = min N (u, v) + min N (u, v) , (13)
v∈V2 ∪{ } u∈V1 ∪{ }
u∈V1 v∈V2
with respect to the node matching cost N (u, v) defined below.

Figure 3 illustrates a possible HED node assignment for V1 = {v1 , v2 , v3 }
and V2 = {u1 , u2 } with the assignments {(v1 → u1 ), (v2 → ), (v3 → ), (u1 →
v1 ), (u2 → v1 )}. In contrast to the LSAP node assignment, multiple assignments to
the same node are allowed.
The node matching cost N (u, v) for HED is defined as
⎧
⎪
⎪
cp
for node deletion (u → )
⎨cu + p∈P 2
c
N (u, v) = c v + q∈Q 2q for node insertion ( → v) , (14)
⎪
⎪
⎩ cuv + dHED2(P,Q) for node substitution (u → v)
2
where P and Q are the edges adjacent to u and v, respectively, and cij the cost
for deletion, insertion, and substitution. Only half of the substitution costs are
considered because HED does not enforce bidirectional assignments. Only half of
the edge costs are considered because they connect two nodes.
The edge matching cost dHED (P, Q) is defined differently for HED when com-
pared with the bipartite framework. Instead of solving an assignment problem for
the adjacent edges of two nodes, Hausdorff matching is performed on the edges in
the same way as for the nodes (see Ref. 19 for details).
In summary, HED considers the best case for matching each node and edge in-
dividually and hence underestimates the true edit distance. That is, dHED (g1 , g2 ) ≤
dGED (g1 , g2 ). In order to constrain the underestimation, a lower bound can be used
for dHED (g1 , g2 ), which asserts a minimum amount of deletion and insertion costs
if the two matched graphs differ in size (see Ref. 19 for more details). Moreover, in
Ref. 42 an improvement of this approximation has been proposed which involves a
larger context of individual nodes.
3.4. Normalization of Graph Edit Distance
In the following, we use d as a placeholder for both graph edit distance approxima-
tions, i.e. dBP or dHED .
Previous publications have shown that it is crucial to apply a normalization
when using a dissimilarity measure for signature verification. We normalize our
graph edit distance with what we refer to as maximal graph edit distance, i.e. the
cost of completely deleting the first graph and then inserting the complete second
graph.
Formally, given two graphs gR = (VR , ER , μR , νR ) and gT = (VT , ET , μT , νT )
and a cost function c, we define dmax as

dmax (gR , gT ) = c(u → ε) + c(e → ε)
u∈VR e∈ER
(15)
+ c(ε → v) + c(ε → e)
v∈VT e∈ET
When using the cost function defined in Section 2, this equation can be simplified
to
dmax (gR , gT ) = (|VR | + |VT |) · cnode + (|ER | + |ET |) · cedge (16)
We now define the normalized graph edit distance of two signature images R
and T as
d(gR , gT )
dnorm (gR , gT ) = , (17)
dmax (gR , gT )
where gR and gT are the keypoint graphs for the signature images R and T respec-
tively, and d(gR , gT ) is either dBP or dHED .
4. Experimental Evaluation
4.1. Experimental Setup

4.1.1. Datasets
We use four publicly available datasets to evaluate the performance of our frame-
work. In Table 1 a summary of some key characteristics of the datasets is given.
GPDSsynthetic-Offline is a large dataset26 of synthetic Western signatures.

It is the replacement for the popular GPDS-960 dataset and its earlier variants,
which are no longer available.43 The new dataset consists of 4,000 synthetic users
with 24 genuine signatures and 30 simulated forgeries for each user. All signatures
have been generated with differently modeled pens at a simulated resolution of 600
dpi. We are using two subsets of this dataset:
• GPDS-last100: Containing the last 100 users of the dataset (users 3901 to
4000).
• GPDS-75: Containing the first 75 users of the dataset (users 1 to 75).
The GPDS-last100 dataset is used as the training set for both structural meth-
ods. That is, we tune the parameters of our graph-based methods on this subset
exclusively.
UTSig is a Persian signature dataset.44 It consists of 115 users with 27 genuine
signatures, 3 opposite-hand signaturesb , and 42 forgeries for each user. The users
have been instructed to sign within six differently sized bounding boxes to simulate
different conditions. The resulting signatures have been scanned with 600 dpi.
MCYT-75 is a offline signature dataset within the MCYT baseline corpus.9,45
It consists of 75 users with 15 genuine signatures and 15 forgeries for each user. The
users signed in a 127mm × 97mm box and each signature has been scanned at 600
dpi.
b The opposite-hand signatures are treated as forgeries as suggested by the authors of the dataset.
Table 1. Number of users, genuine and forged signatures, as well as dpi during
scanning for all datasets.
used for used for
Name Users Genuine Forgeries dpi
tuning testing
GPDS-last10026 100 24 30 600 x
GPDS-7526 75 24 30 600 x
MCYT-759 75 15 15 600 x
UTSig44 115 27 45 600 x
CEDAR46 55 24 24 300 x
CEDAR consists of 55 users46 with 24 genuine signatures and 24 forgeries for

each user. The users signed in a 50mm × 50mm box and each signature has been
scanned at 300 dpi.
4.1.2. Evaluation Metric

We evaluate the performance of our systems to distinguish between genuine signa-
tures and forgeries. We are using two types of forgeries, which are common in the
signature verification community.
• Skilled forgeries (SF): The target’s genuine signature is known to the forger,
and usually, the forger has time to practice it. This often leads to forgeries that
have high resemblance with their genuine counterparts.
• Random forgeries (RF): Genuine signatures of other users are used in a brute
force attack on the verification system. Another reasoning is that the forgers use
their own signatures since they have no knowledge about the target’s signature.
In our experiments, we are using one genuine signature from each other user as
random forgeries.
On all datasets but the UTSig data, we use the first 10 genuine signatures
for each user as reference (on UTSig, the first 12 signatures are employed). The
remaining genuine signatures are used as positive samples for the evaluation.
We evaluate the performance of our graph-based verification systems using the
equal error rate (EER). The EER is the error rate when the false rejection rate
(FRR) is equal to the false acceptance rate (FAR). The FRR refers to the percentage
of genuine signatures that are rejected by the system, and the FAR refers to the
percentage of forgeries accepted by the system. In order to determine FRR and
FAR directly, we have to decide on a (global) decision threshold (applied for all
users).
4.2. Empirical Results

For our first experiment, we create several sets of graphs for the first 10 genuine sig-
natures per user of the GPDS-last100 dataset using different values for the threshold
D during the graph extraction process. Table 2 shows the minimum, median, av-
Table 2. Number of nodes in keypoint graphs with

respect to threshold D for first 10 genuine graphs per
user of GPDS-last100 dataset.
D minimum median average maximum
25 23 125 130 355
50 15 70 73 194
100 9 42 45 120
Table 3. Average runtimes for

computing dHED and dBP on
GPDS-last100.
D Approx. Runtime
dBP 1029 ms
25
dHED 113 ms
dBP 175 ms
50
dHED 38 ms
dBP 51 ms
100
dHED 17 ms
erage, and maximum number of nodes in the graphs for a given D (we show the
results for D ∈ {25, 50, 100}). The number of nodes is increasing as expected when
lowering threshold D.
In Table 3, we report the average runtimec on GPDS-last100 using both dBP
and dHED . The expected speed-up of dHED when compared with dBP is particularly
significant when using graph representations with more nodes.
The next question we want to answer is whether the distance measure dBP or
dHED performs better for the task of signature verification. In Table 4 we report the
EER on all datasets in both scenarios (RF and SF) achieved with dBP and dHED .
Regarding the random forgery scenario, we observe that the bipartite approximation
performs better than the Hausdorff approximation on all datasets. However, in
the scenario with skilled forgeries, dHED performs on two datasets slightly better
than the bipartite counterpart (UTSig and MCYT-75). On GPDS-75 the results
achieved with dBP are substantially better than the results obtained with dHED ,
while on CEDAR both dissimilarities result in the same EER score.
In summary, we can conclude that the bipartite distance dBP leads in general to a
(slightly) better EER in this particular experiment. On the other hand, we observe
substantially better runtimes when using dHED as basic dissimilarity measure rather
than dBP .
c Runtime is with respect to a Java implementation and AMD Opteron 2354 nodes with 2.2 GHz
CPU.
Table 4. Equal error rates on four data sets in a random and skilled forgery
scenario using dBP and dHED .
RF SF
Dataset dBP dHED dBP dHED
GPDS-7526 3.80 3.89 6.67 9.33
UTSig44 4.90 4.00 18.96 17.33
MCYT-759 2.67 3.87 13.24 12.71
CEDAR46 5.05 5.93 17.50 17.50
5. Conclusions and Recent Work
The present chapter reviews a recent line of research concerned with graph-based
signature verification. We review the core processes actually needed in a signature
verification framework, viz. preprocessing of signature images and graph extraction
as well as the graph dissimilarity computation by means of two approximation
algorithms (bipartite graph edit distance and Hausdorff edit distance).
In a non-exhaustive experimental evaluation, we compare the two baseline meth-
ods with each other on four benchmark datasets. While the Hausdorff approach
leads to substantially lower matching times (due to the lower complexity of the
algorithm), we observe slightly better performance with respect to the verification
accuracy (measured via equal error rates).
The system presented in this chapter actually builds the core for various subse-
quent verification systems. In Refs. 21 and 23, for instance, this system is extended
to an ensemble method that allows combining metric learning by means of a deep
CNN47 with the triplet loss function48 with the fast graph edit distance approxi-
mations described in this chapter.
Combining the present structural approach and statistical models has signif-
icantly improved the signature verification performance on the MCYT-75 and
GPDS-75 benchmark datasets.21 The structural model based on approximate graph
edit distance achieved better results on skilled forgeries, while the statistical model
based on metric learning with deep triplet networks achieved better results against
a brute-force attack with random forgeries. The proposed system was able to com-
bine these complementary strengths and has proven to generalize well to unseen
users, which have not been used for model training and parameter optimization.
In Refs. 22 and 24, the basic framework presented in this chapter is combined
with a tree-based inkball model. Inkball models are another recent structural ap-
proach for handwriting analysis proposed by Howe in Ref. 49. Originally, this
approach has been introduced as a technique for segmentation-free word spotting
that requires few training data. In addition to keyword spotting, inkball models
have been used for handwriting recognition as a complex feature in conjunction with
HMM.50 Inkball models are visually similar to keypoint graphs since they are using
very similar points on the handwriting as nodes. However, inkballs are connected to
a rooted tree that is directly matched with a skeleton image using an efficient algo-
rithm. The complementary aspects of the two dissimilarity measures are exploited
to achieve better verification results using a linear combination of the two dissim-
ilarity scores. The systems are evaluated individually as well as combined, and it
can be empirically proven that graph-based signature verification is able to reach
and, in some cases, surpass the current state of the art in signature verification,
motivating further research on structural approaches to signature verification.
References
1. D. Impedovo and G. Pirlo, Automatic signature verification: The state of the art,
IEEE Trans. on Systems, Man and Cybernetics Part C: Applications and Reviews. 38
(5), 609–635, (2008).
2. L. G. Hafemann, R. Sabourin, and L. S. Oliveira. Offline handwritten signature ver-
ification - literature review. In Proc of Int. Conf. on Image Processing Theory, Tools
and Applications (IPTA), pp. 1–8 (Nov, 2017).
3. M. Diaz, M. A. Ferrer, D. Impedovo, M. I. Malik, G. Pirlo, and R. Plamondon, A
perspective analysis of handwritten signature technology, ACM Comput. Surv. 51
(6), 117:1–117:39 (Jan., 2019). ISSN 0360-0300. doi: 10.1145/3274658. URL http:
//doi.acm.org/10.1145/3274658.
4. P. S. Deng, H.-Y. M. Liao, C. W. Ho, and H.-R. Tyan, Wavelet-Based Off-Line Hand-
written Signature Verification, Computer Vision and Image Understanding. 76(3),
173–190, (1999).
5. A. Gilperez, F. Alonso-Fernandez, S. Pecharroman, J. Fierrez, and J. Ortega-Garcia.
Off-line signature verification using contour features. In International Conference on
Frontiers in Handwriting Recognition. Concordia University, (2008).
6. M. A. Ferrer, J. Alonso, and C. Travieso, Offline geometric parameters for auto-
matic signature verification using fixed-point arithmetic, IEEE Transactions on Pat-
tern Analysis and Machine Intelligence. 27(6), 993–997, (2005).
7. A. Piyush Shanker and A. Rajagopalan, Off-line signature verification using DTW,
Pattern Recognition Letters. 28(12), 1407–1414 (9, 2007).
8. F. Alonso-Fernandez, M. Fairhurst, J. Fierrez, and J. Ortega-Garcia. Automatic Mea-
sures for Predicting Performance in Off-Line Signature. In IEEE International Con-
ference on Image Processing, pp. I–369–I–372. IEEE, (2007).
9. J. Fierrez-Aguilar, N. Alonso-Hermira, G. Moreno-Marquez, and J. Ortega-Garcia. An
off-line signature verification system based on fusion of local and global information.
In Biometric Authentication, pp. 295–306. Springer, (2004).
10. M. B. Yilmaz, B. Yanikoglu, C. Tirkaz, and A. Kholmatov. Offline signature verifi-
cation using classifier combination of HOG and LBP features. In International Joint
Conference on Biometrics, pp. 1–7. IEEE, (2011).
11. M. A. Ferrer, J. F. Vargas, A. Morales, and A. Ordonez, Robustness of Offline Signa-
ture Verification Based on Gray Level Features, IEEE Transactions on Information
Forensics and Security. 7(3), 966–977 (jun, 2012). ISSN 1556-6013.
12. S. Dey, A. Dutta, J. I. Toledo, S. K. Ghosh, J. Llados, and U. Pal. SigNet: Convolu-
tional Siamese Network for Writer Independent Offline Signature Verification. (2017).
13. L. G. Hafemann, R. Sabourin, and L. S. Oliveira, Learning features for offline hand-
written signature verification using deep convolutional neural networks, Pattern Recog-
nition. 70, 163–176, (2017).
14. A. Soleimani, B. N. Araabi, and K. Fouladi, Deep multitask metric learning for offline
signature verification, Pattern Recognition Letters. 80, 84–90, (2016).
15. M. Stauffer, P. Maergner, A. Fischer, and K. Riesen, Polar Graph Embedding for
Handwriting Applications, Pattern Analysis and Applications. Submitted, (2019).
16. R. Sabourin, R. Plamondon, and L. Beaumier, Structural interpretation of handwrit-
ten signature images, Int. Journal of Pattern Recognition and Artificial Intelligence.
8(3), 709–748, (1994).
17. A. Bansal, B. Gupta, G. Khandelwal, and S. Chakraverty, Offline signature verification
using critical region matching, Int. Journal of Signal Processing, Image Processing and
Pattern. 2(1), 57–70, (2009).
18. T. Fotak, M. Baca, and P. Koruga, Handwritten signature identification using basic
concepts of graph theory, WSEAS Transactions on Signal Processing. 7(4), 145–157,
(2011).
19. A. Fischer, C. Y. Suen, V. Frinken, K. Riesen, and H. Bunke, Approximation of graph
edit distance based on Hausdorff matching, Pattern Recognition. 48(2), 331–343 (2,
2015).
20. K. Riesen and H. Bunke, Approximate graph edit distance computation by means of
bipartite graph matching, Image and Vision Computing. 27(7), 950–959 (6, 2009).
21. P. Maergner, V. Pondenkandath, M. Alberti, M. Liwicki, K. Riesen, R. Ingold, and
A. Fischer. Offline Signature Verification by Combining Graph Edit Distance and
Triplet Networks. In International Workshop on Structural, Syntactic, and Statistical
Pattern Recognition, pp. 470–480. Springer, (2018).
22. P. Maergner, N. Howe, K. Riesen, R. Ingold, and A. Fischer. Offline Signature Ver-
ification Via Structural Methods: Graph Edit Distance and Inkball Models. In In-
ternational Conference on Frontiers in Handwriting Recognition, pp. 163–168. IEEE,
(2018).
23. P. Maergner, V. Pondenkandath, M. Alberti, M. Liwicki, K. Riesen, R. Ingold, and
A. Fischer. Combining graph edit distance and triplet networks for offline signature
verification. In Pattern Recognition Letters 125, pp. 527–533. (2019).
24. P. Maergner, N. Howe, K. Riesen, R. Ingold, and A. Fischer. Graph-Based Offline
Signature Verification. In arXiv:1906.10401, (2019).
25. T. Y. Zhang and C. Y. Suen, A fast parallel algorithm for thinning digital patterns,
Communications of the ACM. 27(3), 236–239, (1984).
26. M. A. Ferrer, M. Diaz-Cabrera, and A. Morales, Static Signature Synthesis: A Neu-
romotor Inspired Approach for Biometrics, IEEE Transactions on Pattern Analysis
and Machine Intelligence. 37(3), 667–680 (mar, 2015). ISSN 0162-8828.
27. A. Fischer, M. Diaz, R. Plamondon, and M. A. Ferrer. Robust score normalization
for DTW-based on-line signature verification. In Proc. of International Conference on
Document Analysis and Recognition (ICDAR), pp. 241–245. IEEE (8, 2015).
28. D. Conte, P. Foggia, C. Sansone, and M. Vento, Thirty years of graph matching in
pattern recognition, Int. Journal of Pattern Recognition and Artificial Intelligence. 18
(3), 265–298, (2004).
29. P. Foggia, G. Percannella, and M. Vento, Graph Matching and Learning in Pattern
Recognition in the last 10 Years, International Journal of Pattern Recognition and
Artificial Intelligence. 28(01), 1450001, (2014).
30. H. Bunke and G. Allermann, Inexact graph matching for structural pattern recogni-
tion, Pattern Recognition Letters. 1(4), 245–253 (5, 1983).
31. K. Riesen, Structural Pattern Recognition with Graph Edit Distance. Advances in
Computer Vision and Pattern Recognition, (Springer International Publishing, 2015).
32. P. Hart, N. Nilsson, and B. Raphael, A formal basis for the heuristic determination
of minimum cost paths, IEEE Transactions of Systems, Science, and Cybernetics. 4
(2), 100–107, (1968).
33. L. Gregory and J. Kittler. Using graph search techniques for contextual colour re-
trieval. In eds. T. Caelli, A. Amin, R. Duin, M. Kamel, and D. de Ridder, Proc. of the
Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern
Recognition, LNCS 2396, pp. 186–194, (2002).
34. S. Berretti, A. Del Bimbo, and E. Vicario, Efficient matching and indexing of graph
models in content-based retrieval, IEEE Trans. on Pattern Analysis and Machine
Intelligence. 23(10), 1089–1105, (2001).
35. X. Cortés, F. Serratosa, and A. Solé. Active graph matching based on pairwise prob-
abilities between nodes. In eds. G. Gimel’farb, E. Hancock, A. Imiya, A. Kuijper,
M. Kudo, O. S., T. Windeatt, and K. Yamad, Proc. 14th Int. Workshop on Structural
and Syntactic Pattern Recognition, LNCS 7626, pp. 98–106, (2012).
36. K. Riesen and H. Bunke, Approximate graph edit distance computation by means of
bipartite graph matching, Image and Vision Computing. 27(4), 950–959, (2009).
37. R. Burkard, M. Dell’Amico, and S. Martello, Assignment Problems. (Society for In-
dustrial and Applied Mathematics, Philadelphia, PA, USA, 2009). ISBN 0898716632,
9780898716634.
38. K. Riesen, Structural Pattern Recognition with Graph Edit Distance. (Springer, 2016).
39. J. Munkres, Algorithms for the Assignment and Transportation Problems, Journal of
the Society for Industrial and Applied Mathematics. 5(1), 32–38, (1957).
40. A. Fischer, C. Suen, V. Frinken, K. Riesen, and H. Bunke. A fast matching algorithm
for graph-based handwriting recognition. In eds. W. Kropatsch, N. Artner, Y. Hax-
himusa, and X. Jiang, Proc. 8th Int. Workshop on Graph Based Representations in
Pattern Recognition, LNCS 7877, pp. 194–203, (2013).
41. D. P. Huttenlocher, G. A. Klanderman, G. A. Kl, and W. J. Rucklidge, Comparing
images using the Hausdorff distance, IEEE Trans. PAMI. 15, 850–863, (1993).
42. A. Fischer, S. Uchida, V. Frinken, K. Riesen, and H. Bunke. Improving hausdorff
edit distance using structural node context. In eds. C. Liu, B. Luo, W. Kropatsch,
and J. Cheng, Proc. 10th Int. Workshop on Graph Based Representations in Pattern
Recognition, LNCS 9069, pp. 148–157, (2015).
43. M. A. Ferrer. GPDSsyntheticSignature database website, (2016). URL http://www.
gpds.ulpgc.es/downloadnew/download.htm. accessed on Jan 28, 2019.
44. A. Soleimani, K. Fouladi, and B. N. Araabi, Utsig: A persian offline signature dataset,
IET Biometrics. 6(1), 1–8, (2016).
45. J. Ortega-Garcia, J. Fierrez-Aguilar, D. Simon, J. Gonzalez, M. Faundez-Zanuy, V. Es-
pinosa, A. Satue, I. Hernaez, J.-J. Igarza, C. Vivaracho, D. Escudero, and Q.-I. Moro,
MCYT baseline corpus: a bimodal biometric database, IEEE Proceedings-Vision, Im-
age and Signal Processing. 150(6), 395–401, (2003).
46. M. K. Kalera, S. Srihari, and A. Xu, Offline signature verification and identification
using distance statistics, International Journal of Pattern Recognition and Artificial
Intelligence. 18(07), 1339–1360, (2004).
47. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proc of Conf. on Computer Vision and Pattern Recognition, pp. 770–778, (2016).
48. E. Hoffer and N. Ailon. Deep metric learning using triplet network. In International
Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Springer, (2015).
49. N. Howe. Part-structured inkball models for one-shot handwritten word spotting. In
Proc. of International Conference on Document Analysis and Recognition (ICDAR),
(2013).
50. N. Howe, A. Fischer, and B. Wicht. Inkball models as features for handwriting recog-
nition. In Proc. of International Conference on Frontiers in Handwriting Recognition
(ICFHR), (2016).
CHAPTER 2.8
CELLULAR NEURAL NETWORK FOR SEISMIC PATTERN

RECOGNITION
Kou-Yuan Huang* and Wen-Hsuan Hsieh

Department of Computer Science
National Chiao Tung University, Hsinchu, 30010 Taiwan
*
E-mail: kyhuang@cs.nctu.edu.tw
Cellular neural network (CNN) is adopted for seismic pattern recognition. We design
CNN to behave as associative memory according to the stored patterns, and finish the
training process of the network. Then we use this associative memory to recognize
seismic testing patterns. In the experiments, the analyzed seismic patterns are bright spot
pattern, right and left pinch-out patterns that have the structure of gas and oil sand zones.
From the recognition results, the noisy seismic patterns can be recovered. In the
comparison of experimental results, the CNN has better recovery capacity than Hopfield
model. Also we have the experiments on seismic images. Through window moving, the
bright spot pattern and horizon pattern can be detected. The results of seismic pattern
recognition using CNN are good. It can help the analysis and interpretation of seismic
data.
1. Introduction
In 1988, Chua and Yang proposed the theory and applications of cellular neural
network (CNN) [1]–[3]. After that, there were studies on the discrete time CNN
(DT-CNN) [4]–[6]. And some papers discussed the stability analysis and
attractivity analysis [7]–[10]. CNN had ever been used in many applications,
such as to the detection of geological lineaments on radarsat images and the
seismic horizon picking [11], [12].
Here the DT-CNN is used as the associative memory [5], [6]. Each memory
pattern corresponds to a unique globally asymptotically stable equilibrium point
of the network. We use the motion equation of a cellular neural network to
behave as an associative memory, and then use the associative memory to
recognize patterns.
The seismic pattern recognition system using CNN is shown in Fig. 1. The
process of seismic pattern recognition is composed of two parts. In the training
part, the training seismic patterns can be used to construct the auto-associative
323
324 K. Y. Huang and W. H. Hsieh
memory using DT-CNN. In the recognition part, the input testing seismic pattern
can be recognized by the auto-associative memory.
Training烉
Training Cellular
Auto-associative
seismic Neural
Memory
patterns Network
Recognition烉
Recovered
Testing
Auto-associative seismic
seismic
Memory pattern
pattern
Fig. 1. Seismic pattern recognition system using cellular neural network.
2. Cellular Neural Network
2.1. Structure of Cellular Neural Network
The primary element of CNN is a cell and shown in Fig. 2. Each cell has an input,
threshold, and output. For CNN, cells are arranged in a two-dimensional array
usually, as shown in Fig. 3. Every cell is only influenced directly by its
neighboring cells in a CNN. It is not influenced directly by all other cells. In
CNN, the input of one cell comes from the input and output of other cells which
are only near neighboring cells.
j
Cell Cij
uij
Input Output
State
xij yij 3x3 sphere
Threshold Iij Cell Cij
of influence
Fig. 2. The element of a cellular neural Fig. 3. A 6×10 cell array. Radius r = 1, the range
network, cell Cij . of cell Cij and its neighboring cells.
Cellular Neural Network for Seismic Pattern Recognition 325
Considering an M × N two-dimensional cell array, we can arrange in M rows

and N columns. The cell in ith row and jth column is labeled as Cij . The range of
neighboring cells is called neighborhood. The number of neighboring cells of
every cell is the same, it is determined by neighborhood radius, and this radius is
different from the general round radius. The neighborhood of a cell is defined in
the following.
Nij (r ) = { Ckl | max ( | k – i |, | l – j | ) d r, 1 d k d M ; 1 d l d N }
N ij (r ) represents the set of neighboring cells Ckl of cell Cij , r is the radius of
N ij (r ) , r is a positive integer. N ij (r ) express a (2r +1) × (2r +1) cell array. For
being simple and convenient, we omit r and express N ij (r ) with N ij . For
example, r = 1, the range of cell Cij and its neighboring cells is size of 3 × 3, as
shown in Fig. 3 [1]. The cell set contained by the grey square in Fig. 3 is N ij (1) .
N ij (1) is a 3 × 3 cell array.
In order to interpret the propagation of inputs and outputs of the neighboring
cells, the cell array of CNN is shown in Fig. 4. The left and right networks are the
same as the middle network. They are separated because of easy interpretation.
For CNN, the state value of the next time of one cell is influenced by inputs and
outputs of cells near this cell. The inputs and outputs of cells near this cell all will
feedback to this cell; they are regarded as the inputs of this cell. The cell with its
neighboring connection relation is moving and is called a template.
Multiply Multiply
Standard CNN: ( A, B, I )
template A template B
A B
The center cell is

influenced
Output Y State X Input U
Fig. 4. Model of cellular neural network.
2.2. Discrete-time Cellular Neural Network
Each cell had its basic circuit structure [1]. From continuous case of motion
equation, Harrer and Nossek derived to the discrete time CNN (DT-CNN) [4].
Grassi also designed DT-CNN for associative memories [5], [6].
Here we use Grassi’s method in the analysis. Consider a DT-CNN with a two-
dimensional M × N cell array [6]. For each cell (i, j), the equation of motion in a
DT-CNN is as follows.
xij (t 1) ¦ A(i, j; k , l ) y kl (t ) ¦ B(i, j; k , l )u kl (t ) I ij (1)

Ckl Sij Ckl Sij
1 for x ij (t 1) t 0
y ij (t 1) f ( x ij (t 1)) ® ( 2)
¯ 1 for x ij (t 1) 0
1d i d M; 1d j d N
where A(i, j; k, l) is the output weighting from cell (k, l) to cell (i, j); B(i, j; k, l) is
the input weighting from cell (k, l) to cell (i, j); I ij is the external input to cell (i,
j); ykl(t) is the output of the cell (k, l) at time t; ukl(t) is the input of neighboring
cell (k, l); Ckl is the cell (k, l); and Sij is the set of neighboring cells of cell (i, j).
The f (.) is the activation function of hard-limiter.
The (1) and (2) can be written as a vector form [6].
x(t+1) = A y(t) + B u + e (3)
y(t) = f(x(t)) (4)
In (3) and (4), the meanings of symbols are as follows.
x [ x1 x 2 " x n ]T nu1 , a vector has state value of every cell;
y [ y1 y 2 " y n ]T nu1 , a vector has output of every cell;
u [u1 u 2 " u n ]T nu1 , a vector has input of every cell;
e [ I1 I 2 " I n ]T nu1 , a vector has extra input of every cell;

ªa11 a12 " a1n º
« a 22 " a 2 n »»
A = «a 21 , a matrix has all A(i, j; k, l);
nu n
«# % # »
« »
¬a n1 a n 2 " a nn ¼
ªb11 b12 "b1n º

« b22 " b2 n »»
B = «b21 nun , a matrix has all B(i, j; k, l);
«# % # »
« »
¬bn1 bn 2 " bnn ¼
f > f ( x1 ) f ( x2 ) " f ( x n )@ nu1 , a vector has output functions.

T
2.3. Design of Feedback Template with Linear Neighboring

A one-dimensional space-invariant template is used as the feedback template [6].
For example, a 4x4 template that is connected with linear neighboring with n =16
cells and the neighboring radius r=1 is shown in Fig. 5.
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Fig. 5. The connection relation of cells for designing associative memories with r = 1 and n = 16.
If there are n cells, then the feedback coefficients of the cells can be expressed
with the following matrix A.
ªa11 a12 " a1n º

«a a " a »
A= «
21 22 2n »
(5)
«# % # »
« »
¬a n1 a n 2 " a nn ¼
a11 represents the self-feedback coefficient of 1st cell, a12 represents the
feedback coefficient of the next cell (the 2nd cell) of 1st cell in clockwise order,
a 21 represents the feedback coefficient of the next cell (the 1st cell) of 2nd cell
in counterclockwise order. Since the feedback template is a one-dimensional
space-invariant template, so a11 = a 22 = … = a nn = D 1 represent the self-feedback
coefficient of each cell, a12 = a23 = … = a ( n 1) n = an1 = D 2 represent the feedback
coefficient of the next cell in clockwise order, a1n = a 21 = a32 = … = a n ( n 1) = D n
represent the feedback coefficient of the next cell in counterclockwise order, a13
= a 24 = … = a ( n 2 ) n = a ( n 1)1 = a n 2 = D 3 represent the feedback coefficient of the
next two cell in clockwise order, and so on. So matrix A can be expressed as
ªD 1 D 2 " D n º
«D D " D »
A= «
n 1 n 1 »
«# % # »
« »
¬D 2 D 3 " D 1 ¼
Therefore A is a circulant matrix. The following one-dimensional space-invariant
template is considered:
[ a(-r) … a(-1) a( 0 ) a(1) … a( r ) ]
r is neighborhood radius, a(0) is the self-feedback, a(1) is the feedback of next

cell in the clockwise order, a(-1) is the feedback of next cell in the
counterclockwise order, and so on. So according to Eq. (5), we rearrange the
template elements as the following row vector.
[ a( 0 ) a( 1 ) … a( r ) 0 …0 a( -r )… a( -1 ) ] (6)
The Eq. (6) is the first row of matrix A. We arrange the last element of the
first row of matrix A in the first position of the second row of matrix A, it is
regarded as the first element of the second row of matrix A, other elements of the
first row of matrix A cycle right shift one position, they form the second to the
last element of the second row of matrix A. Similarly, we take the previous row
to cycle right shift once, the new sequence is the next row of matrix A, then we
can define matrix A.
ª a(0) a(1) " a (r ) 0 " 0 a( r ) " a(1)º
«a(1) a(0) " a(r ) 0 " 0 a( r ) " a(2) »
« »
« # " # »
A= « » (7)
« # " # »
« # " # »
« »
«¬a (1) a (2) "" """"""" a (-1) a (0) »¼
It only needs to design the first row of matrix A when we design matrix A as
a circulant matrix. Each next row is the previous row which is cycle right shifted
once. The number of 0s of Eq. (6) is decided by radius r and n. A nun , namely
each row of matrix A has n elements, and there are n rows in A. If n = 9, then
there are nine elements in each row. When r = 1, the one dimensional template is
sorted according to the Eq. (6), as the following Eq. (8) shows.
[ a( 0 ) a( 1 ) 0 0 0 0 0 0 a( -1 ) ] (8)
In the Eq. (8), there are six 0s in the middle of the template, add other three
elements, there are nine elements in a row.
2.4. Stability
If a dynamic system has a unique equilibrium point which attracts every

trajectory in state space, then it is called globally asymptotically stable. A
criterion for the global asymptotic stability of the equilibrium point of DT-CNNs
with circulant matrices has been introduced [13]. The criterion is described in the
following.
DT-CNNs described by (3) and (4), with matrix A given by (7), are globally
asymptotically stable, if and only if
F (2 S q / n) 1 , q = 0, 1, 2, …, n-1 (9)
where F is the discrete Fourier transform of a(t).

r
F ( 2 S q / n) ¦ a(t )e j 2S tq / n (10)
t r
The stability criterion (9) can be easily satisfied by choosing small values for
the elements of the one-dimensional space-invariant template. In particular, the
larger the network dimension n is, the smaller the values of the elements will be
by (10). On the other hand, the feedback values cannot be zero, since the stability
properties considered herein require that (3) be a dynamical system. These can
help the designer in setting the values of the feedback parameters. Namely, the
lower bound is zero, whereas the upper bound is related to the network
dimension.
2.5. Design of DT-CNN for Associative Memories
The motion equation of a CNN is designed to behave as an associative memory.

Given m bipolar (value for +1 or -1) training patterns as input vectors u i , i = 1,
i
2, …, m, for each u i , there is only one equilibrium point x satisfying motion
equation (3):
x1 Ay 1 Bu1 e
° 2
°x Ay 2 Bu 2 e (11)
®
° #
°x m Ay m Bu m e
¯
We design CNN to behave as associative memory, mainly set up A, and calculate

B and e from training patterns.
In order to express (11) into a matrix form, we define the following matrices
first:
ª x11 x12 " x1m º
« 1 2 m
»
X [ x1 x 2 " x m ] « x 2 x 2 " x 2 » nu m
« # % »
« »
«¬ x1n x n2 " x nm »¼
ª y11 y12 " y1m º
« 1 2 m
»
Y [y 1 y 2 " y m ] « y 2 y 2 " y 2 » num
« # % »
« »
1 2 m
¬« y n y n " y n »¼
Ay AY [Ay1 Ay 2 " Ay m ] [d1 d 2 " d m ] num
di [ d 1i d 2i " d ni ]T nu1 , i = 1, …, m
ªu11 u12 " u1m º

« 1 2 m
»
U [u1 u 2 " u m ] «u 2 u 2 " u 2 » num
« # % »
« »
«¬u 1n u n2 " u nm »¼
ª I1 I1 " I1 º
«I I 2 " I 2 »»
J [e e " e] « 2 nu m
«# % # »
« »
¬I n In " In ¼
The (11) can be expressed in the matrix form:

X = AY + BU + J (12)
BU + J = X–AY
BU + J = X– A y (13)
U is the input training patterns and has been already known. Because Y is the
desired output, so initially Y = U has been already known. Under global
asymptotic stability condition, we choose a sequence {a(-r), …, a(-1), a(0),
a(1), …, a(r)}, which satisfies the criterion (9). And design A as a circulant
matrix, so A has been already known too. We can know from output function, if
y is +1, then x > 1; if y is -1, then x < -1. U is a bipolar matrix, so Y is a bipolar
matrix too, namely all elements in Y are +1 or -1, so the elements of the state
matrix X corresponded to Y are all greater than +1 or less than -1, so we can
establish X = ĮY = ĮU, Į > 1. So U, Y, A, and X have been already known, then
we want to calculate B and J.
We define the following matrices:
R [U T h] mu( n 1)
ªu11 u21 " un1 1 º
« 2 2 »
«u1 u2 " un2 1 »
« # % #»
« »
«¬u1m u2m m
" un 1 »¼
h [1 1 " 1]T mu1
Xj [ x1j x 2j " x mj ] 1um is the jth row of matrix X

A y, j [d 1j d 2j " d mj ] 1um
ªb11 b12 " b1n I1 º ªw 1 º

«b b22 " b2 n I 2 »» «w »
>B e@ « 21
«#
« 2»
% # #» « # »
« » « »
¬bn1 bn 2 " bnn In ¼ ¬w n ¼
wj [b j1 b j 2 " b jn I j ] 1u( n 1) j = 1, 2, …, n
From (13), BU + J = X– A y
ªb11 b12 " b1n º ªu11 u12 " u1m º ª I I1 " I1 º

«b « 1 2 m»
1
« 21 b22 " b2 n »» «
«u 2 u 2 " u 2 » 為 « I 2 I 2 " I 2 »»
«# % # » « # % » «# % # »
« » « » « »
¬bn1 bn 2 " bnn ¼ «¬u 1n u n2 " u nm »¼ ¬ I n In " In ¼
ª x11 x12 " x1m º ªd11 d12 " d1m º

« 1 2 m» « »
烌 « x 2 x 2 " x 2 » 炼 «d 21 d 22 " d 2m »
« # % » « # % »
« » « »
1 2 m 1 2 m
«¬ x n xn " xn »¼ «¬d n d n " d n »¼
In view of the jth row,

ªu11 u12 " u1m º
« 1 2 m»
[b j 1 b j 2 " b jn ] «u 2 u 2 " u 2 » 為 [ I j I j " I j ]
« # % »
« »
1 2 m
¬«u n u n " u n »¼
烌 [ x1j x 2j " x mj ] 炼 [d 1j d 2j " d mj ]
Rw Tj XTj A Ty , j (14)
ªu11 u 12 " u 1n 1º ªb j1 º ª x1j º ªd 1j º

« 2 2 » « » « » « »
«u1 u 2 " u n
2
1» «b j 2 » 烌 « x 2j » 炼 «d 2j » j = 1, 2, …, n
« # « # » « » « »
% #» « » «# » «# »
« » «b jn »
«¬u1 u 2 " u nm
m m
1»¼ « x m » «d m »
« » ¬ j¼ ¬ j¼
«¬ I j »¼
The (14) is the transpose of the jth row of (13), so we can rewrite (13) as (14).
Because each cell is only influenced by its neighboring cells, so matrix B is a
sparse matrix, and elements in w j are mostly 0. We remove 0 elements of w j ,
~ . And we remove the corresponding columns of R, then we
then we can get w j
~ ~ ~T
can get R j
, and there is the property R jw j Rw Tj . Then (14) becomes (15).
~ ~T
R jw j X Tj A Ty , j (15)
~T ~
w j R j ( X Tj A Ty , j ) , j = 1, 2, …, n (16)
~
R j is got from R according to the connection relation of the input of the jth cell
and inputs of other cells. We express the connection relation of the inputs of cells
by matrix S, so R~ can be got by taking out partly vectors of R according to the
j
~ is the pseudo inverse of ~ . ~

jth row of S. R
mu h ~ 1uh j ,
j R j R j j , w j
§ n ·
hj ¨ ¦ S ji ¸ 1 . Matrix S is the matrix represents the connection relation of
©i1 ¹
cells’ inputs. S nun , if the ith cell’s input and jth cell’s input have connection
relation, sij=1. On the other hand, if the ith cell’s input and jth cell’s input have
no connection relation, sij =0.
1, if the ith cell's input and jth cell's input have connection relation
°
sij ®
°0, if the ith cell's input and jth cell's input have no connection relation
¯
For example, a 4 × 4 cell array and radius r = 1 in Fig. 5, then S is in the
following:
ª1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0º
«1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0»»
«
«0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0»
« »
«0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0»
«1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0»
« »
«1 1 1 0 1 1 1 0 1 1 1 0 0 0 0 0»
«0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 0»
« »
«0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0»
S «0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0»
« »
«0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0»
« »
«0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1»
«0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1»
« »
«0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0»
«0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0»
« »
«0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1»
«0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 »¼
¬
In (14), we can get wTj by the following equation.

w Tj R XTj A Ty , j
R may not be unique, so wTj may not be unique. B may not be unique. Then B
may not accord with the interconnecting structure of network inputs. So we must
use a matrix S to represent the interconnecting structure of network inputs and
use the above derivation to calculate matrix B. We summarize the steps of using
CNN to design associative memories in the following.
Algorithm 1: Design a DT-CNN to behave as an associative memory in the

training part
Input: m bipolar patterns u i , i = 1, …, m
Output: w j [b j1 b j 2 " b jn I j ] , j = 1, …, n, i.e., B and e
Method:
(1) Set up matrix U from training patterns u i .
U [u 1 u 2 " u m ]
(2) Establish Y = U.
(3) Set up S.
1, if the ith cell's input and jth cell's input have connection relation
°
sij ®
°0, if the ith cell's input and jth cell's input have no connection relation
¯
(4) Design matrix A as the circulant matrix which satisfies globally
asymptotically stable condition.
(5) Set the value of Į (Į > 1), and calculate X = ĮY.
(6) Calculate A y AY .
(7) for ( j =1 to n ) do:
Calculate X j by X.
Calculate A y, j by A y .
Calculate R, R [ U T h] .
~ from matrix S and matrix R.
Establish matrix R j
~ of ~ .
Calculate pseudoinverse matrix R j Rj
~ ,w
T ~ T ~ T T
Calculate w j R (X A ) .
j j j y, j
~T .
Recover w j from w j
End
3. Pattern Recognition Using DT-CNN Associative Memory
After training, we can do the process of recognition. We have A, B, e, and initial

y(t). We input the testing pattern u to the equation of motion in (3). After getting
the state value x(t+1) at the next time, we use output function in (4) to calculate
the output y(t+1) at the next time. We calculate the state value and the output
until all output values are not changed anymore, then the final output is the
classification of the testing pattern. The following algorithm is the recognition
process.
Algorithm 2: Use DT-CNN associative memory to recognize the testing pattern

Input: A, B, e, and testing pattern u in the equation of motion
Output: Classification of the testing pattern u
Method:
(1) Set up initial output vector y, its element values are all in [-1, 1] interval.
(2) Input testing pattern u and A, B, e, and y into the equation of motion to get
x(t+1).
x(t + 1) = A y(t) + B u + e
(3) Input x(t + 1) into activation function, get new output y(t + 1).
The activation function is:
x ! 1, then y 1
°
® 1 d x d 1, then y x
° x 1, then y 1
¯
(4) Compare new output y(t + 1) with y(t). Check whether they are the same.
If they are the same, then stop, otherwise input new output y(t + 1) into
equation of motion again. Repeat Step (2) to Step (4) until output y is not
changed.
End
4. Experiments
We have two kinds of experiments. The first one is on the simulated seismic
pattern recognition. The analyzed seismic patterns are bright spot pattern, right
and left pinch-out patterns that have the structure of gas and oil sand zones [17].
The second one is on the simulated seismic images. The analyzed seismic
patterns are bright spot pattern and horizon pattern. We use window moving to
detect the patterns.
4.1. Preprocessing on Seismic Data
We do the experiments on the seismogram. The seismogram can be preprocessed

to image. The preprocessing steps of the seismogram are shown in Fig. 6. It
contains enveloping, thresholding, peaking, and compression in the time
direction [14]. Fig. 7 shows a simulated seismogram. It consists of 64 seismic
traces. Each trace contains many peaks (wavelets). We can extract peak data
from seismogram through preprocessing. Then we transform peak data to bipolar
image data. Fig. 8 shows the result of preprocessing of Fig. 7. The symbol of
pixel “1” is for peak point and “0” is for background.
Input Peaks of Compress

Enveloping Thresholding data in time
seismogram seismogram
direction
Fig. 6. Preprocessing steps of seismogram.
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
011111111111111111111111111111111111111111111111111111111110
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000111111111111000000000000000000000000
000000000000000000001111000000000000111100000000000000000000
000000000000000001110000000000000000000011100000000000000000
000000000000000110000000000000000000000000011000000000000000
000000000000111000000000000011100000000000000111000000000000
000000000011000000000011111100011111100000000000110000000000
000000011100000000001100000000000000001110000000001110000000
000111100000000111110000000000000000010001110000000001111100
001000000000001000001111111111111111100000001110000000000011
000000000001110000000000000000000000000000000001100000000000
000000001110001111111000000000000000000111111110011100000000
000000110000000000000111111111111111111000000000000011100000
001111000000000000000000000000000000000000000000000000011100
110000000000000000000000000000000000000000000000000000000011
000000000000000000000000000000000000000000000000000000000000
Fig. 7. Simulated seismogram. Fig. 8. Preprocessing of Fig. 7.
4.2. Experiment 1: Experiment on Simulated Seismic Patterns

4.2.1. Experiment on Simulated Seismic Patterns
In the experiment, we store three simulated seismic training patterns and
recognize three noisy input testing patterns. The three simulated peak data of
bright spot pattern, right and left pinch-out patterns are shown in Fig. 9(a), (b),
and (c). The size is 12x48. We use these three training patterns to train the CNN.
The noisy testing bright spot pattern, right and left pinch-out patterns are shown
in Fig. 10(a), (b), and (c). We use Hamming distance (HD) as the number of
difference of symbols between the training pattern and the noisy pattern to
measure the ratio of noise. Fig. 10(a) has 107 Hamming distance. It is 19% noise.
Fig. 10(b) has 118 Hamming distance. It is 21% noise. Fig. 10(c) has 118
Hamming distance. It is 21% noise. We apply the DT-CNN associative memory
with connection matrix S to this experiment. The S is the matrix that represents
the connection relation of cell’s inputs. If the ith cell’s input and jth cell’s input
have connection, then sij=1, otherwise, sij=0. We set Ƚ = 3 and neighborhood
radius r = 2, 3, and 4. For r = 2, the 1-D feedback template is [a(-2), a(-1), a(0),
a(1), a(2)]=[0.01, 0.01, 0.01, 0.01, 0.01]. Similar for r = 3 and r = 4.
(a) (b) (c)
Fig. 9. Training seismic patterns: (a) bright spot, (b) right pinch-out, (c) left pinch-out.
(a) (b) (c)
Fig. 10. Noisy testing seismic patterns: (a) bright spot, (b) right pinch-out, (c) left pinch-out.
For r = 2, the recovered patterns are shown in Fig. 11(a), (b), and (c). They
are not correct output patterns. Fig. 11(d), (e), and (f) are the energy vs iteration.
So we set neighborhood radius to r = 3 and test again. For r = 3, the recovered
patterns are shown in Fig. 12(a), (b), and (c). Fig. 12(b) is not correct output
pattern. Fig. 12(d), (e), and (f) are the energy vs iteration. So we set
neighborhood radius to r = 4 and test again. For r = 4, the recovered pattern is
shown in Fig. 13(a). Fig. 13(b) is the energy vs iteration. It is the correct output
pattern.
(a) (b) (c)
(d) (e) (f)
Fig. 11. For r = 2, (a) output of Fig. 10(a), (b) output of Fig. 10(b), (c) output of Fig. 10(c),
(d) energy curve of Fig. 10(a), (e) energy curve of Fig. 10(b), (f) energy curve of Fig. 10(c).
(a) (b) (c)
(d) (e) (f)
Fig. 12. For r = 3, (a) output of Fig. 10(a), (b) output of Fig. 10(b), (c) output of Fig. 10(c),
(d) energy curve of Fig. 10(a), (e) energy curve of Fig. 10(b), (f) energy curve of Fig. 10(c).
(a) (b)
Fig. 13. For r=4, (a) output of Fig. 10(b), (b) energy curve of Fig. 10(b).
Next, we apply DT-CNN associative memory without matrix S to Fig. 10(a), (b),
and (c). We set Į = 3 and neighborhood radius r = 1. The output recovered
patterns are the same as Fig. 9(a), (b), and (c) respectively.
4.2.2. Comparison with Hopfield Associative Memory
Hopfield associative memory was proposed by Hopfield [15], [16]. In Hopfield

model, the input of one cell comes from the outputs of all other cells.
We apply the Hopfield associative memory to Fig. 10(a), (b), and (c). The
output recovered patterns are shown in Fig. 14(a), (b), and (c). Only Fig. 14(b) is
correct. The recognition result is failure. The results of four DT-CNNs and
Hopfield model in this experiment is shown in Table 1.
(a) (b) (c)
Fig. 14. Results of Hopfield associative memory: (a) output of Fig. 10(a), (b) output of Fig. 10(b),
(c) output of Fig. 10(c).
Table 1. Results of four DT-CNNs and Hopfield model.

DT-CNN with S DT-CNN without S Hopfield
r=2 r=3 r=4 r=1 model
Recognition Failure Failure Success Success Failure
4.3. Experiment 2: Experiment on Simulated Seismic Images
We apply this DT-CNN associative memory with matrix S to recognize the

simulated seismic images. In this experiment, we store two training seismic
patterns and recognize the patterns in three seismic images. The two training
patterns are bright spot pattern and horizon pattern, and shown in Fig. 15(a) and
(b). The size is 16x50. Most of the seismic data has horizons related to geologic
layer boundary. We set that the neighborhood radius r is 1 and 3, and Į is set to 3.
(a) (b)
Fig. 15. Two training seismic patterns: (a) bright spot pattern, (b) horizon pattern.
We have three testing seismic images shown in Fig. 16(a), 17(a), and 18(a).
The size is 64x64 and larger than the size of training pattern 16x50. We use a
window to extract the testing pattern from seismic image. The size of this
window is equal to the size of training pattern. This window is shifted from left
to right and top to bottom on the testing seismic image. If the output pattern of
the network is equal to one of training patterns, we record the coordinate of
upper-left corner of the window. After the window is shifted to the last position
on the testing seismic image and all testing patterns are recognized, we calculate
the coordinate of the center of all recorded coordinates which are the same kind
of training pattern. And then we use the center coordinate to recover the detected
training pattern.
We set neighborhood radius r = 1 to process the Fig. 16(a) and Fig. 17(a), and
set r = 3 to process the Fig. 18(a). For the first image in Fig. 16(a), the horizon is
short. The detected pattern in Fig. 16(c) is only the bright spot. For the second
image in Fig. 17(a), the horizon is long. The detected patterns in Fig. 17(c) are
the horizon and bright spot. For the third image in Fig. 18(a), the horizon and
bright spot pattern have discontinuities, but both kinds of patterns can also be
detected in Fig. 18(c).
(a) (b) (c)

Fig. 16. (a) First testing seismic image, (b) coordinates of successful detection, (c) output of (a).
(a) (b) (c)

Fig. 17.(a) Second testing seismic image, (b) coordinates of successful detection, (c) output of (a).
(a) (b) (c)

Fig. 18. (a) Third testing seismic image, (b) coordinates of successful detection, (c) output of (a).
5. Conclusions
CNN is adopted for seismic pattern recognition. We design CNN to behave as

associative memory according to the stored training seismic patterns, and finish
the training process of the network. Then we use this associative memory to
recognize seismic testing patterns. In the experiments, the analyzed seismic
patterns are bright spot pattern, right and left pinch-out patterns that have the
structure of gas and oil sand zones. From the recognition results, the noisy
seismic patterns can be recovered.
In the comparison of experimental results, the CNN has better recovery
capacity than Hopfield model. They have the difference. The cells at CNN are
locally connected. But the cells at Hopfield model are globally connected. And
the input of one cell in CNN comes from the inputs and outputs of the
neighboring neurons. But the input of one cell in Hopfield model comes from all
other cells.
Also we have the experiments on seismic images. Two kinds of seismic
patterns are in the training. They are the bright spot pattern and horizon pattern
After training, we do the testing on seismic images. Through window moving,
the patterns can be detected. The results of seismic pattern recognition using
CNN are good. It can help the analysis and interpretation of seismic data.
Acknowledgements
This work was supported in part by the National Science Council, Taiwan, under
NSC92-2213-E-009-095 and NSC93-2213-E-009-067.
References
1. L. O. Chua and Lin Yang, “Cellular Neural Networks: Theory,” IEEE Trans. on CAS, vol. 35
no. 10, pp.1257-1272, 1988.
2. L. O. Chua and Lin Yang, “Cellular Neural Networks: Applications,” IEEE Trans. on CAS,
vol.35, no.10, pp. 1273-1290, 1988.
3. Leon O. Chua, CNN: A paradigm for complexity, World Scientific, 1998.
4. H. Harrer and J. A. Nossek, “Discrete-time Cellular Neural Networks,” International Journal
of Circuit Theory and Applications, vol. 20, pp. 453-468, 1992.
5. G. Grassi, “A new approach to design cellular neural networks for associative memories,”
IEEE Trans. Circuits Syst. I, vol. 44, pp. 835–838, Sept. 1997.
6. G. Grassi, “On discrete-time cellular neural networks for associative memories,” IEEE Trans.
Circuits Syst. I, vol. 48, pp. 107–111, Jan. 2001.
7. Liang Hu, Huijun Gao, and Wei Xing Zheng, “Novel stability of cellular neural networks with
interval time-varying delay,” Neural Networks, vol. 21, no. 10, pp. 1458-1463, Dec. 2008.
8. Lili Wang and Tianping Chen, “Complete stability of cellular neural networks with
unbounded time-varying delays,” Neural Networks, vol. 36, pp. 11-17, Dec. 2012.
9. Wu-Hua Chen, and Wei Xing Zheng, “A new method for complete stability analysis of
cellular neural networks with time delay,” IEEE Trans. on Neural Networks, vol. 21, no. 7, pp.
1126-1139, Jul. 2010.
10. Zhenyuan Guo, Jun Wang, and Zheng Yan, “Attractivity analysis of memristor-based cellular
neural networks with time-varying delays,” IEEE Trans. on Neural Networks, vol. 25, no. 4,
pp. 704-717, Apr. 2014.
11. R. Lepage, R. G. Rouhana, B. St-Onge, R. Noumeir, and R. Desjardins, “Cellular neural
network for automated detection of geological lineaments on radarsat images,” IEEE Trans. on
Geoscience and Remote Sensing, vol. 38, no. 3, pp. 1224-1233, May 2000.
12. Kou -Yuan Huang, Chin-Hua Chang, Wen-Shiang Hsieh, Shan-Chih Hsieh, Luke K.Wang,
and Fan-Ren Tsai, “Cellular neural network for seismic horizon picking,” The 9th IEEE
International Workshop on Cellular Neural Networks and Their Applications, CNNA 2005,
May 28~30, Hsinchu, Taiwan, 2005, pp. 219-222.
13. R. Perfetti, “Frequency domain stability criteria for cellular neural networks,” Int. J. Circuit
Theory Appl., vol. 25, no. 1, pp. 55–68, 1997.
14. K. Y. Huang, K. S. Fu, S. W. Cheng, and T. H. Sheen, "Image processing of seismogram: (A)
Hough transformation for the detection of seismic patterns (B) Thinning processing in the
seismogram," Pattern Recognition, vol.18, no.6, pp. 429-440, 1985.
15. J. J. Hopfield and D. W. Tank, “Neural” computation of decisions in optimization problems,”
Biolog. Cybern., 52, pp. 141-152, 1985.
16. J. J. Hopfield and D. W. Tank, “Computing with neural circuits: A model,” Science, 233, pp.
625-633, 1986.
17. M. B. Dobrin and C. H. Savit, Introduction to Geophysical Prospecting, New York: McGraw-
Hill Book Co., 1988.
March 16, 2020 11:51 ws-rv961x669 HBPRCV-6th Edn.–11573 chap18 page 343
CHAPTER 2.9
INCORPORATING FACIAL ATTRIBUTES IN CROSS-MODAL

FACE VERIFICATION AND SYNTHESIS
Hadi Kazemi, Seyed Mehdi Iranmanesh and Nasser M. Nasrabadi

Lane Department of Computer Science and Electrical Engineering,
West Virginia University, Morgantown, WV. USA
Face sketches are able to capture the spatial topology of a face while lacking some facial
attributes such as race, skin, or hair color. Existing sketch-photo recognition and synthesis
approaches have mostly ignored the importance of facial attributes. This chapter introduces
two deep learning frameworks to train a Deep Coupled Convolutional Neural Network
(DCCNN) for facial attribute guided sketch-to-photo matching and synthesis. Specifically,
for sketch-to-photo matching, an attribute-centered loss is proposed which learns several
distinct centers, in a shared embedding space, for photos and sketches with different com-
binations of attributes. Similarly, a conditional CycleGAN framework is introduced which
forces facial attributes, such as skin and hair color, on the synthesized photo and does not
need a set of aligned face-sketch pairs during its training.
1. Introduction
Automatic face sketch-to-photo identification has always been an important topic in com-
puter vision and machine learning due to its vital applications in law enforcement.1,2 In
criminal and intelligence investigations, in many cases, the facial photograph of a suspect
is not available, and a forensic hand-drawn or computer generated composite sketch follow-
ing the description provided by the testimony of an eyewitness is the only clue to identify
possible suspects. Based on the existance or absence of the suspect’s photo in the law
enforcement database, an automatic matching algorithm or a sketch-to-photo synthesis is
needed.
Automatic Face Verification: An automatic matching algorithm is necessary for a
quick and accurate search of the law enforcement face databases or surveillance cameras
using a forensic sketch. The forensic or composite sketches, however, encode only limited
information of the suspects’ appearance such as the spatial topology of their faces while
the majority of the soft biometric traits, such as skin, race, or hair color, are left out.
Traditionaly, sketch recognition algorithms were of two categories, namely generative
and discriminative approaches. Generative algorithms map one of the modalities into the
other and perform the matching in the second modality.3,4 On the contrary, discriminative
approaches learn to extract useful and discriminative common features to perform the veri-
fication, such as the Weber’s local descriptor (WLD),5 and scale-invariant feature transform
343
344 H. Kazemi, S.M. Iranmanesh and N.M. Nasrabadi
(SIFT).6 Nonetheless, these features are not always optimal for a cross-modal recognition
task.7 More recently, deep learning-based approaches have emerged as a general solution
to the problem of cross-domain face recognition. It is enabled by their ability in learn-
ing a common latent embedding between the two modalities.8,9 Despite all the success,
employing deep learning techniques for the sketch-to-photo recognition problem is still
challenging compared to the other single modality domains as it requires a large number
of data samples to avoid over-fitting on the training data or stopping at local minima. Fur-
thermore, the majority of the sketch-photo datasets include a few pairs of corresponding
sketches and photos.
Existing state-of-the-art methods primarily focus on making the semantic representa-
tion of the two domains into a single shared subspace, whilst the lack of soft-biometric
information in the sketch modality is completely ignored. Despite the impressive results of
recent sketch-photo recognition algorithms, conditioning the matching process on the soft
biometric traits has not been adequately investigated. Manipulating facial attributes in pho-
tos has been an active research topic for years.10 The application of soft biometric traits in
person reidentification has also been studied in the literature.11,12 A direct suspect identifi-
cation framework based solely on descriptive facial attributes is introduced in.13 However,
they have completely neglected the sketch images. In recent work, Mittal et al.14 employed
facial attributes (e.g. ethnicity, gender, and skin color) to reorder the list of ranked iden-
tities. They have fused multiple sketches of a single identity to boost the performance of
their algorithm.
In this chapter, we introduce a facial attribute-guided cross-modal face verification
scheme conditioned on relevant facial attributes. To this end, a new loss function, namely
attribute-centered loss, is proposed to help the network in capturing the similarity of identi-
ties that have the same facial attributes combination. This loss function is defined based on
assigning a distinct centroid (center point), in the embedding space, to each combination of
facial attributes. Then, a deep neural network can be trained using a pair of sketch-attribute.
The proposed loss function encourages the DCNN to map a photo and its corresponding
sketch-attribute pair into a shared latent sub-space in which they have similar representa-
tions. Simultaneously, the proposed loss forces the distance of all the photos and sketch-
attribute pairs to their corresponding centers to be less than a pre-specified margin. This
helps the network to filter out the subjects of similar facial structures to the query but a
limited number of common facial attributes. Finally, the learned centers are trained to keep
a distance related to their number of contradictory attributes. The justification behind the
latter is that it is more likely that a victim misclassifies a few facial attributes of the suspect
than most of them.
Sketch-to-Photo Synthesis: In law enforcement, the photo of the person of interest
is not always available in the police database. Here, an automatic face sketch-to-photo
synthesis comes handy enabling them to produce suspects’ photos from the drawn forensic
sketches. The majority of the current research works in the literature of sketch-based photo
synthesis have tackled the problem using pairs of sketches and photos that are captured un-
der highly controlled conditions, i.e., neutral expression and frontal pose. Different tech-
Incorporating Facial Attributes in Cross-modal Face Verification and Synthesis 345
niques have been studied including transductive learning of a probabilistic sketch-photo

generation model,15 sparse representations,16 support vector regression,17 Bayesian ten-
sor inference,4 embedded hidden Markov model,18 and multiscale Markov random field
model.19 Despite all the success, slight variations in the conditions can dramatically de-
grade the performance of these photo synthesizing frameworks which are developed and
trained based on the assumption of having a highly controlled training pairs. In,20 a deep
convolutional neural network (DCNN) is proposed to solve the problem of face sketch-
photo synthesis in an uncontrolled condition. Another six-layer convolutional neural net-
work (CNN) is introduced in21 to translate photos into sketches. In,21 a novel optimiza-
tion objective is defined in the form of joint generative discriminative minimization which
forces the person’s identity to be preserved in the synthesis process.
More recently, generative adversarial networks (GANs)22 resulted in a significant im-
provement in image generation and manipulation tasks. The main idea was defining a new
loss function which can help the model to capture the high-frequency information and gen-
erate more sharp and realistic images. More specifically, the generator network is trained to
fool a discriminator network whose job is to distinguish synthetic and real images. Condi-
tional GANs (cGAN)23 are also proposed to condition the generative models and generate
images on an input which could be some attributes or another image. This ability makes
cGANs a good fit for many image transformation applications such as sketch-photo syn-
thesis,24 image manipulation,25 general-purpose image-to-image translation,23 and style
transfer.26 However, to train the network, the proposed GAN frameworks required a pair
of corresponding images from both the source and the target modalities. In order to bypass
this issue, an unpaired image-to-image translation framework was introduced in,27 namely
CycleGAN. The CycleGAN can learn image translation from a source domain to a target
domain without any paired examples. For the same reason, we follow the same approach as
CycleGAN to train a network for sketch-photo synthesis in the absence of paired samples.
Despite the profound achievements in the recent literature of face sketch-photo synthe-
sis, a key part, i.e., conditioning the face synthesis task on the soft biometric traits is mostly
neglected in these works. Especially in sketch-to-photo synthesis, independent of the qual-
ity of sketches, there are some facial attributes that are missing in the sketch modality, such
as skin, hair, eye colors, gender, and ethnicity. In addition, despite the existence of other
adhered facial characteristics, such as having eyeglasses or a hat, on the sketch side, con-
ditioning the image synthesis process on such information provides extra guidance about
the generation of the person of interest and can result in a more precise and higher quality
synthesized output. The application of soft biometric traits in person reidentification has
been studied in the literature.28 Face attributes help to construct face representations and
train domain classifiers for identity prediction. However, few researchers have addressed
this problem in sketch-photo synthesis,29 attribute-image synthesis,30 and face editing.31
Although the CycleGAN solved the problem of learning a GAN network in the absence
of paired training data, the original version does not force any conditions, e.g., facial at-
tributes, on the image synthesis process. In this chapter, we propose a new framework built
on the CycleGAN to generate face photos from sketches conditioned on relevant facial at-
tributes. To this end, we developed a conditional version of the CycleGAN which we refer
to as the cCycleGAN and trained it by an extra discriminator to force the desired facial
attributes on the synthesized images.
2. Attribute-Guided Face Verification
2.1. Center loss

Minimization of cross-entropy is a common objective to train a deep neural network for
classification or verification task. However, this loss function does not encourage the net-
work to extract discriminative features and only guarantees their separability.32 The intu-
ition behind the center loss is that the cross-entropy loss does not force the network to learn
the intra-class variations in a compact form. To bypass this problem, contrastive loss33 and
triplet loss34 have emerged in the literature to capture a more compact form of the intra-
class variations. Despite their recent diverse successes, their convergence rates are quite
slow. Consequently, a new loss function, namely center loss, has been proposed in32 to
push the neural network to distill a set of features with more discriminative power. The
center loss, Lc , is formulated as
1
m
Lc = xi − cyi 22 , (1)
2 i=1
where m denotes the number of samples in a mini-batch, xi ∈ IRd denotes the ith sample
feature embedding, belonging to the class yi . The cyi ∈ IRd denotes the yi th class center
of the embedded features, and d is the feature dimension. To train a deep neural network, a
joint supervision of the proposed center loss and cross-entropy loss is adopted:
L = Ls + λLc , (2)
where Ls is the softmax loss (cross-entropy). The center loss, as defined in Eq. 1, is defi-
cient in that it only penalizes the compactness of intra-class variations without considering
the inter-class separation. Therefore, to address this issue, a contrastive-center loss has
been proposed in35 as
1
m
xi − cyi 22
Lct−c = k , (3)
2 i=1 ( j=1,j=y xi − cj 22 ) + δ
i
where δ is a constant preventing a zero denominator, and k is the number of classes. This
loss function not only penalizes the intra-class variations but also maximizes the distance
between each sample and all the centers belonging to the other classes.
2.2. Proposed loss function

Inspired by the center loss, we propose a new loss function for facial attributes guided
sketch-photo recognition. Since in most of the available sketch datasets there is only a
single pair of sketch-photo images per identity, there is no benefit in assigning a separate
center to each identity as in32 and.35 However, here we assign centers to different combi-
nations of facial attributes. In other words, the number of centers is equal to the number of
possible facial attribute combinations. To define our attribute-centered loss, it is important
to briefly describe the overall structure of the recognition network.
2.2.1. Network structure
Due to the cross-modal nature of the sketch-photo recognition problem, we employed a

coupled DNN model to learn a deep shared latent subspace between the two modalities,
i.e., sketch and photo. Figure 1 shows the structure of the coupled deep neural network
which is deployed to learn the common latent subspace between the two modalities. The
first network, namely photo-DCNN, takes a color photo and embeds it into the shared
latent subspace, pi , while the second network, or sketch-attribute-DCNN, gets a sketch and
its assigned class center and finds their representation, si , in the shared latent subspace.
The two networks are supposed to be trained to find a shared latent subspace such that the
representation of each sketch with its associated facial attributes to be as close as possible
to its corresponding photo while still keeping the distance to other photos. To this end,
we proposed and employed the Attribute-Centered Loss for our attribute-guided shared
representation learning.
Fig. 1.: Coupled deep neural network structure. Photo-DCNN (upper network) and sketch-attribute-
DCNN (lower network) map the photos and sketch-attribute pairs into a common latent subspace.
2.2.2. Attribute-centered loss
In the problem of facial-attribute guided sketch-photo recognition, one can consider dif-
ferent combinations of facial attributes as distinct classes. With this intuition in mind, the
first task of the network is to learn a set of discriminative features for inter-class (between
different combinations of facial attributes) separability. However, the second goal of our
network differs from the other two previous works32,35 which were looking for a compact
representation of intra-class variations. On the contrary, here, intra-class variations rep-
resent faces with different geometrical properties, or more specifically, different identities.
Consequently, the coupled DCNN should be trained to keep the separability of the identities
as well. To this end, we define the attribute-centered loss function as
Lac = Lattr + Lid + Lcen , (4)
where Lattr is a loss to minimize the intra-class distances of photos or sketch-attribute
pairs which share similar combination of facial attributes, Lid denotes the identity loss for
intra-class separability, and Lcen forces the centers to keep distance from each other in
the embedding subspace for better inter-class discrimination. The attribute loss Lattr is
formulated as
1
m
Lattr = max( pi − cyi 22 − c , 0) + max( sgi − cyi 22 − c , 0) (5)
2 i=1
i − cyi 2 − c , 0),
+ max( sim 2
where c is a margin promoting convergence, pi is the feature embedded of the input photo
by the photo-DCNN with attributes combination represented by yi . Also, sgi and sim i (see
Figure 1) are the feature embeddings of two sketches with the same combination of at-
tributes as pi but with the same (genuine pair) or different (impostor pair) identities, re-
spectively. On the contrary to the center loss (1), the attribute loss does not try to push
the samples all the way to the center, but keeps them around the center by a margin with a
radius of c (see Figure 2). This gives the flexibility to the network to learn a discriminative
feature space inside the margin for intra-class separability. This intra-class discriminative
representation is learned by the network through the identity loss Lid which is defined as
1
m
Lid = pi − sgi 22 + max( d − pi − sim
i 2 , 0),
2
(6)
2 i=1
which is a contrastive loss33 with a margin of d to push the photos and sketches of a
same identity toward each other, within their center’s margin c , and takes the photos and
sketches of different identities apart. Obviously, the contrastive margin, d , should be less
than twice the attribute margin c , i.e. d < 2 × c (see Figure 2). However, from a
theoretical point of view, the minimization of identity loss, Lid , and attribute loss, Lattr ,
has a trivial solution if all the centers converge to a single point in the embedding space.
This solution can be prevented by pushing the centers to keep a minimum distance. For
this reason, we define another loss term formulated as
1
nc nc
Lcen = max( jk − cj − ck 22 , 0), (7)
2 j=1
k=1,k=j
where nc is the total number of centers, cj and ck denote the jth and kth centers, and jk is
the associated distance margin between cj and ck . In other words, this loss term enforces
a minimum distance jk , between each pair of centers, which is related to the number
of contradictory attributes between two centers cj and ck . Now, two centers which only
differ in few attributes are closer to each other than those with more number of dissimilar
attributes. The intuition behind the similarity-related margin is that the eyewitnesses may
mis-judge one or two attributes, but it is less likely to mix up more than that. Therefore,
during the test, it is very probable that the top rank suspects have a few contradictory
attributes when compared with the attributes provided by the victims. Figure 2 visualizes
the overall concept of the attribute-centered loss.
Fig. 2.: Visualization of the shared latent space learn by the utilization of the attribute-centered loss.
Centers with less contradictory attributes are closer to each other in this space.
2.2.3. A special case and connection to the data fusion

For better clarification, in this section, we discuss an special case in which the network
maps the attributes and geometrical information into two different subspaces. Figure 2
represents the visualization of this special case. The learned common embedding space
(Z) comprises of two orthogonal subspaces. Therefore, the basis for Z can be written as
Span{Z} = Span{X} + Span{Y }, (8)
where X ⊥ Y and dim(Z) = dim(X) + dim(Y ). In this scenario, the network learns to
put the centers in the embedding subspace X, and utilizes embedding subspace Y to model
the intra-class variations.
In other words, the learned embedding space is divided into two subspaces. The first
embedding subspace represents the attribute center which provides the information regard-
ing the subjects facial attributes. The second subspace denotes the geometrical properties
of subjects or their identity information. Although this is a very unlikely scenario as some
of the facial attributes are highly correlated with the geometrical property of the face, this
scenario can be considered to describe the intuition behind our proposed framework.
It is important to note, the proposed attribute-centered loss guides the network to fuse
the geometrical and attribute information automatically during its shared latent represen-
tation learning. In the proposed framework, the sketch-attribute-DCNN learns to fuse an
input sketch and its corresponding attributes. This fusion process is an inevitable task for
the network to learn the mapping from each sketch-attribute pair to its center vicinity. As
shown in Figure 1, in this scheme the sketch and n binary attributes, ai=1,...,n , are passed
to the network as a (n + 1)-channel input. Each attribute-dedicated channel is constructed
by repeating the value that is assigned to that attribute. This fusion algorithm uses the in-
formation provided by the attributes to compensate the information that cannot be extracted
from the sketch (such as hair color) or it is lost while drawing the sketch.
2.3. Implementation details
2.3.1. Network structure
We deployed a deep coupled CNN to learn the attribute-guided shared representation be-
tween the forensic sketch and the photo modalities by employing the proposed attribute-
centered loss. The overall structure of the coupled network is illustrated in Figure 1.
The structures of both photo and sketch DCNNs are the same and are adopted from the
VGG16.36 However, for the sake of parameter reduction, we replaced the last three con-
volutional layers of VGG16, with two convolutional layers of depth 256 and one convo-
lutional layer of depth 64. We also replaced the last max pooling with a global average
pooling, which results in a feature vector of size 64. We also added batch-normalization
to all the layers of VGG16. The photo-DCNN takes an RGB photo as its input and the
sketch-attribute-DCNN gets a multi-channel input. The first input channel is a gray-scale
sketch and there is a specific channel for each binary attribute filled with 0 or 1 based on
the presence or absence of that attribute in the person of interest.
2.3.2. Data description
We make use of hand-drawn sketch and digital image pairs from CUHK Face Sketch
Dataset (CUFS)37 (containing 311 pairs), IIIT-D Sketch dataset38 (containing 238 viewed
pairs, 140 semi-forensic pairs, and 190 forensic pairs), unviewed Memory Gap Database
(MGDB)3 (containing 100 pairs), as well as composite sketch and digital image pairs from
PRIP Viewed Software-Generated Composite database (PRIP-VSGC)39 and extended-
PRIP Database (e-PRIP)14 for our experiments. We also utilized the CelebFaces Attributes
Dataset (CelebA),40 which is a large-scale face attributes dataset with more than 200K
celebrity images with 40 attribute annotations, to pre-train the network. To this end, we
generated a synthetic sketch by applying xDOG41 filter to every image in the celebA
dataset. We selected 12 facial attributes, namely black hair, brown hair, blond hair, gray
hair, bald, male, Asian, Indian, White, Black, eyeglasses, sunglasses, out of the available
40 attribute annotations in this dataset. We categorized the selected attributes into four at-
tribute categories of hair (5 states), race (4 states), glasses (2 states), and gender (2 states).
For each category, except the gender category, we also considered an extra state for any case
in which the provided attribute does not exist for that category. Employing this attribute
setup, we ended up with 180 centers (different combinations of the attributes). Since none
of the aforementioned sketch datasets includes facial attributes, we manually labeled all of
the datasets.
2.3.3. Network training
We pre-trained our deep coupled neural network using synthetic sketch-photo pairs from
the CelebA dataset. We followed the same approach as32 to update the centers based on
mini-batches. The network pre-training process terminated when the attribute-centered
loss stopped decreasing. The final weights are employed to initialize the network in all the
training scenarios.
Since deep neural networks with a huge number of trainable parameters are prone to
overfitting on a relatively small training dataset, we employed multiple augmentation tech-
niques (see Figure 3):
Fig. 3.: A sample of different augmentation techniques.
• Deformation: Since sketches are not geometrically matched with their photos, we
employed Thin Plate Spline Transformation (TPS)42 to help the network learning
more robust features and prevent overfitting on small training sets, simultaneously.
To this end, we deformed images, i.e. sketches and photos, by randomly translat-
ing 25 preselected points. Each point is translated with random magnitude and
direction. The same approach has been successfully applied for fingerprint distor-
tion rectification.43
• Scale and crop: Sketches and photos are upscaled to a random size, while do
not keep the original width-height ratio. Then, a 250×200 crop is sampled from
the center of each image. This results in a ratio deformation which is a common
mismatch between sketches and their ground truth photos.
• Flipping: Images are randomly flipped horizontally.
2.4. Evaluation
The proposed algorithm works with a probe image, preferred attributes and a gallery of
mugshots to perform identification. In this section, we compare our algorithm with multiple
attribute-guided techniques as well as those that do not utilize any extra information.
Table 1.: Experiment Setup. The last three columns show the number of identities in each of train,
test gallary, and test probe.
Setup Test Train Train # Gallery # Prob #
P1 e-PRIP e-PRIP 48 75 75
P2 e-PRIP e-PRIP 48 1500 75
IIIT-D Semi-forensic CUFS, IIIT-D Viewed 135
P3 1968 1500
MGDB Unviewed CUFSF, e-PRIP 100
2.4.1. Experiment setup
We conducted three different experiments to evaluate the effectiveness of the proposed

framework. For the sake of comparison, the first two experiment setups are adopted from.14
In the first setup, called P1, the e-PRIP dataset, with the total of 123 identities, is par-
titioned into training, 48 identities, and testing, 75 identities, sets. The original e-PRIP
dataset, which is used in,14 contains 4 different composite sketch sets of the same 123 iden-
tities. However, at the time of writing of this article, there are only two of them available
to the public. The accessible part of the dataset includes the composite sketches created
by an Asian artist using the Identi-Kit tool, and an Indian user adopting the FACES tool.
The second experiment, or P2 setup, is performed employing an extended gallery of 1500
subjects. The gallery size enlarged utilizing WVU Muti-Modal,44 IIIT-D Sketch, Multiple
Encounter Dataset (MEDS),45 and CUFS datasets. This experiment is conducted to eval-
uate the performance of the proposed framework in confronting real-word large gallery.
Finally, we assessed the robustness of the network to a new unseen dataset. This setup, P3,
reveals to what extent the network is biased to the sketch styles in the training datasets. To
this end, we trained the network on CUFS, IIIT-D Viewed, and e-PRIP datasets and then
tested it on IIIT-D Semi-forensic pairs, and MGDB Unviewed.
The performance is validated using ten fold random cross validation. The results of the
proposed method are compared with the state-of-the-art techniques.
2.4.2. Experimental results
For the set of sketches generated by the Indian (Faces) and Asian (IdentiKit) users14 has
the rank 10 accuracy of %58.4 and %53.1, respectively. They utilized an algorithm called
attribute feedback to consider facial attributes on their identification process. However,
SGR-DA46 reported a better performance of %70 on the IdentiKit dataset without utiliza-
tion of any facial attributes. In comparison, our proposed attribute-centered loss resulted in
%73.2 and %72.6 accuracies, on Faces and IdentiKit, respectively. For the sake of evalu-
ation, we also trained the same coupled deep neural network with the sole supervision of
contrastive loss. This attribute-unaware network has %65.3 and %64.2 accuracies, on Faces
and IdentiKit, respectively, which demonstrates the effectiveness of attributes contribution
as part of our proposed algorithm.
Figure 4 visualize the effect of attribute-centered loss on top five ranks on P1 experi-
ment’s test results. The first row is the results of our attribute-unaware network, while the
Table 2.: Rank-10 identification accuracy (%) on the e-PRIP composite sketch database.
Algorithm Faces (In) IdentiKit (As)
Mittal et al.47 53.3 ± 1.4 45.3 ± 1.5
Mittal et al.48 60.2 ± 2.9 52.0 ± 2.4
Mittal et al.14 58.4 ± 1.1 53.1 ± 1.0
SGR-DA46 - 70
Ours without attributes 68.6 ± 1.6 67.4 ± 1.9
Ours with attributes 73.2 ± 1.1 72.6 ± 0.9
second row shows the top ranks for the same sketch probe using our proposed network
trained by the attribute-centered loss. Considering the attributes removes many of the false
matches from the ranked list and the correct subject moves to a higher rank.
To evaluate the robustness of our algorithm in the presence of a relatively large gallery
of mugshots, the same experiments are repeated but on an extended gallery of 1500 sub-
jects. Figure 5a shows the performance of our algorithm as well as the state of the art
algorithm on Indian user (Faces) dataset. The proposed algorithm outperforms14 by al-
most %11 rank 50 when exploiting facial attributes. Since the results for IdentiKit was not
reported on,14 we compared our algorithm with SGR-DA46 (see Figure 5b). Even tough
SGR-DA outperformed our attribute-unaware network in the P1 experiment but its result
was not as robust as our proposed attribute-aware deep coupled neural network.
Finally, Figure 6 demonstrate the results of the proposed algorithm on P3 experiment.
The network is trained on 1968 sketch-photo pairs and then tested on two completely un-
seen datasets, i.e. IIIT-D Semi-forensic and MGDB Unviewed. The gallery of this experi-
ment was also extended to 1500 mugshots.
Fig. 4.: The effect of considering facial attributes in sketch-photo matching. The first line shows the
results for a network trained with attribute-centered loss, and the second line depicts the result of a
network trained using contrastive loss.
(a) (b)
Fig. 5.: CMC curves of the proposed and existing algorithms for the extended gallery experiment:
(a) results on the Indian data subset compared to Mittal et al.14 and (b) results on the Identi-Kit data
subset compared to SGR-DA.46
Fig. 6.: CMC curves of the proposed algorithm for experiment P3. The results confirm the robustness
of the network to different sketch styles.
3. Attribute-guided sketch-to-photo synthesis
3.1. Conditional generative adversarial networks (cGANs)
GANs22 are a group of generative models which learn to map a random noise z to output
image y: G(z) : z −→ y. They can be extended to a conditional GAN (cGAN) if the gen-
erator model, G, (and usually the discriminator) is conditioned on some extra information,
x, such as an image or class labels. In other words, cGAN learns a mapping from an input
x and a random noise z to the output image y: G(x, z) : {x, z} −→ y. The generator
model is trained to generate an image which is not distinguishable from “real” samples by
a discriminator network, D. The discriminator is trained adversarially to discriminate be-
tween the “fake” generated images by the generator and the real samples from the training
dataset. Both the generator and the discriminator are trained simultaneously following a
two-player min-max game.
The objective function of cGAN is defined as:
lGAN (G, D) =Ex,y∼pdata [log D(x, y)] + Ex,z∼pz [log(1 − D(x, G(x, z)))], (9)
where G attempts to minimize it and D tries to maximize it. Previous works in the literature
have found it beneficial to add an extra L2 or L1 distance term to the objective function
which forces the network to generate images which are near the ground truth. Isola et al.23
found L1 to be a better candidate as it encourages less blurring in the generated output. In
summary, the generator model is trained as follows:
G∗ = arg min max lGAN (G, D) + λlL1 (G), (10)
G D
where λ is a weighting factor and lL1 (G) is

lL1 (G) = y − G(x, z) 1 . (11)
3.1.1. Training procedure

In each training step, an input, x is passed to the generator to produce the corresponding
output, G(x, z). The generated output and the input are concatenated and fed to the dis-
criminator. First, the discriminator’s weight is updated in a way to distinguish between the
generated output and a real sample from the target domain. Then, the generator is trained
to fool the discriminator by generating more realistic images.
3.2. CycleGAN
The main goal of CycleGAN27 is to train two generative models, Gx and Gy . These two
models learn the mapping functions between two domains x and y. The model, as illus-
trated in Figure 7, includes two generators; the first one maps x to y: Gy (x) : x −→ y and
the other does the inverse mapping y to x: Gx (y) : y −→ x. There are two adversarial dis-
criminators Dx and Dy , one for each generator. More precisely, Dx distinguishes between
“real” x samples and its generated “fake” samples Gx (y), and similarly, Dy discriminates
between “real” y and the “fake” Gy (x). Therefore, there is a distinct adversarial loss in
CycleGAN for each of the two (Gx , Dx ) and (Gy , Dy ) pairs. Notice that the adversarial
losses are defined as in Eq. 9.
For a high capacity network to be trained using only the adversarial loss, there is a
possibility of mapping the same set of inputs to a random permutation of images in the
target domain. In other words, the adversarial loss is not enough to guarantee that the
trained network generates the desired output. This is the reason behind having an extra
Fig. 7.: CycleGAN
L1 distance term in the objective function of cGAN as shown in Eq. 10. As shown in
Figure 7, in the case of CycleGAN, there are no paired images between the source and
target domains, which is the main feature of CycleGAN over cGAN. Consequently, the L1
distance loss cannot be applied to this problem. To tackle this issue, a cycle consistency
loss was proposed in27 which forced the learned mapping functions to be cycle-consistent.
Particularly, the following conditions should be satisfied
x −→ Gy (x) −→ Gx (Gy (x)) ≈ x, y −→ Gx (y) −→ Gy (Gx (y)) ≈ y. (12)
To this end, a cycle consistency loss is defined as
lcyc (Gx , Gy ) = Ex∼pdata [ x−Gx (Gy (x)) 1 ] + Ey∼pdata [ y−Gy (Gx (y)) 1 ] . (13)
Taken together, the full objective function is
l( Gx , Gy , Dx , Dy ) = lGAN (Gx , Dx ) + lGAN (Gy , Dy ) + λlcyc (Gx , Gy ), (14)
where λ is a weighting factor to control the importance of the objectives and the whole
model is trained as follows
G∗x , G∗y = arg min max l(Gx , Gy , Dx , Dy ). (15)
Gx ,Gy Dx ,Dy
From now on, we use x for our source domain which is the sketch domain and y for the
target domain or the photo domain.
3.2.1. Architecture
The two generators, Gx and Gy , adopt the same architecture27 consisting of six convolu-
tional layers and nine residual blocks49 (see27 for details). The output of the discriminator
is of size 30x30. Each output pixel corresponds to a patch of the input image and tries to
classify if the patch is real or fake. More details are reported in.27
3.3. Conditional CycleGAN (cCycleGAN)

The CycleGAN architecture has solved the problem of having unpaired training data, but
still, has a major drawback: Extra conditions, such as soft biometric traits, cannot be forced
on the target domain. To tackle this problem, we proposed a CycleGAN architecture with
a soft biometrics conditional setting which we refer it as Conditional CycleGAN (cCycle-
GAN). Since in the sketch-photo synthesis problem, attributes (e.g., skin color) are miss-
ing on the sketch side and not on the photo side, the photo-sketch generator, Gx (y), is left
unchanged in the new setting. However, the sketch-photo generator, Gy (x), needs to be
modified by conditioning it on the facial attributes. The new sketch-photo generator maps
(x, a) to y, i.e., Gy (x, a) : (x, a) −→ y, where a stands for the desired facial attributes
to be present in the synthesized photo. The corresponding discriminator, Dy (x, a), is also
conditioned on both the sketch, x, and the desired facial attributes, a. The definition of the
loss function remains the same as in CycleGAN given by Eq. 14.
On the contrary to the previous work in face editing,31 our preliminary results showed
that having only a single discriminator conditioned on the desired facial attributes was not
enough to force the attributes on the generator’s output of the CycleGAN. Consequently,
instead of increasing the complexity of the discriminator, we trained an additional auxiliary
discriminator, Da (y, a), to detect if the desired attributes are present in the synthesized
photo or not. In other words, the sketch-photo generator, Gy (x, a), tries to fool an extra
attribute discriminator, Da (y, a), which checks the presence of the desired facial attributes.
The objective function of the attribute discriminator is defined as follows:
lAtt (Gy , Da ) =Ea,y∼pdata [log Da (a, y)] + Ey∼pdata ,ā=a [log(1 − Da (ā, y))]+ (16)
Ea,y∼pdata [log(1 − Da (a, Gy (x, a)))],
where a is the corresponding attributes of the real image, y, and ā = a is a set of random
arbitrary attributes. Therefore, the total loss of the cCycleGAN is
l( Gx , Gy , Dx , Dy ) = lGAN (Gx , Dx ) + lGAN (Gy , Dy ) + λ1 lcyc (Gx , Gy ) (17)
+ λ2 lAtt (Gy , Da ),
where λ1 and λ2 are weighting factors to control the importance of the objectives.
3.4. Architecture
Our proposed cCycleGAN adopts the same architecture as in CycleGAN. However, to
condition the generator and the discriminator to the facial attributes, we slightly modified
the architecture. The generator which transforms photos into sketches, Gx (y), and its
corresponding discriminator, Dx , are left unchanged as there is no attribute to force in
sketch generation phase. However, in the sketch-photo generator, Gy (x), we insert the
desired attributes before the fifth residual block of the bottleneck (Figure 8). To this end,
each attribute is repeated 4096 (64*64) times and then resized to a matrix of size 64×64.
Then all of these attribute feature maps and the output feature maps of the fourth residual
block are concatenated in depth and passed to the next block, as shown in Figure 9. The
Fig. 8.: cCycleGAN architecture, including Sketch-Photo cycle (top) and Photo-Sketch cycle (bot-
tom).
Fig. 9.: Sketch-Photo generator network, Gy (x, a), in cCycleGAN.
same modification is applied to the corresponding attribute discriminator, Da . All the

attributes are repeated, resized, and concatenated with the generated photo in depth and are
passed to the discriminator.
3.5. Training procedure
We follow the same training procedure as in Section 3.1.1 for the photo-sketch generator.
However, for the sketch-photo generator, we need a different training mechanism to force
the desired facial attributes to be present in the generated photo. Therefore, we define a
new type of negative sample for the attribute discriminator, Da , which is defined as a real
photo from the target domain but with a wrong set of attributes, ā. The training mech-
anism forces the sketch-photo generator to produce faces with the desired attributes. At
each training step, this generator synthesizes a photo with the same attributes, a, as the real
photo. Both the corresponding sketch-photo discriminator, Dy , and attribute discriminator,
Da , are supposed to detect the synthesized photo as a fake sample. The attribute discrimi-
nator, Da , is also trained with two other pairs: a real photo with correct attributes as a real
sample, and a real photo with wrong set of attributes as a fake sample. Simultaneously, the
sketch-photo generator attempts to fool both of the discriminators.
3.6. Experimental results

3.6.1. Datasets
FERET Sketch: The FERET database50 includes 1,194 sketch-photo pairs. Sketches
are hand-drawn by an artist while looking at the face photos. Both the face photos and
sketches are grayscale images of size 250 × 200 pixels. However, to produce color pho-
tos we did not use the grayscale face photos of this dataset to train the cCycleGAN. We
randomly selected 1000 sketches to train the network and the remaining 194 are used for
testing.
WVU Multi-modal: To synthesis color images from the FERET sketches, we use
the frontal view face images from WVU Multi-modal.44 The Dataset contains 3453 high-
resolution color frontal images of 1200 subjects. The images are aligned, cropped and
resized to the same size as FERET Sketch, i.e., 250 × 200 pixels. The dataset does not
contain any facial attributes. However, for each image, the average color of a 25 × 25 pixels
rectangular patch (placed in forehead or cheek) is considered as the skin color. Then, they
are clustered into three classes, namely white, brown and black, based on their intensities.
CelebFaces Attributes (CelebA): We use the aligned and cropped version of the
CelebA dataset51 and scale the images down to 128 × 128 pixels. We also randomly split
it into two partitions, 182K for training and 20K for testing. Of the original 40 attributes,
we selected only those attributes that have a clear visual impact on the synthesized faces
and are missing in the sketch modality, which leaves a total of six attributes, namely black
hair, brown hair, blond hair, gray hair, pale skin, and gender. Due to the huge differences
in face views and the background in FERET and celebA databases, the preliminary results
did not show an acceptable performance on FERET-celebA pair training. Consequently,
we generated a synthetic sketch dataset by applying xDOG41 filter to the celebA dataset.
However, to train the cCycleGAN, the synthetic sketch and photo images are used in an
unpaired fashion.
3.6.2. Results on FERET and WVU multi-modal

Sketches from the FERET dataset are trained in couple with frontal face images from the
WVU Multi-modal to train the proposed cCycleGAN. Since there is no facial attributes
associated with the color images of the WVU Multi-modal dataset, we have classified them
based on their skin colors. Consequently, the skin color is the only attribute which we
can control during the sketch-photo synthesis. Therefore, the input to the sketch-photo
generator has two channels including a gray-scale sketch image, x, and a single attribute
Fig. 10.: Sketch-based photo synthesis of hand-drawn test sketches (FERET dataset). Our network
adapts the synthesis results to satisfy different skin colors (white, brown, black).
channel, a, for the skin color. The sketch images are normalized to stand in [−1, 1] range.
Similarly, the skin color attribute gets -1, 0, and 1 for the black, brown and white skin
colors, respectively. Figure 10 shows the results of the cCycleGAN after 200 epochs on
the test data. The three skin color classes are not represented equally in the dataset which
obviously balanced the results towards the lighter skins.
3.6.3. Results on CelebA and synthesized sketches
Preliminary results reveal that the CycleGAN training can get unstable when there is a
significant difference, such as differences in scale and face poses, in the source and target
datasets. The easy task of the discriminator in differentiating between the synthesized and
real photos in these cases could account for this instability. Consequently, we generated a
synthetic sketch dataset as a replacement to the FERET dataset. Among the 40 attributes
provided in the CelebA dataset, we have selected the six most relevant ones in terms of
the visual impacts on the sketch-photo synthesis, including black hair, blond hair, brown
hair, gray hair, male, and pale skin. Therefore, the input to the sketch-photo generator has
seven channels including a gray-scale sketch image, x, and six attribute channels, a. The
attributes in CelebA dataset are binary, we have chosen -1 for a missing attribute and 1 for
an attribute which is supposed to be present in the synthesized photo. Figure 11 shows the
results of the cCycleGAN after 50 epochs on the test data. The trained network can follow
the desired attributes and force them on the synthesized photo.
3.6.4. Evaluation of synthesized photos with a face verifier
For the sake of evaluation, we utilized a VGG16-based face verifier pre-trained on the
CMU Multi-PIE dataset. To evaluate the proposed algorithm, we first selected the identi-
ties which had more than one photos in the testing set. Then, for each identity, one photo is
randomly added to the test gallery, and a synthetic sketch corresponding to another photo
Fig. 11.: Attribute guided Sketch-based photo synthesis of synthetic test sketches from CelebA
dataset. Our network can adapt the synthesis results to satisfy the desired attributes.
Table 3.: Verification performance of the proposed cCycle-GAN vs. the cycle-GAN
Method cycle-GAN cCycle-GAN
Accuracy (%) %61.34 ± 1.05 %65.53 ± 0.93
of the same identity is added to the test prob. Finally, every prob synthetic sketch is given to
our attribute-guided sketch-photo synthesizer and the resulting synthesized photos are used
for face verification against the entire test gallery. This evaluation process was repeated
10 times. Table 3 depicts the face verification accuracies of the proposed attribute-guided
approach and the results of the original cycle-GAN on celebA dataset. The results of our
proposed network significantly improved on the original cycle-GAN with no attribute in-
formation.
4. Discussion
In this chapter, two distinct frameworks are introduced to enable employing facial attributes
in cross-modal face verification and synthesis. The experimental results show the superi-
ority of the proposed attribute-guided frameworks compared to the state-of-the-art tech-
niques. To incorporate facial attribute in cross modal face verification, we introduced an
attribute-centered loss to train a coupled deep neural network learning a shared embedding
space between the two modalities in which both geometrical and facial attribute infor-
mation cooperate on similarity score calculation. To this end, a distinct center point is
constructed for every combination of the facial attributes, which are used in the sketch-
attribute-DCNN, by leveraging the facial attributes of the suspect provided by the victims,
and the photo-DCNN learned to map their inputs close to their corresponding attribute cen-
ters. To incorporate facial attributes for an unpaired face sketch-photo synthesis problem,
an additional auxiliary attribute discriminator was prpposed with an appropriate loss to
force the desired facial attributes on the output of the generator. The pair of real face photo
from the training data with a set of false attributes defined a new fake input to the attribute
discriminator in addition to the pair of generator’s output and a set of random attributes.
References
1. X. Wang and X. Tang, Face photo-sketch synthesis and recognition, IEEE Transactions on Pat-
tern Analysis and Machine Intelligence. 31(11), 1955–1967, (2009).
2. Q. Liu, X. Tang, H. Jin, H. Lu, and S. Ma. A nonlinear approach for face sketch synthesis and
recognition. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer
Society Conference on, vol. 1, pp. 1005–1010. IEEE, (2005).
3. S. Ouyang, T. M. Hospedales, Y.-Z. Song, and X. Li. Forgetmenot: memory-aware forensic
facial sketch matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 5571–5579, (2016).
4. Y. Wang, L. Zhang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, and D. Samaras, Face relighting from a
single image under arbitrary unknown lighting conditions, IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence. 31(11), 1968–1984, (2009).
5. H. S. Bhatt, S. Bharadwaj, R. Singh, and M. Vatsa, Memetically optimized mcwld for matching
sketches with digital face images, IEEE Transactions on Information Forensics and Security. 7
(5), 1522–1535, (2012).
6. B. Klare and A. K. Jain, Sketch-to-photo matching: a feature-based approach, Proc. Society of
Photo-Optical Instrumentation Engineers Conf. Series. 7667, (2010).
7. W. Zhang, X. Wang, and X. Tang. Coupled information-theoretic encoding for face photo-sketch
recognition. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on,
pp. 513–520. IEEE, (2011).
8. C. Galea and R. A. Farrugia, Forensic face photo-sketch recognition using a deep learning-based
architecture, IEEE Signal Processing Letters. 24(11), 1586–1590, (2017).
9. S. Nagpal, M. Singh, R. Singh, M. Vatsa, A. Noore, and A. Majumdar, Face sketch matching via
coupled deep transform learning, arXiv preprint arXiv:1710.02914. (2017).
10. Y. Zhong, J. Sullivan, and H. Li. Face attribute prediction using off-the-shelf CNN features. In
Biometrics (ICB), 2016 International Conference on, pp. 1–7. IEEE, (2016).
11. A. Dantcheva, P. Elia, and A. Ross, What else does your biometric data reveal? a survey on soft
biometrics, IEEE Transactions on Information Forensics and Security. 11(3), 441–467, (2016).
12. H. Kazemi, M. Iranmanesh, A. Dabouei, and N. M. Nasrabadi. Facial attributes guided deep
sketch-to-photo synthesis. In Applications of Computer Vision (WACV), 2018 IEEE Workshop
on. IEEE, (2018).
13. B. F. Klare, S. Klum, J. C. Klontz, E. Taborsky, T. Akgul, and A. K. Jain. Suspect identifica-
tion based on descriptive facial attributes. In Biometrics (IJCB), 2014 IEEE International Joint
Conference on, pp. 1–8. IEEE, (2014).
14. P. Mittal, A. Jain, G. Goswami, M. Vatsa, and R. Singh, Composite sketch recognition using
saliency and attribute feedback, Information Fusion. 33, 86–99, (2017).
15. W. Liu, X. Tang, and J. Liu. Bayesian tensor inference for sketch-based facial photo hallucina-
tion. pp. 2141–2146, (2007).
16. X. Gao, N. Wang, D. Tao, and X. Li, Face sketch–photo synthesis and retrieval using sparse
representation, IEEE Transactions on circuits and systems for video technology. 22(8), 1213–
1226, (2012).
17. J. Zhang, N. Wang, X. Gao, D. Tao, and X. Li. Face sketch-photo synthesis based on support
vector regression. In Image Processing (ICIP), 2011 18th IEEE International Conference on,
pp. 1125–1128. IEEE, (2011).
18. N. Wang, D. Tao, X. Gao, X. Li, and J. Li, Transductive face sketch-photo synthesis, IEEE
transactions on neural networks and learning systems. 24(9), 1364–1376, (2013).
19. B. Xiao, X. Gao, D. Tao, and X. Li, A new approach for face recognition by sketches in photos,
Signal Processing. 89(8), 1576–1588, (2009).
20. Y. Güçlütürk, U. Güçlü, R. van Lier, and M. A. van Gerven. Convolutional sketch inversion. In
European Conference on Computer Vision, pp. 810–824. Springer, (2016).
21. L. Zhang, L. Lin, X. Wu, S. Ding, and L. Zhang. End-to-end photo-sketch generation via fully
convolutional representation learning. In Proceedings of the 5th ACM on International Confer-
ence on Multimedia Retrieval, pp. 627–634. ACM, (2015).
22. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and
Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems,
pp. 2672–2680, (2014).
23. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, Image-to-image translation with conditional adver-
sarial networks, arXiv preprint arXiv:1611.07004. (2016).
24. P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays, Scribbler: Controlling deep image synthesis with
sketch and color, arXiv preprint arXiv:1612.00835. (2016).
25. J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the
natural image manifold. In European Conference on Computer Vision, pp. 597–613. Springer,
(2016).
26. D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky. Texture networks: Feed-forward
synthesis of textures and stylized images. In ICML, pp. 1349–1357, (2016).
27. J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, Unpaired image-to-image translation using cycle-
consistent adversarial networks, arXiv preprint arXiv:1703.10593. (2017).
28. J. Zhu, S. Liao, D. Yi, Z. Lei, and S. Z. Li. Multi-label CNN based pedestrian attribute learning
for soft biometrics. In Biometrics (ICB), 2015 International Conference on, pp. 535–540. IEEE,
(2015).
29. Q. Guo, C. Zhu, Z. Xia, Z. Wang, and Y. Liu, Attribute-controlled face photo synthesis from
simple line drawing, arXiv preprint arXiv:1702.02805. (2017).
30. X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from
visual attributes. In European Conference on Computer Vision, pp. 776–791. Springer, (2016).
31. G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Álvarez, Invertible conditional GANs for
image editing, arXiv preprint arXiv:1611.06355. (2016).
32. Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face
recognition. In European Conference on Computer Vision, pp. 499–515. Springer, (2016).
33. R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant map-
ping. In Computer vision and pattern recognition, 2006 IEEE computer society conference on,
vol. 2, pp. 1735–1742. IEEE, (2006).
34. F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recogni-
tion and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 815–823, (2015).
35. C. Qi and F. Su, Contrastive-center loss for deep neural networks, arXiv preprint
arXiv:1707.07391. (2017).
36. K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recog-
nition, arXiv preprint arXiv:1409.1556. (2014).
37. X. Tang and X. Wang. Face sketch synthesis and recognition. In Computer Vision, 2003. Pro-
ceedings. Ninth IEEE International Conference on, pp. 687–694. IEEE, (2003).
38. H. S. Bhatt, S. Bharadwaj, R. Singh, and M. Vatsa. Memetic approach for matching sketches
with digital face images. Technical report, (2012).
39. H. Han, B. F. Klare, K. Bonnen, and A. K. Jain, Matching composite sketches to face photos:
A component-based approach, IEEE Transactions on Information Forensics and Security. 8(1),
191–204, (2013).
40. Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings
of International Conference on Computer Vision (ICCV), (2015).
41. H. WinnemöLler, J. E. Kyprianidis, and S. C. Olsen, Xdog: an extended difference-of-gaussians
compendium including advanced image stylization, Computers & Graphics. 36(6), 740–753,
(2012).
42. F. L. Bookstein, Principal warps: Thin-plate splines and the decomposition of deformations,
IEEE Transactions on pattern analysis and machine intelligence. 11(6), 567–585, (1989).
43. A. Dabouei, H. Kazemi, M. Iranmanesh, and N. M. Nasrabadi. Fingerprint distortion rectifica-
tion using deep convolutional neural networks. In Biometrics (ICB), 2018 International Confer-
ence on. IEEE, (2018).
44. Biometrics and identification innovation center, wvu multi-modal dataset. Available at http:
//biic.wvu.edu/,.
45. A. P. Founds, N. Orlans, W. Genevieve, and C. I. Watson, Nist special databse 32-multiple
encounter dataset ii (meds-ii), NIST Interagency/Internal Report (NISTIR)-7807. (2011).
46. C. Peng, X. Gao, N. Wang, and J. Li, Sparse graphical representation based discriminant analysis
for heterogeneous face recognition, arXiv preprint arXiv:1607.00137. (2016).
47. P. Mittal, A. Jain, G. Goswami, R. Singh, and M. Vatsa. Recognizing composite sketches with
digital face images via ssd dictionary. In Biometrics (IJCB), 2014 IEEE International Joint Con-
ference on, pp. 1–6. IEEE, (2014).
48. P. Mittal, M. Vatsa, and R. Singh. Composite sketch recognition via deep network-a transfer
learning approach. In Biometrics (ICB), 2015 International Conference on, pp. 251–256. IEEE,
(2015).
49. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, (2016).
50. P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, The feret evaluation methodology for face-
recognition algorithms, IEEE Transactions on pattern analysis and machine intelligence. 22(10),
1090–1104, (2000).
51. Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings
of the IEEE International Conference on Computer Vision, pp. 3730–3738, (2015).
March 16, 2020 9:34 ws-rv961x669 HBPRCV-6th Edn.–11573 FinalVersion page 365
CHAPTER 2.10
CONNECTED AND AUTONOMOUS VEHICLES IN THE

DEEP LEARNING ERA: A CASE STUDY ON
COMPUTER-GUIDED STEERING
Rodolfo Valientea , Mahdi Zamana , Yaser P. Fallaha , Sedat Ozerb

a
Connected and Autonomous Vehicle Research Lab (CAVREL), Orlando, FL
University of Central Florida
Orlando, FL,USA
b
Bilkent University, Ankara, TURKEY
sedatist@gmail.com
Connected and Autonomous Vehicles (CAVs) are typically equipped with multiple
advanced on-board sensors generating a massive amount of data. Utilizing and
processing such data to improve the performance of CAVs is a current research
area. Machine learning techniques are effective ways of exploiting such data in
many applications with many demonstrated success stories. In this chapter, first,
we provide an overview of recent advances in applying machine learning in the
emerging area of CAVs including particular applications and highlight several
open issues in the area. Second, as a case study and a particular application, we
present a novel deep learning approach to control the steering angle for coopera-
tive self-driving cars capable of integrating both local and remote information. In
that application, we tackle the problem of utilizing multiple sets of images shared
between two autonomous vehicles to improve the accuracy of controlling the steer-
ing angle by considering the temporal dependencies between the image frames.
This problem has not been studied in the literature widely. We present and
study a new deep architecture to predict the steering angle automatically. Our
deep architecture is an end-to-end network that utilizes Convolutional-Neural-
Networks (CNN), Long-Short-Term-Memory (LSTM) and fully connected (FC)
layers; it processes both present and future images (shared by a vehicle ahead via
Vehicle-to-Vehicle (V2V) communication) as input to control the steering angle.
In our simulations, we demonstrate that using a combination of perception and
communication systems can improve robustness and safety of CAVs. Our model
demonstrates the lowest error when compared to the other existing approaches
in the literature.
1. Introduction
It is estimated that by the end of next decade, most vehicles will be equipped with
powerful sensing capabilities and on-board units (OBUs) enabling multiple commu-
nication types including in-vehicle communications, vehicle-to-vehicle (V2V) com-
munications and vehicle-to-infrastructure (V2I) communications. As vehicles be-
365
366 R. Valiente, M. Zaman, Y.P. Fallah and S. Ozer
On-Board Image stream from V2 (vehicle ahead)

Buffer via V2V Communication
CNN
LSTM
(time-
FC distributed)
t
t-1
t-2
t-3
Vehicle 2 (V2)
Image stream from

Predicted Steering Angle
V1 (Own camera)
t
t-3
t+1
t-2 t+2
t-1 t+3
t
Vehicle 1 (V1)
Fig. 1. The overview of our proposed vehicle-assisted end-to-end system. Vehicle 2 (V2) sends
his information to Vehicle 1 (V1) over V2V communication. V1 combines that information along
with its own information to control the steering angle. The prediction is made through our
CNN+LSTM+FC network (see Fig. 2 for the details of our network).
come more aware of their environments and as they evolve towards full autonomy,
the concept of connected and autonomous vehicles (CAVs) becomes more crucial.
Recently, CAVs has gained substantial momentum to bring a new level of connec-
tivity to vehicles. Along with novel on-board computing and sensing technologies,
CAVs serve as a key enabler for Intelligent Transport Systems (ITS) and smart
cities.
CAVs are increasingly equipped with a wide variety of sensors, such as engine
control units, radar, light detection and ranging (LiDAR), and cameras to help a
vehicle perceive the surrounding environment and monitor its own operation status
in real-time. By utilizing high-performance computing and storage facilities, CAVs
can keep generating, collecting, sharing and storing large volumes of data. Such data
can be exploited to improve CAVs robustness and safety. Artificial intelligence (AI)
is an effective approach to exploit, analyze and use such data. However, how to
mine such data currently remains a challenge in many ways for further research
direction.
Among the current existing issues, robust control of the steering angle is one of
the most difficult and important problems for autonomous vehicles [1–3]. Recent
computer vision-based approaches to control the steering angle in autonomous cars
mostly focus on improving the driving accuracy with the local data collected from
the sensors on the same vehicle and as such, they consider each car as an isolated
unit gathering and processing information locally. However, as the availability and
the utilization of V2V communication increases, real-time data sharing becomes
Connected and Autonomous Vehicles in the Deep Learning Era 367
more feasible among vehicles [4–6]. As such, new algorithms and approaches are
needed that can utilize the potential of cooperative environments to improve the
accuracy for controlling the steering angle automatically [7].
One objective of this chapter is to bring more attention to this emerging field
since the research on applying AI in CAVs is still a growing area. We identify and
discuss major challenges and applications of AI in perception/sensing, in communi-
cations and in user experience for CAVs. In particular, we discuss in greater detail
and present a deep learning-based approach that differs from other approaches. It
utilizes two sets of images (data): coming from the on-board sensors and coming
from another vehicle ahead over V2V communication to control the steering angle
in self-driving vehicles automatically (see Fig. 1). Our proposed deep architecture
contains a convolutional neural network (CNN) followed by a Long-Short-Term-
Memory (LSTM) and a fully connected (FC) network. Unlike the older approach
that manually decomposes the autonomous driving problem into different compo-
nents as in [8, 9] the end-to-end model can directly steer the vehicle from the cam-
era data and has been proven to operate effectively in previous works [1, 10]. We
compare our proposed deep architecture to multiple existing algorithms in the liter-
ature on Udacity dataset. Our experimental results demonstrate that our proposed
CNN-LSTM-based model yields state-of-the-art results. Our main contributions
are: (1) we provide a survey of AI applications in the emerging area of CAVs and
highlight several open issues for further research; (2) we propose an end-to-end
vehicle-assisted steering angle control system for cooperative vehicles using a large
sequence of images; (3) we introduce a new deep architecture that yields the state-
of-the-art results on the Udacity dataset; (4) we demonstrate that integrating the
data obtained from other vehicles via V2V communication system improves the
accuracy of predicting the steering angle for CAVs.
2. Related Work: AI Applications in CAVs
As the automotive industry transforms, data remains at the heart of CAV’s evolu-
tion [11, 12]. To take advantage of the data, efficient methods are needed to interpret
and mine massive amount of data and to improve robustness of self-driving cars.
Most of the relevant attention is given on the AI-based techniques as many recent
deep learning-based techniques demonstrated promising performance in a wide va-
riety of applications in vision, speech recognition and natural language areas by
significantly improving the state-of-the-art performance [13, 14].
Similar to many other areas, deep learning-based techniques are also providing
promising and improved results in the area of CAVs [14, 15]. For instance, the
problem of navigating a self-driving car with the acquired sensory data has been
studied in the literature with and without using end-to-end approaches [16]. The
earlier works such as the ones from [17] and [18] use multiple components for rec-
ognizing objects of safe-driving concerns including lanes, vehicles, traffic signs, and
pedestrians. The recognition results are then combined to give a reliable world rep-
resentation, which are used with an AI system to make decisions and control the
car. More recent approaches focus on using deep learning-based techniques. For ex-
ample, Ng et al. [19] utilized a CNN for vehicle and lane detection. Pomerleau [20]
used a NN to automatically train a vehicle to drive by observing the input from a
camera. Dehghan et al. [21] presents a vehicle color recognition (MMCR) system,
that relies on a deep CNN.
To automatically control the steering angle, recent works focus on using neural
networks-based end-to-end approaches [22]. The Autonomous Land Vehicle in a
Neural Network (ALVINN) system was one of the earlier systems utilizing multilayer
perceptron [23] in 1989. Recently, CNNs were commonly used as in the DAVE-2
Project [1]. In [3], the authors proposed an end-to-end trainable C-LSTM network
that uses a LSTM network at the end of the CNN network. A similar approach was
taken by the authors in [24], where the authors designed a 3D CNN model with
residual connections and LSTM layers. Other researchers also implemented different
variants of convolutional architectures for end-to-end models as in [25–27]. Another
widely used approach for controlling vehicle steering angle in autonomous systems is
via sensor fusion where combining image data with other sensor data such as LiDAR,
RADAR, GPS to improve the accuracy in autonomous operations [28, 29]. For
instance, in [26], the authors designed a fusion network using both image features
and LiDAR features based on VGGNet.
There has been significant progress by using AI on several cooperative and con-
nected vehicle related issues including network congestion, intersection collision
warning, wrong-way driving warning, remote diagnostic of vehicles, etc. For in-
stance, a centrally controlled approach to manage network congestion control at
intersections has been presented by [30] with the help of a specific unsupervised
learning algorithm, k-means clustering. The approach basically addresses the con-
gestion problem when vehicles stop at a red light in an intersection, where the road
side infrastructures observe the wireless channels to measure and control chan-
nel congestion. CoDrive [31] proposes an AI cooperative system for an open-car
ecosystem, where cars collaborate to improve the positioning of each other. Co-
Drive results in precise reconstruction of a traffic scene, preserving both its shape
and size. The work in [32] uses the received signal strength of the packets received
by the Road Side Units (RSUs) and sent by the vehicles on the roads to predict the
position of vehicles. To predict the position of vehicles they adopted a cooperative
machine-learning methodology, they compare three widely recognized techniques:
K Nearest Neighbors (KNN), Support Vector Machine (SVM) and Random Forest.
In CVS systems, drivers’ behaviors and online and real-time decision making on
the road directly affect the performance of the system. However, the behaviors of
human drivers are highly unpredictable compared to pre-programmed driving assis-
tance systems, which makes it hard for CAVs to make a prediction based on human
behavior and that creates another important issue in the area. Many AI applications
have been proposed to tackle that issue including [5, 6, 33]. For example, in [33],
a two-stage data-driven approach has been proposed: (I) classify driving patterns
of on-road surrounding vehicles using the Gaussian mixture models (GMM); and
(II) predict vehicles’ short-term lateral motions based on real-world vehicle mobility
data. Sekizawa et al. [34] developed a stochastic switched auto-regressive exogenous
model to predict the collision avoidance behavior of drivers using simulated driving
data in a virtual reality system. Chen et al., in 2018, designed a visibility-based col-
lision warning system to use the NN to reach four models to predict vehicle rear-end
collision under a low visibility environment [35]. With historical traffic data, Jiang
and Fei, in 2016, employed neural network models to predict average traffic speeds
of road segments and a forward-backward algorithm on Hidden Markov models to
predict speeds of an individual vehicle [36].
For the prediction of drivers’ maneuver. Yao et al., in 2013, developed a para-
metric lane change trajectory prediction approach based on real human lane change
data. This method generated a similar parametric trajectory according to the k-
Nearest real lane change instances [37]. In [38] is proposed an online learning-based
approach to predict lane change intention, which incorporated SVM and Bayesian
filtering. Liebner et al. developed a prediction approach for lateral motion at urban
intersections with and without the presence of preceding vehicles [39]. The study
focused on the parameter of the longitudinal velocity and the appearance of pre-
ceding vehicles. In [40] is proposed a multilayer perceptron approach to predict the
probability of lane changes by surrounding vehicles and trajectories based on the
history of the vehicles’ position and their current positions. Woo et al., in 2017,
constructed a lane change prediction method for surrounding vehicles. The method
employed SVM to classify driver intention classes based on a feature vector and
used the potential field method to predict trajectory [41].
With the purpose of improving the driver experience in CAVs, there have been
proposed works on predictive maintenance, automotive insurance (to speed up the
process of filing claims when accidents occur), car manufacturing improved by AI,
driver behavior monitoring, identification, recognition and alert [42]. Other works
focus on eye gaze, eye openness, and distracted driving detection, alerting the driver
to keep their eyes on the road [43]. Some advanced AI facial recognition algorithms
are used to allow access to the vehicle and detect which driver is operating the
vehicle, the system can automatically adjust the seat, mirrors, and temperature to
suit the individual. For example, [44] presents a deep face detection vehicle system
for driver identification that can be used in access control policies. These systems
have been devised to provide customers greater user experience and to ensure safety
on the roads.
Other works focus on traffic flow prediction, traffic congestion alleviation, fuel
consumption reduction, and various location-based services. For instance, in [45], a
probabilistic graphical model; Poisson regression trees (PRT), has been used for two
correlated tasks: the LTE communication connectivity prediction and the vehicular
traffic prediction. A novel deep-learning-based traffic flow prediction method based

on a stacked auto-encoder model has further been proposed in [46], where auto-
encoders are used as building blocks to represent traffic flow features for prediction
and achieve significant performance improvement.
3. Relevant Issues
The notion of a “self-driving vehicle” has been around for quite a long time now,
yet a fully automated vehicle not being available for sale created some confu-
sion [47–49]. To put the concept into a measurable degree, the United States’
Department of Transportation’s (USDOT) National Highway Traffic Safety Ad-
ministration (NHTSA) defines 6 levels of automation [47]. They released this clas-
sification for smooth standardization and to measure the safety ratings for AVs.
The levels span from 0 to 5, where level 0 referring to no automation at all where
the human driver does all the control and maneuver. In level 1 of automation, an
Advanced Driver Assistance System (ADAS) helps the human driver with either
control (i.e. accelerating, braking) or maneuver (steering) at certain circumstances,
albeit not simultaneously both. Adaptive Cruise Control (ACC) falls below this
level of automation as it can vary the power to maintain the user-set speed but the
automated control is limited to maintaining the speed, not the lateral movement. At
the next level of automation (Level 2, Partial Automation), the ADAS is capable of
controlling and maneuvering simultaneously, but under certain circumstances. So
the human driver still has to monitor the vehicle’s surrounding and perform the
rest of the controls when required. At level 3 (Conditional Automation), the ADAS
does not require the human driver to monitor the environment all the time. At
certain circumstances, the ADAS is fully capable of performing all the parts of the
driving task. The range of safe-automation scenarios is larger at this level than in
level 2. However, the human driver should still be ready to regain control when the
system asks for it at such circumstance. At all other scenarios, the control is up to
human-maneuver. Level 4 of automation is called “High Automation” as the ADAS
at level 4 can take control of the vehicle at most scenarios and a human driver is not
essentially required to take control from the system. But in critical weather where
the sensor information might be noisier (e.g. in rain or snow), the system may
disable the automation for safety concerns requiring the human driver to perform
all of the driving tasks [47–49]. Currently, many private sector car companies and
investors are testing and analyzing their vehicles at level 4 standard but putting a
safety driver behind the wheel, which necessarily brings down the safety testing at
level 2 and 3. All the automakers and investors are currently putting their research
and development efforts to eventually reach level 5, which refers to full automation
where the system is capable of performing all of the driving tasks without requiring
any human takeover at any circumstance [47, 48].
The applications presented in section 2 shown a promised future for data-driven
deep learning algorithms in CAVs. However, the jump from level 2 to level 3, 4
and 5 is substantial from the AI perspective and naively applying existing deep
learning methods is currently insufficient for full automation due to the complex
and unpredictable nature of CAVs [48, 49]. For example, in level 3 (Conditional
Driving Automation) the vehicles need to have intelligent environmental detection
capabilities, able to make informed decisions for themselves, such as accelerating
past a slow-moving vehicle; in level 4 (High Driving Automation) the vehicles have
full automation in specific controlled areas and can even intervene if things go
wrong or there is a system failure and finally in level 5 (Full Driving Automation)
the vehicles do not require human attention at all [42, 47]. Therefore, how to adapt
existing solutions to better handle such requirements remains a challenging task.
In this section, we identify some research topics for further investigation and in
particular we propose one new approach to the control of the steering angle for
CAVs.
There are various open research problems in the area [11, 42, 48, 49]. For
instance, further work can be done on detection of driver’s physical movements
and posture as in eye gaze, eye openness, and head position to detect and alert
a distracted driver with lower latency [11, 12, 50]. An upper body detection can
detect the driver’s posture and in case of a crash, airbags can be deployed in a
manner that will reduce injury based on how the driver is sitting [51]. Similarly,
detecting the driver’s emotion can also help with the decision making [52].
Connected vehicles can use an Autonomous Driving Cloud (ADC) platform,
that will allow data to have need-based availability [11, 14, 15]. The ADC can use
AI algorithms to make meaningful decisions. It can act as the control policy or the
brain of the autonomous vehicle. This intelligent agent can also be connected to a
database which acts as a memory where past driving experiences are stored [53].
This data along with the real-time input coming in through the autonomous vehi-
cle about the immediate surroundings will help the intelligent agent make accurate
driving decisions. In the vehicular network side, AI can exploit multiple sources of
data generated and stored in the network (e.g. vehicle information, driver behavior
patterns, etc.) to learn the dynamics in the environment [4–6] and then extract ap-
propriate features to use for the benefit of many tasks for communications purposes,
such as signal detection, resource management, and routing [11, 15]. However, it is a
non-trivial task to extract semantic information from the huge amount of accessible
data, which might have been contaminated by noise or redundancy, and thus infor-
mation extraction need to be performed [11, 53]. In addition, in vehicular networks,
data are naturally generated and stored across different units in the network [15, 54]
(e.g., RVs, RSUs, etc). This brings challenges to the applicability of most existing
machine learning algorithms that have been developed under the assumption that
data are centrally controlled and easily accessible [11, 15]. As a result, distributed
learning methods are desired in CAVs that act on partially observed data and have
the ability to exploit information obtained from other entities in the network [7, 55].
Furthermore, additional overheads incurred by the coordination and sharing of in-
formation among various units in vehicular networks for distributed learning shall
be properly accounted for to make the system work effectively [11].
In particular, an open area for further research is how to integrate the informa-
tion from local sensors (perception) and remote information (cooperative) [7]. We
present a novel application in this chapter that steps in this domain. For instance,
for controlling the steering angle all the above-listed work focus on utilizing data
obtained from the on-board sensors and they do not consider the assisted data that
comes from another car. In the following section, we demonstrate that using addi-
tional data that comes from the ahead-vehicle helps us obtain better accuracy in
controlling the steering angle. In our approach, we utilize the information that is
available to a vehicle ahead of our car to control the steering angle.
4. A Case study: Our Proposed Approach
We consider the control of the steering angle as a regression problem where the
input is a stack of images and the output is the steering angle. Our approach can
also process each image individually. Considering multiple frames in a sequence
can benefit us in situations where the present image alone is affected by noise or
contains less useful information such as when the current image is burnt largely by
direct sunlight. In such situations, the correlation between the current frame and
the past frames can be useful to decide the next steering value. We use LSTM to
utilize multiple images as a sequence. LSTM has a recursive structure acting as a
memory, through which the network can keep some past information to predict the
output based on the dependency of the consecutive frames [56, 57].
Our proposed idea in this chapter relies on the fact that the condition of the
road ahead has already been seen by another vehicle recently and we can utilize
that information to control the steering angle of our car as discussed above. Fig.
1 illustrates our approach. In the figure, Vehicle 1 receives a set of images from
Vehicle 2 over V2V communication and keeps the data at the on-board buffer. It
combines the received data with the data obtained from the on-board camera and
processes those two sets of images on-board to control the steering angle via an
end-to-end deep architecture. This method enables the vehicle to look ahead of its
current position at any given time.
Our deep architecture is presented in Fig. 2. The network takes the set of
images coming from both vehicles as input and it predicts the steering angle as
the regression output. The details of our deep architecture are given in Table 1.
Since we construct this problem as a regression problem with a single unit at the
end, we use the Mean Squared Error (MSE) loss function in our network during the
training.
Input
Output
conv2D
conv2D
conv2D
conv2D
conv2D
Image Set (self)
FC 1
FC 10
FC 50
Image Set (remote) 64 64 64 FC 100
Fig. 2. CNN + LSTM + FC Image sharing model. Our model uses 5 convolutional layers,
followed by 3 LSTM layers, followed by 4 FC layers. See Table 1 for further details of our proposed
architecture.
Table 1. Details Of Proposed Architecture

Layer Type Size Stride Activation
0 Input 640*480*3*2X - -
1 Conv2D 5*5, 24 Filters (5,4) ReLU
6 LSTM 64 Units - Tanh
9 FC 100 - ReLU
10 FC 50 - ReLU
11 FC 10 - ReLU
12 FC 1 - Linear
5. Experiment Setup
In this section we will elaborate further on the dataset as well as data pre-processing
and evaluation metrics. We conclude the section with details of our implementation.
5.1. Dataset
In order to compare our results to existing work in the literature, we used the self-
driving car dataset by Udacity. The dataset has a wide variation of 100K images
from simultaneous Center, Left and Right camera on a vehicle, collected in sunny
and overcast weather, 33K images belong to center camera. The dataset contains
the data of 5 different trips with a total drive time of 1694 seconds. Test vehicle
104
3
2.5
2
Histogram
1.5
0.5
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Angle (Radians)
Fig. 3. The angle distribution within the entire Udacity dataset (angle in radians vs. total number
of frames), just angles between -1 and 1 radians are shown.
has 3 cameras mounted as in [1] collecting images at a rate of near 20Hz. Steering
wheel angle, acceleration, brake, GPS data was also recorded. The distribution of
the steering wheel angles over the entire dataset is shown in Fig. 3. As shown
in Fig. 3, the dataset distribution includes a wide range of steering angles. The
image size is 480*640*3 pixels and total dataset is of 3.63 GB. Since there is no
dataset available with V2V communication images currently, here we simulate the
environment by creating a virtual vehicle that is moving ahead of the autonomous
vehicle and sharing camera images by using the Udacity dataset.
Udacity dataset has been used widely in the recent relevant literature [24, 58]
and we also use Udacity dataset in this chapter to compare our results to the ex-
isting techniques in literature. Along with the steering angle, the dataset contains
spatial (latitude, longitude, altitude) and dynamic (angle, torque, speed) informa-
tion labelled with each image. The data format for each image is: index, timestamp,
width, height, frame id, filename, angle, torque, speed, latitude, longitude, altitude.
For our purpose, we are only using the sequence of center-camera images.
5.2. Data Preprocessing
The images in the dataset are recorded at the rate around 20 frame per second.
Therefore, usually there is a large overlap between consecutive frames. To avoid
overfitting, we used image augmentation to get more variance in our image dataset.
Our image augmentation technic randomly adds brightness and contrast to change
pixel values. We also tested image cropping to exclude possible redundant infor-
mation that are not relevant in our application. However, in our test the models
perform better without cropping.
For the sequential model implementation, we preprocessed the data in a different
way. Since we do want to keep the visual sequential relevance in the series of frames
while avoiding overfitting, we shuffle the dataset while keeping track of the sequential
information. We then train our model with 80% images on the same sequence from
the subsets and validate on the rest 20%.
5.3. Vehicle-assisted Image Sharing

Modern wireless technology allows us to share data between vehicles at high bitrates
of up to Gbits/s (e.g., in peer-to-peer and line-of-sight mmWave technologies [54,
59–61]). Such communication links can be utilized to share images between vehicles
for improved control. In our experiments, we simulate that situation between two
vehicles as follows: we assume that both vehicles are away from each other by Δt
seconds. We take the x consecutive frames (t, t− 1, ..., t− x+ 1) from the self-driving
vehicle (vehicle 1) at time step t and the set of images containing x future frames
starting at (t + Δt) from the other vehicle. Thus, a single input data (sample)
contains a set of 2x frames for the model.
5.4. Evaluation Metrics

The steering angle is a continuous variable predicted for each time step over the
sequential data and the metrics: mean absolute error (MAE) and root mean squared
error (RMSE) are two of the most common used metrics in the literature to measure
the effectiveness of the controlling systems. For example, RMSE is used in [24, 58]
and MAE in [62]. Both MAE and RMSE express average model prediction error
and their values can range from 0 to ∞. They both are indifferent to the error sign.
Lower values are better for both metrics.
5.5. Baseline Networks
t t-24 . . t t t+∆t [t , t + ∆t]

.
t
-
8x Resnet 5x 5x
5x
Conv 2D
Conv 3D Pre- Conv 2D Conv 2D
trained
2 x LSTM
5 x FC 5 x FC 4 x FC 5 x FC 5 x FC
Model A Model B Model C Model D Model E
Fig. 4. An overview of the used baseline models in this chapter. The details of each model can
be found in their respective source paper.
As baseline, we include multiple deep architectures that have been proposed

in the literature to compare our proposed algorithm. Those models from [1, 24]
and [58] are, to the best of our knowledge, the best reported approaches in the
literature using a camera only. In total, we chose 5 baseline end-to-end algorithms
to compare our results. We name these five models as models A, B, C, D and E
in the rest of this chapter. Model A is our implementation of the model presented
in [1]. Models B and C are the proposal of [24]. Models D and E are reproduced as
in [58]. The overview of these models is given in Fig. 4. Model A uses a CNN-based
network while Model B combines LSTM with 3D-CNN and uses 25 time-steps as
input. Model C is based on ResNet [63] model and Model D uses the difference
image of two given time-steps as input to a CNN-based network. Finally, Model E
uses the concatenation of two images coming from different time-steps as input to
a CNN-based network.
5.6. Implementation and Hyperparameter Tuning

We use Keras with TensorFlow backend in our implementations. Final training is
done on two NVIDIA Tesla V100 16GB GPUs. When implemented on our final
system, the training took 4 hours for the model in [1] and between 9-12 hours for
the deeper networks used in [24], in [58] and for our proposed network.
We used Adam optimizer in all our experiments and we used the final parameters
as follows: learning rate = 10−2 , β1 = 0.900 , β2 = 0.999, E = 10−8 ). For learning
rate, we tested from 10−1 to 10−6 and we found the best-performing learning rate
being 10−3 . We also studied the minibatch size to see its effect on our application.
Minibatch sizes of 128, 64 and 32 are tested and the value of 64 yielded the best
results for us therefore we used 64 in our experiments reported in this chapter.
Fig. 5 demonstrates how the value of the loss function changes as the number of
epochs increases for both training and validation data sets. The MSE loss decreases
after the first few epochs rapidly and then remains stable, remaining almost constant
around the 14th epoch.
6. Analysis and Results
Table 2 lists the comparison of the RMSE values for multiple end-to-end models
after training them on the Udacity dataset. In addition to the five baseline models
listed in Section IV-E, we also include two models of ours: Model F and Model G.
Model F is our proposed approach with setting x = 8 for each vehicle. Model G
sets x = 10 time-steps for each vehicle instead of 8 in our model. Since the RMSE
values on Udacity dataset were not reported for Model D and Model E in [58], we
re-implemented those models to compute the RMSE values on Udacity Dataset and
reported the results from our implementation in Table 2.
Table 3 lists the MAE values computed for our implementations of the models
A, D, E, F, and G. Models A, B, C, D, and E do not report their individual MAE
0.04
Validation Set
Training Set
0.03
MSE Loss
0.02
0.01
0
0 5 10 15
Number of Ephochs
Fig. 5. Training and Validation steps for our best model with x = 8.
0.2
0.16
0.17
0.15
0.12
0.13
RMSE
0.1 0.095
0.073
0.073 0.051
0.071 0.05
0.05 0.043 0.042 0.044 0.06
0.046 0.044 0.044 Validation Set
0.034 Training Set
0
0 2 4 6 8 10 12 14 16 18 20
Number of Images (x)
Fig. 6. RMSE vs.x value. We trained our algorithm at various x values and computed the
respective RMSE value. As shown in the figure, the minimum value is obtained at x = 8.
values in their respective sources. While we re-implemented each of those models

in Keras, our implementations of the models B and C yielded higher RMSE values
than their reported values even after hyperparameter tuning. Consequently, we did
not include the MAE results of our implementations for those two models in Table
3. The MAE values for the models A, D and E are obtained after hyperparameter
tuning.
We then study the effect of changing the value of x on the performance of our
model in terms of RMSE. We train our model at separate x values where x is set
Table 2. Comparison to Related Work in terms of RMSE. aX = 8, b X = 10.
Model: A B C D E Fa Gb
[1] [24] [24] [58] [58] Ours Ours
Training 0.099 0.113 0.077 0.061 0.177 0.034 0.044
Validation 0.098 0.112 0.077 0.083 0.149 0.042 0.044
to 1, 2, 4, 6, 8, 10, 12, 14, 20. We computed the RMSE value for both the training
and validation data respectively at each x value. The results were plotted in Fig. 6.
As shown in the figure, we obtained the lowest RMSE value for both training and
validation data at the value when x = 8, where RMSE = 0.042 for the validation
data. The figure also shows that choosing the appropriate x value is important to
receive the best performance from the model. As Fig. 6 shows, the number of the
used images in the input affects the performance. Next, we study how changing the
Δt value affects the performance of our end-to-end system in terms of RMSE value
during the testing, once the algorithm is trained at a fixed Δt.
Changing Δt corresponds to varying the distance between the two vehicles. For
that purpose, we first set Δt = 30 frames (i.e., 1.5 seconds gap between the vehicles)
and trained the algorithm accordingly (where x = 10). Once our model was trained
and learned the relation between the given input image stacks and the corresponding
output value at Δt = 30, we studied the robustness of the trained system as the
distance between two vehicles change during the testing. Fig. 7 demonstrates the
results on how the RMSE value changes as we change the distance between the
vehicles during the testing. For that, we run the trained model over the entire
validation data where the input obtained from the validation data formed at Δt
values varying between 0 and 95 with increments of 5 frames, and we computed the
RMSE value at each of those Δt values.
0.07
0.065
0.064
0.065 0.062
0.061
0.059
0.06 0.057
0.055
RMSE
0.053
0.055 0.051
0.049 0.049
0.048
0.05 0.0470.046 0.046 0.047
0.046
0.044 0.044
0.045
0.04
0 10 20 30 40 50 60 70 80 90
Number of frames ahead ( )
Fig. 7. RMSE value vs. size of the number of frames ahead (Δt) over the validation data. The
model is trained at Δt = 30 and x = 10. Between the Δt values: 13 and 37 (the red area)
the change in RMSE value remains small and the algorithm almost yields the same min value at
Δt = 20 which is different than the training value.
Table 3. Comparison to Related Work in terms of MAE.

a X = 8, b X = 10.
A D E Fa Gb
Model: [1] [58] [58] Ours Ours
Training 0.067 0.038 0.046 0.022 0.031
Validation 0.062 0.041 0.039 0.033 0.036
As shown in Fig. 7, at Δt = 30, we have the minimum RMSE value (0.0443)

as the training data was also trained by setting Δt = 30. However, another (local)
minimum value (0.0444), that is almost the same as the value obtained the training
Δt value, is also obtained at Δt = 20. Because of those two local minimums, we
noticed that the change in error remains small inside the red area as shown in the
figure. However, the error does not increase evenly on both sides of the training
value (Δt = 30) as most of the RMSE values within the red area remains on the
left side of the training value (Δt = 30).
Next, we demonstrate the performance of multiple models over each frame of
the entire Udacity dataset in Fig. 9. In total, there are 33808 images in the dataset.
The ground-truth for the figure is shown in Fig. 8 and the difference between the
prediction and the ground-truth is given in Fig. 9 for multiple algorithms. In
each plot, the maximum and minimum error values made by each algorithm are
highlighted with red lines individually. In Fig. 9, we only demonstrate the results
obtained for Model A, Model D, Model E and Model F (ours). The reason for
that is the fact that there is no available implementation of Model B and Model
C from [24] and our implementations of those models (as they are described in
the original paper) did not yield good results to be reported here. Our algorithm
(Model F) demonstrated the best performance overall with the lowest RMSE value.
Comparing all the red lines in the plots (i.e., comparing all the maximum and
minimum error values) suggests that the maximum error made by each algorithm
is minimum for our algorithm over the entire dataset.
2
Steering Angle (Radians)
-1
-2
0 0.5 1 1.5 2 2.5 3 3.5
Number of Images 104
Fig. 8. Steering angle (in radians) vs. the index of each image frame in the data sequence is
shown for the Udacity Dataset. This data forms the ground-truth for our experiments. The upper
and lower red lines highlight the maximum and minimum angle values respectively in the figure.
7. Concluding Remarks
In this chapter, we provided an overview of AI applications to address the challenges

in the emerging area of CAVs. We briefly discuss recent advances in applying
Model A Model D
1 1
0.5 0.5
Angle (Radians)
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
Model E Model F
4
1
2 0.5
Angle (Radians)
0
0
-0.5
-2
-1
-4
-1.5
0 1 2 3 0 1 2 3
Number of Images x 10 4 Number of Images x 10 4
Fig. 9. Individual error values (shown in radian) made at each time frame is plotted for four
models namely: Model A, Model D, Model E and Model F. The dataset is the Udacity Dataset.
Ground-truth is shown in Fig. 8. The upper and lower red lines highlight the maximum and
minimum errors made by each algorithm. The error for each frame (the y axis) for Model A, D
and F are plotted in the range of [-1.5, +1,2] and the error for Model E is plotted in the range of
[-4.3, +4.3].
machine learning in CAVs and highlight several open issues for further research. We
present a new approach by sharing images between cooperative self-driving vehicles
to improve the control accuracy of steering angle. Our end-to-end approach uses a
deep model using CNN, LSTM and FC layers and it combines the on-board data
with the data (images) received from another vehicle as input. Our proposed model
using shared images yields the lowest RMSE value when compared to the other
existing models in the literature.
Unlike previous works that only use and focus on local information obtained from
a single vehicle, we proposed a system where the vehicles communicate with each
other and share data. In our experiments, we demonstrate that our proposed end-to-
end model with data sharing in cooperative environments yields better performance
than the previous approaches that rely on only the data obtained and used on the
same vehicle. Our end-to-end model was able to learn and predict accurate steering
angles without manual decomposition into road or lane marking detection.
One potentially strong argument against using image sharing might be that
using the geo-spatial information along with the steering angle from the future
vehicle and employing the same angle value at that position. Here we argue that:
(I) using GPS makes the prediction dependent on the location data which, like
many other sensor types, provides faulty location values due to various reasons and
that can force algorithms to use wrong image sequence as input.
More work and analysis are needed to improve the robustness of our proposed
model. While this chapter relies on simulated data (where the data sharing be-
tween the vehicles is simulated from the Udacity Dataset), we are in the process of
collecting real data collected from multiple cars communicating over V2V and will
perform more detailed analysis on that new and real data.
Acknowledgement
This work was done as a part of CAP5415 Computer Vision class in Fall 2018
at UCF. We gratefully acknowledge the support of NVIDIA Corporation with the
donation of the GPU used for this research.
References
[1] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D.

Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, End to
End Learning for Self-Driving Cars. (2016). ISSN 0006341X. doi: 10.2307/2529309.
URL https://images.nvidia.com/content/tegra/automotive/images/2016/
solutions/pdf/end-to-end-dl-using-px.pdfhttp://arxiv.org/abs/1604.07316.
[2] Z. Chen and X. Huang. End-To-end learning for lane keeping of self-driving cars. In
IEEE Intelligent Vehicles Symposium, Proceedings, (2017). ISBN 9781509048045. doi:
10.1109/IVS.2017.7995975.
[3] H. M. Eraqi, M. N. Moustafa, and J. Honer, End-to-End Deep Learning for Steering
Autonomous Vehicles Considering Temporal Dependencies (oct. 2017). URL http:
//arxiv.org/abs/1710.03804.
[4] H. Nourkhiz Mahjoub, B. Toghi, S. M. Osman Gani, and Y. P. Fallah, V2X System
Architecture Utilizing Hybrid Gaussian Process-based Model Structures, arXiv e-
prints. art. arXiv:1903.01576 (Mar, 2019).
[5] H. N. Mahjoub, B. Toghi, and Y. P. Fallah. A stochastic hybrid framework for driver
behavior modeling based on hierarchical dirichlet process. In 2018 IEEE 88th Vehic-
ular Technology Conference (VTC-Fall), pp. 1–5 (Aug, 2018). doi: 10.1109/VTCFall.
2018.8690570.
[6] H. N. Mahjoub, B. Toghi, and Y. P. Fallah. A driver behavior modeling structure
based on non-parametric bayesian stochastic hybrid architecture. In 2018 IEEE 88th
Vehicular Technology Conference (VTC-Fall), pp. 1–5 (Aug, 2018). doi: 10.1109/
VTCFall.2018.8690965.
[7] R. Valiente, M. Zaman, S. Ozer, and Y. P. Fallah, Controlling steering angle for
cooperative self-driving vehicles utilizing cnn and lstm-based deep networks, arXiv
preprint arXiv:1904.04375. (2019).
[8] M. Aly. Real time detection of lane markers in urban streets. In IEEE Intelligent
Vehicles Symposium, Proceedings, (2008). ISBN 9781424425693. doi: 10.1109/IVS.
2008.4621152.
[9] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez. Road scene segmentation from
a single image. In Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), (2012). ISBN
9783642337857. doi: 10.1007/978-3-642-33786-4 28.
[10] H. Xu, Y. Gao, F. Yu, and T. Darrell. End-to-end learning of driving models
from large-scale video datasets. In Proceedings - 30th IEEE Conference on Com-
puter Vision and Pattern Recognition, CVPR 2017, (2017). ISBN 9781538604571.
doi: 10.1109/CVPR.2017.376.
[11] J. Li, H. Cheng, H. Guo, and S. Qiu, Survey on artificial intelligence for vehicles,
Automotive Innovation. 1(1), 2–14, (2018).
[12] S. R. Narla, The evolution of connected vehicle technology: From smart drivers to
smart cars to self-driving cars, Ite Journal. 83(7), 22–26, (2013).
[13] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, nature. 521(7553), 436, (2015).
[14] A. Luckow, M. Cook, N. Ashcraft, E. Weill, E. Djerekarov, and B. Vorster. Deep

learning in the automotive industry: Applications and tools. In 2016 IEEE Interna-
tional Conference on Big Data (Big Data), pp. 3759–3768. IEEE, (2016).
[15] H. Ye, L. Liang, G. Y. Li, J. Kim, L. Lu, and M. Wu, Machine learning for vehic-
ular networks: Recent advances and application examples, ieee vehicular technology
magazine. 13(2), 94–101, (2018).
[16] W. Schwarting, J. Alonso-Mora, and D. Rus, Planning and decision-making for au-
tonomous vehicles, Annual Review of Control, Robotics, and Autonomous Systems.
1, 187–210, (2018).
[17] N. Agarwal, A. Sharma, and J. R. Chang. Real-time traffic light signal recognition
system for a self-driving car. In Advances in Intelligent Systems and Computing,
(2018). ISBN 9783319679334. doi: 10.1007/978-3-319-67934-1 24.
[18] B. S. Shin, X. Mou, W. Mou, and H. Wang, Vision-based navigation of an unmanned
surface vehicle with object detection and tracking abilities, Machine Vision and Ap-
plications. (2018). ISSN 14321769. doi: 10.1007/s00138-017-0878-7.
[19] B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, J. Pazhayampallil, M. Andriluka,
P. Rajpurkar, T. Migimatsu, R. Cheng-Yue, et al., An empirical evaluation of deep
learning on highway driving, arXiv preprint arXiv:1504.01716. (2015).
[20] D. Pomerleau. Rapidly adapting artificial neural networks for autonomous navigation.
In Advances in neural information processing systems, pp. 429–435, (1991).
[21] A. Dehghan, S. Z. Masood, G. Shu, E. Ortiz, et al., View independent vehicle
make, model and color recognition using convolutional neural network, arXiv preprint
arXiv:1702.01721. (2017).
[22] A. Amini, G. Rosman, S. Karaman, and D. Rus, Variational end-to-end navigation
and localization, arXiv preprint arXiv:1811.10119. (2018).
[23] D. a. Pomerleau, Alvinn: An Autonomous Land Vehicle in a Neural Network, Ad-
vances in Neural Information Processing Systems. (1989).
[24] S. Du, H. Guo, and A. Simpson. Self-Driving Car Steering Angle Prediction Based
on Image Recognition. Technical report, (2017). URL http://cs231n.stanford.edu/
reports/2017/pdfs/626.pdf.
[25] A. Gurghian, T. Koduri, S. V. Bailur, K. J. Carey, and V. N. Murali. DeepLanes: End-
To-End Lane Position Estimation Using Deep Neural Networks. In IEEE Computer
Society Conference on Computer Vision and Pattern Recognition Workshops, (2016).
ISBN 9781467388504. doi: 10.1109/CVPRW.2016.12.
[26] J. Dirdal. End-to-end learning and sensor fusion with deep convolutional networks
for steering an off-road unmanned ground vehicle. PhD thesis, (2018). URL https:
//brage.bibsys.no/xmlui/handle/11250/2558926.
[27] H. Yu, S. Yang, W. Gu, and S. Zhang. Baidu driving dataset and end-To-end reactive
control model. In IEEE Intelligent Vehicles Symposium, Proceedings, (2017). ISBN
9781509048045. doi: 10.1109/IVS.2017.7995742.
[28] H. Cho, Y. W. Seo, B. V. Kumar, and R. R. Rajkumar. A multi-sensor fusion system
for moving object detection and tracking in urban driving environments. In Proceed-
ings - IEEE International Conference on Robotics and Automation, (2014). ISBN
9781479936847. doi: 10.1109/ICRA.2014.6907100.
[29] D. Gohring, M. Wang, M. Schnurmacher, and T. Ganjineh. Radar/Lidar sensor fusion
for car-following on highways. In ICARA 2011 - Proceedings of the 5th International
Conference on Automation, Robotics and Applications, (2011). ISBN 9781457703287.
doi: 10.1109/ICARA.2011.6144918.
[30] N. Taherkhani and S. Pierre, Centralized and localized data congestion control strat-
egy for vehicular ad hoc networks using a machine learning clustering algorithm, IEEE
Transactions on Intelligent Transportation Systems. 17(11), 3275–3285, (2016).

[31] S. Demetriou, P. Jain, and K.-H. Kim. Codrive: Improving automobile positioning
via collaborative driving. In IEEE INFOCOM 2018-IEEE Conference on Computer
Communications, pp. 72–80. IEEE, (2018).
[32] M. Sangare, S. Banerjee, P. Muhlethaler, and S. Bouzefrane. Predicting vehicles’ po-
sitions using roadside units: A machine-learning approach. In 2018 IEEE Conference
on Standards for Communications and Networking (CSCN), pp. 1–6. IEEE, (2018).
[33] C. Wang, J. Delport, and Y. Wang, Lateral motion prediction of on-road preceding
vehicles: a data-driven approach, Sensors. 19(9), 2111, (2019).
[34] S. Sekizawa, S. Inagaki, T. Suzuki, S. Hayakawa, N. Tsuchida, T. Tsuda, and H. Fu-
jinami, Modeling and recognition of driving behavior based on stochastic switched
arx model, IEEE Transactions on Intelligent Transportation Systems. 8(4), 593–606,
(2007).
[35] K.-P. Chen and P.-A. Hsiung, Vehicle collision prediction under reduced visibility
conditions, Sensors. 18(9), 3026, (2018).
[36] B. Jiang and Y. Fei, Vehicle speed prediction by two-level data driven models in
vehicular networks, IEEE Transactions on Intelligent Transportation Systems. 18(7),
1793–1801, (2016).
[37] W. Yao, H. Zhao, P. Bonnifait, and H. Zha. Lane change trajectory prediction by
using recorded human driving data. In 2013 IEEE Intelligent Vehicles Symposium
(IV), pp. 430–436. IEEE, (2013).
[38] P. Kumar, M. Perrollaz, S. Lefevre, and C. Laugier. Learning-based approach for
online lane change intention prediction. In 2013 IEEE Intelligent Vehicles Symposium
(IV), pp. 797–802. IEEE, (2013).
[39] M. Liebner, F. Klanner, M. Baumann, C. Ruhhammer, and C. Stiller, Velocity-based
driver intent inference at urban intersections in the presence of preceding vehicles,
IEEE Intelligent Transportation Systems Magazine. 5(2), 10–21, (2013).
[40] S. Yoon and D. Kum. The multilayer perceptron approach to lateral motion prediction
of surrounding vehicles for autonomous vehicles. In 2016 IEEE Intelligent Vehicles
Symposium (IV), pp. 1307–1312. IEEE, (2016).
[41] H. Woo, Y. Ji, H. Kono, Y. Tamura, Y. Kuroda, T. Sugano, Y. Yamamoto, A. Ya-
mashita, and H. Asama, Lane-change detection based on vehicle-trajectory predic-
tion, IEEE Robotics and Automation Letters. 2(2), 1109–1116, (2017).
[42] C. Rödel, S. Stadler, A. Meschtscherjakov, and M. Tscheligi. Towards autonomous
cars: the effect of autonomy levels on acceptance and user experience. In Proceedings
of the 6th International Conference on Automotive User Interfaces and Interactive
Vehicular Applications, pp. 1–8. ACM, (2014).
[43] J. Palmer, M. Freitas, D. A. Deninger, D. Forney, S. Sljivar, A. Vaidya, and J. Gris-
wold. Autonomous vehicle operator performance tracking (May 30, 2017). US Patent
9,663,118.
[44] C. Qu, D. A. Ulybyshev, B. K. Bhargava, R. Ranchal, and L. T. Lilien. Secure dis-
semination of video data in vehicle-to-vehicle systems. In 2015 IEEE 34th Symposium
on Reliable Distributed Systems Workshop (SRDSW), pp. 47–51. IEEE, (2015).
[45] C. Ide, F. Hadiji, L. Habel, A. Molina, T. Zaksek, M. Schreckenberg, K. Kersting, and
C. Wietfeld. Lte connectivity and vehicular traffic prediction based on machine learn-
ing approaches. In 2015 IEEE 82nd Vehicular Technology Conference (VTC2015-
Fall), pp. 1–5. IEEE, (2015).
[46] Y. Lv, Y. Duan, W. Kang, Z. Li, and F.-Y. Wang, Traffic flow prediction with big
data: a deep learning approach, IEEE Transactions on Intelligent Transportation
Systems. 16(2), 865–873, (2014).
[47] nhtsa. nhtsa automated vehicles for safety, (2019). URL https://www.nhtsa.gov/
technology-innovation/automated-vehicles-safety.
[48] J. M. Anderson, K. Nidhi, K. D. Stanley, P. Sorensen, C. Samaras, and O. A. Oluwa-
tola, Autonomous vehicle technology: A guide for policymakers. (Rand Corporation,
2014).
[49] W. J. Kohler and A. Colbert-Taylor, Current law and potential legal issues pertaining
to automated, autonomous and connected vehicles, Santa Clara Computer & High
Tech. LJ. 31, 99, (2014).
[50] Y. Liang, M. L. Reyes, and J. D. Lee, Real-time detection of driver cognitive distrac-
tion using support vector machines, IEEE transactions on intelligent transportation
systems. 8(2), 340–350, (2007).
[51] Y. Abouelnaga, H. M. Eraqi, and M. N. Moustafa, Real-time distracted driver posture
classification, arXiv preprint arXiv:1706.09498. (2017).
[52] M. Grimm, K. Kroschel, H. Harris, C. Nass, B. Schuller, G. Rigoll, and T. Moosmayr.
On the necessity and feasibility of detecting a drivers emotional state while driving.
In International Conference on Affective Computing and Intelligent Interaction, pp.
126–138. Springer, (2007).
[53] M. Gerla, E.-K. Lee, G. Pau, and U. Lee. Internet of vehicles: From intelligent grid
to autonomous cars and vehicular clouds. In 2014 IEEE world forum on internet of
things (WF-IoT), pp. 241–246. IEEE, (2014).
[54] B. Toghi, M. Saifuddin, H. N. Mahjoub, M. O. Mughal, Y. P. Fallah, J. Rao, and
S. Das. Multiple access in cellular v2x: Performance analysis in highly congested
vehicular networks. In 2018 IEEE Vehicular Networking Conference (VNC), pp. 1–8
(Dec, 2018). doi: 10.1109/VNC.2018.8628416.
[55] K. Passino, M. Polycarpou, D. Jacques, M. Pachter, Y. Liu, Y. Yang, M. Flint, and
M. Baum. Cooperative control for autonomous air vehicles. In Cooperative control
and optimization, pp. 233–271. Springer, (2002).
[56] F. A. Gers, J. Schmidhuber, and F. Cummins, Learning to forget: Continual pre-
diction with LSTM, Neural Computation. (2000). ISSN 08997667. doi: 10.1162/
089976600300015015.
[57] K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J. Schmidhuber,
LSTM: A Search Space Odyssey, IEEE Transactions on Neural Networks and Learn-
ing Systems. (2017). ISSN 21622388. doi: 10.1109/TNNLS.2016.2582924.
[58] D. Choudhary and G. Bansal. Convolutional Architectures for Self-Driving Cars.
Technical report, (2017).
[59] B. Toghi, M. Saifuddin, Y. P. Fallah, and M. O. Mughal, Analysis of Distributed
Congestion Control in Cellular Vehicle-to-everything Networks, arXiv e-prints. art.
arXiv:1904.00071 (Mar, 2019).
[60] B. Toghi, M. Mughal, M. Saifuddin, and Y. P. Fallah, Spatio-temporal dynam-
ics of cellular v2x communication in dense vehicular networks, arXiv preprint
arXiv:1906.08634. (2019).
[61] G. Shah, R. Valiente, N. Gupta, S. Gani, B. Toghi, Y. P. Fallah, and S. D. Gupta,
Real-time hardware-in-the-loop emulation framework for dsrc-based connected vehicle
applications, arXiv preprint arXiv:1905.09267. (2019).
[62] M. Islam, M. Chowdhury, H. Li, and H. Hu, Vision-based Navigation of Au-
tonomous Vehicle in Roadway Environments with Unexpected Hazards, arXiv
preprint arXiv:1810.03967. (2018).
[63] S. Wu, S. Zhong, and Y. Liu, ResNet, CVPR. (2015). ISSN 15737721. doi: 10.1002/
9780470551592.ch2.
INDEX
3D-CNN, 61 brain atrophy rate, 257

adaptive cruise control (ACC), 370 brain magnetic resonance images
advanced driver assistance system (MRIs), 251
(ADAS), 370 brain structural edge labeling, 252
adversarial training, 169 bright spot pattern, 335
Alzheimer's disease, 259, 260
anisotropic 2-D gaussians, 214 canny labeling algorithm for edge, 252
application specific integrated circuit cascaded CNNs, 54
(ASIC), 107 catheter-based technique, 272
associative memory, 323 cellular neural network, 323
atlas images, 251 center loss, 346
atlas-matching, 251 ChangeDetection.Net (CDnet), 52
atrophy, 252 character recognition, 184
attention network, 164 classification probability, 18
attention-based encoder-decoder classifier, 8
(AED), 163 cognitive impairment, 259
attribute-centered loss, 347 cold fusion, 166
attribute-guided cross-modal, 344 colour bleeding, 140, 144-147
attributes, 343 combining views, 126
augmentation, 351 composite operator, 274
automated sorting of fishes, 184 computer-guided steering, 365
automatic speech recognition (ASR), Conditional CycleGAN, 357
159 conditional risk, 18
conjugate priors, 12
background subtraction, 52 connected and autonomous vehicles
back-propagation (BP) algorithm, 31 (CAVs), 365
back-propagation training algorithm, 3 connectionist temporal classification
Bayes classifier, 8 (CTC), 161, 289
Bayes decision rule (BDR), 4, 19 contextual information, 2
Bayes error, 8 contrastive loss, 348
Bayes’ rule, 10 control the steering angle, 365
Bayesian conditional risk estimator convolutional neural networks (CNNs),
(BCRE), 20 31, 51, 107, 251, 365
Bayesian GAN (BGAN) network, 62 coupled DNN, 347
Bayesian MMSE error estimator (BEE), cross-depiction, 291
9 cross-entropy, 346
Bayesian risk estimate (BRE), 20 cross-modal face verification, 343
binarization, 287, 307 curvelet-based texture features, 99
bipartite graph edit distance, 311 curvelets, 87, 89, 94
385
386 Index
CycleGAN, 298, 345, 355 face verification, 346

data fusion, 349 fast fourier transform (FFT), 116
deep attractor network (DANet), 170 feature extraction, 2
deep clustering (DPCL), 170 feature learning, 191
deep fusion, 166 feature vector, 8
deep learning methods, 139, 140, feature-label distribution, 8
148-150, 152, 154 field-programmable gate array (FPGA),
deep learning software, 184 107
deep learning, 2, 200, 251, 252, 289, floe size distributions, 231
365 F-Measure score, 53
deep neural networks (DNNs), 107 forged signatures, 305
deep-learning based background fraction surfaces, 209
subtraction, 51 fully connected (FC) layers, 365
deformation field, 258 fully convolutional neural networks
DIA, 287 (FCNN), 56
dice similarity coefficient, 256 fully convolutional semantic network
digital humanities (DH), 290 (FCSN), 58
digital image processing, 3
discrete element method (DEM), 243 gaussian mixture models (GMM), 369
discrete wavelet frame decompositions, gaussian-based spatially adaptive
271 unmixing (GBSAU), 209
dissimilarity representations, 119 general covariance model, 15
document content recognition, 289 generative adversarial network
document image analysis, 287 (conditioned GAN), 202
document image binarization generative adversarial networks
competition (DIBCO), 288 (GANs), 345
dynamic combination, 129 genuine, 305
dynamic learning neural networks, 4 gleason grading system, 99, 103
dynamic view selection, 119 gleason scores, 100, 102
global handwriting characteristics, 306
earth observation, 187 GPU, 252
edit path, 310 gradient descent (GD), 216
effective class-conditional densities, 11, gradient vector flow, 234, 235
26 graph edit distance (GED), 310
effective density, 20 graph, 308
embedded graphics processing unit graph-based representation, 305, 306
(GPU), 107 graphology, 305
encoder network, 161 ground truth (GT), 252, 289
endmembers, 209 ground truth information, 53
end-to-end (E2E), 159 GVF snake algorithm, 236
end-to-end model, 380
entropy-orthogonality loss, 31 handwriting recognition, 289
equations display, 256, 259 handwritten signatures, 305
expected risk, 18 hausdor edit distance, 313
external elastic membrane (EEM), 271 hopfield associative memory, 339
Index 387
horizon pattern, 336 machine learning, 2, 187

hungarian algorithm, 312 machine vision and inspection, 185
hyperspectral image, 210 maximal knowledge-driven
information prior (MKDIP), 22
ice boundary detection, 248 maximum curvelet coefficient, 103, 105
ice field generation, 248 mean absolute error (MAE), 221
ice shape enhancement, 238 media-adventitia border, 276
image colourisation, 139-144, 147-149, medical diagnosis, 184, 185
151, 154 minimum sample size, 260
automatic methods, 140-143, 145, Min-Max loss, 31
147, 149 mixed-units, 166
comic colourisation, 152 monaural speech separation, 171
manga colourisation, 152 motion equation, 326
outline colourisation, 152 multi-atlas, 252
image texture analysis, 87 multiscale fully convolutional network
image texture pattern classification, 88 (MFCN), 57
independent covariance model, 14 multi-view learning, 119
Intravascular ultrasound (IVUS), 271 mutli-channel separation, 171
intrinsically Bayesian robust (IBR)
operator, 8 Navier-Stokes equation, 258
intrinsically Bayesian robust classifier nearest-neighbor decision rule, 4
(IBRC), 9 neural networks, 2, 251
inverse-Wishart distribution, 16 neural style transfer, 298
normalization of graph edit distance, 315
joint network, 162 normalization, 309
NVIDIA DGX, 252
keypoint graphs, 308
K-means clustering, 233 object recognition, 31
knowledge distillation, 168 online signature verification, 305
online, 306
labels, 8 optical character recognition (OCR),
language model, 159 290
LAS: listen, attend and spell, 163 optimal Bayesian classifier (OBC), 9
light detection and ranging (LiDAR), 366 optimal Bayesian operator, 8
likelihood function, 19 optimal Bayesian risk classifier
linear sum assignment problem (LSAP), (OBRC), 18, 21
311 optimal Bayesian transfer learning
local classifier accuracy (LCA) method, classifier (OBTLC), 27
131 otsu thresholding, 233
local feature descriptors, 306 out-of-vocabulary, 165
local volume changes, 258
localization and recognition, document parametric shape modeling, 81
summarizing, and captioning, 289 pattern classification, 101
log-jacobians, 258 pattern recognition, 323
long-short-term memory (LSTM), 161, permutation invariant training (PIT), 170
365 person identifications, 184
longitudinal registration, 252 pinch-out patterns, 335
388 Index
poisson regression trees (PRT), 369 speech separation, 170

polarimetric SAR, 189 static combination, 127
prediction network, 162 statistical pattern recognition, 2
probabilities, 12 statistical power, 252, 260
prostate cancer tissue images, 100 statistical, 306
prostate cancer, 87, 92, 94, 100, 102 statistically defined regions of interest,
260
quadratic assignment problem (QAP), structural edge predictions, 257
311 structural MRIs, 251
structural representation, 306
radial basis functions, 280 structure-aware colourisation, 143, 146,
radial basis networks (RBNs), 75 147
random forest dissimilarity, 121 subspace learning models, 51
random forest, 119, 190 supervised unmixing, 210
random sample, 8 support vector machine (SVM), 3, 51
recognition of underwater objects, 184 syntactical (grammatical) pattern
recurrent neural network (RNN) recognition, 2
models, 113 synthesis, 343
recurrent neural networks (RNNs), 159 synthetic aperture radar, 187
reference-based methods, 139, 140, 154 synthetic image generation, 290
remote sensing, 184, 187
RNN transducer (RNN-T), 162 teacher-student learning, 160
road side units (RSUs), 368 TensorFlow, 252
root mean squared error (RMSE), 375 text-to-speech (TTS), 166
texture classification, 88
SAR, 188, 189 texture feature extraction, 87
scribble-based methods, 139, 140, 145, texture features, 103
147, 154 texture information, 273
sea ice characteristics, 232 texture pattern classification, 92
sea ice parameter, 231 transcoding, 200
segmentation, 251 transposed convolutional neural
seismic pattern recognition, 323 network (TCNN), 55
seismogram, 336
semantic segmentation, 189 udacity dataset, 374
semi-supervised, 211 uncertainty class, 8, 9
shallow fusion, 166 unsupervised, 211
signature verification, 305, 306 user guidance, 150, 153
similarity domains network, 75
similarity or dissimilarity of graphs, 307 vehicle-assisted image sharing, 375
skeleton extraction, 81 vehicle-to-vehicle (V2V)
skeletonization, 307 communication, 365
skull-stripping, 251
soft labels, 168 winograd minimal filter algorithm, 116
spatially adaptive unmixing, 209 word error rate (WER), 289
spectral unmixing, 209 word spotting object detection,
speech recognition, 184 word-piece, 166

2020 C H Chen - Handbook of Pattern Recognition and Computer Vision (6th Edition)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020 C H Chen - Handbook of Pattern Recognition and Computer Vision (6th Edition)

Uploaded by

Copyright:

Available Formats

HANDBOOK OF

115783_9789811211065_tp.indd 1 26/7/19 1:30 PM

115783_9789811211065_tp.indd 2 26/7/19 1:30 PM

British Library Cataloguing-in-Publication Data

HANDBOOK OF PATTERN RECOGNITION AND COMPUTER VISION

ISBN 978-981-121-106-5 (hardcover)

For any available supplementary material, please visit

Steven - 11573 - Handbook of Pattern Recognition.indd 1 26-03-20 12:22:22 PM

Prof. K.S. Fu,

Motivated by the re-emergence of artificial intelligence, big data, and machine

series is revised as Chapter 1.5 entitled, On Curvelet-Based Texture Features for

been made on information processing for hyperspectral images in remote

experimental results show the superiority of the proposed attribute-guided

PART 1: THEORY, TECHNOLOGY AND SYSTEMS 1

PART 2: APPLICATIONS 183

THEORY, TECHNOLOGY AND SYSTEMS

From my best recollection, the effort toward making machines as intelligent as

OPTIMAL STATISTICAL CLASSIFICATION

The basic structure of engineering is to operate on a system to achieve some objec-

8 E.R. Dougherty and L. Dalton

uncertainty class Θ of models determined by a parameter vector θ composed of the

2 Optimal Bayesian Classiﬁer

Binary classiﬁcation involves a feature vector X = (X1 , X2 , ..., Xd ) ∈ d composed

Optimal Statistical Classiﬁcation 9

10 E.R. Dougherty and L. Dalton

2.1 OBC Design

π ∗ (c) = π(c|n0 ) ∝ π(c)f (n0 |c) ∝ π(c)cn0 (1 − c)n1 . (6)

If π(c) is beta(α, β) distributed, then π ∗ (c) is still a beta distribution,

ε̂Θ [ψ; Sn ] = Eπ∗ [cε0θ [ψ] + (1 − c)ε1θ [ψ]]

Optimal Statistical Classiﬁcation 11

ε̂Θ [ψ; Sn ] = Eπ∗ [c]ε̂0Θ [ψ; Sn ] + (1 − Eπ∗ [c])ε̂1Θ [ψ; Sn ] . (11)

We evaluate the BEE via eﬀective class-conditional densities, which for y = 0, 1,

where I denotes the indicator function, 1 or 0, depending on whether the condition

12 E.R. Dougherty and L. Dalton

3 OBC for the Discrete Model

If the range of X is ﬁnite, then there is no loss in generality in assuming a single

Optimal Statistical Classiﬁcation 13

From Eq. 13, the expected error of the OBC is

Consider ﬁve Dirichlet priors π1 , π2 , ..., π5 with c = 0.5,

α1j,0 = α2j,0 = α3j,0 = α4j,0 = aj,0 ,

for j = 1, 2, ..., 5, where aj,0 = 1, 1, 1, 2, 4 for j = 1, 2, ..., 5, respectively, bj,0 =

14 E.R. Dougherty and L. Dalton

4 OBC for the Gaussian Model

For y ∈ {0, 1}, assume an d Gaussian distribution with parameters θy = [μy , Λy ],

Optimal Statistical Classiﬁcation 15

symmetric positive semi-deﬁnite d × d matrix. Deﬁne

π(μy |Λy ) ∝ fm (μy ; νy , my , Λy ), (26)

16 E.R. Dougherty and L. Dalton

π ∗ (θy ) = π ∗ (μy |Λy )π ∗ (Λy ), (31)

where Γd is the multivariate gamma function. For a proper posterior, we require

Rewriting Eq. 35 with ν ∗ = νy∗ , m∗ = my∗ , κ∗ = κ∗y , and ky = κ∗y − d + 1 degrees

Optimal Statistical Classiﬁcation 17

Fig. 2. Classiﬁers for a Gaussian model with two features.

18 E.R. Dougherty and L. Dalton

The expected risk of an M -class classiﬁer ψ is given by

where the classiﬁcation probability

Optimal Statistical Classiﬁcation 19

ψBDR (x) = arg min R(i, x, c, θ)

are marginal posteriors of C and T. Independence between C and T is preserved

20 E.R. Dougherty and L. Dalton

5.1 Optimal Bayesian Risk Classiﬁcation

Optimal Statistical Classiﬁcation 21

This is analogous to Eq. 39 in Bayes decision theory.

× Λlt etr − Tlt Λt

−1 −1 κlt,n l l T

where Φ = (Φij )n×n is an n × n matrix with Φij = xi − xj 2 , denotes element-