Smart Algorithms Multimedia Imaging

Signals and Communication Technology
Michael N. Rychagov
Ekaterina V. Tolstaya
Mikhail Y. Sirotenko Editors
Smart
Algorithms
for Multimedia
and Imaging
Series Editors
Emre Celebi, Department of Computer Science, University of Central Arkansas,
Conway, AR, USA
Jingdong Chen, Northwestern Polytechnical University, Xi'an, China
E. S. Gopi, Department of Electronics and Communication Engineering, National
Institute of Technology, Tiruchirappalli, Tamil Nadu, India
Amy Neustein, Linguistic Technology Systems, Fort Lee, NJ, USA
H. Vincent Poor, Department of Electrical Engineering, Princeton University,
Princeton, NJ, USA
This series is devoted to fundamentals and applications of modern methods of
signal processing and cutting-edge communication technologies. The main topics
are information and signal theory, acoustical signal processing, image processing
and multimedia systems, mobile and wireless communications, and computer and
communication networks. Volumes in the series address researchers in academia and
industrial R&D departments. The series is application-oriented. The level of
presentation of each individual volume, however, depends on the subject and can
range from practical to scientific.
**Indexing: All books in “Signals and Communication Technology” are indexed by
Scopus and zbMATH**
For general information about this book series, comments or suggestions, please
contact Mary James at mary.james@springer.com or Ramesh Nath Premnath at
ramesh.premnath@springer.com.
More information about this series at http://www.springer.com/series/4748

Michael N. Rychagov • Ekaterina V. Tolstaya •
Mikhail Y. Sirotenko
Editors
Smart Algorithms for

Multimedia and Imaging
Editors
Michael N. Rychagov Ekaterina V. Tolstaya
National Research University of Electronic Aramco Innovations LLC
Technology (MIET) Moscow, Russia
Moscow, Russia
Google Research
New York, NY, USA
ISSN 1860-4862 ISSN 1860-4870 (electronic)

ISBN 978-3-030-66740-5 ISBN 978-3-030-66741-2 (eBook)
https://doi.org/10.1007/978-3-030-66741-2
© Springer Nature Switzerland AG 2021

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Over the past decades, people have produced vast amounts of multimedia content,
including text, audio, images, animations, and video. The substance of this content
belongs, in turn, to various areas, including entertainment, engineering, medicine,
business, scientific research, etc. This content should be readily processed, analysed,
and displayed by numerous devices like TVs, mobile devices, VR headsets, medical
devices, media players, etc., without losing its quality. This brings researchers and
engineers to the problem of the fast transformation and processing of
multidimensional signals, where they must deal with different sizes and resolutions,
processing speed, memory, and power consumption. In this book, we describe smart
algorithms applied both for multimedia processing in general and in imaging
technology in particular.
In the first book of this series, Adaptive Image Processing Algorithms for Printing
by I.V. Safonov, I.V. Kurilin, M.N. Rychagov, and E.V. Tolstaya, published by
Springer Nature Singapore in 2018, several algorithms were considered for the
image processing pipeline of photo-printer and photo-editing software tools that
we have worked on at different times for processing still images and photos.
The second book, Document Image Processing for Scanning and Printing by the
same authors, published by Springer Nature Switzerland in 2019, dealt with docu-
ment image processing for scanning and printing. A copying technology is needed to
make perfect copies from extremely varied originals; therefore, copying is not in
practice separable from image enhancement. From a technical perspective, it is best
to consider document copying jointly with image enhancement.
This book is devoted to multimedia algorithms and imaging, and it is divided into
four main interconnected parts:
• Image and Video Conversion
• TV and Display Applications
• Machine Learning and Artificial Intelligence
• Mobile Algorithms
v
vi Preface
Image and Video Conversion includes five chapters that cover solutions on super-
resolution using a multi-frame-based approach as well as machine learning-based
super-resolution. They also cover the processing of 3D signals, namely depth
estimation and control, and semi-automatic 2D to 3D video conversion. A compre-
hensive review of visual lossless colour compression technology concludes this part.
TV and Display Applications includes three chapters in which the following
algorithms are considered: video editing, real-time sports episode detection by
video content analysis, and the generation and reproduction of natural effects.
Machine Learning and Artificial Intelligence includes four chapters, where the
following topics are covered: image classification as a service, mobile user profiling,
and automatic view planning in magnetic resonance imaging, as well as dictionary-
based compressed sensing MRI (magnetic resonance imaging).
Finally, Mobile Algorithms consists of four chapters where the following algo-
rithms and solutions implemented for mobile devices are described: a depth camera
based on a colour-coded aperture, the animated graphical abstract of an image, a
motion photo, and approaches and methods for iris recognition for mobile devices.
The solutions presented in the first two books and in the current one have been
included in dozens of patents worldwide, presented at international conferences, and
realized in the firmware of devices and software. The material is based on the
experience of both editors and the authors of particular chapters in industrial research
and technology commercialization. The authors have worked on the development of
algorithms for different divisions of Samsung Electronics Co., Ltd, including the
Printing Business, Visual Display Business, Health and Medical Equipment Divi-
sion, and Mobile Communication Business for more than 15 years.
We should especially note that this book in no way pretends to present an
in-depth review of the achievements accumulated to date in the field of image and
video conversion, TV and display applications, or mobile algorithms. Instead, in this
book, the main results of the studies that we have authored are summarized. We hope
that the main approaches, optimization procedures, and heuristic findings are still
relevant and can be used as a basis for new intelligent solutions in multimedia, TV,
and mobile applications.
How can algorithms capable of being adaptive to image content be developed? In
many cases, inductive or deductive inference can help. Many of the algorithms
include lightweight classifiers or other machine-learning-based techniques, which
have low computational complexity and model size. This makes them deployable on
embedded platforms.
As we have mentioned, the majority of the described algorithms were
implemented as systems-on-chip firmware or as software products. This was a
challenge because, for each industrial task, there are always strict specification
requirements, and, as a result, there are limitations on computational complexity,
memory consumption, and power efficiency. In this book, typically, no device-
dependent optimization tricks are described, though the ideas for effective methods
from an algorithmic point of view are provided.
This book is intended for all those who are interested in advanced multimedia
processing approaches, including applications of machine learning techniques for
Preface vii
the development of effective adaptive algorithms. We hope that this book will serve
as a useful guide for students, researchers, and practitioners.
It is the intention of the editors that each chapter be used as an independent text. In
this regard, at the beginning of a large fragment, the main provisions considered in
the preceding text are briefly repeated with reference to the appropriate chapter or
section. References to the works of other authors and discussions of their results are
given in the course of the presentation of the material.
We would like to thank our colleagues who worked with us both in Korea and at
the Samsung R&D Institute Rus, Moscow, on the development and implementation
of the technologies mentioned in the book, including all of the authors of the
chapters: Sang-cheon Choi, Yang Lim Choi, Dr. Praven Gulaka, Dr. Seung-Hoon
Hahn, Jaebong Yoo, Heejun Lee, Kwanghyun Lee, San-Su Lee, B’jungtae O,
Daekyu Shin, Minsuk Song, Gnana S. Surneni, Juwoan Yoo, Valery
V. Anisimovskiy, Roman V. Arzumanyan, Andrey A. Bout, Dr. Victor V. Bucha,
Dr. Vitaly V. Chernov, Dr. Alexey S. Chernyavskiy, Dr. Aleksey B. Danilevich,
Andrey N. Drogolyub, Yuri S. Efimov, Marta A. Egorova, Dr. Vladimir A. Eremeev,
Dr. Alexey M. Fartukov, Dr. Kirill A. Gavrilyuk, Ivan V. Glazistov, Vitaly
S. Gnatyuk, Aleksei M. Gruzdev, Artem K. Ignatov, Ivan O. Karacharov, Aleksey
Y. Kazantsev, Dr. Konstantin V. Kolchin, Anton S. Kornilov, Dmitry
A. Korobchenko, Mikhail V. Korobkin, Dr. Oxana V. Korzh (Dzhosan), Dr. Igor
M. Kovliga, Konstantin A. Kryzhanovsky, Dr. Mikhail S. Kudinov, Artem
I. Kuharenko, Dr. Ilya V. Kurilin, Vladimir G. Kurmanov, Dr. Gennady
G. Kuznetsov, Dr. Vitaly S. Lavrukhin, Kirill V. Lebedev, Vladislav A. Makeev,
Vadim A. Markovtsev, Dr. Mstislav V. Maslennikov, Dr. Artem S. Migukin, Gleb
S. Milyukov, Dr. Michael N. Mishourovsky, Andrey K. Moiseenko, Alexander
A. Molchanov, Dr. Oleg F. Muratov, Dr. Aleksei Y. Nevidomskii, Dr. Gleb
A. Odinokikh, Irina I. Piontkovskaya, Ivan A. Panchenko, Vladimir
P. Paramonov, Dr. Xenia Y. Petrova, Dr. Sergey Y. Podlesnyy, Petr Pohl,
Dr. Dmitry V. Polubotko, Andrey A. Popovkin, Iryna A. Reimers, Alexander
A. Romanenko, Oleg S. Rybakov, Associate Prof., Dr. Ilia V. Safonov, Sergey
M. Sedunov, Andrey Y. Shcherbinin, Yury V. Slynko, Ivan A. Solomatin, Liubov
V. Stepanova (Podoynitsyna), Zoya V. Pushchina, Prof., Dr.Sc. Mikhail
K. Tchobanou, Dr. Alexander A. Uldin, Anna A. Varfolomeeva, Kira
I. Vinogradova, Dr. Sergey S. Zavalishin, Alexey M. Vil’kin, Sergey
Y. Yakovlev, Dr. Sergey N. Zagoruyko, Dr. Mikhail V. Zheludev, and numerous
volunteers who took part in the collection of test databases and the evaluation of the
quality of our algorithms.
Contributions from our partners at academic and institutional organizations with
whom we are associated through joint publications, patents, and collaborative work,
i.e., Prof. Dr.Sc. Anatoly G. Yagola, Prof. Dr.Sc. Andrey S. Krylov, Dr. Andrey
V. Nasonov, and Dr. Elena A. Pavelyeva from Moscow State University; Academi-
cian RAS, Prof., M.D. Sergey K. Ternovoy, Prof., M.D. Merab A. Sharia, and
M.D. Dmitry V. Ustuzhanin from the Tomography Department of the Cardiology
Research Center (Moscow); Prof., Dr.Sc. Rustam K. Latypov, Dr. Ayrat
F. Khasyanov, Dr. Maksim O. Talanov, and Irina A. Maksimova from Kazan
viii Preface
State University; Academician RAS, Prof., Dr.Sc. Evgeniy E. Tyrtyshnikov from the
Marchuk Institute of Numerical Mathematics RAS; Academician RAS, Prof., Dr.Sc.
Sergei V. Kislyakov, Corresponding Member of RAS, Dr.Sc. Maxim A. Vsemirnov,
and Dr. Sergei I. Nikolenko from the St. Petersburg Department of Steklov Math-
ematical Institute of RAS; Corresponding Member of RAS, Prof., Dr.Sc. Rafael
M. Yusupov, Prof., and Prof., Dr.Sc. Vladimir I. Gorodetski from the St. Petersburg
Institute for Informatics and Automation RAS; Prof., Dr.Sc. Igor S. Gruzman from
Novosibirsk State Technical University; and Prof., Dr.Sc. Vadim R. Lutsiv from
ITMO University (St. Petersburg), are also deeply appreciated.
During all these years and throughout the development of these technologies, we
received comprehensive assistance and active technical support from SRR General
Directors Dr. Youngmin Lee, Dr. Sang-Yoon Oh, Dr. Kim Hyo Gyu, and Jong-Sam
Woo; the members of the planning R&D team: Kee-Hang Lee, Sang-Bae Lee,
Jungsik Kim, Seungmin (Simon) Kim, and Byoung Kyu Min; the SRR IP Depart-
ment, Mikhail Y. Silin, Yulia G. Yukovich, and Sergey V. Navasardyan from
General Administration. All of their actions were always directed toward finding
the most optimal forms of R&D work both for managers and engineers, generating
new approaches to create promising algorithms and SW, and ultimately creating
solutions of high quality. At any time, we relied on their participation and assistance
in resolving issues.
Moscow, Russia Michael N. Rychagov

New York, NY, USA Ekaterina V. Tolstaya
Acknowledgment
Proofreading of all pages of the manuscript was performed by PRS agency (http://
www.proof-reading-service.com).
ix
Contents
1 Super-Resolution: 1. Multi-Frame-Based Approach . . . . . . . . . . . . 1

Xenia Y. Petrova
2 Super-Resolution: 2. Machine Learning-Based Approach . . . . . . . . 35
Alexey S. Chernyavskiy
3 Depth Estimation and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Ekaterina V. Tolstaya and Viktor V. Bucha
4 Semi-Automatic 2D to 3D Video Conversion . . . . . . . . . . . . . . . . . . 81
Petr Pohl and Ekaterina V. Tolstaya
5 Visually Lossless Colour Compression Technology . . . . . . . . . . . . . 115
Michael N. Mishourovsky
6 Automatic Video Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Sergey Y. Podlesnyy
7 Real-Time Detection of Sports Broadcasts Using
Video Content Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Xenia Y. Petrova, Valery V. Anisimovsky,
and Michael N. Rychagov
8 Natural Effect Generation and Reproduction . . . . . . . . . . . . . . . . . 219
Konstantin A. Kryzhanovskiy and Ilia V. Safonov
9 Image Classification as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
10 Mobile User Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Alexey M. Fartukov, Michael N. Rychagov,
and Lyubov V. Stepanova
xi
xii Contents
11 Automatic View Planning in Magnetic Resonance Imaging . . . . . . . 277

Aleksey B. Danilevich, Michael N. Rychagov,
and Mikhail Y. Sirotenko
12 Dictionary-Based Compressed Sensing MRI . . . . . . . . . . . . . . . . . . 303
Artem S. Migukin, Dmitry A. Korobchenko,
and Kirill A. Gavrilyuk
13 Depth Camera Based on Colour-Coded Aperture . . . . . . . . . . . . . . 325
Vladimir P. Paramonov
14 An Animated Graphical Abstract for an Image . . . . . . . . . . . . . . . . 351
Ilia V. Safonov, Anton S. Kornilov, and Iryna A. Reimers
15 Real-Time Video Frame-Rate Conversion . . . . . . . . . . . . . . . . . . . . 373
Igor M. Kovliga and Petr Pohl
16 Approaches and Methods to Iris Recognition for Mobile . . . . . . . . . 397
Alexey M. Fartukov, Gleb A. Odinokikh, and Vitaly S. Gnatyuk
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
About the Editors
Michael N. Rychagov received his MS in acoustical imaging and PhD from the
Moscow State University (MSU) in 1986 and 1989, respectively. In 2000, he
received a Dr.Sc. (Habilitation) from the same University. From 1991, he is involved
in teaching and research at the National Research University of Electronic Technol-
ogy (MIET) as an associate professor in the Department of Theoretical and Exper-
imental Physics (1998), professor in the Department of Biomedical Systems (2008),
and professor in the Department of Informatics and SW for Computer Systems
(2014). Since 2004, he joined Samsung R&D Institute in Moscow, Russia (SRR),
working on imaging algorithms for printing, scanning, and copying; TV and display
technologies; multimedia; and tomographic areas during almost 14 years, including
last 8 years as Director of Division at SRR. Currently, he is Senior Manager of SW
Development at Align Technology, Inc. (USA) in the Moscow branch (Russia). His
technical and scientific interests are image and video signal processing, biomedical
modelling, engineering applications of machine learning, and artificial intelligence.
He is a Member of the Society for Imaging Science and Technology and a Senior
Member of IEEE.
Ekaterina V. Tolstaya received her MS in applied mathematics from Moscow

State University, in 2000. In 2004, she completed her MS in geophysics from the
University of Utah, USA, where she worked on inverse scattering in electromag-
netics. Since 2004, she worked on problems of image processing and reconstruction
in Samsung R&D Institute in Moscow, Russia. Based on these investigations she
obtained in 2011 her PhD with research on image processing algorithms for printing.
In 2014, she continued her career with Align Technology, Inc. (USA) in Moscow
branch (Russia) on problems involving computer vision, 3D geometry, and machine
learning. Since 2020, she works at Aramco Innovations LLC in Moscow, Russia, on
geophysical modelling and inversion.
xiii
xiv About the Editors
Mikhail Y. Sirotenko received his engineering degree in control systems from

Taganrog State University of Radio Engineering (2005) and PhD in Robotics and AI
from Don State Technical University (2009). In 2009, he co-founded computer
vision start-up CVisionLab, shortly after he joined Samsung R&D Institute in
Moscow, Russia (SRR), where he lead a team working on applied machine learning
and computer vision research. In 2015, he joined Amazon to work as a research
scientist on Amazon Go project. In 2016, he joined computer vision start-up Dresr,
which was acquired by Google in 2018, where he leads a team working on object
recognition.
Chapter 1
Super-Resolution: 1. Multi-Frame-Based
Approach
Xenia Y. Petrova
1.1 Super-Resolution Problem
1.1.1 Introduction
Super-resolution (SR) is the name given to techniques that allow a single high-
resolution (HR) image to be constructed out of one or several observed low-
resolution (LR) images (Fig. 1.1). Compared to single-frame interpolation, SR
reconstruction is able to restore the high frequency component of the HR image
by exploiting complementary information from multiple LR frames. The SR prob-
lem can be stated as described in Milanfar (2010).
In a traditional setting, most of the SR methods can be classified according to the
model of image formation, the model of the image prior, and the noise model.
Commonly, research has paid more attention to image formation and noise
models, while the image prior model has remained quite simple. The image forma-
tion model may include some linear operators like smoothing and down-sampling
and the motion model that should be considered in the case of multi-frame SR.
At the present time, as machine learning approaches are becoming more popular,
there is a certain paradigm shift towards a more elaborate prior model. Here the main
question becomes “What do we expect to see? Does it look natural?” instead of the
question “What could have happened to a perfect image so that it became the one we
can observe?”
Super-resolution is a mature technology covered by numerous research and survey
papers, so in the text below, we will focus on aspects related to image formation
models, the relation and difference between SR and interpolation, supplementary
X. Y. Petrova (*)
Samsung R&D Institute Russia (SRR), Moscow, Russia
e-mail: xen@excite.com
© Springer Nature Switzerland AG 2021 1

M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals
and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_1
2 X. Y. Petrova
Fig. 1.1 Single-frame (left side) vs multi-frame super-resolution (right side)
Fig. 1.2 Interpolation grid (a) Template with pixel insertion (b) Uniform interpolation template
algorithms that are required to make super-resolution practical, super-resolution with

input in the Bayer domain, and fast methods of solving the super-resolution problem.
1.1.2 Super-Resolution and Interpolation, Image

Formation Model
When pondering on how to make a bigger image out of a smaller image, there are
two basic approaches. The first one, which looks more obvious, assumes that there
are some pixels that we know for sure and are going to remain intact in the resulting
image and some other pixels that are to be “inserted” (Fig. 1.2a). In image interpo-
lation applications, this kind of idea was developed in a wide range of edge-directed
algorithms, starting with the famous NEDI algorithm, described in Li and Orchard
(2001) and more recent developments, like those by Zhang and Wu (2006), Giachetti
and Asuni (2011), Zhou et al. (2012), and Nasonov et al. (2016). However, simply
observing Fig. 1.2a, it can be seen that keeping some pixels intact in the resulting
image makes it impossible to deal with noisy signals. We can also expect the
interpolation quality to become non-uniform, which may be unappealing visually,
so the formation model from Fig. 1.2b should be more appropriate. In the interpo-
lation problem, researchers rarely consider the relation between large-resolution and
small-resolution images, but in SR formulation the main focus is on the image
formation model (light blue arrow in Fig. 1.2b). So, interpreting SR as an inverse
problem to images formation has become a fruitful idea. From this point of view, the
image formation model in Fig 1.2a is a mere down-sampling operator, which is in
1 Super-Resolution: 1. Multi-Frame-Based Approach 3
Fig. 1.3 Image formation model (a) comprised of blur, down-sampling and additive noise; (b)
using information from multiple LR frames with subpixel shift to recover single HR frame
weak relation with the physical processes taking place in the camera, including at
least the blur induced by the optical system, down-sampling, and camera noise
(Fig. 1.3a), as described in Heide et al. (2013). Although the example in the drawing
is quite exaggerated, it emphasizes three simple yet important ideas:
1. If we want to make a real reconstruction, and not a mere guess, it would be very
useful to get more than one observed image (Fig. 1.3b). The number of frames
used for reconstruction should grow as a square of the down-sampling factor.
2. The more blurred the image is, the higher the noise, and the bigger the down-
sampling factor, the harder it is to reconstruct an original image. There exist both
theoretical estimates, like those presented in Baker and Kanade (2002) and Lin
et al. (2008), and also practical observations, when the solution of the SR
reconstruction problem really makes sense.
3. If we know the blur kernel and noise parameters (and the accurate noise model),
the chances of successful reconstruction will increase.
The first idea is an immediate step towards multi-frame SR, which is the main
topic of this chapter. In this case, a threefold model (blur, down-sampling, noise)
becomes insufficient, and we need to consider an additional warp operator, which
describes a spatial transform applied to a high-resolution image before applying blur,
down-sampling, and noise. This operator is related to camera eigenmotions and
object motion. In some applications, the warp operator can be naturally derived from
the problem itself, e.g. in astronomy, the global motion of the sky sphere is known.
In sensor-shift SR, which is now being implemented not only in professional
products like Hasselblad cameras but also in many consumer level devices like
Olympus, Pentax, and some others, a set of sensor shifts is implemented in the
4 X. Y. Petrova
hardware, and these shifts are known by design. But in cases targeting consumer
cameras, estimation of the warp operator (or motion estimation) becomes a separate
problem. Most of the papers, such as Heide et al. (2014), consider only translational
models, while others turn to more complex parametric models, like Rochefort et al.
(2006), who assume globally affine motion, or Fazli and Fathi (2015) as well as
Kanaev and Miller (2013), who consider a motion model in the form of optical flow.
Thus, the multi-frame SR problem can be formulated as the reconstruction of a
high-resolution image X from several observed low-resolution images Yi, where the
image formation model is described by Yi ¼ WiX + ηi, 8 i ¼ 1, . . ., k, where Wi is the
ith image formation operator and ηi is additive noise. Operators Wi can be composed
out of warp Mi, blur Gi, and decimation (down-sampling) D for a single-channel SR
problem:
W i ¼ DGi M i :
In Bodduna and Weickert (2017), along with this popular physically motivated
warp-blur-down-sample model, it was proposed to use a less popular yet more
effective for practical purposes blur-warp-down-sample model, i.e. Wi ¼ DMiGi.
In cases like those described in Park et al. (2008) or Gutierrez and Callico (2016),
when different systems are used to obtain different shots, the blur operators Gi can be
different for each observed frame, but using the same camera system is more
common, so it’s enough to use a single blur matrix G. Anyway, in the case of
spatially invariant warp and blur operators, the blur and warp operators commute, so
there is no difference between these two approaches. The pre-blur (before warping)
model compared to the post-blur model also allows us to concentrate on finding GX
instead of X.
In Farsiu et al. (2004), a more detailed model containing two separate blur
operators, responsible for camera blur Gcam and atmospheric blur Gatm, is
considered:
W i ¼ DGcam M i Gatm :
But in many consumer applications, unlike astronomical applications, atmo-

spheric blur may be considered negligible.
The formulations above can be sufficient for images obtained by a monochro-
matic camera or three-CCD camera, but more often the observed images are
obtained using systems equipped with a colour filter array (CFA).
Some popular types of arrays are shown in Fig. 1.4. The Bayer array is only one
of the possible CFAs, but from the computational point of view, it makes no
difference which particular array type to consider, so without any further loss of
generality, we are going to stick to the Bayer model. During the development of the
algorithm, it makes sense to concentrate only on one specific pattern, because in Liu
et al. (2019), it was shown that it’s quite a straightforward procedure to convert
solutions for different types of Bayer patterns using only shifts and reflections of the
input and output images. In Bayer image formation, the model includes the Bayer
Fig. 1.4 Types of colour filter arrays
decimation operator B. Thus the degradation operator can be written as Wi ¼ BDGMi.

This model can be used for reconstruction from the Bayer domain (joint demosaicing
and super-resolution) as proposed in Heide et al. (2013, 2014) and repeated by
Petrova and Glazistov and Petrova (2017). Slight modifications to the image forma-
tion model can be used for other image processing tasks, e.g. Wi ¼ BGMi can be used
for Bayer reconstruction from multiple images, Wi ¼ BG for demosaicing, Wi ¼ G
for deblurring, and Wi ¼ DyGMi, where Dy is the down-sampling operator in the
vertical dimension, can be used for de-interlacing.
As for noise mode ηi, rather often researchers consider a simple Gaussian model,
but those who are seeking a more physically motivated approach refer to the
brightness-dependent model as described in Foi et al. (2007, 2008), Azzari and Foi
(2014), Rakhshanfar and Amer (2016), Pyatykh and Hesser (2014), and Sutour et al.
(2015, 2016).
In many papers, the authors assume that camera noise can be represented as a sum
of two independent components: signal-dependent Poissonian shot noise ηpand
signal-independent Gaussian electronic read-out noise ηg. The Poissonian term
represents a random number of photons registered by the camera system. The flux
intensity increases proportionally to the image brightness. Thus, the registered pixel
values are described as y ¼ Wx + ηp(Wx) + ηg.
In the case of a relatively intensive photon flux, the Poissonian distribution is
accurately enough approximated by a Gaussian distribution. That’s why the standard
deviation of camera noise
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi can be described by a heteroscedastic Gaussian model
y ¼ Wx þ a Wx þ bξ, where ξ is sampled from a Gaussian distribution with a
zero mean and unit standard deviation.
In Kuan et al. (1985), Liu et al. (2013, 2014), aspwell
ffiffiffi as Zhang et al. (2018), the
pffiffiffi
generalized noise model y ¼ Wx þ ðWxÞγ aη þ bξ is considered, where η is
also Gaussian noise with a zero mean and unit standard deviation, independent of ξ.
6 X. Y. Petrova
Parameters a and b of the noise model depend on the camera model and shooting
conditions, such as gain or ISO (which is usually provided in metadata) and the
exposure time. Although there is abundant literature covering quite sophisticated
procedures of noise model parameter estimation, for researchers focused on the SR
problem per se, it is more efficient to refer to data provided by camera manufacturers,
like the NoiseProfile field described in the DNG specification (2012).
Camera 2 API of Android devices is also expected to provide this kind of
information. There also exist quite simple computational methods to estimate the
parameters of the noise model using multiple shots in a controlled environment.
Within the machine learning approach, the image formation model described
above is almost sufficient to generate artificial observations out of the “perfect” input
data to be used as the training dataset. Still, to obtain an even more realistic
formation model, some researchers like Brooks et al. (2019) consider also the colour
and brightness transformation taking place in the image signal processing (ISP) unit:
white balance, colour correction matrix, gamma compression, and tone mapping.
Some researchers like Segall et al. (2002), Ma et al. (2019), and Zhang and Sze
(2017) go even further and consider the reconstruction problem for compressed
video, but this approach is more related to the image and video compression field
rather than image reconstruction for consumer cameras.
1.1.3 Optimization Criteria
The SR image reconstruction problem can be tackled within a data-driven approach

based on an elaborate prior model or a data-agnostic approach based on solution of
the inverse problem. In the data-driven approach, researchers propose some structure
of the algorithm S depending on parameters P. The algorithm takes one or several
observed low-resolution frames as input and provides an estimate of the high-
resolution frame X ¼ S(P, Y) or X ¼ S(P, Y1, ⋯Yk) as output. The algorithm itself
is usually non-iterative and computationally quite cheap, while the main computa-
tional burden lies in algorithm parameter estimation. Thus, given some training data
T, consisting of t pairs
of high-resolution
images and corresponding low-resolution
observations T ¼ X u , Y u1 , ⋯Y uk Y ui ¼ W ui X u þ ηui g, u ¼ 1::t, the parameters can
be found as
X
t
P ¼ argminP F data S P, Y u1 , ⋯Y uk X u
u¼1
for multi-frame problem setting. For single-frame problem setting (single-frame

super-resolution known as SISR), at least for now, this formulation is the most
popular among researchers practising machine learning. In SISR, a training database
consists of pairs T ¼ fX u , Y u j Y u ¼ WX u þ ηui g, u ¼ 1::t, and the model parame-
ters are found as
X
t
P ¼ argminP F data ðSðP, Y u Þ X u Þ:
u¼1
Even in modern research like Zhang et al. (2018), the data term Fdata often
remains very simple, being just an L2 norm, PSNR (peak signal to noise ratio) or
SSIM (Structural Similarity Index). Although more sophisticated losses, like the
perceptual loss proposed in Johnson et al. (2016), are widely used in research, in
major super-resolution competitions like NTIRE, the outcomes of which were
discussed by Cai et al. (2019), the winners are still determined according to the
PSNR and SSIM, which are still considered to be the most objective quantitative
measures.
By the way, we can consider as representatives of the data-driven approach not
only computationally expensive CNN-based approaches but also advanced edge-
driven methods like those covered in Sect. 1.1.2 or well-known single-frame SR
algorithms like A+ described by Timofte et al. (2014) and RAISR covered in
Romano et al. (2017).
The data-agnostic approach makes minimum assumptions about output images
and solves an optimization problem for each particular input. Considering a three-
fold (blur, down-sampling, noise), fourfold (warp, blur, down-sampling, noise), or
extended fourfold (warp, blur, down-sampling, Bayer down-sampling, noise) image
formation model automatically suggests the SR problem formulation
X
k X
k
X ¼ argminX F data ðW i X Y i Þ þ F reg ðX Þ log Lnoise ðW i X Y i Þ,
i¼1 i¼1
where Fdata is a data fidelity term, Freg is the regularization term responsible for data
smoothness, and Lnoise is the likelihood term for the noise model. When developing
an algorithm targeting a specific input device, it is reasonable to assume that we
know something about the noise model, possibly more than about the input data.
Also, when the noise model is Gaussian, the problem is sufficiently described by the
square data term alone, but a more complex noise model would require a separate
noise term. In MF SR, the regularization term is indispensable, because the input
data is often insufficient for unique reconstruction. More than that, problem condi-
tioning depends on a “lucky” or “unhappy” combination of warp operators, and if
these warps are close to identical, the condition number will be very large even for
numerous available observations (Fig. 1.5).
The data fidelity term is usually chosen as the L1 or L2 norm, but most of the
research in SR is focused on the problem with a quadratic data fidelity term and total
variation (TV) regularization term. Different types of norms, regularization terms,
and corresponding solvers are described in detail by Heide et al. (2014). In the case
of a non-linear and particularly non-convex form of the regularization term, the only
way to find a solution is an iterative approach, which may be prohibitive for real-time
implementation.
8 X. Y. Petrova
Fig. 1.5 “Lucky” (observed points cover different locations) and “unhappy” (observed points are
all the same) data for super-resolution
In Kanaev and Miller (2013), an anisotropic smoothness term is used for

regularization.
1.2 Fast Approaches to Super-Resolution
1.2.1 Problem Formulation Allowing Fast Implementation
Let us consider a simple L2 L2 problem with Freg(X) ¼ λ2(HX)(HX), where H is a

convolution operator. This makes it possible to solve the SR problem as the linear
equation:
b X ¼ W Y,
A

where A b ¼ W W þ λ2 H H, W ¼ W , . . . , W , Y ¼ Y , . . . , Y .
1 k 1 k
It is important to mention that this type of problem can be treated by the very fast
shift and add approach covered in Farsiu et al. (2003) and Ismaeil et al. (2013),
which allows the reconstruction of the blurred high-resolution image using averag-
ing of shifted pixels from low-resolution frames. Unfortunately, this approach leaves
the filling of the remaining wholes to the subsequent deblur sub-algorithm on an
irregular grid. This means that the main computational burden is being transferred to
the next pipeline stage and remains quite demanding.
As an image formation model, we are going to use Wi ¼ DGMi and Wi ¼ BDGMi.
It is possible to make this problem even narrower and assume each warp Mi and blur
G as being space-invariant. This limitation is quite reasonable when processing a
small image patch, as has been stated in Robinson et al. (2009). In this case, A b is
known to be reducible to the block diagonal form. This fact is intensively exploited
in papers on fast super-resolution by Robinson et al. (2009, 2010), Sroubek et al.
(2011), and Zhao et al. (2016). Similar results for the image formation model with
Bayer down-sampling were presented by Petrova et al. (2017) and Glazistov and
Petrova (2018).
The warp operators Wi are assumed to be already estimated with sufficient
accuracy. Besides, we assume that the motion is some subpixel circular translation,
which is a reasonable assumption that holds in the small image block.
We consider a simple Gaussian noise model with the same sigma value for all the
observed frames ηi ¼ η. Within the L2 L2 formulation, the Gaussian noise model
means that minimising the data fidelity term minimizes also the noise term, so we
can consider a significantly simplified problem, i.e.
X
k
X ¼ argminX F data ðW i X Y i Þ þ λ2 ðHX Þ ðHX Þ
i¼1
We assume a high-resolution image to be represented as a stack X ¼ (RT, GT,

B )T, where R, G, B are vectorized colour channels. Low-resolution images are
T
T
represented as Y ¼ Y T1 , . . . , Y Tk , where Yi is the vectorized i-th observation. As
long as we assume Bayer input, each of the input observations has only one channel.
So, if we consider observed images (or we should rather say the input image block)
2
of size n m, then Yi 2 Rmn and X 2 R3s mn , where s is a magnification factor.
We have chosen regularization operator H so that it can both control the image
smoothness within each colour channel and also mitigate the colour artefacts. The
main idea behind reducing the colour artefacts is the requirement that the image
gradient in the different colour channels should change simultaneously. This idea is
described in detail in literature related to demosaicing, like Malvar et al. (2004).
Thus, we compose operator H from intra-channel sub-operators HI and HC mapping
HI 0 0
b
2 2 H I b I ¼ 0 H I 0 , H
bc ¼
from Rs mn to Rs mn as H ¼ , where H
Hbc
0 0 HI

H C H C 0

HC 0 H C : We took HI in the form of a 2D convolution operator with

0 H C H C
1=8 1=8 1=8

the kernel 1=8 1 1=8 and p1ffiffiγ H C ¼ H I . As will be shown in Sect. 1.2.2, such

1=8 1=8 1=8
problem formulation is very convenient for analysis using the apparatus of structured
matrices.
1.2.2 Block Diagonalization in Super-Resolution Problems
The apparatus of structured matrices fits well the linear optimization problems
arising in image processing (and also as a linear inner loop inside non-linear
algorithms). The mathematical foundations of this approach were described in
10 X. Y. Petrova
Voevodin and Tyrtyshnikov (1987). A detailed English description of the main

relevant results can be found in Trench (2009) and Benzi et al. (2016). Application
of this apparatus for super-resolution is covered in Robinson et al. (2009, 2010),
Sroubek et al. (2011), and Zhao et al. (2016). An excellent presentation of the matrix
structure for the deblur problem, which has a lot of common traits with SR, is offered
by Hansen et al. (2006). The key point of these approaches is the block diagonal-
ization of the problem matrix. The widely known results about the diagonalization of
circulant matrices can be extended to block diagonalization of some non-circulant
matrices, which still share some structural similarity in certain aspects with circulant
matrices. The basic idea for how to find this structural similarity of matrices arising
from the quadratic super-resolution problem is sketched below.
Let us start the discussion with 1D and 2D single-channel SR problems and use
them as building blocks to construct the description of the Bayer SR problem with
pronounced structural properties encouraging a straightforward block diagonaliza-
tion procedure and allowing a fast solution based on Fast Fourier transform.
First, we are going to recollect some basic facts from linear algebra that are
essential in the apparatus of structured matrices.
We will use extensively a permutation matrix P obtained from the identity matrix
by row permutation. Left multiplication of the matrix M by P causes permutation of
rows, while right multiplication by Q ¼ PT leads to the same permutation of
columns. Permutation matrices P and Q are orthogonal:
P1 ¼ PT , Q1 ¼ QT :
Notation Pu will mean a cyclic shift by u, providing
ðPu Þ ¼ ðPu Þ1 ¼ Pu :
A perfect shuffle matrix Πn1 n2 corresponds to the transposition of a rectangular

matrix of size n1 n2 in vectorized form. Notes on the application of a perfect
shuffle matrix to structured matrices can be found in Benzi et al. (2016). This is an
n n permutation matrix, where n ¼ n1n2 and the element with indices i, j is one if
and only if i and j can be presented as i 1 ¼ α2n1 + α1, j 1 ¼ α1n2 + α2 for some
integers α1, α2 : 0 α1 n1 1, 0 α2 n2 1.
An explicit formula for the matrix of a Fourier transform of size n n will also be
used:
2 3
1 1 ... 1 1
6 ðn2Þ ðn1Þ 7
61 E11 ... E1 E1 7
6 n n n
7
Fn ¼ 6
6 1 E 21
n ... E2
n
ðn2Þ
E2
n
ðn1Þ 7,
7
6 7
4 ⋮ 5
1 Eðnn1Þ1 . . . Eðnn1Þðn2Þ Eðnn1Þðn1Þ
where En ¼ e n . The Fourier matrix and its conjugate satisfy

2πi
F n F n ¼ F n F n ¼ n I n :
Definition 1 A circulant matrix is a matrix with a special structure, where every

row is a right cyclic shift of the row above
2 3
a1 a2 . . . an
6a . . . an1 7
6 n a1 7
A¼6 7
4 ⋱ 5
a2 a3 ... a1
and corresponds to 1D convolution with cyclic boundary conditions. Circulant

matrices are invariant under cyclic permutations:
8A 2 ℂ ) A ¼ ðPu ÞT APu :
The class of circulant matrices of size n n is denoted by ℂn; so, we can write
A 2 ℂn. A circulant matrix is defined by a single row (or column) a ¼ [a1, a2, . . .,
an]. It can be transformed to diagonal form by the Fourier transform:
1
8A 2 ℂn ) A ¼ F n Λn F n :
n
All circulant matrices of the same size commute. Many matrices used below are
circulant ones, i.e. matrices corresponding to one-dimensional convolution with
cyclic boundary conditions.
N Since we are going to deal with two or more dimensions, the Kronecker product
becomes an important tool. Properties of the Kronecker product that may be
useful for further derivations are summarized in Zhang and Ding (2013) and several
other educational mathematical papers.
An operatorNthat down-samples a vector of length n by the factor s can be written
as Ds ¼ I n=s eT1,s , where eT1,s is the first row of identity matrix Is. Suppose a
two-dimensional n n array is given:
2 3
x11 x12 . . . x1n
6x . . . x2n 7
6 21 x22 7
X matr ¼ 6 7:
4 ⋮ 5
xn1 xn2 . . . xnn
In vectorized form, this can be written as
X T ¼ ½x11 , x21 , . . . , xn1 , x12 , x22 , . . . , xn2 , . . . , x1n , . . . , xnn :

12 X. Y. Petrova
If An is a 1D convolution operator from Rn to Rn with coefficients ai, i ¼ 1, . . ., n,

N 2
then In An applied to X will correspond to row-wise convolution acting from Rn
2 N
to Rn , and An In to column-wise convolution with this filter. For
N two row-wise
and column-wise convolution operators An and Bn, operator An Bn will be a
2 2
separable convolution operator from Rn to Rn for vectorized n n arrays due to the
following property of the Kronecker product:
O O
O
AB CD ¼ A C B D :
For example, 2D down-sampling by factor s will be

O O O O
Ds,s ¼ Ds Ds ¼ I n=s eT1,s I n=s eT1,s :
Two-dimensional non-separable convolution (warp and blur) operators are block

circulant with circulant block, or BCCB, according to the notation from Hansen et al.
(2006). This can be expressed via a sum of Kronecker products of 1D convolution
operators:
8A 2 ℂn ℂm ) ∃N i 2 ℂn , M i 2 ℂm , i ¼ 1, . . . , r : A
Xr O
¼ Ni Mi:
i¼1
A BCCB matrix can be easily transformed to block diagonal form:

O
O
1
8A 2 ℂn ℂm ) A ¼ F F m Λ F n Fm ,
mn n
P N M
where Λ ¼ ri¼1 ΛNi Λi and ΛNi , ΛM
i are diagonal matrices of eigenvalues of
matrices Ni and Mi .
Although BCCB matrices and their properties are extensively covered in the
literature, matrices arising from the SR problem (especially the Bayer case) are more
complicated, and this paper will borrow a more general concept of the matrix class
from Voevodin and Tyrtyshnikov (1987) to deal with them in a simple and unified
manner.
Definition 2 A matrix class is a linear subspace of square matrices. Matrix A with
elements ai, j : i, j ¼ 1, . . ., n belongs to matrix class  described by numbers
ðqÞ
aij , q 2 Q if it satisfies
X ðqÞ
aij aij ¼ 0:
i, j
This definition is narrower than in the original work, which allows a non-zero
constant on the right-hand side and considers also rectangular matrices, but this
modification makes a definition more relevant to the problem under consideration.
We are interested in ℂn(circulant),  (general, Q ¼ ∅), and n (diagonal) classes
of square matrices of size n n.
The Kronecker product produces bi-level matrices of N class 1 2 from matrices
from classes  and  : 8M 2  , 8M 2  ) M
1 2 1 1 2 2 1
M 2 2 1 2 . Here, 1
is called an outer class and  an inner class. Saying A 2  simply means that
2
each block of A belongs to class : Multilevel classes like

1 2 , . . . , i , . . . ,  j , . . . , k can also be constructed.
Furthermore, we’ll show that matrices related to the SR problem belong to certain
multilevel classes 1 2 , . . . , i iþ1 , . . . ,  j  jþ1 , . . . , k containing sev-
eral diagonal block classes, where i stands for some non-diagonal types. Then it
will be shown how matrices from this class can be transformed to block diagonal
form (by grouping together diagonal subclasses).
Detailed proofs of the facts below can be found in Glazistov and Petrova (2017,
2018).
Let’s start with the 1D case. Single-channel SR matrix A b can be expanded as
!
X
k
b¼
A M i G D DGM i þ λ2 H H:
i¼1
In 1D, Mi and H being convolution operators provides

O
M i , G, H 2 ℂn , D ¼ Ds ¼ I n=s eT1,s :
b constructed as described above satisfies

Matrix A
b ¼ F ΛA F n ,
A n
where ΛA 2 s n=s .
In the 2D case, the warp, blur, and regularization operators become
O
M i , G, H 2 ℂn ℂn , D ¼ Ds,s ¼ Ds Ds :
N N
Such matrix A b will satisfy A b ¼ F F ΛA ðF n F n Þ, where ΛA 2
n n
s n=s s n=s .
b can be expanded as
In the Bayer case, matrix A
!
X
k
b¼
A M e D
e i G e B BD
eGeM
ei e H,
þ λ2 H e
i¼1
where
14 X. Y. Petrova
O O O
e ¼ I3
D e ¼ I3
Ds,s , G e i ¼ I3
G, M Mi,
2 3
D2,2 0 0
6 D P1,1 7
6 2,2 0 0 7
B¼6 7,
4 0 D2,2 P1,0 0 5
0 0 D2,2 P0,1
2 3
Hg 0 0
6 0 0 7
6 Hb 7
6 7
6 0 0 Hr 7
e ¼6
H 7,
6H H c1 0 7
6 c1 7
6 7
4 H c2 0 H c2 5
0 H c3 H c3
and Pu, v is a 2D cyclic shift by u columns and v rows. Submatrices from the
expression above satisfy
M i , G, H r , H g , H b , H c1 , H c2 , H c3 2 ℂn ℂn :
Bayer down-sampling operator B extracts and stacks channels G1, G2, R, and
B from the pattern in Fig. 1.6.
b constructed as described
As proven in Glazistov and Petrova (2018), the matrix A
above satisfies
O O
O O
b ¼ I3
A F n F n ΛA I 3 Fn Fn ,
where ΛA 2 3 2s 2sn 2s 2sn . After characterizing the matrix in terms of matrix
class, it becomes possible to prescind from the original problem setting and focus on
matrix class transformations.
In the papers relying on block diagonalization of BCCB matrices, like Sroubek
et al. (2011), it is usually only noted that certain matrices can be transformed to 
and no explicit transforms are provided, probably because it’s hard to express the
formula, but thanks to the apparatus of the structured matrices it becomes easy to
obtain closed-form permutation matrices transforming from classes s n=s ,
s n=s s n=s , and 3 2s n=ð2sÞ 2s n=ð2sÞ to block diagonal form n=s s ,
n2 =s2 s2 , and n2 =ð4s2 Þ 12s2 , respectively.
Fig. 1.6 Pixel enumeration

in Bayer pattern
Fig. 1.7 Swapping matrix classes by permutation of rows and columns
If we want to rearrange the matrix elements in order to swap matrix classes, as

shown in in Fig. 1.7, the following theorem from Voevodin and Tyrtyshnikov (1987)
can be applied:
Theorem 1.1 Let 1 and 2 be two classes of n n and m m matrices,
respectively. Then 8A 2 1 2 : Π Tn,m AΠ n,m 2 2 1 , where Π n, m is a perfect
shuffle.
A perfect shuffle matrix is used in the property of the Kronecker product, as was
shown in Benzi et al.N(2016). If A and NB are matrices of size n n and m m,
respectively, then ðB AÞ ¼ ΠTn,m ðA BÞΠn,m , but Theorem 1.1 is slightly more
general. It can be applied to any class, including n m N or n m , which cannot
necessarily be expressed as a single Kronecker product A B. An example of such
matrix class is a BCCB matrix.
It is also interesting to know which operations preserve the matrix classes on a
certain level. Thus, no matter what are the classes 1 of size n n and 2 , 3 of
size m m (each of these classes can be multilevel), there exist permutation matrices
Pm, Qm, such that if 8A 2 2 ) PTm AQm 2 3 , then the outer class preservation
property holds
O
O
8B 2 1 2 ) I n PTm B I n Q m 2 1 3
The property of inner class preservation can be postulated for any matrix classes
and for matrices Pn, Qn, providing 8A 2 2 ) PTn AQn 2 3 :
O
O
8B 2 1 2 : PTn I m B Qn I m 2 3 2 :
This means that each outer block is transformed from class 1 to class 3 , while
the inner class 2 remains the same.
16 X. Y. Petrova
Table 1.1 Complexity

Problem Matrix size Blocks Block size MI, original MI, reduced
1D nn n
s ss n3 ns2
n2
2D n2 n2 s2 s2 n6 n2s4
s
n 2
Bayer 3n2 3n2 12s2 12s2 n6 n2s4
2s
Thus matrix Ab from the 1D single-channel SR problem can be transformed to a

b Πs,n 2 n=s s :
block diagonal form as follows: ΠTs,n F n AF
s n s
In the 2D SR problem, we can apply the class swapping operation twice and
N N
convert Ab to block diagonal form: for each A b ¼ F F ΛA ðF n F n Þ, where
n n
ΛA 2 s n=s s n=s , it holds that
O O
O
O
O O
I ns ΠTs,n I s ΠTns,n F n b F
Fn A F n Πns,ns I ns Πs,ns Is
s s n
2 n2 s2 :
s2
b b¼
Matrix A
N N
arising from the Bayer SR problem satisfying A
N N
I 3 F n F n ΛA ðI 3 F n F n Þ, where ΛA 2 3 2s 2s 2s 2s , can be
n n
transformed to block diagonal form in a similar way:

O O O

O O
ΠT3,n2 I 3 I 2sn ΠT2s, n I 2s ΠT2ns, n I3 Fn F n Ab

O O
O
2s 2s
O O
I3 F n F n I 3 Π2ns,2sn I 2sn Π2s,2sn I 2s Π3,n2 2  n2 12s2 :

4s2
Table 1.1 summarizes the computational complexity of finding the matrix inverse
(marked “MI”) for the 1D, 2D, and Bayer SR problems. Block diagonalization made
it possible to reduce the complexity of the 2D and Bayer SR problems from O n6 to
O ðn2 s4 Þ þ O ðn2 log nÞ, where n2 log n corresponds to the complexity of the block
diagonalization process itself. Typically, n is much larger than s (as n ¼ 16, . . .,
32, s ¼ 2, . . ., 4), which provides significant economy.
1.2.3 Filter-Bank Implementation
It is quite a common idea in image processing to implement the pixel processing

algorithm in the form of a filter-bank, where each output pixel is produced by
applying some filter selected according to some local features. For example, Romano
et al. (2017) solved the SR problem by applying pretrained filters (s2 phases for
magnification factor s, which naturally follows from selecting a down-sampling grid
as shown in Fig. 1.2b) grouped into 216 “buckets” (24 gradations of angle, 3 grada-
tions of edge strength, and 3 gradations of coherence). All these characteristics were
computed using a structure tensor. In the multi-frame case, we need to merge

information from several, say, frames, which have different mutual displacement.
This means that for each pixel we should consider k filters and the filter-bank should
store entries for different combinations of mutual displacement.
Let us start with the applicability of the filter-bank approach for the L2 L2
problem setting. In this case, the solution can be written in closed form as X ¼
b1 W Y . This means that it is possible to precompute the set of matrices A ¼
A
b1 W for all possible shifts between LR frames and later use them for SR
A
reconstruction. Processing can be performed on a small image block where it is
safe to assume that displacement is translational and space-invariant for each pixel of
the block. Also, it’s quite reasonable to assume that pixel-scale displacement can be
compensated in advance, and we should be concerned only with subpixel displace-
ment (in the LR frame) quantized to a certain precision, like ¼ pixel (which is known
to be sufficient for compression applications). This means that for k input frames, if
each observed LR image is shifted with respect to a decimated HR image by s ∙ uk
pixels horizontally and by s ∙ vk pixels vertically, we can express the inverse matrix
as a function of 2k scalar values: A ¼ A(u0, v0, . . .uk, vk), as shown in Fig. 1.8. In
consumer camera applications, we can assume that k is small – something like 3–4
frames.
Let us consider matrix A constructed for some small absolute values of uk and vk
in Fig. 1.9. For the sake of visual clarity, it will be a single-channel SR problem with
three observations. Elements of A with absolute values exceeding 103 are shown in
white, and the rest are shown in black. No more than 5% of the matrix elements
exceed this threshold. Taking into account the structure of matrix A b discussed in
b
Sect. 1.2.2, it is no surprise that matrix A is also structured. Matrix A is sparse, but
generally A is not bound to be sparse. Luckily, our case has the property of weight
decay, as covered in Mastronardi et al. (2010).
Thus, the matrix structure and weight decay suggest excluding redundancy in the
matrix representation and describing the whole matrix only by s2 lines that are used
to compute adjacent pixels of the HR image, belonging to the same block of size
s s, and storing only a limited number of values corresponding to the central pixel.
For a single-channel SR problem, we stored 11 11 filters and for Bayer SR –
16 16 filters. Figure 1.10 gives numerical evidence to justify such selection of the
filter size and shows the distribution of filter energy depending on the distance from
Fig. 1.8 Using precomputed inverse matrices for SR reconstruction

18 X. Y. Petrova
Fig. 1.9 Visualization of matrix A and extracting filters
Fig. 1.10 Dependency of the proportion of energy of filter coefficients inside ε-vicinity of the
central element on the vicinity size
the central element, averaged for all filters computed for three input frames with ¼
motion quantization.
The filters are extracted as shown in Fig. 1.9 during the off-line stage (Fig. 1.11)
and applied in the online stage (Fig. 1.12). These images seem self-explanatory, but
an additional description can be found in Petrova et al. (2017).
Fig. 1.11 Online and off-line parts of the algorithm
Fig. 1.12 Online and off-line parts of the algorithm
1.2.4 Symmetry Properties
In the case of a straightforward implementation of the filter-bank approach, the

number of filters that need to be stored is still prohibitively large, e.g. for single-
channel SR with k ¼ 4 frames, magnification s ¼ 4 and ¼ pixel motion quantization
(q ¼ 4), even supposing that the displacement of one frame is always zero, the total
number of filters will be k ∙ s2 ∙ (q2)k 1 ¼ 262144, and for k ¼ 3 it will be 12,288.
For Bayer reconstruction, this number will become 3k ∙ (2s)2 ∙ (q2)k 1.
20 X. Y. Petrova
We will show that by taking into account the symmetries intrinsic to this problem
and implementing a smart filter selection scheme, this number can be dramatically
reduced. Strict proofs were provided in Glazistov and Petrova (2018), while here
only the main results will be listed.
Let’s introduce the following transforms:
O
O
ϕ1 ðBÞ ¼ J Ps,s B J Ps,s ,

O
O
ϕ2 ðBÞ ¼ I 3 Px,y B I 3 Px,y ,

O O
O O
ϕ3 ðBÞ ¼ I 3 Un In B I3 Un In ,
O O
O O
ϕ4 ðBÞ ¼ I 3 In Un B I3 In Un ,
O
O
ϕ5 ð B Þ ¼ I 3 ΠTn,n B I 3 Πn,n ,
2 3
0 ... 0 1
60 ... 1 07
6 7
where U n ¼ 6 7 (a permutation matrix of size n n that flips the
40 0 05
1 . . . 20 0 3
1 0 0
6 7
input vector) and J ¼ 4 0 0 1 5, and Px, y is a 2D circular shift operator, where
0 1 0
x is the horizontal shift and y is the vertical shift. Then the number of stored filters
can be reduced using the following properties:

bðu1 , v1 , ⋯, uk , vk Þ ¼ ϕ1 A
A bðu1 , v1 , ⋯, uk , vk Þ ,

bðu1 þ x, v1 þ y, ⋯, uk þ x, vk þ yÞ ¼ ϕ2 A
A bðu1 , v1 , ⋯, uk , vk Þ ,

bðu1 1, v1 , ⋯, uk , vk Þ ¼ ϕ3 A
A bðu1 , v1 , ⋯, uk , vk Þ ,

bðu1 , v1 1, ⋯, uk , vk 1Þ ¼ ϕ4 A
A bðu1 , v1 , ⋯, uk , vk Þ,

bðv1 þ s, u1 þ s, ⋯, vk þ s, uk þ sÞ ¼ ϕ5 A
A bðu1 , v1 , ⋯, uk , vk Þ :
We can also use the same filters for different permutations of input frames: if σ(i)
is any permutation of indices i ¼ 1, . . ., k, then
Table 1.2 Filter-bank compression using symmetries

Problem Original size Compressed size Compression factor
2D, s ¼ 2 16 [3 2 2 16 16] 26 16 16 7.38
2D, s ¼ 4 256 [3 4 4 16 16] 300 16 16 40.96
Bayer, s ¼ 2 256 [3 3 4 4 16 16] 450 16 16 81.92
Bayer, s ¼ 4 4096 [3 3 8 8 16 16] 25752 16 16 91.62

bðu1 , v1 , ⋯, uk , vk Þ ¼ A
A b uσ ð1Þ , vσð1Þ , ⋯, uσðkÞ , vσð1Þ :
Adding 2s to one of the ui’s or vi’s also does not change the problem:
bðu1 , v1 , ⋯, uk , vk Þ ¼ A
A bðu1 , v1 , ⋯, ui1 , vi1 , ui þ 2s, vi , uiþ1 , viþ1 , ⋯, uk , vk Þ,
bðu1 , v1 , ⋯, uk , vk Þ ¼ A
A bðu1 , v1 , ⋯, ui1 , vi1 , ui , vi þ 2s, uiþ1 , viþ1 , ⋯, uk , vk Þ:
For some motions u1, v1, ⋯, uk, vk, certain non-trivial compositions of transforms
ϕ1, . . ., ϕ5 keep the system invariant:

b ¼ ϕi ϕi . . . ϕi
A b ,
A
1 2 m
which makes it possible to express some rows of A b by using elements from other
rows. This is an additional resource for filter-bank compression. Applying exhaus-
tive search and using the rules listed above, for filter size 16 16 and k ¼ 3, we have
obtained the filter-bank compression ratios described in Table 1.2. Thus, the number
of stored values and the number of problems to be solved during the off-line stage
were both reduced. The compression approach increased the complexity of the
online stage to a certain extent, but the proposed compression scheme allows a
straightforward software implementation based on the index table, which stores
appropriate base filters and a list of transforms, encoded in 5 bits, for each possible
set of quantized displacements.
The apparatus of multilevel matrices can be similarly applied to deblurring, multi-
frame deblurring, demosaicing, multi-frame demosaicing, or de-interlacing prob-
lems in order to obtain fast FFT-based algorithms similar to those described in Sect.
1.2.2 and to analyse problem symmetries, as was shown in this section for the Bayer
SR problem.
1.2.5 Discussion of Results
We have developed a high quality multi-frame joint demosaicing and SR (Bayer SR)
solution which does not use iterations and has linear complexity. A visual
22 X. Y. Petrova
Fig. 1.13 Sample quality on real images: top row demosaicing from Hirakawa and Parks (2005)
with subsequent bicubic interpolation bottom row
Fig. 1.14 Comparison of RGB and Bayer SR: left side demosaicing from Malvar et al. (2004) with
post-processing using RGB SR; right side Bayer SR
comparison with the traditional approach is shown in Fig. 1.13. It can also be seen
that direct reconstruction from the Bayer domain is visually more pleasing compared
to subsequent demosaicing and single-channel SR, as shown in Fig. 1.14. This is the
only case when we used for benchmarking a demosaicing algorithm from Malvar
et al. (2004), because its design purpose was to minimize colour artefacts, which
would be a desirable property for the considered example. In all other cases, we
prefer the approach suggested by Hirakawa and Parks (2005), which provides a
higher PSNR and more natural-looking results.
In Fig. 1.15, we perform a visual comparison with an implementation of an
algorithm from Heide et al. (2014), which shows that careful choice of the linear
cross-channel regularization term can result in a more visually pleasing image than a
non-linear term.
We performed a numeric evaluation of the SR algorithm quality on synthetic
images in order to concentrate on the core algorithm performance without consid-
ering issues of accuracy of motion estimation. We used a test image shown in
Fig. 1.16, which contains several challenging areas (from the point of view of
demosaicing algorithms). Since the reconstruction quality depends on the displace-
ment between low-resolution frames (worst corner case: all the images with the same
displacement), we conducted a statistical experiment with randomly generated
motions. Numeric measurements for several experiment conditions are charted in
Fig. 1.17. Measurements are made separately for each channel. Experiments with
reconstruction from two, three, and four frames were made. Four different
Fig. 1.15 Sample results of joint demosaicing and SR on rendered sample with known translational
motion: (a) ground truth; (b) demosaicing from Hirakawa and Parks (2005) + bicubic interpolation;
(c) demosaicing from Hirakawa and Parks (2005) + RGB SR; (d) Bayer SR, smaller regularization
term; (e) Bayer SR, bigger regularization term; (f) our implementation of Bayer SR from Heide et al.
(2014) with cross-channel regularization term from Heide et al. (2013)
Fig. 1.16 Artificial test

image
24 X. Y. Petrova
Fig. 1.17 Evaluation results on synthetic test
combinations of multipliers that go with the intra-channel regularization H term and

cross-channel regularization term Hc were used. It is no surprise that, in the green
channel, which is sampled more densely, we can observe better PSNR values than in
other channels, while the red and blue channels have almost the same reconstruction
quality. We can see that different subpixel shifts provide different PSNR values for
the same image, which means that, for accurate benchmarking of the multi-frame SR
set-up, many possible combinations of motion displacements should be checked. It
is possible to find globally optimal regularization parameters, which perform best for
both “lucky” and “unhappy” combinations of motions. Obviously, increasing the
number of LR frames leads to a higher PSNR, so the number of frames is limited
only by the shooting speed and available computational resources.
Since the algorithm relies on motion quantization, we evaluated the impact of this
factor on the reconstruction quality. The measurement results for the red channel for
magnification factor s ¼ 4 are provided in Table 1.3. We used 100 randomly
sampled displacements and measured the PSNR. Artificial degradation (Bayer
down-sampling) was applied to the test image shown in Fig. 1.16. Cases with two,
three, and four input frames, single-channel (RGB SR), and joint demosaicing and
SR (Bayer SR) configurations were tested. Since the number of stored filters
increases dramatically when increasing the magnification ratio, we also checked
the configuration with subsequent multi-frame SR with s ¼ 2 followed by bicubic
up-scaling. Bicubic up-sampling after demosaicing by Hirakawa and Parks (2005)
with 23.0 dBs (shown on the bottom line) was considered as a baseline. In RGB SR,
two demosaicing methods were evaluated – Hirakawa and Parks (2005) and Malvar
et al. (2004). For our purposes, the overall reconstruction quality was on average
about 0.6–0.8 dBs higher for Hirakawa and Parks (2005) than for Malvar et al.
Table 1.3 Impact of MV rounding and RGB/Bayer SR for 4 magnification, red channel
(PSNR, Db)
Number of
frames used for
MV reconstruction
rounding Domain Demosaicing method Configuration 2 3 4
No RGB Malvar et al. (2004) 4 " SR 23.5 23.7 23.9
Yes 23.4 23.6 23.7
No Hirakawa and Parks 24.3 24.5 24.6
Yes (2005) 24.0 24.3 24.3
No Bayer N/A 2 " SR + 2 " 24.6 25.2 25.6
Yes 23.3 23.4 23.5
No 4 " SR 24.9 25.7 26.3
Yes 24.5 25.2 25.6
No RGB Malvar et al. (2004) 2 " SR + 2 " 23.5 23.7 23.9
Yes 22.8 22.9 23.0
No Hirakawa and Parks 24.1 24.2 24.5
Yes (2005) 23.4 23.5 23.5
N/A RGB Hirakawa and Parks 4" 23.0
(2005)
(2004). Increasing the number of frames from 2 to 4 caused a 0.5 dB increase in the
RGB SR set-up and a 1.1–1.4 dB increase in the Bayer SR set-up. As expected, the
Bayer SR showed a superior performance, with 26.3 dB on four frames without
rounding of the motion vectors and 25.6 dB with rounded motion vectors. MV
rounding caused a quality drop by 0.1–0.3 dB for RGB SR and a quality drop by
0.4–0.7 dB for Bayer SR. The configuration with subsequent SR and up-sampling
behaves well enough without MV rounding but in the case of rounding can be even
inferior to the baseline.
Although there is clear evidence of weight decay, we had to evaluate the real
impact on the quality of the algorithm output caused by filter truncation. Also, since
the results of subsequent 4 SR with subsequent 2 downscaling were visually
more pleasing than plain 2 SR, we evaluated these configurations numerically. The
results are shown in Table 1.4. The bottom line shows the baseline with Hirakawa
and Parks (2005) demosaicing followed by 2 bicubic up-sampling, providing
27.95 Db reconstruction quality. We can see that even for the simplest setting for
2 magnification, the difference in PSNR from the baseline is 1.38 Db. Increasing
the number of frames from two to four allows us to increase the quality from about
0.9 (for 2 SR) to 1.1 Db (4 SR + down-sampling) compared to the corresponding
two-frame configuration. We can also see that for each number of observed frames,
the 4 SR + down-sampling is about 1.6–1.8 Db better than the corresponding plain
2 SR. The influence of the reduced kernel size (from 16 to 12) is almost negligible
and never exceeds 0.15 Dbs. Finally, in the four-frame set-up, we can see a PSNR
increase of 4.17 Db compared to the baseline.
26 X. Y. Petrova
Table 1.4 Impact of the kernel size and comparison of 4 3 " SR + 2 3 # and 2 3 " SR
configurations (PSNR, Db)
Number of frames used for
reconstruction
Kernel size Configuration 2 3 4
16 16 4 " SR + 2 # 30.93 31.75 32.12
2 " SR 29.34 30.19 30.25
14 14 4 " SR + 2 # 30.93 31.75 32.12
2 " SR 29.34 30.04 30.26
12 12 4 " SR + 2 # 30.93 31.72 32.12
2 " SR 29.33 30.05 30.25
N/A Hirakawa and Parks (2005) + 2 " SR 27.95
1.3 Practical Super-Resolution
1.3.1 System Architecture
The minimalistic system architecture of the SR algorithm was already shown in

Fig. 1.12. However, there are additional problems that need to be solved to make an
out-of-the-box MF SR solution. Figure 1.18 shows a variant of the implementation
of a complete MF SR system based on the filter-bank approach that includes
dedicated sub-algorithms for directional processing in edge areas, motion estimation
in the Bayer domain, preliminary noise reduction, salience map estimation, blending
motion using a reliability map, and post-processing.
The directional processing has the following motivation. We found that the core
Bayer SR algorithm for some motions performs poorly on edges, and we developed
edge directional reconstruction filters to be applied in strongly directional areas.
These reconstruction filters were computed in the same way as anisotropic ones,
except for regularization term H, which was implemented as a directional DoG
(Difference of Gaussians) filter. The filters were stored in a separate directional filter-
bank with special slots for each direction. Local direction was estimated in the Bayer
domain image using the structure tensor as described in Sect. 1.3.2. A visual
comparison of the results obtained with and without direction adaptation is shown
in Fig. 1.19.
Earlier, the applicability of the DoG filter was demonstrated also for mask
creation and multiplication of the initial image by the created mask for the purpose
of blending the initial image with its blurred copy (Safonov et al. 2018, Chap. 11),
for the creation and specification of binary maps as well as for binary sketch
generation (Safonov et al. 2019, Chaps. 2 and 13).
In order to avoid using pixels from frames where the motion was not estimated
accurately, a special reliability map was computed. Obviously, it is based on analysis
of the difference of the reference and compensated frames, but in order to obtain the
desired behaviour on real images, the process included the operations described
below. For each frame, a half-resolution compensated frame was obtained. The
Fig. 1.18 MF SR system architecture
Fig. 1.19 Baseline

anisotropic reconstruction
left; directional
reconstruction right
green channel of the input half-resolution frame was estimated as an average of

green sub-channels G1 and G2. The reference half-resolution frame and compen-
sated frame were subjected to white balance, gamma correction, Gaussian smooth-
ing, and subsequent conversion to the LST colour space. This colour space was
successfully used in segmentation tasks by Chen et al. (2016) and was reported to be
a fast alternative to LaB space. Luminance, saturation, and tint were computed as
28 X. Y. Petrova
1 1 1
L¼ pffiffiffi ðR þ G þ BÞ, S ¼ pffiffiffi ðR BÞ, T ¼ pffiffiffi ðR 2G þ BÞ,
M rgb 3 M rgb 2 M rgb 6
where Mrgb is the maximum of colour channels. These formulae assume normalized
input (i.e. between 0 and 1). Let us denote the reference frame in LST space as fref
and a compensated frame in LST space as fk. Then two difference sub-metrics were
computed: d1 ¼ 1 (( fref fk) G)γ , where is a convolution operator and G is a
Gaussian filter, and d2 ¼ max (0, SSIM( fref, fk)). The final reliability max was
d2
computed as a threshold transform over dd11þd 2
with subsequent nearest neighbour
up-sampling to full size. In pixels where motion was detected to be unreliable, a
reduced number of frames were used for reconstruction. In Fig. 1.18, the special
filter-bank for processing areas which use a reduced number of frames (from 1 to
k 1) is denoted as “fallback”. In order to obtain the final image out of pixels
obtained using anisotropic, directional, and partial (fallback) reconstruction filters, a
special blending sub-algorithm was implemented.
Motion estimation in the Bayer domain is an interesting problem deserving
further description, which will be provided in Sect. 1.3.3.
The main goal of post-processing is to reduce the colour artefacts. Unfortunately,
a cross-channel regularizer that is strong enough to suppress colour artefacts pro-
duces undesirable blur as a side effect. In order to apply a lighter cross-channel
regularization, an additional colour artefact suppression block was implemented. It
converts an output image to YUV space, computes the map of local standard
deviations in the Y channel, smooths it using Gaussian filtering, and uses it as the
reference channel for cross-bilateral filtering of channels U and V. Then the image
with the original values in the Y channel and filtered values in the U and V channels
is transferred back to RGB.
Since the reconstruction model described above considers only Gaussian noise, it
makes sense to implement a separate and more elaborate noise reduction block using
accurate information from metadata and the camera noise profile.
In the case of higher noise levels, it is possible to use reconstruction filters
computed for a higher degree of regularization, but in this case the effect of SR
processing shifts from revealing new details to noise reduction, which is a simpler
problem that can be solved without the filter-bank approach.
In order to achieve a good balance between noise reduction and detail recon-
struction, a salience map was applied to control the local strength of the noise
reduction. A detailed description of salience map computation along with a descrip-
tion of a Bayer structure tensor is provided in Sect. 1.3.2. A visual comparison of the
results obtained with and without salience-based local control of noise reduction is
shown in Fig. 1.20. It can be seen that such adaptation provides a better detail level in
textured areas and higher noise suppression in flat regions compared to the baseline
version.
Fig. 1.20 Visual

comparison of results: left
salience-based adaptation
off; right salience-based
adaptation on
1.3.2 Structure Tensor in Bayer Domain and Salience Map
The structure tensor is a traditional instrument for the estimation of local direction-
ality."The
PstructurePtensor is#a matrix composed of local gradients of pixel values
∇2x ∇x ∇y
T¼ P P 2 , and the presence of a local directional structure is
∇x ∇y ∇y

2
detected by threshold transform of coherence c ¼ λλþþ λ
þλ , which is computed from
the larger and smaller eigenvalues of structure tensor λ+ and λ, respectively. If the
coherence is small, this means that a pixel belongs to the low textured area or some
high textured area without a single preferred dimension. If the coherence is above the
threshold, the local direction is collinear to the eigenvector corresponding to the
larger eigenvalue. For RGB images, the gradients are obviously computed as
∇x ¼ Iy, x + 1 Iy, x 1, ∇ y ¼ Iy + 1, x Iy 1, x, while Bayer input requires
some modifications. The gradients were computed as ∇x ¼ max (∇xR, ∇xG, ∇
xB), ∇ y ¼ max (∇yR, ∇yG, ∇yB), where the gradients in each channel were computed
as shown in Table 1.5.
In order to apply texture direction estimation in the filter-bank structure, the angle
of the texture direction was quantized into 16 levels (we checked configurations with
8 levels, which provided visibly inferior quality and 32 levels, which provided a
minor improvement over 16 levels but was more demanding from the point of view
of the memory required). An example of a direction map is shown in Fig. 1.21.
The smaller eigenvalue λ of the structure tensor was also used to compute
the salience map. In each pixel location, the value of λ was computed in
some local window, and then threshold transform and normalization were applied:
r ¼ max ð min ðλ , t 1 Þ, t 2 Þ
t 2 t 1 . After that, the obtained map was smoothed by a Gaussian filter
30 X. Y. Petrova
Table 1.5 Computation of gradients for Bayer pattern (from Fig. 1.6)
Position in Bayer pattern
Gradient R B
∇xR I y,xþ2 I y,x2 I yþ1,xþ1 þI y1,xþ1 I yþ1,x1 I y1,x1
2 2
∇xG Iy, x + 1 Iy, x 1
∇xB I yþ1,xþ1 þI y1,xþ1 I yþ1,x1 I y1,x1 I y,xþ2 I y,x2
2 2
∇yR I yþ2,x I y2,x I yþ1,xþ1 þI yþ1,x1 I y1,xþ1 I y1,x1
2 2
∇yG Iy + 1, x Iy 1, x
∇yB I yþ1,xþ1 þI yþ1,x1 I y1,xþ1 I y1,x1 I yþ2,x I y2,x
2 2
Position in Bayer pattern
Gradient G1 G2
∇xR I y1,xþ2 þI yþ1,xþ2 I y1,x2 I yþ1,x2 Iy, x + 1 Iy, x 1
4
∇xG I y,xþ2 I y,x2 þI yþ1,xþ1 þI y1,xþ1 I yþ1,x1 I y1,x1
4
∇xB Iy, x + 1 Iy, x 1 I y1,xþ2 þI yþ1,xþ2 I y1,x2 I yþ1,x2
4
∇yR Iy + 1, x Iy 1, x I y2,xþ1 þI y2,x1 I yþ2,xþ1 I yþ2,x1
4
∇yG I yþ2,x I y2,x þI yþ1,xþ1 þI yþ1,x1 I y1,xþ1 I y1,x1
4
∇yB I yþ2,xþ1 þI yþ2,x1 I y2,xþ1 I y2,x1 Iy + 1, x Iy 1, x
4
Fig. 1.21 Direction map

constructed using structure
tensor: left input image and
right estimated directions
shown by colour wheel as of
Baker et al. (2007)
and inverted. The resulting value was used as a texture-dependent coefficient to

control the noise reduction.
1.3.3 Motion Estimation in Bayer Domain
Motion estimation inside the MF SR pipeline should be able to cover a sufficient

range of displacements on the one hand and provide an accurate estimation of
Fig. 1.22 Implementation of motion estimation for Bayer MF SR
subpixel displacements on the other hand. At the same time, it should have modest
computational complexity. To fulfill these requirements, a multiscale architecture
combining 3-Dimensional Recursive Search (3DRS) and Lucas–Kanade (LK) opti-
cal flow was implemented (Fig. 1.22). Using a 3DRS algorithm for frame-rate
conversion application is demonstrated also by Pohl et al. (2018) and in Chap. 15.
Here, the LK algorithm was implemented with improvements described in Baker
and Matthews (2004). To estimate the largest scale displacement, a simplified 3DRS
implementation from Pohl et al. (2018) was applied to the ¼ scale of the Y channel.
Further motion was refined by conventional LK on ¼ and ½ resolution, and finally
one pass of specially developed Bayer LK was applied. The single-channel Lucas–
Kanade method relies on local solution of the system TTT|u v|T ¼ TTb, where T is
computed similarly to the way it was done in Sect. 1.3.2, except for averaging the
Gaussian window applied to the gradient values. However, for this application
gradient values were obtained just from bilinear demosaicing of the original Bayer
image. The chart of the algorithm is shown in Fig. 1.22.
References
Azzari, L., Foi, A.: Gaussian-Cauchy mixture modeling for robust signal-dependent noise estima-
tion. In: Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and
Signal Processing, pp. 5357–5351 (2014)
Baker, S., Kanade, T.: Limits on super-resolution and how to break them. IEEE Trans. Pattern Anal.
Mach. Intell. 24(9), 1167–1183 (2002)
Baker, S., Matthews, I.: Lucas-Kanade 20 years on: a unifying framework. Int. J. Comput. Vis.
56(3), 221–255 (2004)
32 X. Y. Petrova
Baker, S., Scharstein, D., Lewis, J., Roth S., Black, M.J., Szeliski, R.: A database and evaluation
methodology for optical flow. In: Proceedings of IEEE International Conference on Computer
Vision, pp. 1–8 (2007). https://doi.org/10.1007/s11263-010-0390-2
Benzi, M., Bini, D., Kressner, D., Munthe-Kaas, H., Van Loan, C.: Exploiting hidden structure in
matrix computations: algorithms and applications. In: Benzi, M., Simoncini, V. (eds.) Lecture
Notes in Mathematics, vol. 2173. Springer International Publishing, Cham (2016)
Bodduna, K., Weickert, J.: Evaluating data terms for variational multi-frame super-resolution. In:
Lauze, F., Dong, Y., Dahl, A.B. (eds.) Lecture Notes in Computer Science, vol. 10302,
pp. 590–601. Springer Nature Switzerland AG, Cham (2017)
Brooks, T., Mildenhall, B., Xue, T., Chen, J., Sharlet, D., Barron, J.-T.: Unprocessing images for
learned raw denoising. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 11036–11045 (2019)
Cai, J., Gu, S., Timofte, R., Zhang, L., Liu, X., Ding, Y. et al.: NTIRE 2019 challenge on real image
super-resolution: methods and result. In: IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops, pp. 2211–2223 (2019)
Chen, C., Ren, Y., Kuo, C.-C.: Big Visual Data Analysis. Scene Classification and Geometric
Labeling, Springer Singapore, Singapore (2016)
Digital negative (DNG) specification, v.1.4.0 (2012)
Farsiu, S., Robinson, D., Elad, M., Milanfar, P.: Robust shift and add approach to superresolution.
In: Proceedings of IS&T International Symposium on Electronic Imaging, Applications of
Digital Image Processing XXVI, vol. 5203 (2003). https://doi.org/10.1117/12.507194.
Accessed on 02 Oct 2020
Farsiu, S., Robinson, M.D., Elad, M., Milanfar, P.: Fast and robust multi-frame super resolution.
IEEE Trans. Image Process. 13(10), 1327–1344 (2004)
Fazli, S., Fathi, H.: Video image sequence super resolution using optical flow motion estimation.
Int. J. Adv. Stud. Comput. Sci. Eng. 4(8), 22–26 (2015)
Foi, A., Alenius, S., Katkovnik, V., Egiazarian, K.: Noise measurement for raw data of digital
imaging sensors by automatic segmentation of non-uniform targets. IEEE Sensors J. 7(10),
1456–1461 (2007)
Foi, A., Trimeche, M., Katkovnik, V., Egiazarian, K.: Practical Poissonian-Gaussian noise model-
ing and fitting for single-image raw-data. IEEE Trans. Image Process. 17(10), 1737–1754
(2008)
Giachetti, A., Asuni, N.: Real time artifact-free image interpolation. IEEE Trans. Image Process.
20(10), 2760–2768 (2011)
Glazistov, I., Petrova X.: Structured matrices in super-resolution problems. In: Proceedings of the
Sixth China-Russia Conference on Numerical Algebra with Applications. Session Report
(2017)
Glazistov, I., Petrova, X.: Superfast joint demosaicing and super-resolution. In: Proceedings of
IS&T International Symposium on Electronic Imaging, Computational Imaging XVI,
pp. 2721–2728 (2018)
Gutierrez, E.Q., Callico, G.M.: Approach to super-resolution through the concept of multi-camera
imaging. In: Radhakrishnan, S. (ed.) Recent Advances in Image and Video Coding (2016).
https://www.intechopen.com/books/recent-advances-in-image-and-video-coding/approach-to-
super-resolution-through-the-concept-of-multicamera-imaging
Hansen, P.C., Nagy, J.G., O’Leary, D.P.: Deblurring Images: Matrices, Spectra, and Filtering.
Fundamentals of Algorithms, vol. 3. SIAM, Philadelphia (2006)
Heide, F., Rouf, M., Hullin, M.-B., Labitzke, B., Heidrich, W., Kolb, A.: High-quality computa-
tional imaging through simple lenses. ACM Trans. Graph. 32(5), Article No. 149 (2013)
Heide, F., Steinberger, M., Tsai, Y.-T., Rouf, M., Pajak, D., Reddy, D., Gallo, O., Liu, J., Heidrich,
W., Egiazarian, K., Kautz, J., Pulli, K.: FlexISP: a flexible camera image processing framework.
ACM Trans. Graph. 33(6), 1–13 (2014)
Hirakawa, K., Parks, W.-T.: Adaptive homogeneity-directed demosaicing algorithm. IEEE Trans.
Image Process. 14(3), 360–369 (2005)
Ismaeil, K.A., Aouada, D., Ottersten B., Mirbach, B.: Multi-frame super-resolution by enhanced
shift & add. In: Proceedings of 8th International Symposium on Image and Signal Processing
and Analysis, pp. 171–176 (2013)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution.
In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) European Conference on Computer Vision,
pp. 694–711. Springer International Publishing, Cham (2016)
Kanaev, A.A., Miller, C.W.: Multi-frame super-resolution algorithm for complex motion patterns.
Opt. Express. 21(17), 19850–19866 (2013)
Kuan, D.T., Sawchuk, A.A., Strand, T.C., Chavel, P.: Adaptive noise smoothing filter for images
with signal-dependent noise. IEEE Trans. Pattern Anal. Mach. Intell. 7(2), 165–177 (1985)
Li, X., Orchard, M.: New edge-directed interpolation. IEEE Trans. Image Process. 10(10),
1521–1527 (2001)
Lin, Z., He, J., Tang, X., Tang, C.-K.: Limits of learning-based superresolution algorithms.
Int. J. Comput. Vis. 80, 406–420 (2008)
Liu, X., Tanaka, M., Okutomi, M.: Estimation of signal dependent noise parameters from a single
image. In: Proceedings of the IEEE International Conference on Image Processing, pp. 79–82
(2013)
Liu, X., Tanaka, M., Okutomi, M.: Practical signal-dependent noise parameter estimation from a
single noisy image. IEEE Trans. Image Process. 23(10), 4361–4371 (2014)
Liu, J., Wu C.-H., Wang, Y., Xu Q., Zhou, Y., Huang, H., Wang, C., Cai, S., Ding, Y., Fan, H.,
Wang, J.: Learning raw image de-noising with Bayer pattern unification and Bayer preserving
augmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, pp. 4321–4329 (2019)
Ma, D., Afonso, F., Zhang, M., Bull, A.-D.: Perceptually inspired super-resolution of compressed
videos. In: Proceedings of SPIE 11137. Applications of Digital Image Processing XLII, Paper
1113717 (2019)
Malvar, R., He, L.-W., Cutler, R.: High-quality linear interpolation for demosaicing of Bayer-
patterned color images. In: International Conference of Acoustic, Speech and Signal Processing,
vol. 34(11), pp. 2274–2282 (2004)
Mastronardi, N., Ng, M., Tyrtyshnikov, E.E.: Decay in functions of multiband matrices. SIAM
J. Matrix Anal. Appl. 31(5), 2721–2737 (2010)
Milanfar, P. (ed.): Super-Resolution Imaging. CRC Press (Taylor & Francis Group), Boca Raton
(2010)
Nasonov, A. Krylov, A., Petrova, X., Rychagov M.: Edge-directional interpolation algorithm using
structure tensor. In: Proceedings of IS&T International Symposium on Electronic Imaging.
Image Processing: Algorithms and Systems XIV, pp. 1–4 (2016)
Park, J.-H., Oh, H.-M., Moon, G.-K.: Multi-camera imaging system using super-resolution. In:
Proceedings of 23rd International Technical Conference on Circuits/Systems, Computers and
Communications, pp. 465–468 (2008)
Petrova, X., Glazistov, I., Zavalishin, S., Kurmanov, V., Lebedev, K., Molchanov, A., Shcherbinin,
A., Milyukov, G., Kurilin, I.: Non-iterative joint demosaicing and super-resolution framework.
In: Proceedings of IS&T International Symposium on Electronic Imaging, Computational
Imaging XV, pp. 156–162 (2017)
Pohl, P., Anisimovsky, V., Kovliga, I., Gruzdev, A., Arzumanyan, R.: Real-time 3DRS motion
estimation for frame-rate conversion. In: Proceedings of IS&T International Symposium on
Electronic Imaging, Applications of Digital Image Processing XXVI, pp. 3281–3285 (2018)
Pyatykh, S., Hesser, J.: Image sensor noise parameter estimation by variance stabilization and
normality assessment. IEEE Trans. Image Process. 23(9), 3990–3998 (2014)
Rakhshanfar, M., Amer, M.A.: Estimation of Gaussian, Poissonian-Gaussian, and processed visual
noise and its level function. IEEE Trans. Image Process. 25(9), 4172–4185 (2016)
Robinson, M.D., Farsiu, S., Milanfar, P.: Optimal registration of aliased images using variable
projection with applications to super-resolution. Comput. J. 52(1), 31–42 (2009)
34 X. Y. Petrova
Robinson, M.D., Toth, C.A., Lo, J.Y., Farsiu, S.: Efficient Fourier-wavelet super-resolution. IEEE
Trans. Image Process. 19(10), 2669–2681 (2010)
Rochefort, G., Champagnat, F., Le Besnerais, G., Giovannelli, G.-F.: An improved observation
model for super-resolution under affine motion. IEEE Trans. Image Process. 15(11), 3325–3337
(2006)
Romano, Y., Isidoro, J., Milanfar, P.: RAISR: rapid and accurate image super resolution. IEEE
Trans. Comput. Imaging. 3(1), 110–125 (2017)
Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Adaptive Image Processing Algo-
rithms for Printing. Springer Nature Singapore AG, Singapore (2018)
Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Document Image Processing for
Scanning and Printing. Springer Nature Switzerland AG, Cham (2019)
Segall, C.A., Katsaggelos, A.K., Molina, R., Mateos, J.: Super-resolution from compressed video.
In: Chaudhuri, S. (ed.) Super-Resolution Imaging. The International Series in Engineering and
Computer Science Book Series, Springer, vol. 632, pp. 211–242 (2002)
Sroubek, F., Kamenick, J., Milanfar, P.: Superfast super-resolution. In: Proceedings of 18th IEEE
International Conference on Image Processing, pp. 1153–1156 (2011)
Sutour, C., Deledalle, C.-A., Aujol, J.-F.: Estimation of the noise level function based on a
non-parametric detection of homogeneous image regions. SIAM J. Imaging Sci. 8(4),
2622–2661 (2015)
Sutour, C., Aujol, J.-F., Deledalle, C.-A.: Automatic estimation of the noise level function for
adaptive blind denoising. In: Proceedings of 24th European Signal Processing Conference,
pp. 76–80 (2016)
Timofte, R., De Smet, V., Van Gool, L.: A+: adjusted anchored neighbourhood regression for fast
super-resolution. In: Asian Conference on Computer Vision, pp. 111–126 (2014)
Trench, W.: Properties of multilevel block α-circulants. Linear Algebra Appl. 431(10), 1833–1847
(2009)
Voevodin, V.V., Tyrtyshnikov, E.E.: Computational processes with Toeplitz matrices. Moscow,
Nauka (1987) (in Russian). https://books.google.ru/books?id¼pf3uAAAAMAAJ. Accessed on
02 Oct 2020
Zhang, H., Ding, F.: On the Kronecker products and their applications. J. Appl. Math. 2013, 296185
(2013)
Zhang, Z., Sze, V.: FAST: a framework to accelerate super-resolution processing on compressed
videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1015–1024 (2017)
Zhang, L., Wu, X.L.: An edge-guide image interpolation via directional filtering and data fusion.
Zhang, Y., Wang, G., Xu, J.: Parameter estimation of signal-dependent random noise in CMOS/
CCD image sensor based on numerical characteristic of mixed Poisson noise samples. Sensors.
18(7), 2276–2293 (2018)
Zhao, N., Wei, Q., Basarab, A., Dobigeon, N., Kouame, D., Tourneret, J.-Y.: Fast single image
super-resolution using a new analytical solution for ℓ2-ℓ2 problems. IEEE Trans. Image Process.
25(8), 3683–3697 (2016)
Zhou, D., Shen, X., Dong, W.: Image zooming using directional cubic convolution interpolation.
IET Image Process. 6(6), 627–634 (2012)
Chapter 2
Super-Resolution: 2. Machine
Learning-Based Approach
Alexey S. Chernyavskiy
2.1 Introduction
Multi-frame approaches to super-resolution allow us to take into account the minute

variabilities in digital images of scenes that are due to shifts. As we have seen earlier
in the previous Chap. 1, after warping and blurring parameters are estimated, a
solution of large systems of equations is usually involved in the process of gener-
ating a composite highly detailed image. Many ingenious ways of accelerating the
algorithms and making them practical for applications in multimedia have been
developed. On the other hand, for single image super-resolution (SISR), the missing
information cannot be taken from other frames, and it should be derived based on
some statistical assumptions about the image. In the remainder of this chapter, by
super-resolution we will mean single image zooming that involves detail creation
and speak of 2, 3, etc. zoom levels, where N means that we will be aiming at
the creation of new high-resolution (HR) images with height and width N times
larger than the low-resolution (LR) original. It is easy to see that for N zoom, the
area of the resulting HR image will increase N2 times compared to the original LR
image. So, if an LR image is up-scaled by a factor of two and if we assume that all
the pixel values of the original image are copied into the HR image, then ¾ of the
pixels in the resulting HR image should be generated by an algorithm. For 4 zoom,
this figure will be 15/16 which is almost 94% of all pixel values.
Deep learning methods have been shown to generate representations of the data
that allow us to efficiently solve various computer vision tasks like image classifi-
cation, segmentation, and denoising. Starting with the seminal paper by Dong et al.
(2016a), there has been a rise in the use of deep learning models and, specifically, of
convolutional neural networks (CNNs), in single image super-resolution. The
A. S. Chernyavskiy (*)
Philips AI Research, Moscow, Russia
e-mail: alexey.chernyavskiy@philips.com

36 A. S. Chernyavskiy
remainder of this chapter will focus on architectural designs of SISR CNNs and on
various aspects that make SISR challenging.
The blocks that make up the CNNs for super-resolution do not differ much from
neural networks used in other image-related tasks, such as object or face recognition.
They usually consist of convolution blocks, with kernel sizes of 3 3, interleaved
with simple activation functions such as the ReLU (rectified linear unit). Modern
super-resolution CNNs incorporate the blocks that have been successfully used in
other vision tasks, e.g. various attention mechanisms, residual connections, dilated
convolutions, etc. In contrast with CNNs that are designed for image classification,
there are usually no pooling operations involved. The input low-resolution image is
processed by sets of filters which are specified by kernels with learnable weights.
These operations produce arrays of intermediate outputs called feature maps.
Non-linear activations functions are applied to the feature maps, after adding
learnable biases, in order to zero out some values and to accentuate others. After
passing through several stages of convolutions and non-linear activations, the image
is finally transformed into a high-resolution version by means of the deconvolution
operation, also called transposed convolution (Shi et al. 2016a). Another up-scaling
option is the sub-pixel convolution layer (Shi et al. 2016b), which is faster than the
deconvolution, but is known to generate checkerboard artefacts. During training,
each LR image patch is forward propagated through the neural network, a zoomed
image is generated by the sequence of convolution blocks and activation functions,
and this image is compared to the true HR patch. A loss function is computed, and
the gradients of the loss function with respect to the neural network parameters
(weights and biases) are back-propagated; therefore the network parameters are
updated. In most cases, the loss function is the L2 or L1 distance, but the choice of
a suitable measure for comparing the generated image and its ground truth counter-
part is a subject of active research.
For training a super-resolution model, one should create or obtain a training
dataset which consists of pairs of low-resolution images and their high-resolution
versions. Almost all the SISR neural networks are designed for one zoom factor
only, although, e.g. a 4 up-scaled image can in principle be obtained by passing
the output of a 2-zoom CNN through itself once again. In this way, the size ratio of
the HR and LR images used for training the neural network should correspond to the
desired zoom factor. Training is performed on square image patches cropped from
random locations in input images. The patch size should match the receptive field of
the CNN, which is the size of the neighbourhood that is involved in the computation
of a single pixel of the output. The receptive field of a CNN is usually related to the
typical kernel size and the depth (number of convolution layers).
In terms of architectural complexity, the progress of CNNs for the super-resolu-
tion task has closely followed the successes of image classification CNNs. However,
some characteristic features are inherent to the CNNs used in SISR and to the process
of training such neural networks. We will examine these peculiarities later. There are
several straightforward design choices that come into play when one decides to
create and train a basic SISR CNN.
2 Super-Resolution: 2. Machine Learning-Based Approach 37
First, there is the issue of early vs. late upsampling. In early upsampling, the
low-resolution image is up-scaled using a simple interpolation method (e.g. bicubic
or Lanczos), and this crude version of the HR image serves as input to the CNN,
which basically performs the deblurring. This approach has an obvious drawback
which is the large number of operations and large HR-sized intermediate feature
maps that need to be stored in memory. So, in most modern SISR CNN, starting
from Dong et al. (2016b), the upsampling is delayed to the last stages. In this way,
most of the convolutions are performed over feature maps that have the same size as
the low-resolution image.
While in many early CNN designs the network learned to directly generate an HR
image and the loss function that was being minimized was computed as the norm of
the difference between the generated image and the ground truth HR, Zhang et al.
(2017) proposed to learn the residual image, i.e. the difference between the LR
image, up-scaled by a simple interpolation, and the HR. This strategy proved
beneficial for SISR, denoising and JPEG image deblocking. An example of CNN
architecture with residual learning is shown in Fig. 2.1; a typical model output is
shown in Fig. 2.2 for a zoom factor equal to 3. Compared to standard interpolation,
such as bicubic or Lanczos, the output of a trained SISR CNN contains much sharper
details and shows practically no aliasing.
Fig. 2.1 A schematic illustration of a CNN for super-resolution with residual learning
Fig. 2.2 Results of image up-scaling by a factor of 3: left result of applying bicubic interpolation;
right HR image generated by a CNN
2.2 Training Datasets for Super-Resolution
The success of deep convolutional neural networks is largely due to the availability
of training data. For the task of super-resolution, the neural networks are typically
trained on high-quality natural images, from which the low-resolution images are
generated by applying a specific predefined downscaling algorithm (bicubic,
Lanczos, etc.). The images belonging to the training set should ideally come from
the same domain as the one that will be explored after training. Man-made architec-
tural structures possess very specific textural characteristics, and if one would like to
up-scale satellite imagery or street views, the training set should contain many
images of buildings, as in the popular dataset Urban100 (Huang et al. 2015).
Representative images from Urban100 are shown in Fig. 2.3.
On the other hand, a more general training set would allow for greater flexibility
and better average image quality. A widely used training dataset DIV2K (Agustsson
and Timofte 2017) contains 1000 images each having 2K pixels size of at least one
of its axes and features a very diverse content, as illustrated in Fig. 2.4.
Fig. 2.3 Sample images from the Urban100 dataset

Fig. 2.4 Sample images from the DIV2K dataset
Fig. 2.5 Left LR image; top-right HR ground truth; bottom-right image reconstructed by a CNN
from an LR image that was obtained by simple decimation without smoothing or interpolation.
(Reproduced with permission from Shocher et al. 2018)
When a CNN is trained on synthetic data generated by a simple predefined kernel,

its performance on images that come from a different downscaling and degradation
process deteriorates significantly (Fig. 2.5). Most of the time, real images from the
Web, or taken by a smartphone camera, or old historic images contain many artefacts
coming from sensor noise, non-ideal PSF, aliasing, image compression, on-device
denoising, etc. It is obvious that in a real-life scenario, a low-resolution image is
produced by an optical system and, generally, is not created by applying any kind of
subsampling and interpolation. In this regard, the whole idea of training CNNs on
carefully engineered images obtained by using a known down-sampling function
might sound faulty and questionable.
It seems natural then to create a training dataset that would simulate the real
artefacts introduced into an image by a real imaging system. Another property of real
imaging systems is the intrinsic trade-off between resolution (R) and field of view
(FoV). When zooming out the optical lens in a DSLR camera, the obtained image
has a larger FoV but loses details on subjects; when zooming in the lens, the details
of subjects show up at the cost of a reduced FoV. This trade-off also applies to
cameras with fixed focal lenses (e.g. smartphones), when the shooting distance
changes. The loss of resolution that is due to enlarged FoV can be thought of as a
degradation model that can be reversed by training a CNN (Chen et al. 2019; Cai
et al. 2019a). In a training dataset created for this task, the HR image could come, for
example, from a high-quality DSLR camera, while the LR image could be obtained
by another camera that would have inferior optical characteristics, e.g. a cheap
digital camera with a lower image resolution, different focal distance, distortion
parameters, etc. Both cameras should be mounted on tripods in order to ensure the
closest similarity of the captured scenes. Still, due to the different focus and depth of
field, it would be impossible to align whole images (Fig. 2.6). The patches suitable
for training the CNN would have to be cropped from the central parts of the image
pair. Also, since getting good image quality is not a problem when up-scaling
low-frequency regions like the sky, care should be taken to only select informative
patches that contain high-frequency information, such as edges, corners, and spots.
These parts can be selected using classical computer vision feature detectors, like
SIFT, SURF, or FAST. A subsequent distortion compensation and correlation-based
alignment (registration) must be performed in order to obtain the most accurate
Fig. 2.6 Alignment of two

images of the same scene
made by two cameras with
different depth of field. Only
the central parts can be
reliably aligned
LR-HR pairs. Also, the colour map of the LR image, which can differ from that of
the HR image due to white balance, exposure time, etc., should be adjusted via
histogram matching using the HR image as reference (see Migukin et al. 2020).
During CNN training, the dataset can be augmented by adding randomly rotated
and mirrored versions of the training images. Many other useful types of image
manipulation are implemented in the Albumentations package (Buslaev et al. 2020).
Reshuffling of the training images and sub-pixel convolutions can be greatly facil-
itated by using the Einops package by Rogozhnikov (2018). Both packages are
available for Tensorflow and Pytorch, the two most popular deep learning
frameworks.
2.3 Loss Functions and Image Quality Metrics
Since super-resolution can be thought of as a regression task, the L2 distance (mean

squared error, MSE) can be used as the loss function that is minimized during
training. However, the minimization of the L2 loss often results in overly smooth
images with reduced edge sharpness. The L1 (absolute mean difference) loss is
preferable, since it produces sharper edges. Due to the different behaviour of these
loss functions when the argument is close to zero, in some cases it is preferable to use
a combination of these losses or first train the CNN with L2 loss and fine-tune it using
the L1 loss. Other choices include differentiable variants of the L1 loss, e.g. the
Charbonnier loss as in Lai et al. (2017), which has the ability to handle outliers better.
In deep learning, there is a distinction between the loss function that is minimized
during training and the accuracy, or quality, metric that is computed on the validation
dataset once the model is trained. The loss function should be easily differentiable in
order to ensure a smooth convergence, while the quality metric should measure the
relevance of the model output in the context of the task being solved. The traditional
image quality metrics are the peak signal-to-noise ratio (PSNR) and structural
similarity (SSIM), including its multiscale version MS-SSIM. The PSNR is
inversely proportional to the MSE between the reconstructed and the ground truth
image, while SSIM and MS-SSIM measure the statistical similarity between the two
images. It has been shown that these metrics do not always correlate well with
human judgement about what a high-quality image is. This is partly due to the fact
that these metrics are pixel-based and do not consider large-scale structures, textures,
and semantics. Also, the MSE is prone to the regression to the mean problem, which
leads to degradation of image edges.
There are many other image quality metrics that rely on estimating the statistics of
intensities and their gradients (see Athar and Wang 2019 for a good survey). Some of
them are differentiable and can therefore serve as loss functions for training CNNs
by back-propagation in popular deep learning frameworks (Kastryulin et al. 2020).
Image similarity can also be defined in terms of the distance between features
obtained by processing the images to be compared by a separate dedicated neural
network. Johnson et al. (2016) proposed to use the L2 distance between the visual
Fig. 2.7 Example of images that were presented to human judgement during the development of
LPIPS metric. (Reproduced with permission from Zhang et al. 2018a)
features obtained from intermediate layers of VGG16, a popular CNN that was for
some time responsible for the highest accuracy in image classification on the
ImageNet dataset. SISR CNNs trained with this loss (Ledig et al. 2017, among
many others) have shown more visually pleasant results, without the blurring that is
characteristic of MSE.
Recently, there has been a surge in attempts to leverage the availability of big data
for capturing visual preferences of users and simulating it using engineered metrics
relying on deep learning. Large-scale surveys have been conducted, in the course of
which users were given triplets of images representing the same scene but containing
a variety of conventional and CNN-generated degradations and asked whether the
first or the second image was “closer”, in whatever sense they could imagine, to the
third one (Fig. 2.7). Then, a CNN was trained to predict this perceptual judgement.
The learned perceptual patch similarity LPIPS by Zhang et al. (2018a) and PieAPP
by Prashnani et al. (2018) are two notable examples. These metrics generalize well
even for distortions that were not present during training. Overall, the development
of image quality metrics that would better correlate with human perception is a
subject of active research (Ding et al. 2020).
2.4 Super-Resolution Implementation on a Mobile Device
Since most user-generated content is created by cameras installed in mobile phones,

it is of practical interest to realize super-resolution as a feature running directly on the
device, preferably in real time. Let us consider an exemplar application and adapt
FSRCNN by Dong et al. (2016b) for 2–4 zooming. The FSRCNN, as it is
formulated by its authors, consists of a low-level feature extraction layer (convolu-
tion with a 5 5 kernel) which produces 56 feature maps, followed by a shrinking
layer which compresses the data to 12 feature maps, and a series of mapping
building blocks repeated four times. Finally, deconvolution is applied to map from
the low-resolution to high-resolution space. There are parametric rectified linear
activation units (PReLUs) in between the convolution layers. FSRCNN achieves an
average PSNR of 33.06 dB when zooming images from a test set Set5 by a factor
of 3. We trained this neural network on a dataset of image patches, as described in

Sect. 2.1 and experimented with the number of building blocks. We found that with
ten basic blocks, we could achieve 33.23 dB, if the number of channels in the
building blocks is increased from 12 to 32. This CNN took about 4 s to up-scale a
256 256 image four times on a Samsung Galaxy S7 mobile phone, using a CPU.
Therefore, there was room for CNN optimization.
One of the most obvious ways to improve the speed of a CNN is to reduce the
number of layers and the number of feature maps. We reduced the number of basic
blocks down to 4 and the number of channels in the first layer from 56 to 32. This
gave a small increase in speed, from 4 to 2.5 s. A breakthrough was made when we
replaced the convolutions inside the basic blocks by depthwise separable convolu-
tions. In deep learning, convolution filters are usually regarded as 3D parallelepi-
peds; their depth is equal to the number of input channels, while the other two
dimensions are equal to the height and width of the filter, usually 3 3 or 5 5. In
this way, if the number of input channels is N, the number of output channels is M,
and the size of filter is K K; the total number of parameters (excluding bias) is
NMK2. Convolving an H W feature map requires HWNMK2 multiplications. If
depthwise separable convolutions are used, each of the N feature maps in the stack is
convolved with its own K K filter, which gives N output feature maps. These
outputs can then be convolved with M 3D filters of size 1 1 N. The total number
of parameters in this new module becomes equal to NK2 + NM. The amount of
computations is HW(NK2 + NM). For our case, when, in the BasicBlock, N ¼ 32,
M ¼ 64, K ¼ 3, the amount of computations drops from HW*32*32*9 ¼ HW*9216
to HW*1312, which is seven times less! These are figures computed analytically,
and the real improvement in computations is more modest. Still, switching from 3D
convolutions to depthwise separable convolutions allowed an increase of the speed
and resulted in the possibility of zooming a 256 256 image 4 times in 1 s using a
CPU only. The number of parameters is also greatly reduced. A modified FSRCNN
architecture after applying this technique is shown in Table 2.1. We also added a
residual connection in the head of each basic block (Table 2.2): the input feature
maps are copied and added to the output of the 1 1 32 convolution layer before
passing to the next layer. This improves the training convergence.
For further acceleration, one can also process one-channel images instead of RGB
images. The image to be zoomed should be transformed from RGB colour space to
YCrCb space, where the Y channel contains the intensity, while the other two
channels contain colour information. Research shows that the human eye is more
sensitive to details which are expressed as differences in intensities, so only the Y
channel could be up-scaled for the sake of speed without a big loss of quality. The
other two channels are up-scaled by bicubic interpolation, and then they are merged
with the generated Y channel image and transformed to RGB space for visualization
and saving.
Most of the parameters are contained in the final deconvolution layer. Its size
depends on the zoom factor. In order to increase the speed of image zooming and to
reduce the disk space required for CNN parameter storage, we can train our CNN in
such a way as to maximize the reuse of operations. Precisely, we first train a CNN for
Table 2.1 FSRCNN architecture modified by adding depthwise separable convolutions and
residual connections
Layer name Comment Type Filter size Output channels
Data Input data Y channel 1
Upsample Upsample the input Deconvolution 1
bicubic interpolation
Conv1 Convolution, PReLU 55 32
BasicBlock1 See Table 2.2 3 3, 1 1, ReLU, 33 32
sum w/residual
sum w/residual
sum w/residual
BasicBlock4 See Table 2.2 3 3, 1 1, ReLU, 33
sum w/residual
Conv4 Convolution 11 32
Deconv Obtain the residual Deconvolution 9 9 stride 3 1
Result Deconv + Upsample Summation 1
Table 2.2 FSRCNN architecture modified by adding depthwise separable convolutions and
residual connections
Layer name Type Filter size Output channels
Conv3 3 Depthwise separable convolution 33 32
Conv1 1 Convolution, ReLU 11 32
Sum Summation of input to block and 32
results of Conv1 1
3 zooming. Then, we keep all the layers’ parameters frozen (by setting the learning
rate to zero for these layers) and replace the final deconvolution layer by a layer that
performs 4 zoom. In this way, after an image is processed by the main body of the
CNN, the output of the last layer before the deconvolution is fed into the layer
specifically responsible for the desired zoom factor. So, the latency for the first zoom
operation is big, but zooming the same image by a different factor is much faster than
the first time.
Our final CNN achieves 33.25 dB on the Set5 dataset with 10K parameters,
compared to FSRCNN with 33.06 dB and 12K parameters. To make full use of the
GPU available on recent mobile phones, we ported our super-resolution CNN to a
Samsung Galaxy S9 mobile device using the Qualcomm Neural Processing Engine
(SNPE) SDK. We were able to achieve 1.6 FPS for 4-up-scaling of 1024 1024
images. This figure does not include the CPU to GPU data transfers and RGB to
YCbCr transformations, which can take 1 second in total. The results of super-
resolution using this CNN are shown in Table. 2.2 and Fig. 2.8.
Fig. 2.8 Super-resolution on a mobile device: left column bicubic interpolation; right column
modified FSRCNN optimized for Samsung Galaxy S9
2.5 Super-Resolution Competitions
Several challenges on single image super-resolution have been initiated since 2017.
They intend to bridge the gap between academic research and real-life applications
of single image super-resolution. The first NTIRE (New Trends in Image Restoration
and Enhancement) challenge featured two tracks. In Track 1 bicubic interpolation
was used for creating the LR images. In Track 2, all that was known was that the LR
images were produced by convolving the HR image with some unknown kernel. In
both tracks, the HR images were downscaled by factors of 2, 3, and 4, and only blur
and decimation were used for this, without adding any noise. The DIV2K image
dataset was proposed for training and validation of algorithms (Agustsson and
Timofte 2017).
The competition attracted many teams from academia and industry, and many
new ideas were demonstrated. Generally, although the PSNR figures for all the
algorithms were worse on images from Track 2 than on those coming from Track
1, there was a strong positive correlation between the success of the method in both
tracks. The NTIRE competition became a yearly event, and the tasks to solve became
more and more challenging. It now features more tracks, many of them related to
image denoising, dehazing, etc. With regard to SR, NTIRE 2018 already featured
four tracks, the first one being the same as Track 1 from 2017, while the remaining
three added unknown image artefacts that emulated the various degradation factors
Fig. 2.9 Perception-distortion plane used for SR algorithm assessment in the PIRM challenge
present in the real image acquisition process from a digital camera. In 2019, RealSR,
a new dataset captured by a high-end DSLR camera, was introduced by Cai
et al. (2019b). For this dataset, HR and LR images of the same scenes were acquired
by the same camera by changing its focal length. In 2020, the “extreme” 16 track
was added. Along with PSNR and SSIM values, the contestants were ranked based
on the mean opinion score (MOS) computed in a user study.
The PIRM (Perceptual Image Restoration and Manipulation) challenge that was
first held in 2018 was the first to really focus on perceptual quality. The organizers
used an evaluation scheme based on the perception-distortion plane. The perception-
distortion plane was divided into three regions by setting thresholds on the RMSE
values (Fig. 2.9). In each region, the goal was to obtain the best mean perceptual
quality. For each participant, the perception index (PI) was computed as a combi-
nation of the no-reference image quality measures of Ma et al. (2017) and NIQE
(Mittal et al. 2013), a lower PI indicating better perceptual quality. The PI demon-
strated a correlation of 0.83 with the mean opinion score.
Another similar challenge, AIM (Advances in Image Manipulation), was first
held in 2019. It focuses on the efficiency of SR. In the constrained SR challenge, the
participants were asked to develop neural network designs or solutions with either
the lowest amount of parameters, or the lowest inference time on a common GPU, or
the best PSNR, while being constrained to maintain or improve over a variant of
SRResNet (Ledig et al. 2017) in terms of the other two criteria.
In 2020, both NTIRE and AIM introduced the Real-World Super-Resolution
(RWSR) sub-challenges, in which no LR-HR pairs are ever provided. In the Same
Domain RWSR track, the aim is to learn a model capable of super-resolving images
in the source set, while preserving low-level image characteristics of the input source
domain. Only the source (input) images are provided for training, without any HR
ground truth. In the Target Domain RWSR track, the aim is to learn a model capable
of super-resolving images in the source set, generating clean high-quality images
similar to those in the target set. The source input images in both tracks are
constructed using artificial, but realistic, image degradations. The difference with
all the previous challenges is that this time the images in the source and target set are
unpaired, so the 4 super-resolved LR images should possess the same properties as
the HR images of different scenes.
Final reports have been published for all of the above challenges. The reports are
a great illustrated source of information about the winning solutions, neural net
architectures, training strategies, and trends in SR, in general. Relevant references
are Timofte et al. (2017), Timofte et al. (2018), Cai et al. (2019a), and Lugmayr
et al. (2019).
2.6 Prominent Deep Learning Models for Super-Resolution
Over the years, although many researchers in super-resolution have used the same
neural networks that produced state-of-the-art results in image classification, a lot of
SR-specific enhancements have been proposed. The architectural decisions that we
will describe next were instrumental in reaching the top positions in SISR challenges
and influenced the research in this field.
One of the major early advances in single image super-resolution was the
introduction of generative adversarial networks (GANs) to produce more realistic
high-resolution images in (Ledig et al. 2017). The proposed SRGAN network
consisted of a generator, a ResNet-like CNN with many residual blocks, and a
discriminator. The pair of two networks was being trained concurrently, with the
generator network trying to produce high-quality HR images, and the discriminator
aiming to correctly classify whether its input is a real HR image or one generated by
an SR algorithm. The performance of the discriminator was measured as the
adversarial loss. The rationale was that this competition between the two networks
would push the generated images closer to the manifold of natural images. In that
work, the similarity between intermediate features generated by passing the two
images through a well-trained image classification model was also used as percep-
tual loss – so that, in total, three different losses (along with MSE) were combined
into one for training. It was clearly demonstrated that GANs are able not only to
synthesize fantasy images given a random input but also to be instrumental in image
processing. Since then, GANs have become a method of choice for deblurring,
super-resolution, denoising, etc.
In the LapSRN model (Lai et al. 2017), shown in Fig. 2.10, the upsampling
follows the principle of Laplacian pyramids, i.e. each level of the CNN learns to
predict a residual that should explain the difference between a simple up-scale of the
previous level and the desired result. The predicted high-frequency residuals at each
level are used to efficiently reconstruct the HR image through upsampling and
addition operations.
The model has two branches: feature extraction and image reconstruction. The
first one uses stacks of convolutional layers to produce and, later, up-scale the
Fig. 2.10 Laplacian pyramid network for 2, 4 and 8 up-scaling. (Reproduced with permis-
sion from Lai et al. 2017)
residual images. The second one serves for summing up the residuals coming from
the feature extraction branch with the images that are upsampled by bilinear inter-
polation and process the results. The entire network is a cascade of CNNs with a
similar structure at each level. Each level has its loss function which is computed
with respect to the corresponding ground truth HR image at the specific scale.
LapSRN generates multiscale predictions, with zoom factors equal to powers
of 2. This design facilitates resource-aware applications, such as those running on
mobile devices. For example, if there is a lack of resources for 8 zooming, the
trained LapSRN model can still perform super-resolution with factors 2 and 4.
Like LapSRN, ProSR proposed by Wang et al. (2018a) aimed at the power-of-
two up-scale task and was built on the same hierarchical pyramid idea (Fig. 2.11).
However, the elementary building blocks for each level of the pyramid became more
sophisticated. Instead of sequences of convolutions, the dense compression units
(DCUs) were adapted from DenseNet. In a DCU, each convolutional layer obtains
“collective knowledge” as additional inputs from all preceding layers and passes on
its own feature maps to all subsequent layers through concatenation. This results in
better gradient flow during training.
In order to reduce the memory consumption and increase the receptive field with
respect to the original LR image, the authors used an asymmetric pyramidal structure
with more layers in the lower levels. Each level of the pyramid consists of a cascade
of DCUs followed by a sub-pixel convolution layer. A GAN variant of the ProSR
was also proposed, where the discriminator matched the progressive nature of the
generator network by operating on the residual outputs of each scale.
Compared to LapSRN, which also used a hierarchical scheme for power-of-two
upsampling, in ProSR the intermediate subnet outputs are neither supervised nor
used as the base image in the subsequent level. This design simplifies the backward
pass and reduces the optimization difficulty.
Fig. 2.11 Progressive super-resolution network. (Reproduced with permission from Wang et al.
2018a)
Fig. 2.12 DBPN and its up- and down-projection units. (Reproduced with permission from Haris
et al. 2018)
The Deep Back-Projection Network DBPN (Haris et al. 2018) exploits iterative
up- and down-sampling layers, providing an error feedback mechanism for projec-
tion errors at each stage. Inspired by iterative back-projection, an algorithm used
since the 1990s for multi-frame super-resolution, the authors proposed using mutu-
ally connected up- and down-sampling stages each of which represents different
types of image degradation and high-resolution components. As in ProSR, dense
connections between upsampling and down-sampling layers were added to encour-
age feature reuse.
Initial feature maps are constructed from the LR image, and they are fed to a
sequence of back-projection modules (Fig. 2.12). Each such module performs a
change of resolution up or down, with a set of learnable kernels, with a subsequent
return to the initial resolution using another set of kernels. A residual between the
input feature map and the one that was subjected to the up-down or down-up
operation is computed and passed to the next up- or downscaling. Finally, the
Fig. 2.13 Channel attention module used to reweight feature maps. (Reproduced with permission
from Zhang et al. 2018b)
outputs of all the up-projection units are concatenated and processed by a

convolutional layer to produce the HR output.
Zhang et al. (2018b) introduced two novelties to SISR. First, they proposed a
residual-in-residual (RIR) module that allows better gradient propagation in deep
networks. RIR allows abundant low-frequency information to be bypassed through
multiple skip connections, making the main network focus on learning high-
frequency information. The resulting CNN has almost 400 layers but a relatively
low number of parameters due to the heavy use of residual connections. Second, the
authors introduced a channel attention (CA) mechanism, which is basically a remake
of the squeeze-and-excite module from SENet (Hu et al. 2018).
Many SISR methods treat LR channel-wise features equally, which is not flexible
for the real cases, since high-frequency channel-wise features are more informative
for HR reconstruction. They contain edges, textures, and fine details. In order to
make the network focus on more informative features, the authors exploit the
interdependencies among feature channels. The CA first transforms an input stack
of feature maps into a channel descriptor by using global average pooling (Fig. 2.13).
Next, the channel descriptor is projected into a lower-dimensional space using a
1 1 convolution (which is equivalent to a linear layer). The resulting smaller-sized
vector is gated by a ReLU function and projected back into the higher-dimensional
space using another 1 1 convolution operation and finally goes through a sigmoid
function. The resulting vector is used to multiply each input feature map by its own
factor. In this way, the CA acts as a guide for finding more informative components
of an input and to adaptively rescale channel-wise features based on their relative
importance.
The main contribution of Dai et al. (2019) to super-resolution was a novel
trainable second-order channel attention (SOCA) module. Most of the existing
CNN-based SISR methods mainly focus on wider or deeper architecture design,
neglecting to explore the feature correlations of intermediate layers, hence hindering
the representational power of CNNs. Therefore, by adaptively learning feature
interdependencies, one could rescale the channel-wise features and obtain more
discriminative representations.
With this rescaling, the SOCA module is similar to SENet (Hu et al. 2018) and
RCAN (Zhang et al. 2018b). However, both SENet and RCAN only explored first-
order statistics, because in those models, global average pooling was applied to get
one value for each feature map. Since, in SR, features with more high-frequency are
important for the overall image quality, it is natural to use second-order statistics.
The benefits of using second-order information in the context of CNN-based

processing of information have been demonstrated for fine-grained classification
of birds, face recognition, etc.
After applying any layer of a CNN, a stack of C intermediate feature maps of size
H W is reshaped into a feature matrix X containing HW features of dimension C.
The sample covariance matrix can then be computed as Σ ¼ XXT. This matrix Σ is
symmetric positive semi-definite. The covariance matrix is used to produce a C-
dimensional vector of second-order statistics through what is called a global covari-
ance pooling function. This function computes the averages of each row of the
covariance matrix. As in SENet and RCAN, the resulting vector of statistics is
passed through one 1 1 convolution layer that reduces its size to C/r, and, after
applying the ReLU, it is transformed into a C-dimensional vector wc. Finally, the
input stack of C feature maps is scaled by multiplying it by weights given by wc. The
EIG decomposition involved in the normalization of Σ can be computed approxi-
mately in several iterations of the Newton-Schulz algorithm, and so the whole SR
model containing SOCA modules can be trained efficiently on a GPU.
Wang et al. (2018b) addressed the problem of recovering natural and realistic
texture from low-resolution images. As demonstrated in Fig. 2.14, many HR patches
could have very similar LR counterparts. While GAN-based approaches (without
prior) and the use of perceptual loss during training can generate plausible details,
they are revealed to be not very realistic upon careful examination. There should be a
prior that would help the SR model to differentiate between similar-looking LR
patches, in order to generate HR patches that would be more relevant and
constrained to the semantic class present in this patch. It is shown in Fig. 2.14 that
a model trained on a correct prior (e.g. on a dataset of only plants, or only buildings,
but not a mix of them) succeeds in hypothesizing HR textures.
Since training a separate super-resolution network for each semantic class is
neither scalable nor computationally efficient, the authors proposed to modulate
Fig. 2.14 The building and plant patches from two LR images look very similar. Without a correct
prior, GAN-based methods can add details that are not faithful to the underlying class. (Reproduced
with permission from Wang et al. 2018b)
Fig. 2.15 Modulation of SR feature maps using affine parameters derived from probabilistic
segmentation maps. (Reproduced with permission from Wang et al. 2018b)
the features F of some intermediate layers of a super-resolution CNN by semantic

segmentation probability maps. Assuming that these maps are available from some
other pretrained CNN, a set of convolutional layers takes them as input and maps
them to two arrays of modulation parameters γ(i, j) and β(i, j) (Fig. 2.15). These
arrays are then applied on a per-pixel elementwise basis to the feature maps F: SFT
(F | γ, β) ¼ γ*F + β, where ‘*’ means the Hadamard product.
The reconstruction of an HR image with rich semantic regions can be achieved
with just a single forward pass through transforming the intermediate features of a
single network. The SFT layers can be easily introduced to existing super-resolution
networks. Finally, instead of a categorical prior, other priors such as depth maps can
also be applied. This latter fact could be helpful to the recovery of texture granularity
in super-resolution.
Previously, the efficacy of depth extraction from single images using CNN has
also been summarized in Chap. 1 by Safonov et al. (2019).
2.7 Notable Applications and Future Challenges
In the previous sections, we have demonstrated several approaches and CNN

architectures that allowed us to obtain state-of-the-art results for single image
super-resolution, in terms of PSNR or other image quality metrics. Let us now
describe some notable ingenious applications of SISR and list the directions of
future improvements in this area.
As we have seen before, SISR CNNs are usually trained for performing zooming
by a fixed factor, e.g. 2, 3, or a power of 2. In practice, however, the user might
need to up-scale an image by an arbitrary noninteger factor in order to fit the
resulting image within some bounds, e.g. the screen resolution of a device. Also,
in the multimedia content generation context, a continuous zoom that imitates the
movement towards the scene is a nice feature to have. There is, of course, an option
to simulate this kind of arbitrary magnification by first zooming the LR image by the
nearest integer factor using a trained CNN and then applying down-sampling or
upsampling of the output with subsequent standard interpolation. Hu et al. (2019)
proposed to use a special Meta-Upscale Module. This module can replace the
standard deconvolution modules that are placed at the very end of CNNs and are
responsible for the up-scaling. For an arbitrary scale factor, this module takes the
zoom factor as input, together with the feature maps created by any SISR CNN, and
dynamically predicts the weights of the up-scale filters. The CNN then uses these
weights to generate an HR image of arbitrary size. Besides the elegance of a meta-
learning approach, and the obvious flexibility with regard to zoom factors, an
important advantage of this approach is the requirement to store parameters for
only one small trained subnetwork.
The degradation factor that produces an LR image from an HR one is often
unknown. It can be associated with a nonsymmetric blur kernel, it can contain noise
from sensors or compression, and it can even be spatially dependent. One prominent
approach to simultaneously deal with whole families of blur kernels and many
possible noise levels has been proposed by Zhang et al. (2018c). By assuming that
the degradation can be modelled as an anisotropic Gaussian, with the addition of
white Gaussian noise with standard deviation σ, a multitude of LR images are
created for every HR ground truth image present in the training dataset. These LR
images are augmented with degradation maps which are computed by projecting the
degradation kernels to a low-dimensional subspace using PCA. The degradation
maps can be spatially dependent. The super-resolution multiple-degradations
(SRMD) network performs simultaneous zooming and deblurring for several zoom
factors and a wide range of blur kernels. It is assumed that the exact shape of the blur
kernel can be reliably estimated during inference. Figure 2.16b demonstrates the
result of applying SRMD to an LR image that was obtained from the HR image by
applying a Gaussian smoothing with an isotropic kernel that had a different width for
different positions in the ground truth HR image; also spatially dependent white
Gaussian noise was added. The degradation model shown in Fig. 2.16a, b is quite
complex, but the results of simultaneous zooming and deblurring (Fig. 2.16c) still
demonstrate sharp edges and good visual quality. This work was further extended to
non-Gaussian degradation kernels by Zhang et al. (2019).
Fig. 2.16 Examples of SRMD on dealing with spatially variant degradation: (a) noise level and
Gaussian blur kernel width maps; (b) zoomed LR image with noise added according to (a); (c)
results of SRMD with scale factor 2
Fig. 2.17 Frame of the

animated feature
film “Thumbelina”
produced in the USSR in
1964, restored and up-scaled
using deep learning and
GANs by Yandex
Deep learning algorithms can be applied to multiple frames for video super-
resolution, a subject that we did not touch on in this chapter. In order to ensure proper
spatiotemporal smoothness of the generated videos, DL methods are usually lever-
aged by optical flow and other cues from traditional computer vision, although many
of these cues can also be generated and updated in a DL context. It is worth mentioning
that the Russian Internet company Yandex (2018) has successfully used deep learning
to restore and up-scale various historical movies and cartoons (see Fig. 2.17 for an
example), which the company streamed under the name DeepHD.
Super-resolution of depth maps is also a topic of intense research. Depth maps are
obtained by depth cameras; they can also be computed from stereo pairs. Since this
computation is time-consuming, it is advantageous to use classical algorithms to
obtain the LR depth and then rescale them by a large factor of 4 to 16. The loss of
edge sharpness during super-resolution is much more prominent in depth maps than
in regular images. It has been shown by Hui et al. (2016) that CNNs can be trained to
accurately up-scale LR depth maps given the HR intensity images as an additional
input. Song et al. (2019) proposed an improved multiscale CNN for depth map
super-resolution that does not require the corresponding images. A related applica-
tion of SR is up-scaling of stereo images. Super-resolution of stereo pairs is
challenging because of large disparities between similar-looking patches of images.
Wang et al. (2019) have proposed a special parallax-attention mechanism with a
large receptive field along the epipolar lines to handle large disparity variations.
Super-resolution was recently applied by Chen et al. (2018) for magnetic reso-
nance imaging (MRI) in medicine. The special 3D CNN processes image volumes
and allows the MRI acquisition time to be shortened at the expense of minor image
quality degradation. It is worth noting here that, while in the majority of use cases we
expect that the SISR algorithms would generate images that are visually appealing,
in many cases when an important decision should be made by analysing the image –
e.g. in security, biometrics, and especially in medical imaging – a much more
important issue is to keep the informative features unchanged during the
up-scaling, without introducing features that look realistic for the specific domain as
a whole but are not relevant and misguiding for the particular case. High values of
similarity, including perceptual metrics, may give an inaccurate impression about
good performance of an algorithm. The ultimate verdict should come from visual
inspection by a panel of experts. Hopefully, this expertise can also be learned and
simulated by a machine learning algorithm to some degree.
Deep learning-based super-resolution has come a long way since the first attempts
at zooming synthetic images obtained by naïve bicubic down-sampling. Nowadays,
super-resolution is an integral part of the general image processing pipeline which
includes denoising and image enhancement. Future super-resolution algorithms
should be tunable by the user and provide reasonable trade-offs between the zoom
factor, denoising level and the loss or hallucination of details. Suitable image quality
metrics should be developed for assessing users’ preferences.
In order to make the algorithms generic and independent of the hardware, camera-
induced artefacts should be disentangled from the image content. This should be
done without requiring much training data from the same camera. Ideally, a single
image should suffice to derive the prior knowledge necessary for up-scaling and
denoising, without the need for pairs of LR and HR images. This direction is called
zero-shot super-resolution (Shocher et al. 2018; Ulyanov et al. 2020; Bell-Kliger
et al. 2019). Unpaired super-resolution is a subject of intense research; this task is
formulated in all the recent super-resolution competitions. The capturing of image
priors is often performed using generative adversarial networks and includes not
only low-level statistics but semantics (colour, resolution) as well. It is possible to
learn the “style” of the target (high-quality) image domain and transfer it to the
super-resolved LR image (Pan et al. 2020).
Modern super-resolution algorithms are computationally demanding, and there is
no indication that the number of convolutional layers in a CNN, after it exceeds
some threshold, inevitably leads to higher image quality. High quality is obtained by
other methods – residual or dense connections, attention mechanisms, multiscale
processing, etc. The number of operations per pixel will most likely decrease in
future SR algorithms. The CNNs will become more suitable for real-time processing
even for large images and high zoom factors. This might be achieved by training in
fixed-point arithmetic, network pruning and compression, and automatic adaptation
of architectures to target hardware using neural architecture search (Eisken et al.
2019). Finally, the application of deep learning methods to video SR, including the
time dimension (frame rate up-conversion; see Chap. 15), will set new standards in
multimedia content generation.
References
Agustsson, E., Timofte, R.: NTIRE 2017 challenge on single image super-resolution: dataset and
study. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recog-
nition Workshops, pp. 1122–1131 (2017)
Athar, S., Wang, Z.: A comprehensive performance evaluation of image quality assessment
algorithms. IEEE Access. 7, 140030–140070 (2019)
Bell-Kliger, S., Shocher, A., Irani, M.: Blind super-resolution kernel estimation using an internal-
GAN. Adv. Neural Inf. Proces. Syst. 32 (2019). http://www.wisdom.weizmann.ac.il/~vision/
kernelgan/index.html. Accessed on 20 Sept 2020
Buslaev, A., Iglovikov, V.I., Khvedchenya, E., Parinov, A., Druzhinin, M., Kalinin, A.A.:
Albumentations: fast and flexible image augmentations. Information. 11(2), 125 (2020)
Cai, J., et al.: NTIRE 2019 challenge on real image super-resolution: methods and results. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops, pp. 2211–2223 (2019a). https://ieeexplore.ieee.org/document/9025504. Accessed
on 20 Sept 2020
Cai, J., Zeng, H., Yong, H., Cao, Z., Zhang, L.: Toward real-world single image super-resolution: A
new benchmark and a new model. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision, pp. 3086–3095 (2019b)
Chen, Y., Shi, F., Christodoulou, A.G., Xie, Y., Zhou, Z., Li, D.: Efficient and accurate MRI super-
resolution using a generative adversarial network and 3D multi-level densely connected net-
work. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.)
Medical Image Computing and Computer Assisted Intervention. Lecture Notes in Computer
Science, vol. 11070. Springer Publishing Switzerland, Cham (2018)
Chen, C., Xiong, Z., Tian, X., Zha, Z., Wu, F.: Camera lens super-resolution. In: Proceedings of the
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1652–1660
(2019)
Dai, T., Cai, J., Zhang, Y., Xia, S., Zhang, L.: Second-order attention network for single image
super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 11057–11066 (2019)
Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: unifying structure and
texture similarity. arXiv, 2004.07728 (2020)
Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks.
IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016a)
Dong, C., Loy, C.C., He, K., Tang, X.: Accelerating the super-resolution convolutional neural
network. In: Proceedings of the European Conference on Computer Vision, pp. 391–407
(2016b)
Eisken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20,
1–21 (2019)
Haris, M., Shakhnarovich, G., Ukita, N.: Deep back-projection networks for super-resolution. In:
Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 1664–1673 (2018)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Hu, X., Mu, H., Zhang, X., Wang, Z., Tan, T., Sun, J.: Meta-SR: a magnification-arbitrary network
for super-resolution. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 1575–1584 (2019)
Huang, J., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 5197–5206 (2015)
Hui, T.-W., Loy, C.C., Tang, X.: Depth map super-resolution by deep multi-scale guidance. In:
Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. ECCV 2016.
Lecture Notes in Computer Science, vol. 9907. Springer Publishing Switzerland, Cham (2016)
In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. ECCV
2016. Lecture Notes in Computer Science, vol. 9906. Springer Publishing Switzerland, Cham
(2016)
Kastryulin, S., Parunin, P., Zakirov, D., Prokopenko, D.: PyTorch image quality. https://github.
com/photosynthesis-team/piq (2020). Accessed on 20 Sept 2020
Lai, W., Huang, J., Ahuja, N., Yang, M.: Deep Laplacian pyramid networks for fast and accurate
super-resolution. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern
Recognition, pp. 5835–5843 (2017)
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial
network. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern
Recognition, pp. 105–114 (2017)
Lugmayr, A., et al.: AIM 2019 Challenge on real-world image super-resolution: methods and
results. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision
Workshop, pp. 3575–3583 (2019)
Ma, C., Yang, C.-Y., Yang, M.-H.: Learning a no-reference quality metric for single-image super-
resolution. Comput. Vis. Image Underst. 158, 1–16 (2017)
Migukin, A., Varfolomeeva, A., Chernyavskiy, A., Chernov, V.: Method for image super-
resolution imitating optical zoom implemented on a resource-constrained mobile device, and
a mobile device implementing the same. US patent application 20200211159 (2020)
Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer.
IEEE Signal Process. Lett. 20(3), 209–212 (2013)
Pan, X., Zhan, X., Dai, B., Lin, D., Change Loy, C., Luo, P.: Exploiting deep generative prior for
versatile image restoration and manipulation. arXiv, 2003.13659 (2020)
Prashnani, E., Cai, H., Mostofi, Y., Sen, P.: PieAPP: perceptual image-error assessment through
pairwise preference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 1808–1817 (2018)
Rogozhnikov, A.: Einops – a new style of deep learning code. https://github.com/arogozhnikov/
einops/ (2018). Accessed on 20 Sept 2020
Shi, W., Caballero, J., Theis, L., Huszar, F., Aitken, A., Ledig, C., Wang, Z.: Is the deconvolution
layer the same as a convolutional layer? arXiv, 1609.07009 (2016a)
Shi, W., Caballero, J., Theis, L., Huszar, F., Aitken, A., Ledig, C., Wang, Z.: Real-time single
image and video super-resolution using an efficient sub-pixel convolutional neural network. In:
pp. 1874–1883 (2016b)
Shocher, A., Cohen, N., Irani, M.: “Zero-shot” super-resolution using deep internal learning. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 3118–3126 (2018)
Song, X., Dai, Y., Qin, X.: Deeply supervised depth map super-resolution as novel view synthesis.
IEEE Trans. Circuits Syst. Video Technol. 29(8), 2323–2336 (2019)
Timofte, R., et al.: NTIRE 2017 challenge on single image super-resolution: methods and results.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Work-
shops, pp. 1110–1121 (2017)
Timofte, R., et al.: NTIRE 2018 challenge on single image super-resolution: methods and results.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops, pp. 965–96511 (2018)
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. Int. J. Comput. Vis. 128, 1867–1888
(2020)
Wang, Y., Perazzi, F., McWilliams, B., Sorkine-Hornung, A., Sorkine-Hornung, O., Schroers, C.:
A fully progressive approach to single-image super-resolution. In: Proceedings of the IEEE/
CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 977–97709
(2018a)
Wang, X., Yu, K., Dong, C., Change Loy, C.: Recovering realistic texture in image super-resolution
by deep spatial feature transform. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 606–615 (2018b)
Wang, L., et al.: Learning parallax attention for stereo image super-resolution. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12242–12251
(2019)
Yandex: DeepHD: Yandex’s AI-powered technology for enhancing images and videos. https://
yandex.com/promo/deephd/ (2018). Accessed on 20 Sept 2020
Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a Gaussian denoiser: residual learning
of deep CNN for image denoising. IEEE Trans. Image Process. 26(7), 3142–3155 (2017)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep
features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 586–595 (2018a)
Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep
residual channel attention networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss,
Y. (eds.) Computer Vision – ECCV 2018. Lecture Notes in Computer Science, vol. 11211.
Springer Publishing Switzerland (2018b)
Zhang, K., Zuo, W., Zhang, L.: Learning a single convolutional super-resolution network for
multiple degradations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 3262–3271 (2018c)
Zhang, K., Zuo, W., Zhang, L.: Deep plug-and-play super-resolution for arbitrary blur kernels. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 1671–1681 (2019)
Chapter 3
Depth Estimation and Control
Ekaterina V. Tolstaya and Viktor V. Bucha
3.1 Introduction
During the early 2010s with the release of the super popular Hollywood movie
“Avatar” in 2009, by James Cameron, the era of 3D TV technology got new wind.
By 2010, the interest was very high, and all major TV makers included a “3D-ready”
feature in smart TVs. In early 2010, the world’s biggest manufacturers such as
Samsung, LG, Sony, Toshiba, and Panasonic launched first home 3D TVs in the
market. The 3D TV technology was at the top of expectations. Figure 3.1 shows
some of prospective technologies as they were seen in 2010.
Over the next several years, 3D TV was a hot topic in major consumer electronic
shows. However, because of the absence of technical innovations, lack of content,
and clearer disadvantages of the technology, the interest of consumers to such types
of devices started to subside (Fig. 3.2).
By 2016, almost all TV makers announced the termination of the 3D TV feature
in flat panel TVs and drove their attention to high resolution and high dynamic range
features, though 3D cinema is still popular.
The possible causes of such interest transformation are the following, from the
marketing and technology point of view.
1. Inappropriate moment. The recent transition from analogue to digital TV forced
E. V. Tolstaya (*)
Aramco Innovations LLC, Moscow, Russia
e-mail: ktolstaya@yandex.ru
V. V. Bucha
BIQUANTS, Minsk, Belarus
e-mail: vbucha@biquants.com

60 E. V. Tolstaya and V. V. Bucha
expectations
3D Flat - Panel TVs and Displays
Cloud Computing
Cloud/Web Platforms
Augmented Reality
Internet TV
3D Printing Gesture Recognition
Mobile Robots
Video Search Electronic Paper

Autonomous Vehicles
E-book Readers Speech Recognition
Interactive TV
Computer -Brain Interface

Virtual Assistants Mobile Application Stores
Human Augmentation
Consumer-Generated Media
As of August 2010
Technology Peak of Trough of Slope of Plateau of

Trigger Inflated Disillusionment Enlightment Productivity
Expectations
Years to mainstream adoption: time
less than 2 years 2 to 5 years 5 to 10 years more than 10 years
Fig. 3.1 Some prospective technologies, as they were seen in 2010. (Gartner Hype Cycle for
Emerging technologies 2010, www.gartner.com)
Avatar
3D TV popularity dynamics
premiere,
100 December
90 10, 2009,
London
80
70
60
50
40
30
20
10
0
2008-01 2010-01 2012-01 2014-01 2016-01 2018-01 2020-01
Fig. 3.2 Popularity of term “3D TV” measured by Google Trends, given in percentage of its
maximum
many consumers to buy new digital TVs, and by 2010 many of them were not
ready to invest in the new TVs once again.
2. Extra cost. To take full advantage of using this new technology, consumers also
had to buy a 3D-enabled Blu-ray player or get a 3D-enabled satellite box.
3. Uncomfortable glasses. The 3D images work on the principle that each of our
eyes sees a different picture. By perceiving a slightly different picture from each
3 Depth Estimation and Control 61
eye, the brain automatically constructs the third dimension. 3D-ready TVs came
with stereo glasses: with so-called active or passive glasses technology. Glasses
of different manufacturers could be incompatible with each other. Moreover, a
family of three or more people would need additional pairs, since usually only
one or two pairs were supplied with a TV. Viewers wearing prescription glasses
had to wear an additional pair over the required ones. And finally, the glasses
needed charging, so to be able to watch TV, you had to keep several pairs fully
charged.
4. Live TV. It is a difficult task for broadcast networks to support 3D TV: a separate
channel is required for broadcasting 3D content, additional to the conventional
2D one for viewers with no 3D feature.
5. Picture quality. The picture in the 3D mode is dimmer than in the conventional
2D mode, because each eye sees only half of the pixels (or half of the light)
intended for the picture. In addition, viewing a 3D movie on a smaller screen with
a narrow field of view did not give a great experience, because the perceived
depth is much smaller than on big screen cinema. In this case, even increasing
parallax length does not boost depth but adds to eye fatigue and headache.
6. Multiple user scenario. When several people watch a 3D movie, not all of them
can sit at the point in front of the TV that gives best viewing experience. This
leads to additional eye fatigue and picture defects.
It is clear that some of the mentioned technological problems still exist and need
its future engineers to resolve. However, even now we can see that it is still possible
for stereo reproduction technology to meet its next wave of popularity. Virtual
reality headsets, which recently appeared on the market, give an excellent viewing
experience in 3D movie watching.
3.2 Stereo Content Reproduction Systems
The whole idea of generating a 3D impression in the viewer’s brain is based on

the principle of showing two slightly different images to each eye of the user. The
images are slightly shifted in the horizontal direction by a distance called the
parallax. This creates the illusion of depth in binocular vision. There are several
very common ways to achieve it that require view separation, i.e. systems that use
one screen for stereo content reproduction. Let us consider different ways for
separating left and right views on the screen.
3.2.1 Passive System with Polarised Glasses
In passive systems with polarised glasses, the two pictures are projected with
different light polarisation, and corresponding polarised glasses allow separating
the images. The system requires the use of a quite expensive screen that saves the
initial polarisation of reflected light. Usually, such a system is used in 3D movie
theatres. The main disadvantage is loss of brightness, since only half of the light
reaches each eye.
3.2.2 Active Shutter Glass System
The most common system in home 3D TVs is based on active shutter glasses: the TV
alternates left and right views on the screen, and synchronised glasses close the eyes
one by one. The stereo effect and its strength depend on the parallax (i.e. difference)
in two views of a stereopair. The main disadvantage is loss of frame rate, because
only half of the existing video frames reach each eye.
3.2.3 Colour-Based Views Separation
Conventional colour anaglyphs use red-blue colour spectrum parts, accompanied

with correspondingly coloured glasses with different colour filters on left and right
views. More recent systems use amber and blue filters such as the ColorCode 3D
system, or Inficolor 3D, where the left image uses the green channel only and the
right use the red and blue channels with some added post-processing, and the brain
then combines the two images to produce a nearly full colour experience. The latter
system even allows watching stereo content in full colour without glasses. In the
more sophisticated system by Dolby 3D (first developed by Infitec), a specific
wavelength of the colour gamut is used for each eye, with an alternate colour
wheel placed in front of projector; glasses with corresponding dichroic filters in
the lenses filter out either one or the other set of light wavelengths. In this way, the
projector can display left and right images simultaneously, but the filters are quite
fragile and expensive. Colour-based stereo reproduction is most susceptible to cross-
talk when colours of glasses and filters are not well calibrated.
The common problems of all such systems are the limited position of the viewer
(no head turning) and eye fatigue due to various reasons.
3.3 Eye Fatigue
Let us consider in more detail the cause of eye fatigue while viewing 3D video on
TV. The idea of showing slightly different images to each eye allows creating the 3D
effect in the viewer’s brains. The bigger the parallax, the more obvious the 3D effect.
The types of parallax are illustrated in Fig. 3.3.
Fig. 3.3 Types of parallax: Positive parallax

negative, positive and zero
parallax
Zero parallax
Stereo plane
Negative parallax
1. Zero parallax. The image difference between left and right views is zero, and the
eye focuses right at the plane of focus. This is a convenient situation in real life,
and generally it does not cause viewing discomfort.
2. Positive (uncrossed) parallax. The position of the convergence point is located
behind the projection screen. The most comfortable viewing is accomplished
when the parallax is almost equal to the interocular distance.
3. Negative (crossed) parallax. The focusing point is in front of the projection
screen. The parallax depends on the convergence angle and the distance of the
observer from the display, and therefore it can be more than the interocular
distance.
4. Positive diverged parallax. The optical axes are diverging to perceive stereo with
the parallax exceeding the interocular distance. This case can cause serious visual
inconvenience for the observers.
Parallax is usually measured as percentage of shift related to frame width. In real
3D movies, the parallax can be as big as 16% (e.g. “Journey To The Center of The
Earth”), 9% as in “Dark Country” and 8% “Dolphins and Whales 3D: Tribes of the
Ocean” (Vatolin 2015). This means that the parallax can be up to 1 metre, when
viewing a 6-metre-wide cinema screen, which is significantly bigger than average
interocular distance. Such situation causes unconventional behaviour of the eyes,
when they try to diverge.
Smaller screens (like TV on a smartphone) have smaller parallax, and the eyes
can better adapt to 3D. However, the drawback is that on smaller screens, the 3D
effect is smaller, and objects look flat.
Usually, human eyes use the mechanism of accommodation to see objects at
different distances in focus (see Fig. 3.4). The muscles that control the lens in the eye
shorten to focus on close objects. The limited depth of field means that objects that
are not at the focal length are typically out of focus. This enables viewers to ignore
certain objects in a scene.
The most common causes of eye fatigue (Mikšícek 2006) are enumerated below.
1. Breakdown of the accommodation and convergence relationship. When a viewer
observes an object in the real world, the eyes focus on a specific point belonging
to this object. However, when the viewer watches 3D content, the eyes try to
Far blurred
Accommodation for a near target
Near in focus
Near blurred
Accommodation for a far target
Far in focus
Fig. 3.4 Accommodation for near and far targets
Fig. 3.5 Conflict between interposition and parallax
Fig. 3.6 Vertical disparity
focus on “popping-out” 3D objects, while for a sharp picture, they have to focus
on the screen plane. This can misguide our brains and add a feeling of sickness.
2. High values of the parallax. High values of parallax can lead to divergence of the
eyes, and this is the most uncomfortable situation.
3. Crosstalk (ghosts). Occurs when the picture dedicated to the left eye view is
partly visible for the right eye view and vice versa. It is quite common for
technologies based on colour separation and light polarisation, when filtering is
insufficient, or in cases of bad synchronisation between the TV display and the
shutter glasses.
4. Conflict between interposition and parallax. A special type of conflict between
depth cues appears when a portion of the object on one of the views is clipped by
the screen (or image window) surround. The interposition depth cue indicates that
the image surround is in front of the object, which is in direct opposition to the
Fig. 3.7 Common cue collision
Correct scene
Viewer at
point
PScene point at
other position distorted
position
P’
Viewer at correct
position
stereo
display
Fig. 3.8 Viewer position
disparity depth cue. This conflict causes depth ambiguity and confusion
(Fig. 3.5).
5. Vertical disparities. Vertical disparities are caused by wrong placement of the
cameras or faulty calibration of the 3D presentation apparatus (e.g. different focal
lengths of the camera lenses). Figure 3.6 illustrates vertical disparity.
6. Common cue collision. Any logical collision of binocular cues and monocular
cues, such as light and shade, relative size, interposition, textural gradient, aerial
perspective, motion parallax, perspective, and depth cueing (Fig. 3.7).
7. Viewing conditions. Viewing conditions include viewing distance, screen size,
lighting of the room, viewing angle, etc. Also, personal features of the viewer:
age, anatomical size, and eye adaptability. Generally, the older the person, the
greater the eye fatigue and sickness, because of lower adaptability of the brain.
For children, who have smaller interocular distance, the 3D effect will be more
expressed, but the younger brain and greater adaptability will decrease the
negative experience. For people having strabismus, it is impossible to perceive
stereo content at all. Figure 3.8 illustrates the situations, when one of the viewers
is not at the optimal position.
8. Content quality. Geometrical distortions, difference in colour, sharpness, bright-
ness/contrast, depth of field of the production optical system between left and
right views, flipped stereo, and time shift, all these factors contribute to lower
content quality, causing more eye fatigue.
3.4 Depth Control for Stereo Content Reproduction
The majority of the cited causes of eye fatigue relate to stereo content quality.
However, even high-quality content can have inappropriate parameters, like high
value of parallax. To compensate for this effect, fast real-time depth control tech-
nology has been proposed, which is aimed to reduce the stereo effect of 3D content
by diminishing the stereo effect (Fig. 3.9).
The depth control feature can be implemented in a 3D TV, and it can be
controlled on the TV remote (Ignatov and Joesan 2009), as shown in Fig. 3.10.
The proposed scheme of stereo effect modification is shown in Fig. 3.11. First,
the depth map between input stereo content is estimated, depth is post-processed to
Fig. 3.9 To reduce perceived depth and associated eye fatigue, it is necessary to diminish the stereo
effect
Fig. 3.10 Depth control

functionality for stereo
displays
Original left Processed

eye image left eye
image
Intermediate
Disparity Disparity/
view
estimation depth map
generation
Processed
Original right right eye
eye image image
Depth
control
parameter
estimation
Fig. 3.11 Depth control general workflow
Fig. 3.12 Depth tone mapping: (a) initial depth; (b) tone-mapped depth
remove artefacts and mapped for modified stereo effect, and after that left and right
views are generated.
The depth control method implies using two techniques.
1. Control the depth of pop-up objects (Reduction of excessive negative parallax).
This could be thought as reduction of stereo baseline, when modified stereo views
are moved to each other. In this case, 3D perception of close objects will be
reduced first of all. This technique could be realised via the view interpolation
method when the virtual view for the modified stereopair is interpolated by the
initial stereopair according to a portion of the depth/disparity vector. Areas of the
image with small disparity vectors will wherein remain almost the same, and
areas of the image with large disparity vectors (pop-up objects) will produce less
perception of depth.
2. Control the depth of image plane. This could be thought as moving the image
plane along the z-direction. Then, the perceived 3D scene will move further in the
z-direction. This technique could be realised via depth tone mapping with subse-
quent view interpolation. Depth tone mapping will equally decrease depth per-
ception for every region of the image. Figure 3.12 illustrates how all objects of the
scene are made more distant for an observer. Depth tone mapping could be
realised through pixel-wise operation of contrast, brightness, and gamma
functions.
3.5 Fast Recursive Algorithm for Depth Estimation From

Stereo
Scharstein and Szeliski (2002) presented a taxonomy of matching algorithms based

on the observation that stereo algorithms generally perform (subsets of) the follow-
ing four steps (see Fig. 3.13).
The following constraints are widely used in stereo matching algorithms.
1. Epipolar constraint: the search range for a corresponding point in one image is
restricted to the epipolar line in the other image.
2. Uniqueness constraint: a point in one image should have at most one
corresponding point in the other image.
3. Continuity constraint: the disparity varies slowly across a surface.
4. Photometric constraint: correspondence points have similar photometric proper-
ties in matching images.
Area-based approaches are the oldest methods used in computer vision. They are
based on photometric compatibility constraints between matched pixels. The fol-
lowing optimisation equation is solved for individual pixels in a rectified stereopair:
Dðx, yÞ ¼ arg min ðI r ðx, yÞ I t ðx þ d, yÞÞ ¼ arg min ðCost ðx, y, dÞÞ,
d d
where Ir is the pixel intensity at reference image, It is the pixel intensity at target
image, d ¼ {dmin, dmax} denotes the disparity range, and D(x,y) is the disparity map.
The photometric constraint applied to single pixel pair does not provide a unique
solution. Instead of comparing individual pixels, several neighbouring pixels are
grouped in a support window, and their intensities are compared with those of pixels
in another window. The simplest matching measure is the sum of absolute differ-
ences (SAD). The disparity which minimised the SAD cost for each pixel is chosen.
The optimisation equation can be rewritten as follows:
Disparity
Matching cost Cost (support) Disparity
computation
computation aggregation refinement
/optimisation
Fig. 3.13 Acquisition of depth data, using a common approach

Fig. 3.14 Basic area-based approach
XX
Dðx, yÞ ¼ arg min ð ðI r ðxi , y j Þ I t ðxi þ d, y j ÞÞÞ
d i j
XX
¼ arg min ð ðCostðxi , y j , dÞÞÞ,
d i j
where i2[n,n] and j2[m,m] define the support window size (Fig. 3.14).
Other matching measures include normalised cross correlation (NCC), modified
normalised correlation (MNCC), rank transform, etc. However, there is a problem
with correlation and SAD matching since the window size should be large enough to
include enough intensity variation for matching but small enough to avoid effects of
projective distortions.
For this reason, approaches which adaptively select the window size depending
on local variations of intensities are proposed. Kanade and Okutomi (1994) attempt
to find ideal the window in size and shape for each pixel in an image.
Prazdny (1987) proposed a new function to assign support weights to
neighbouring pixels iteratively. In this method, it is assumed that neighbouring
disparities, if corresponding to the same object in a scene, are similar and that two
neighbouring pixels with similar disparities support each other.
In general, the prior-art aggregation step uses rectangular windows for grouping
the neighbouring pixels and comparing their intensities with those of the pixels in
another window. The pixels can be weighted using linear or nonlinear filters for
better results. The most popular nonlinear filter for disparity estimation with variable
support strategy is the cross-bilateral filter. However, the computation complexity of
this type of filter is extremely high, especially for real-time applications.
In this work, we adapted a separable recursive bilateral-like filtering for matching
cost aggregation. It has a constant-time complexity which is independent of filter
window size and runs much faster than the traditional one while producing the
similar aggregation result of matching cost (Tolstaya and Bucha 2012). We used a
recursive implementation of the cost aggregation function, similar to (Deriche 1990).
The separable implementation of the bilateral filter allows significant speed-up of
computations, having a result similar to the full-kernel implementation (Pham and
van Vliet 2005). Right-to-left and left-to-right disparities are computed using similar
considerations. First, a difference image between left and right images is computed:
Dðx, y, δÞ ¼ ΔðI l ðx, yÞ I r ðx δ, yÞÞ,
where Δ is a measure of colour dissimilarity. It can be implemented as a mean

difference between colour channels, or some more sophisticated measure, like the
Birchfield-Tomasi method from (Birchfield and Tomasi 1998), which does not
depend on image sampling.
The cost function is computed by accumulating the image difference within a
small window, using adaptive support by analogy of cross-bilateral filtering:
1 X
Fðx, y, δÞ ¼ Dðx0 , y0 , δÞSðjx x0 jÞ hðΔðIðx, yÞ, Iðx0 , y0 ÞÞ,
w x0 , y0 2Γ
where w(x,y) is weight to normalise filter output and computed according to follow-
ing formula:
X
wðx, yÞ ¼ Sðjx x0 jÞ hðΔðIðx, yÞ, Iðx0 , y0 ÞÞ,
x0 , y0 2Γ
and Γ is the support window. This helps to adapt the filtering window, according to
colour similarity of image regions.
In our work, we used the following h(r) and S(x), range and space filter kernels
correspondingly:

jrj jxj
hðr Þ ¼ exp and SðxÞ ¼ exp
σr σs
The disparity d is computed via minimisation of the cost function F:
dðx, yÞ ¼ arg min F ðx, y, δÞ:

δ
Symmetric kernels allow separable accumulating for rows and columns. Kernels
h(r) and S(x) are not equal to the commonly used Gaussians kernels, but using those
kernels it is possible to construct the recursive accumulating function and signifi-
cantly increase processing speed, while preserving the quality.
Let us consider the one-dimensional case of the smoothing with S(x). For fixed δ,
we have
X
N 1
F ð xÞ ¼ DðxÞSðk xÞ:
k¼0
Unlike (Deriche 1990), the second pass is based on the result of the first pass:
F 1 ðxÞ ¼ DðxÞð1 αÞ þ αF 1 ðx 1Þ,

F ðxÞ ¼ F 1 ðxÞð1 αÞ þ αF ðx þ 1Þ
with the normalising coefficient α ¼ e1=σ s to ensure that the range of the output
signal is the same as the range of the input signal. In case of cross-bilateral filtering,
the weight α is a function of x, and the formulas for the filtered signal (first pass) are
the following:
P1 ðxÞ ¼ I ðxÞð1 αÞ þ αP1 ðx 1Þ,

1
αðxÞ ¼ exp hðjP1 ðxÞ I ðxÞjÞ,
σs
F 1 ðxÞ ¼ DðxÞð1 αðxÞÞ þ αðxÞF 1 ðx 1Þ:
The backward pass step is modified similarly, and F(x) is the formula for the
filtered signal:
PðxÞ ¼ P1 ðxÞð1 αÞ þ αPðx þ 1Þ,

1
αðxÞ ¼ exp hðjPðxÞ I ðxÞjÞ,
σs
F ðxÞ ¼ F 1 ðxÞð1 αðxÞÞ þ αðxÞF ðx þ 1Þ:
To compute the aggregated cost for the 2D case, four passes of recursive
equations are performed: left to right, right to left, top to bottom, and bottom to
top. Those formulas give only an approximate solution for the cross-bilateral filter,
but for the purpose of cost function aggregation, they give adequate results.
After the matching cost function is aggregated for every δ, the pass for every pixel
along δ gives the disparity values.
Disparity in occlusion areas is filtered additionally, according to the following
formulas, using symmetry consideration:
DFL ðx, yÞ ¼ min ðDL ðx, yÞ, DR ðx DL ðx, yÞ, yÞÞ,

DFR ðx, yÞ ¼ min ðDR ðx, yÞ, DL ðx þ DR ðx, yÞ, yÞÞ,
where DL is disparity map from left image to right image and DR is disparity from
right to left.
Fig. 3.15 Left image of stereopair (a) and computed disparity map (b)
This rule is very efficient in the case of correction disparity in occlusion areas of
stereo matching, because usually it is known that minimal (or maximal, depending
on the stereopair format) disparity corresponds to farthest (covered) objects and
occlusions occur near boundaries and cover farther objects.
Figure 3.15 shows the results of the proposed algorithm.
3.6 Depth Post-Processing
The proposed method relies on the idea of convergence from a rough estimate
towards the consistent depth map through subsequent iterations of the depth filter
(Ignatov et al. 2009). On each iteration, the current depth estimate is refined by
filtration with accordance to images from the stereopair. A reference image is a
colour image from a stereopair for which the depth is estimated. A matching image is
the other colour image from the stereopair.
The first step of the method for depth smoothing is analysis and cutting of the
reference depth histogram (Fig. 3.16). The cutting of the histogram suppresses noise
present in depth data. The raw depth estimates could have a lot of outliers. The noise
might appear due to false stereo matching in occlusion areas and in textureless areas.
The proposed method uses two thresholds: a bottom of the histogram range B and a
top of the histogram range T. These thresholds are computed automatically from the
given percentage of outliers.
The next step of the method for depth smoothing is left, right depth cross-check.
The procedure operates as follows:
• Compute the left disparity vector (LDV) from the left depth value.
• Fetch the right depth value mapped by the LDV.
• Compute the right disparity vector (RDV) from the right depth value.
• Compute the disparity difference (DD) of absolute values of LDV and RDV.
• In the case that DD is higher than the threshold, then the left depth pixel is marked
as the outlier.
Depth Image
Depth cross- segmentation Depth
histogram
check onto textured smoothing
cutting
Fig. 3.16 Steps of depth post-processing algorithm
Fig. 3.17 Example of left, right depth cross-check: (a) left image; (b) right image; (c) left depth;
(d) right depth; (e) left depth with noisy pixels (marked black); (f) smoothing result for left depth
without depth cross-checking; (g) smoothing result for left depth with depth cross-checking
In our implementation, the threshold for the disparity cross-check is set at 2, and
noisy pixels are marked by 0. Since 0 < 64, noisy pixels are automatically treated as
outliers for further processing. An example of the depth map marked with noisy pixels
according to depth cross-checking is shown in Fig. 3.17. It shows that the depth cross-
check successfully removes outliers from occlusion areas (it is shown by red circles).
Fig. 3.18 Example of image segmentation into textured and non-textured regions: (a) colour
image; (b) raw depth; (c) binary segmentation mask (black, textured regions; white, non-textured
regions); (d) smoothing result without using image segmentation; (e) smoothing result using image
segmentation
The next step of the method for depth smoothing is binary segmentation of the left
colour image into textured and non-textured regions. For this purpose, the gradients
in four directions, i.e. horizontal, vertical, and two diagonal, are computed. If all
gradients are lower than the predefined threshold, the pixel is considered to be
non-textured; otherwise it is treated as textured. This could be formulated as follows:

255, if gradients ðx, yÞ < Threshold
BSðx, yÞ ¼ ,
0, otherwise
where BS is the binary segmentation mask for the pixel with coordinates (x, y), a
value of 255 corresponds to a non-textured image pixel, while 0 corresponds to a
textured one. Figure 3.18 presents an example of image segmentation into textured
and non-textured regions along with example of depth map smoothing with and
without the segmentation mask.
Fig. 3.19 Examples of processed depth: (a) colour images; (b) initial raw depth maps; (c) depth
maps smoothed by the proposed method
Figure 3.19 presents examples of processed depth maps.
3.7 Stereo Synthesis
The problem of depth-based virtual view synthesis (or depth image-based rendering,
DIBR) means that reconstruction of the view from the virtual camera CV, while
views from other cameras C1 and C2 (or different views captured by a moving
camera) and available scene geometry (point correspondences, depth, or the precise
polygon model) are provided (see Fig. 3.20).
The following problems need to be addressed in particular during view
generation:
• Disocclusion
• Temporal consistency
• Symmetric vs asymmetric view generation
• Toed-in camera configuration
Fig. 3.20 Virtual view

synthesis
t1 t2
C1 C2
Cv
Reconstructed
3D scene
Disocclusion Disocclusion
area x area
Virtual Reference Virtual

image for C11 image image for Cr1
Fig. 3.21 Virtual view synthesis
3.7.1 Disocclusion
As we intend to use one depth map for virtual view synthesis, we should be ready for
the appearance of disocclusion areas. The disocclusion area is a part of the virtual
image which becomes visible in a novel viewpoint in contrast to the initial view. The
examples of disocclusion areas are marked by black in Fig. 3.21. A common way to
eliminate disocclusions is to fill up these areas by colours of neighbouring pixels.
3.7.2 Temporal Consistency
Most stereo disparity estimation methods consider still images as input, but TV
stereo systems require real-time depth control/view generation algorithms intended
for video. When considering all frames independently, some flickering can occur,
especially near objects’ boundaries. Usually, the depth estimation algorithm is
modified to output temporally consistent depth maps. Since it is not very practical
to use some complicated algorithms like bundle adjustment, more computationally
effective methods are applied, like averaging inside a small temporal window.
3.7.3 Symmetric Versus Asymmetric View Generation
The task of stereo intermediate view generation is a particular case of arbitrary view
rendering, where the positions of virtual views are constrained to lie on the line
connecting the centres of source cameras. To generate the new stereopair with a
reduced stereo effect, we applied symmetric view rendering (see Fig. 3.22), where
the middle point of baseline stays fixed and both left and right views are generated.
Other configurations will render only one view, leaving the other intact. But in this
case, the disocclusion area will be located on one side of the popping-out objects and
can be more susceptible to artefacts.
3.7.4 Toed-in Camera Configuration
There are two possible camera configurations: parallel and toed-in (see Fig. 3.23). In
the case of parallel configuration, depth has positive value, and all objects appear in
front of the screen. When the stereo effect is large, this can cause eye discomfort.
The toed-in configuration is closer to natural human visual system. However, the
toed-in configuration generates keystone distortion in images, including vertical
disparity. Due to non-parallel disparity lines, the algorithm of depth estimation
will give erroneous results, and such content will require rectification. To eliminate
Fig. 3.22 Symmetric stereo

view rendering
Cl Vl V C
r r
(a) (b) (c)
L R L R
Fig. 3.23 Parallel (a) and toed-in (b) camera configuration; illustration of keystone distortion (c)
Fig. 3.24 Virtual view synthesis: initial stereopair (top); generated stereopair with 30% depth
decrease (bottom)
eye discomfort from stereo and save the stereo effect, a method of zero-plane setting
can be applied. It consists of shifting the virtual image plane and reducing depth by
some amount to have negative values in some image areas.
Figures 3.24 and 3.25 illustrate resulting stereopairs with 30% depth decrease.
In the proposed application, we consider depth decrease as more applicable for
the “depth control” feature, since the automatic algorithm in this case will not face
the problem of disocclusion and, hence, hopefully produce fewer artefacts. In the
following Chap. 4 (Semi-automatic 2D to 3D Video Conversion), we will further
address the topic of depth-based image rendering (DIBR) for situations with depth
increase and appearing disocclusion areas that should be treated in a specific way.
At the end, we would like to add that unfortunately very few models of 3D TVs
were equipped with the “3D depth control” feature for a customisable strength of the
stereo effect, for example, LG Electronics 47GA7900. We can consider automatic
2D ! 3D video conversion systems (that were available in production in some
models of 3D TVs by LG electronics, Samsung, etc. and also in commercially
available TriDef 3D software by DDD Group) as a part of such feature, since during
conventional monoscopic to stereoscopic video conversion, the user can preset the
desired amount of stereo effect.
Fig. 3.25 Virtual view synthesis: initial stereopair (top); generated stereopair with 30% depth
decrease (bottom)
References
Birchfield, S., Tomasi, C.: A pixel dissimilarity measure that is insensitive to image sampling. IEEE
Trans. Pattern Anal. Mach. Intell. 20(4), 401–406 (1998)
Deriche, R.: Fast algorithms for low-level vision. IEEE Trans. Pattern Anal. Mach. Intell. 12(1),
78–87 (1990)
Ignatov, A., Joesan O.: Method and system to transform stereo content, European Patent
EP2293586 (2009)
Ignatov, A., Bucha, V., Rychagov, M.: Disparity estimation in real-time 3D acquisition and
reproduction system. In: Proceedings of International Conference on Computer Graphics
“Graphicon”, pp. 61–68 (2009)
Kanade, T., Okutomi, M.: A stereo matching algorithm with an adaptive window: theory and
experiment. IEEE Trans. Pattern Anal. Mach. Intell. 16(9), 920–932 (1994)
Mikšícek, F.: Causes of visual fatigue and its improvements in stereoscopy. University of West
Bohemia in Pilsen, Pilsen, Technical Report DCSE/TR-2006-04 (2006)
Pham, T., van Vliet, L.: Separable bilateral filtering for fast video preprocessing. In: Proceedings of
IEEE International Conference on Multimedia and Expo, pp. 1–4 (2005)
Prazdny, K.: Detection of binocular disparities. In: Fischler, M.A., Firschein, O. (eds.) Readings in
Computer Vision. Issues. Problem, Principles, and Paradigms, pp. 73–79. Morgan Kaufmann,
Los Altos (1987)
Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence
algorithms. Int. J. Comput. Vis. 47(1), 7–42 (2002)
Tolstaya, E.V., Bucha, V.V.: Silhouette extraction using color and depth information. In: Pro-
ceedings of 2012 IS&T/SPIE Electronic Imaging. Three-Dimensional Image Processing (3DIP)
and Applications II. 82900B (2012) Accessed on 04 October 2020. https://doi.org/10.1117/12.
907690
Vatolin, D.: Why does 3D lead to the headache? / Part 4: Parallax (in Russian) (2015). https://habr.
com/en/post/378387/
Chapter 4
Semi-Automatic 2D to 3D Video Conversion
Petr Pohl and Ekaterina V. Tolstaya
4.1 2D to 3D Semi-automatic Video Conversion
As we mentioned in Chap. 3, during the last decade, the popularity of 3D TV

technology has passed through substantial rises and falls. It is worth mentioning
here, however, that 3D cinema itself was not so volatile but rather gained vast
popularity among cinema lovers: over the last decade, the number of movies
produced in 3D grew by an order of magnitude, and the number of cinemas equipped
with modern 3D projection devices grew 40 times and continues to grow (Vatolin
2019). Moreover, virtual reality headsets, which recently appeared on the market,
give an excellent viewing experience for 3D movie watching. This is a completely
new type of device intended for stereo content reproduction: it has two different
screens, one for each eye, whereas the devices of the previous generation had one
screen and therefore required various techniques for view separation. Maybe these
new devices can once again boost consumer interest in 3D cinema technology, which
has already experienced several waves of excitement during the twenty-first century.
Among the causes of the subsiding popularity of 3D TV, apart from costs and
technological problems, which are mentioned in Chap. 3, was a lack of 3D content.
In the early stages of the technology renaissance, the efforts of numerous engineers
around the world were applied to stereo content production and conversion tech-
niques, i.e. producing 3D stereo content from common 2D monocular video.
Very often, such technologies were required even when the movie was shot with a
stereo rig: for example, the movie “Avatar” contains several scenes shot in 2D and
P. Pohl (*)
Samsung R&D Institute Rus (SRR), Moscow, Russia
e-mail: p.pohl@samsung.com
E. V. Tolstaya
Aramco Innovations LLC, Moscow, Russia
e-mail: ktolstaya@yandex.ru

82 P. Pohl and E. V. Tolstaya
Table 4.1 Advantages/disadvantages of shooting in stereo and stereo conversion

Advantages Disadvantages
Shooting Capturing stereo images at Requires specialized camera rigs, which
in stereo time of shooting are more complex, heavy, hard to oper-
Natural stereo effect in case ate, and require more time and more
of difficult to reproduce situations: experienced personnel
smoke, hair, reflections, rain, Restrictions on lenses that can obtain
leaves, transparent objects, etc. good-looking stereo
Possibility of immediately Cameras should operate synchronously,
reviewing stereo content because even small differences in
shooting options (like aperture and
exposure, depth of field, focusing point)
will lower stereo content quality
Stereo depth should be fixed at time of
shooting
Lens flares, shiny reflections, and other
optical effects can appear differently
and require fixing
Post-processing is needed to fix prob-
lems of captured stereo video (bright-
ness differences, geometrical distortion,
colour imbalance, focus difference, and
so on)
Stereo Possibility of shooting with standard Extra time and cost required for post-
conversion process, equipment, and personnel production
Wide choice of film or digital cameras Reflections, smoke, sparks, and rain
and lenses more difficult to convert transparent and
Possibility of assigning any 3D depth semi-transparent objects pose serious
during post-production on a shot-by-shot problems during post-production stereo
basis, flexibility to adjust the depth and conversion
stereo effect of each actor or object in a Risk of conversion artefacts even in
scene; this creative option is not avail- regular scenes
able when shooting in stereo
converted to stereo in post-production. A stereo rig is an expensive and bulky

system, prone to certain limitations: it must be well calibrated to produce high-
quality, geometrically aligned stereo with the proper colour and sharpness balance.
Sometimes this is difficult to do, so now and then movies are shot in conventional 2D
and converted afterward. Table 4.1 enumerates the advantages and disadvantages of
both techniques, i.e. shooting in stereo and stereo conversion.
Algorithms for creating stereo video from a mono stream can be roughly divided
into two main subcategories: fully automatic conversion algorithms, which are
implemented most often via a specialized chip inside a TV, and semi-automatic
(operator-assisted) conversion algorithms, which use special applications, tools for
marking videos, and serious quality control.
Fully automatic, real-time methods, developed under strict limitations on memory
usage and the number of computations, provide quite low-quality 3D images,
resulting in a poor viewing experience, including headache and eye fatigue. This
forces consumers to further invest in their home entertainment systems, including
3D-enabled devices to reproduce display-ready 3D content.
4 Semi-Automatic 2D to 3D Video Conversion 83
Recent developments suggest a fully automated pipeline, where the system

automatically tries to guess the depth map of the scene and then applies depth-
based rendering techniques to synthesize stereo, as demonstrated by Appia et al.
(2014) or Feng et al. (2019). Other systems directly predict the right view from the
left view, excluding the error-prone stages of depth-based rendering (Xie et al.
2016). Still, such systems rely on predicting the depth of a scene based on some
cues (or systems that learn the cues from training data), which leads to prediction
errors. That is why much effort was directed toward semi-automatic algorithms,
which require human interaction and control, a lot of time, and high costs but provide
much higher quality content.
An operator-assisted pipeline for stereo content production usually involves the
manual drawing of depth (and possibly also segmentation) for some selected refer-
ence frames (key frames) and subsequent depth propagation for stereo rendering. An
initial depth assignment is done, sometimes by drawing just disparity scribbles, and
after that, it is restored and propagated using 3D cues (Yuan 2018). In other
techniques, full key frame depth is needed (Tolstaya et al. 2015).
The proposed 2D-3D conversion technique consists of the following steps:
1. Video analysis and key frame selection
2. Manual depth map drawing
3. Depth propagation
4. Object matting
5. Background inpainting
6. Stereo rendering
7. Occlusion inpainting
4.2 Video Analysis and Key Frame Detection
The extraction of key frames is a very important step for semi-automatic video
conversion. Key frames during stereo conversion are completely different from key
frames selected for video summarization, as is described in Chap. 6. The stereo
conversion key frames are selected for an operator, who will manually draw depth
maps for key frames, and then these depth map frames will be further interpolated
(propagated) through the whole video clip, followed by depth-based stereo view
rendering. The more frames that are selected, the more manual labour will be
required for video conversion, but in the case of an insufficient number of key
frames, a lot of intermediate frames may have inappropriate depths or contain
conversion artefacts. The video clip should be thoroughly analysed prior to the
start of manual work. For example, a slow-motion frame with simple, close to linear
motion will require fewer key frames, while dramatic, fast moving objects, espe-
cially if they are closer to the camera and have larger occlusion areas, must have
more key frames to assure better conversion quality.
In Wang et al. (2012), the key frame selection algorithm relies on the size of
cumulative occlusion areas and shot segmentation is performed using a block-based
histogram comparison.
Sun et al. (2012) select key frame candidates using the ratio of SURF feature
points to the correspondence number, and a key frame is selected from among
candidates such that it has the smallest reprojection error. Experimental results
show that the propagated depth maps using the proposed method have fewer errors,
which is beneficial for generating high-quality stereoscopic video.
However, for semi-automatic depth 2D-3D conversion, the key frame selection
algorithm should properly handle various situations that are difficult for depth
propagation to diminish possible depth map quality issues and at the same time
limit the overall number of key frames.
Additionally, the algorithm should analyse all parts of video clips and group
similar scene parts. For example, very often during character conversation, the
camera switches from one object to another, while objects’ backgrounds almost do
not change. Such smaller parts of a bigger “dialogue” scene can be grouped together
and be considered as a single scene for every character.
For this purpose, video should be analysed and segmented into smaller parts
(cuts), which should be sorted into similar groups with different characteristics:
1. Scene change (shot segmentation). Many algorithms have already been proposed
in the literature. They are based on either abrupt colour change or motion vector
change. The moving averages of histograms are analysed and compared to some
threshold, meaning that the scene changes when the histogram difference exceeds
the threshold. Smooth scene transitions pose serious problems for such algo-
rithms. It is needless to say that the depth map propagation and stereo conversion
of such scenes is also a difficult task.
2. Motion type detection: still, panning, zooming, complex/chaotic motion. Slow-
motion scenes are easy to propagate, and in this case, a few key frames can save a
lot of manual work. A serious problem for depth propagation is caused by zoom
motion (objects approaching or moving away); in this case, a depth increase or
decrease should be smoothly interpolated (this is illustrated in Fig. 4.2 bottom
row).
3. Visually similar scenes (shots) grouping. Visually similar scenes very often occur
when shooting several talking people, when the camera switches from one person
to another. In this case, we can group scenes with one person and consider this
group to be a longer continuous sequence.
4. Object tracking in the scene. To produce consistent results during 2D-3D con-
version, it is necessary to analyse objects’ motion. When the main object (key
object) appears in the scene, its presence is considered to select the best key frame
when the object is fully observable, and its depth map can be well propagated to
the other frames of its appearance. Figure 4.1 and 4.2 (two top rows) illustrates
this kind of situation.
5. Motion segmentation for object tracking. For better conversion results, motion is
analysed within the scenes. The simplest type of motion is panning or linear
Key frame for

key object
Key object
Fig. 4.1 Example of the key object of the scene and its corresponding key frame
Object is not occluded

key frame
Object appeared
key frame
Area grew key

frame
Frame t Frame t+Dt
Fig. 4.2 Different situations for key frame selection: object should be fully observable in the scene,
objects appears in the scene, and object is zoomed
motion: this motion is easy to convert. Other non-linear or 3D motions require

more key frames for smooth video conversion.
6. Background motion model detection. Background motion model detection is
necessary for background motion inpainting, i.e. to fill up occlusion areas. In
the case of a linear model, it is possible to apply an occlusion-based key frame
detection algorithm, as in Wang et al. (2012).
7. Occlusion analysis. The key frame should be selected when the cumulative area
of occlusion goes beyond the threshold. This is similar to Wang et al. (2012).
The key frame selection algorithm (Tolstaya and Hahn 2012), based on a function
reflecting the transition complexity between every frame pair, proposes to find the
optimal distribution of key frames indexed by graph optimization. The frames of a
video shot are represented by the vertices of this graph, where the source is the first
frame and the sink is the last frame. When two frames are too far away, their
transition complexity is set to be equal to some large value. The optimization of
such a path can be done using the well-known Dijkstra’s algorithm.
Ideas on the creation of an automatic video analysis algorithm based on machine
learning techniques can be further explored in Chap. 6, where we describe
approaches for video footage analysis and editing and style simulations for creating
dynamic and professional-looking video clips.
4.3 Depth Propagation from Key Frames
4.3.1 Introduction
One of the most challenging problems that arise in semi-automatic conversion is the
temporal propagation of depth data. The bottleneck of the whole 2D to 3D conver-
sion pipeline is the quality of the propagated depth: if the quality is not high enough,
a lot of visually disturbing artefacts appear on the final stereo frames. The quality
strongly depends on the frequency of manually assigned key frames, but drawing a
lot of frames requires more manual work and makes production slower and more
expensive. That is why the crucial problem is error-free temporal propagation of
depth data through as many frames as possible. The optimal key frame distance for
the desired quality of outputs is highly dependent on the properties of the video
sequence and can change significantly within one sequence.
4.3.2 Related Work
The problem of the temporal propagation of key frame depth data is a complex task.
Video data are temporally undersampled: they contain noise, motion blur, and
optical effects such as reflections, flares, and transparent objects. Moreover, objects
in the scene can disappear, get occluded, or significantly change shape or visibility.
The traditional method of depth interpolation uses a motion estimation result, using
either depth or video images. Varekamp and Barenbrug (2007) propose creating the
first estimate of depth by bilateral filtering of the previous depth image and then
correct the image by estimating the motion between depth frames. A similar
approach is described by Muelle et al. (2010). Harman et al. (2002) use a machine
learning approach for the depth assignment of key frames. They suggest that these
should be selected manually or that techniques similar to those for shot-boundary
detection should be applied. After a few points are assigned, a classifier (separate for
each key frame) is trained using a small number of samples, and then it restores the
depth in the key frame. After that, a procedure called “depth tweening” restores
intermediate depth frames. For this purpose, both classifiers of neighbouring key
frames are fed with an image value to produce a depth value. For the final depth, both
intermediate depths are weighted by the distance to the key frames. Weights could
linearly depend on the time distance, but the authors propose the use of a non-linear
time-weight dependence. A problem with such an approach could arise when
intermediate video frames have areas that are completely different than those that
can be found on key frames (for example, in occlusion areas). However, this
situation is difficult for the majority of depth propagation algorithms. Feng et al.
(2012) describe a propagation method based on the generation of superpixels,
matching them and generating depth using matching results and key frame depths.
Superpixels are generated by SLIC (Simple Linear Iterative Clustering). Superpixels
are matched using mean colour (three channels) and the coordinates of the centre
position. Greedy search finds superpixels in a non-key frame (within some fixed
window) with minimal colour difference and L1 spatial distance, multiplied by a
regularization parameter. Cao (2011) proposes the use of motion estimation and
bilateral filtering to get a first depth estimate and refines it by applying depth motion
compensation in frames between key frames with assigned depths.
As a base method for comparison, we will use an approach similar to Cao (2011).
The motion information is then used for the warping of the depth data from previous
and subsequent key frames. These two depth fields are then mixed with the weights
of motion confidence. These weights can be acquired by an analysis of the motion
projection error from the current image to one of the key frames or the error of
projection of a small patch, which has slightly more stable behaviour. As a motion
estimation algorithm, we use the optical flow described by Pohl et al. (2014) with the
addition of the third channel of three YCbCr channels.
Similar but harder problems appear in the area of greyscale video colourization,
as in Irony (2005). In this case, only greyscale images can be used for matching and
finding similar objects. Pixel values and local statistics (features) are used in this
case; spatial consistency is also taken into account.
Most motion estimation algorithms are not ready for larger displacement and
could not be used for interpolation over more than a few frames. The integration of
motion information over time leads to increasing motion errors, especially near
object edges and in occlusion areas. Bilateral filtering of either motion or depth
can cover only small displacements or errors. Moreover, it can lead to disturbing
artefacts in the case of similar foreground and background colours.
4.3.3 Depth Propagation Algorithm
Our dense depth propagation algorithm interpolates the depth for every frame
independently. It utilizes the nearest preceding and nearest following frames with
known depth maps (key frame depth). The propagation of depth maps from two
sides is essential, as it allows us to interpolate most occlusion problems correctly.
The general idea is to find correspondence between image patches, and assuming
that patches with similar appearances have similar depths, we can synthesize an
unknown depth map based on this similarity (Fig. 4.3).
The process of finding similar patches is based on work by Korman and Avidan
(2015). First, a bank of Walsh-Hadamard (W-H) filters is applied to both images. As
a result, we have a vector of filtering results for every pixel (Fig. 4.4). After that, a
hash code is generated for each pixel using this vector of filtering results.
Patches
hash table
Key
frame
Input video Synthesized
frame depth
Key
frame
Fig. 4.3 Illustration of depth propagation process from two neighbouring frames, the preceding
and following frames
Fig. 4.4 After applying a Hashing W-H feature vectors

bank of Walsh-Hadamard
filters to both images (key
frame image and input
image), we have a stack of
images—the results of the
W-H filtering. For every
pixel of both images, we
therefore have a feature Multi-candidates with
vector, and feature closeness ranging (L1-prior)
in terms of the L1 norm is
assumed to correspond to
patches’ similarity
Hash tables are built using hash codes and the corresponding pixel coordinates for
fast search for patches with equal hashes. We assume that patches with equal hashes
have a similar appearance. For this purpose, the authors applied a coherency
sensitive hashing algorithm. Hash code is a short integer of 16 bits, and with hashes
for each patch, a greedy matching algorithm selects patches with the same
(or closest) hash and computes the patch difference as the difference between vectors
with W-H filter results. The matching error used in the search for the best corre-
spondence is a combination of filter output difference and spatial distance, with a
dead zone (no penalty for small distances) and a limit for maximum allowed
distance. Spatial distance between matching patches was introduced to avoid unrea-
sonable correspondences between similar image structures from different parts of the
image. Such matches are improbable for relatively close frames of video input.
In the first step, RGB video frames are converted to the YCrCb colour space. This
allows us to treat luminance and chrominance channels differently. To accommodate
fast motion and decrease the sensitivity to noise, this process is done using image
pyramids. Three pyramids for video frames and two pyramids for key frame depth
maps are created. The finest level (level ¼ 0) has full-frame resolution, and we
decrease the resolution by a factor of 0.5, so that the coarsest level ¼ 2 and has 1/4 of
the full resolution. We use an area-based interpolation method. The iterative scheme
starts on the coarsest level of the pyramid by matching two key frames to the current
video frame. The initial depth is generated by voting over patch correspondences
using weights dependent on colour patch similarity and temporal distances from
reference frames. In the next iterations, matching between images combined from
colour and the depth map is accomplished. Due to performance reasons, only one of
chrominance channels (Cr or Cb does not make a big difference) with a given depth
for reference frames is replaced, and thus the depth estimate is obtained for the
current frame. On each level, we perform several CSH matching iterations and
iteratively update the depth image by voting. A Gaussian kernel smoothing with
decreasing kernel size or another low-pass image filter is used to blur the depth
estimate. This smooths the small amount of noise coming from logically incorrect
matches. The low-resolution depth result for the current frame is upscaled, and
the process is repeated for every pyramid level and ends at the finest resolution,
which is the original resolution of the frames and depth. The process is described by
Algorithm 4.1.
Algorithm 4.1 Depth Map Propagation from Reference Frames

tþ1
Initialize three pyramids with YCbCr video frames: I t1 t
level , I level , and I level and
tþ1
two pyramids with key frame depth maps Dt1 level , Dlevel . With this notation,
t
Dlevel is the unknown depth map.
Using CSH algorithm, find patch matches at the coarsest level, using just
colour image frames:
(continued)
Algorithm 4.1 (continued) t

Maplevel0 ¼ I tlevel0 ! I t1
level0 and
Mapþt
level0 ¼ I level0 ! I level0
t tþ1
Maplevel
–t Maplevel
+t
t–1 t
Ilevel t+1
Ilevel Ilevel
By initial patch-voting procedure for coarsest level, synthesize Dtlevel0 for

coarsest level of the depth pyramid.
For each level from coarsest to finest (from 2 to 0) do several iterations
(Nlevel) of the algorithm do
for iteration ¼ 1 .. Nlevel do
At the iteration 0, upscale Dtlevelþ1 to Dtlevel with bilinear (or bicubic)
algorithm to get first estimate of unknown depth map at the level. Filter the
tþ1
known depth maps Dt1 level and Dlevel with Gaussian kernel to remove noise.
t1 t tþ1
Copy Dlevel , Dlevel , and Dlevel to one of chrominance channels of image
tþ1
pyramids I t1 t
level , I level , and I level correspondingly.
Using CSH algorithm find patch matches
Mapt
level ¼ I level ! I level and
t t1
Mapþt
level ¼ I level ! I level
t tþ1
By patch-voting procedure of Algorithm 4.2, synthesize Dt:level
The patch-voting procedure uses an estimate of match error for forward and
backward frames and is described in Algorithm 4.2. As an estimate of error, we
use the sum of absolute differences over patch correspondence (on the coarsest level)
or the sum of absolute differences of 16 Walsh-Hadamard kernels that are available
after CSH estimation. In our experiment, the best results were achieved with errors
normalized by the largest error over the whole image. The usage of Walsh-
Hadamard kernels as a similarity measure is justified because it is an estimate of
the true difference of patches, but it decreases sensitivity to noise, because only the
coarser filtering results are used.
Algorithm 4.2 Depth Map Synthesis by Patch-Voting Procedure

þt
Given Mapt t1 tþ1
level and Maplevel and respective matching errors Err level and Err level
computed from image correspondence and key frame depth maps Dtlevelþ1 and
Dtlevel
(continued)
Algorithm 4.2 (continued)

Initialize Stlevel and W tlevel by zeros
for each match, Y t1 ¼ Mapt level ðX Þ and Y
tþ1
¼ Mapþtlevel ðX Þ, where X is a
t t1 t+1
patch in the image I level and Y and Y are patches in the images I t1
level and
tþ1
I level correspondingly do
for each pixel of patch X do
Compute correspondence errors:
t1 tþ1
X , and Err tþ1 X
level ¼ Y
Err t1 level ¼ Y
Estimate voting weights:
Errt1
2σ level
W prev ¼ e V ðlevelÞ ,
Err tþ1
2σ level
W next ¼ e V ðlevel,tÞ
Stlevel ¼ Stlevel þ W prev Y t1 þ W next Y tþ1

W tlevel ¼ W prev þ W next
Dtlevel ¼ Stlevel =W tlevel
Most parts of the algorithm are well parallelizable and can make use of multicore
CPUs or GPGPU architectures. The implementation of CSH matching from Korman
and Avidan (2015) can be parallelized by the introduction of propagation tiles. This
decreases the area over which the found match is propagated and usually leads to an
increased number of necessary iterations. When we investigated the differences on
an MPI-Sintel testing dataset Butler et al. (2012), the differences were small even on
sequences with relatively large motions. To speed up CSH matching, we use only
16 Walsh-Hadamard kernels (as we mentioned earlier), scaled to short integer. This
allows the implementation of the computation of absolute differences using SSE and
intrinsic functions. The testing of candidates can be further parallelized in GPGPU
implementation. Our solution has parts running on CPU and parts running on
GPGPU. We tested our implementation on a PC with a Core i7 960 (3.2 GHz)
CPU and an Nvidia GTX 480 graphics card. We achieved running times of ~2 s/
frame on a video with 960 540 resolution. The most time-consuming part of the
computation is the CSH matching. For most experiments, we used three pyramid
levels with half resolution between the levels, with two iterations per level and
σ(level,1) ¼ 2 and σ(level,2) ¼ 1. σ V(level) ¼ 0.053(level-1). The slow python
implementation of the algorithm is given by Tolstaya (2020).
4.3.4 Results
In general, some situations remain difficult for propagation, such as the low contrast
videos, noise, and small parts of moving objects, since in this case the background
pixels inside the patch occupy the biggest part of the patch and contribute too much
in voting. However, in the case where the background does not change substantially,
small details can be tracked quite acceptably. The advantages given by CSH
matching include the fact that it is not true motion, and objects on the query frame
can be formed from completely different patches, based only on their visual simi-
larity to reference patches (Fig. 4.5).
The main motivation for the development of a depth propagation algorithm was
the elimination or suppression of the main artefacts of the previously used algorithm
based on optical flow. The main artefacts include depth leakage and depth loss.
Depth leakage can be caused either by the misalignment of key frame depth and
motion edge or an incorrect motion estimation result. The most perceptible artefacts
are noisy tracks of object depth that are left on the background after the foreground
object moves away. Depth loss is mostly caused by an error of motion estimation in
the case of fast motion or complex scene changes like occlusions, flares, reflections,
or semi-transparent objects in the foreground. Examples of such artefacts are shown
Reference frame 1 Propagated frame Reference frame 2
Fig. 4.5 Example of interpolated depths

Fig. 4.6 Comparison of optical flow-based interpolation (solid line) with our new method (dashed
line) for four different distances of key frames—key frame distance is on the x-axis. PSNR
comparison with original depth. Top—ballet sequence, bottom—breakdance sequence
Fig. 4.7 Optical flow-based interpolation—an example of depth leakage and a small depth loss
(right part of head) in the case of fast motion of an object and flares
in Fig. 4.7. Figure 4.8 shows the output of the proposed algorithm. We compared the
performance of our method with optical flow-based interpolation on the MSR 3D
Video Dataset from Microsoft research (Zitnick et al. 2004). The comparison of the
interpolation error (as the PSNR from the original) is shown in Fig. 4.6. Figure 4.9
compares the depth maps of the proposed algorithm and the depth map computed
with motion vectors, with the depth overlain onto the source video frames. We can
see that in the case of motion vectors, small details can be lost. Other tests were done
on proprietary datasets with the ground truth depth from stereo (computed by the
method of Ignatov et al. 2009) or manually annotated.
Fig. 4.8 Our depth interpolation algorithm—an example of solved depth leakage and no depth loss
artefact
Fig. 4.9 Comparison of depth interpolation results—optical flow-based interpolation (top) and our
method (bottom)—an example of solved depth leakage and thin object depth loss artefacts. Left,
depth + video frame overlay; right, interpolated depth. The key frame distance used is equal to eight
From our experiments, we see that the proposed depth interpolation method has
on average better performance than interpolation based on optical flow. Usually,
finer details of depth are preserved, and the artefacts coming from the imperfect
alignment of the depth edge and the true edge of objects are less perceptible or
removed altogether. Our method is also capable of capturing faster motion. On the
other hand, optical flow results are more stable in the case of consistent and not too
fast motion, especially in the presence of a high level of camera noise, video
flickering, or a complex depth structure. The proposed method has a lot of param-
eters, and many of them were set up by intelligent guesswork. One of the future steps
might be a tuning of parameters on a representative set of sequences. Another way
forward could be a hybrid approach that merges the advantages of our method and
optical flow-based interpolation. Unfortunately, we were not able to find a public
dataset for the evaluation of depth propagation that is large enough and includes a
satisfying variety of sequences to be used for tuning parameters or the evaluation of

the interpolation method for general video input.
4.4 Motion Vector Estimation
4.4.1 Introduction
Motion vectors provide helpful insights on video content. They are used for occlu-
sion analysis and, during the background restoration process, to fill up occlusions
produced by stereo rendering.
Motion vectors are the apparent motion of brightness patterns between two
images defined by the vector field u(x). Optical flow is one of the important but
not generally solved problems in computer vision, and it is under constant develop-
ment. Recent methods using ML techniques and precomputed cost volumes like
Teed and Deng (2020) or Zhao (2020) improve the performance in the case of fast
motion and large occlusions. At the time this material was prepared, the state-of-the-
art methods generally used variational approaches. The Teed and Deng (2020) paper
states that even the most modern methods are inspired by a traditional setup with a
balance between data and regularization terms. Teed and Deng (2020) even follow
the iterative structure similar to first-order primal-dual methods from variational
optical flow; however, it uses learned updates implemented using convolutional
layers. For computing optical flow, we decided to adapt the efficient primal-dual
optimization algorithm proposed by Chambolle and Pock (2011), which is suitable
for GPU implementation. The authors propose the use of total variation optical flow
with a robust L1 norm and extend the brightness constancy assumption by an
additional field to model brightness change. The main drawbacks of the base
algorithm are incorrect smoothing around motion edges and unpredictable behaviour
in occlusion areas. We extended the base algorithm to use colour information and
replaced TV-L1 regularization by a local neighbourhood weighting known as the
non-local smoothness term, proposed by Werlberger et al. (2010) and Sun et al.
(2010a). To fix the optical flow result in occlusion areas, we decided to use motion
inpainting, which uses nearby motion information and motion over-segmentation to
fill in unknown occlusion motion. Sun et al. (2010b) propose explicitly modelling
layers of motion to model the occlusion state. However, this leads to a non-convex
problem formulation that is difficult to optimize even for a small number of motion
layers.
4.4.2 Variational Optical Flow
Our base algorithm is a highly efficient primal-dual algorithm proposed by

Chambolle and Pock (2011), which is suitable for parallelization and hence,
effective GPU implementation. Variational optical flow usually solves a variational

minimization problem of the general form:
Z
E¼ ½λE D ðx, I 1 ðxÞ, I 2 ðxÞ, uðxÞÞ þ E s ðx, uðxÞÞdx:
Ω
The energy ED is usually linearized to achieve a convex problem and is solved on

several discrete grids according to a coarse-to-fine pyramid scheme. Linearization
can be done several times on a single pyramid level. A resized solution of a coarser
level is used for the initialization of the next finer level of the pyramid. Here, Ω is the
image domain, x ¼ (x1, x2) is the position, I1(x) and I2(x) are images, u(x) ¼ (u1(x),
u2(x)) is the estimated motion vector field, and λ is a regularization parameter that
controls the trade-off between the smoothness of u(x), described by the smoothness
of the energy term ES, and the image warping fit described by the data energy term
ED.
The pyramid structure is a common approach in computer vision to deal with
different scales of details in images. A pyramid for one input image is a set of images
with decreasing resolution. The crucial pyramid creation parameters are the finest
level resolution, the resolution ratio between levels, the number of levels, and the
resizing method, together with smoothing or anti-aliasing parameters.
4.4.3 Total Variation Optical Flow Using Two Colour

Channels
As a trade-off between precision and computation time, we use two colour channels
(Y’ and Cr of the Y’CbCr colour space) instead of three. A basic version of two-colour
variational optical flow with an L1 norm smoothness term is below:
E ðuðxÞ, wðxÞÞ ¼ E D ðuðxÞ, wðxÞÞ þ ES ðuðxÞ, wðxÞÞ

X
E D ðuðxÞ, wðxÞÞ ¼ λL ELD ðx, uðxÞ, wðxÞÞ þ λC E CD ðx, uðxÞÞ
x2Ω
E LD ðx, uðxÞ, wðxÞÞ ¼ jI L2 ðx þ uðxÞÞ I L1 ðxÞ þ γ wðxÞj1

E CD ðx, uðxÞÞ ¼ I C2 ðx þ uðxÞÞ I C2 ðxÞ1
X
ES ðuðxÞ, wðxÞÞ ¼ jΔuðxÞj1 þ jΔwðxÞj1 ,
x2Ω
where Ω is the image domain; E is the minimized energy; u(x) ¼ (u1(x), u2(x)) is the
motion field; w(x) is the field connected to the illumination change; λL and λC are
parameters that control the data term’s importance for luminance and colour chan-
nels, respectively; γ controls the regularization of the illumination change; ES is the
smoothness part of energy that penalizes changes in u and w; I L1 and I L2 are the
luminance components of the current and next frames; and I C2 and I C2 are the colour
components of the current and next frames.
4.4.4 Total Variation Optical Flow with a Non-local

Smoothness Term
To deal with motion edges, we incorporated a local neighbourhood weighting

known as the non-local smoothness term proposed by Werlberger et al. (2010) and
Sun et al. (2010b):
X
ES ¼ sn ðI 1 , x, dxÞjuðx þ dxÞ uðxÞj1 þ jΔwðxÞj1
dxEΨ
jI 1 ðxÞI 1 ðxþdxÞj2

sðI 1 , x, dxÞ ¼ eks jdxj e 2kc 2
sðI 1 , x, dxÞ
sn ðI 1 , x, dxÞ ¼ P ,
dx2Ψ sðI 1 , x, dxÞ
where Ψ is the non-local neighbourhood term (e.g. a 5 5 square with (0,0) in the
middle); s(I1, x, dx) and sn(I1, x, dx) are non-normalized and normalized non-local
weights, respectively; and ks and kc are parameters that control the non-local
weights’ response.
4.4.5 Special Weighting Method for Fast Non-local Optical

Flow Computation
In the original formulation of the non-local smoothness term, the size of the local
window determines the number of weights and dual variables per pixel. Thus, for
example, if the window size is 5 5, then for each pixel we need to store 50 dual
variables and 25 weights in memory. Considering that all of these variables are used
in every iteration of optimization, a large number of computations and memory
transfers are required. To overcome this problem, we devised a computation simpli-
fication to decrease the number of non-local weights and dual variables. The idea is
to decrease the non-local neighbourhood for motion and use a larger part of the
image for weights. The way that we use a 3x3 non-local neighbourhood with 5x5
image information is described by the following formula:
jI 1 ðxÞI 1 ðxþdxÞj2 þjI 1 ðxÞI 1 ðxþ2dxÞj2

sðI 1 , x, dxÞ ¼ eks jdxj e 4k c 2 ,
where all notations are the same as in the previous equation.
4.4.6 Solver
Our solver is based on Algorithm 4.1 from Chambolle and Pock (2011). According
to this algorithm, we derived the iteration scheme for the non-local smoothness term
in optical flow. In order to get a convex formulation for optimization, we need to
linearize the data term ED for both channels, Y and Cr. The linearization uses the
current state of motion field u0; derivatives are approximated using the following
scheme:
I T ðxÞ ¼ I 2 ðx þ u0 ðxÞÞ I 1 ðxÞ,

I x ðxÞ ¼ I 2 ðx þ u0 ðxÞ þ ð0:5, 0ÞÞ I 2 ðx þ u0 ðxÞ ð0:5, 0ÞÞ,
I y ðxÞ ¼ I 2 ðx þ u0 ðxÞ þ ð0, 0:5ÞÞ I 2 ðx þ u0 ðxÞ ð0, 0:5ÞÞ,
where I1 is the image from which motion is computed, I2 is the image to which
motion is computed, IT is the image time derivative estimation, and Ix and Iy are
image spatial derivative estimations. It is also possible to derive a three colour
channel version, but the computational complexity increase is considerable. Full
implementation details can be found in Pohl et al. (2014).
4.4.7 Evolution of Data Term Importance
One of the problems of variational optical flow estimated on a pyramid scheme is

that it fails if fast motion is present in the video sequence. To improve the results, we
proposed an update to the pyramid processing, which changes the data importance
parameters λL and λC during the pyramid computation from a more data-oriented
solution to a smoother solution:
λðnÞ ¼ λcoarsest , n > nramp

ðnramp nÞ
λðnÞ ¼ λcoarsest ðλcoarsest λ f inest Þ , n nramp ,
nramp
where n is the pyramid level (zero n means the finest level and hence the highest
resolution), λ is one of the regularization parameters that is a function of n, and nramp
Fig. 4.10 Occlusion

detection three frame
scheme with 1D projection
of image
is start of the linear ramp. The parameters λcoarsest and λfinest define the initial and
final λ value.
The variational optical flow formulation does not explicitly handle occlusion
areas. The estimated motion in occlusion areas is incorrect and usually follows a
match to the nearest patch of similar colour. However, if the information for visible
pixels is correct, it is possible to find occlusion areas using the motion from the
nearest frames to the current frame, as shown in Fig. 4.10. Object 401 on the moving
background creates occlusion 404. The precise computation of occlusion areas uses
inverse of bilinear interpolation and thresholding.
4.4.8 Clustering of Motion
The purpose of motion clustering is to recognize different areas of an image that

move together. Joint clustering of forward and backward motion in a three-frame
scheme, as shown in Fig. 4.10, is used. At first, four motion fields are computed
using variational optical flow; then, the occlusion areas are detected for the central
frame. To find nonocclusion areas that can be described by one motion model for
forward motion and one motion model for backward motion, we use a slightly
adapted version of the RANSAC algorithm introduced by Fischler and Bolles
(1981). The RANSAC algorithm is a nondeterministic model fitting, which ran-
domly selects a minimal set of points that determine the model and estimates how
many samples actually do fit this model. To distinguish inliers from outliers, it uses
some error threshold. Our target is the processing of general video material. In this
case, it is impossible to make a single sensible setting of this threshold. For dealing
with this problem, we change the RANSAC evaluation function. At first, the method
evaluates tested forward and backward motion models by summing the Gaussian
prior of the misfit as follows:
X ju23 ðxÞMðθ23 ,xÞj2 þju21 ðxÞMðθ21 ,xÞj2

Jðθ23 , θ21 Þ ¼ e 2k2 ,
x2Ω
where Ω is the image domain, u23(x) and u21(x) are motion fields from the central to
the next and central to the previous frame respectively, M(θ, x) is the motion given
by the model with parameters θ at the image point x, and k is the parameter that
weights the dispersion of the misfit. J is the evaluation function: higher values of this
function give better candidates. This evaluation still needs a parameter k that acts as
the preferred dispersion. However, it will still give quite reasonable results, even
when all clusters have high dispersions. After the evaluation stage, misfit histogram
analysis is done in order to find a first mode of misfit. We search for the first local
minima in a histogram smoothed by convolution with a Gaussian, because the first
local minima in an unsmoothed histogram is too sensitive to noise. Pixels that have a
misfit below three times the standard deviation of the first mode are deemed pixels
belonging to the examined cluster.
After the joint model fit, single direction occlusion areas are added if the local
motion is in good agreement with the fitted motion model. Our experiments show
that the best over-clustering results were achieved using the similarity motion model:
2 3 2 3
u1 t1
6 7 sR I
6 u2 7 ¼ M ðθ, xÞ ¼ 6
4
7
t 2 5,
4 5
1 0 0 1
where u1 and u2 are motion vectors, x1 and x2 are the coordinates of the original
point, t ¼ (t1, t2) is the translation, R is an orthonormal 2 2 rotation matrix, s is the
scaling coefficient, and θ ¼ (R, s, t) are the parameters of the model M.
In order to assign clusters to areas which are marked as occlusions in both
directions, we use a clustering inpainting algorithm. The algorithm searches for
every occluded pixel and assigns it a cluster number using a local colour similarity
assumption:
X
jI 1 ðxÞI 1 ðyÞj2
Wðx, kÞ ¼ fe , CðyÞ ¼ k
2σ 2
y2Ω 0, CðyÞ 6¼ k
CðxÞ ¼ argmaxðWðx, kÞÞ,
k
where Ω is the local neighbourhood domain, I1(x) is the current image pixel with
coordinates x, C(x) is the cluster index of the pixel with coordinates x, W(x, k) is the
weight of the cluster k for a pixel with coordinates x, and σ is the parameter
controlling the colour similarity measure of current and neighbourhood pixels.
4.4.9 Results
The results of the clustering are the function C(x) and the list of models MC(θ, x).
This result is used to generate unknown motion in occlusion areas using the motion
model of the cluster that the occlusion pixel was added to. An example of a motion
clustering result is shown in Fig. 4.11.
The main part of the testing of our results was done on scene cuts of film videos to
see “real-life” performance. The main problem with this evaluation is that the ground
truth motion is unavailable and manual evaluation of the results is the only method
we can use. To have some quantification as well as a comparison with the state of the
art, we used the famous Middlebury optical flow evaluation database made by Baker
et al. (2011). We use a colouring scheme that is used by the Middlebury benchmark.
Motion estimation results are stable for a wide range of values of the regulariza-
tion parameter lambda, as can be seen in their errors from ground truth in Fig. 4.12.
The results with lower values of lambda are oversmoothed, whereas a value of
lambda that is too high causes a lot of noise in the motion response as the algorithm
tries to fix the noise in the input images.
The non-local neighbourhood smoothness term improves the edge response in the
constructed motion field. The simplified version can be seen as a relaxation to full
non-local processing, and the quality of results is somewhere in between TV-L1
regularization and full non-local neighbourhood. An example of motion edge
behaviour between these approaches is demonstrated in Fig. 4.13. You can see
that the simplified non-local term decreases the smoothing around the motion edge
but still creates a motion artefact not aligned with the edge of the object. However,
this unwanted behaviour is usually caused by fast motion or a lack of texture on one
of the motion layers. We found out that on the testing set of the Middlebury
benchmark, the difference between the simplified and normal non-local
neighbourhoods is not important. We think it is because the dataset has only small
or moderate motion and usually has rather well-textured layers. You can see the
comparison in Fig. 4.14, with a comparison of errors shown in Fig. 4.15.
The proposed method of motion estimation keeps the sharp edges of the motion
and tries to fix incorrect motion in occlusion areas. We also presented a way to relax
the underlying model to allow considerable speedup of the computation. Our motion
estimation method was ranked in the top 20 out of all algorithms on the Middlebury
Fig. 4.11 Motion inpainting result on cut of Grove3 sequence from Middlebury benchmark—
clustering result overlain on grey image (left), non-inpainted motion (centre), motion after occlu-
sion inpainting (right)
Fig. 4.12 Average endpoint and angular errors for Middlebury testing sequence for changing the
regularization parameter
Fig. 4.13 Motion results on Dimetrodon frame 10 for lambda equal to 1, 5, 20, 100
Fig. 4.14 Comparison of TV-L1 (left), 3 3 neighbourhood with special weighting (middle), and
full 5 5 non-local neighbourhood (right) optical flow results; the top row is the motion colourmap,
and on the bottom it is overlaid on the greyscale image to demonstrate the alignment of motion and
object edges
Fig. 4.15 Comparison of our 3 3 non-local neighbourhood with special weights and 5 5
neighbourhood optical flow result on Middlebury testing sequence
optical flow dataset of the time, but only one other method reported better processing
time on the “Urban” sequence.
4.5 Background Inpainting
4.5.1 Introduction
The task of the background inpainting step is to recover occluded parts of video
frames to use later during stereo synthesis and to fill occlusion holes. We have an
input video sequence and a corresponding sequence of masks (for every frame of the
video) that denote areas to be inpainted (a space-time hole). The goal is to recover
background areas denoted (covered) by object masks so that the result is visually
plausible and temporally coherent. A lot of attention has been given to the area of
still image inpainting. Notable methods include diffusion-based approaches, such as
Telea (2004) and Bertalmio (2000), and texture synthesis or exemplar-based algo-
rithms, for example, Criminisi et al. (2004). Video inpainting imposes additional
temporal restrictions, which make it more computationally expensive and
challenging.
4.5.2 Related Work
Generally, most video inpainting approaches fall into two big groups: global and
local with further filtering. Exemplar-based methods for still images can be naturally
extended for videos as a global optimization problem of filling the space-time hole.
Wexler et al. (2007) define a global method using patch similarity for video
completion. The article reports satisfactory results, but the price for global
optimization is that the algorithm is extremely complex. They report several hours of
computation for a video of very low resolution and short duration (100 frames of
340 120 pixels). A similar approach was taken by Shiratori (2006). They suggest a
procedure called motion field transfer to estimate motion inside a space-time hole.
Motion is filled in with a patch-based approach using a special similarity measure.
The motion found allows the authors to inpaint a view sequence while maintaining
temporal coherence. However, the performance is also quite low (~40 minutes for a
60-frame video of 352 240). Bugeau et al. (2010) propose inpainting frames
independently and filter the results by Kalman filtering along point trajectories found
by the dense optical flow algorithm. The slowest operation of this approach is the
computation of optical flow. Their method produces a visually consistent result, but
the inpainted region is usually smoothed with occasional temporal artefacts.
4.5.3 Background Inpainting
In our work (Pohl et al. 2016), we decided not to use a global optimization approach
to make the algorithm more computationally efficient and avoid the loss of small
details on the restored background video. The proposed algorithm consists of three
well-defined steps:
1. Restoration of background motion in foreground regions, using motion vectors
2. Temporal propagation of background image data using the restored background
motion from step #1
3. Iterative spatial inpainting with temporal propagation of inpainted image data
Background motion is restored by computing and combining several motion
estimates based on optical flow motion vectors. Let black contour A be the edge
of a marked foreground region (see Fig. 4.16a). Outside A, we will be using motion
produced by the optical flow algorithm, M0(x, y). In area B, we obtain a local
background motion estimate M1(x, y). This local estimate can be produced, for
example, by running a diffusion-like inpainting algorithm, such as Telea (2004),
Fig. 4.16 Background motion estimation: (a) areas around object’s mask; (b) example of back-
ground motion estimation
on M0(x, y) inside region B. Next, we use M0(x, y) from area C to fit the parameters
a0,..,a5 of an affine global motion model
!
a0 þ a1 x þ a 2 y
M 2 ðx, yÞ ¼ :
a3 þ a4 x þ a 5 y
Motion M2(x, y) is used inside D. We choose area C to be separated from

foreground object contour A. This is to avoid introducing artefacts that might
come from optical flow estimation around object edges into global motion estimates.
To improve motion smoothness, we blend motions M1 and M2 inside A. Let
0 W(x, y)1 be a weighting function; then, the resulting motion vector field is
computed as
M ðx, yÞ ð1 W ðx, yÞÞM 2 ðx, yÞ þ W ðx, yÞM 2 ðx, yÞ:
W(x, y) is defined as the exponential decay (with a reasonable decay rate parameter in
pixels) of distance from the foreground area edge A. It is equal to 1 outside contour
A, and the distance parameter is computed by the effective distance transform
algorithm.
As a result, we obtain a full-frame per-pixel estimate of background motion that
generally has the properties of global motion inside previously missing regions but
does not suffer from discontinuity problems around the foreground region edges
(Fig. 4.16b).
The second step—temporal propagation—is the crucial part of our algorithm. The
forward and backward temporal passes are symmetrical, and they fill in the areas that
were visible on other frames of video. We do a forward and backward pass through
the video sequence using integrated (accumulated) motion in occluded regions. In
the forward pass, we integrate motion in the backward direction and decide which
pixels can be filled by data from the past. The same is done for the backward pass to
fill in pixels from future frames. After forward and backward temporal passes, we
can still have some areas that were not inpainted (unfilled areas). These areas were
not seen during the entire video clip. We use still image spatial inpainting to fill in the
missing data in a selected frame and propagate it using restored background motion
to achieve temporal consistency.
Let us introduce the following notation:
– I(x, y) is the nth input frame.
– M(m, n, x) is the restored background motion from frame m to frame n.
– F(n, x) is the input foreground mask for frame n (area to be inpainted).
– QF(n, x) is the inpainted area mask for frame n.
– I(n, x) is the inpainted frame n.
– T is the temporal window size (algorithm parameter).
Algorithm 4.3 Forward Temporal Pass

for Ncurr ¼ 1. . Nframes do
Initialize forward pass image and mask:
I F ðN curr , xÞ ¼ I ðN curr , xÞ
QF ðN curr , xÞ ¼ 0
Iterate backward for temporal data:

for Nsrc ¼ Ncurr 1. . max {1, Ncurr T} do
Integrate motion:
MðN curr , N src , xÞ ¼ MðN curr , N src þ 1, xÞ

þ MðN curr , N src , x þ MðN curr , N src þ 1, xÞÞ
New inpainted mask:
F ðN curr , xÞ ¼ FðN curr , xÞ\

Qnew
QF ðN curr , xÞ \ FðN curr , x þ MðN curr , N src , xÞÞ
Update inpainted image and mask:

F ðN curr , xÞ do
for all points x marked on Qnew
I F ðN curr , xÞ ¼ I F ðN src , x þ MðN curr , N src , xÞÞ

QF ðN curr , xÞ ¼ QF ðN curr , xÞ [ Qnew
F ðN curr , xÞ
Algorithm 4.3 is a greedy algorithm that inpaints the background with data from the
temporally least distant frame. The backward temporal pass algorithm is doing the
same operations, only with a reversed order of iterations and motion directions.
Some areas can be filled from both sides. In this case, we need to find a single
inpainting solution. We used the temporally less distant source. In the case of the
same temporal distance, a blending procedure was done based on the distance from
the non-inpainted area.
The third step is the spatial pass. The goal of the spatial pass is to inpaint regions
that were not filled by temporal passes. To achieve temporal stability, we inpaint on a
selected frame and propagate inpainted data temporally. We found that a reasonable
strategy is to find the largest continuous unfilled area, use spatial inpainting to fill it
in, and then propagate through the whole sequence using a background motion
estimate. It is necessary to perform spatial inpainting with iterative propagation until
all unfilled areas are inpainted. Any spatial inpainting algorithm can be used; in our
work, we experimented with exemplar-based and diffusion-based methods. Our
experience shows that it’s better to use a diffusion algorithm for filling small or thin
areas, and an exemplar-based algorithm is better for larger unfilled parts.
Let us denote QFB(n, x) and IFB(n, x) as the temporally inpainted mask and the
background image, respectively, after forward and backward passes are blended
together. QS(n, x) and IS(n, x) are the mask and the image after temporal and spatial
inpainting. |D| stands for the number of pixels in the image domain D.
Algorithm 4.4 Spatial Pass

Initialize spatially inpainted image and mask:
QS ðn, xÞ ¼ QFB ðn, xÞ, I S ðn, xÞ ¼ I FB ðn, xÞ
while |F(n, x) [ ~QS(n, x)| > 0 8 (n, x) do

Find Ncurr such that |F(Ncurr, x) \ ~QS(n, x)| is maximal
Let RðN curr , xÞ ¼ F ðN curr , xÞ \ Q~S ðN curr , xÞ
Inpaint area R(Ncurr, x) and store pixel data into IS(Ncurr, x)

QS ðN curr , xÞ ¼ QS ðN curr , xÞ [ F ðN curr , xÞ \ Q~S ðN curr , xÞ ∖
Propagate inpainted area forward and backward:

for all frames Ndst : |F(Ndst, x) \ ~QS(Ndst, x)| > 0 do
Integrate motion to obtain M(Ndst,Ncurr, x)
Get new inpainting mask:
S ðN dst , xÞ ¼ FðN dst , xÞ\ QS ðN dst , xÞ \ QS ðN curr , x þ MðN dst, N curr , xÞÞ
Qnew
Update inpainted image:

S ðN dst , xÞ do
for all points x marked on Qnew
I S ðN dst , xÞ ¼ I S ðN curr , x þ MðN dst, N curr , xÞÞ

QS ðN dst , xÞ ¼ QS ðN dst , xÞ [ Qnew
S ðN dst , xÞ
4.5.4 Results
For quantitative evaluation, we generated a set of synthetic examples with known,

ground truth backgrounds and motion. There are two kinds of object motion in the
test sequences:
Fig. 4.17 Synthetic data example
Fig. 4.18 Background restoration quality: (a) simple motion; (b) affine motion
1. Simple motion. The background and foreground are moved by two different
randomly generated motions, which include only rotation and shift.
2. Affine motion. The background and foreground are moved by two different
randomly generated affine motions.
Evaluation was done by running the proposed algorithm with default parameters
on synthetic test sequences (an example is shown in Fig. 4.17). We decided to use
the publicly available implementation of an exemplar-based inpainting algorithm by
Criminisi et al. (2004) to provide a baseline for our approach. Our goal was to test
both the quality of background inpainting for each frame and the temporal stability
of the resulting sequence. For the evaluation of the background inpainting quality,
we use the PSNR between the algorithm’s output and the ground truth background.
Results are shown in Fig. 4.18.
To measure temporal stability, we use the following procedure. Let I(n, x) and
I(n, x) be a pair of consecutive frames with inpainted backgrounds, and let
MGT(n, n + 1, x) be the ground truth motion from frame n to n + 1. We compute
the PSNR between I(n, x) and I(n + 1, x + MGT(n, n + 1, x) (sampling is done using
bicubic interpolation). The results are shown in Fig. 4.19.
As we can see, our inpainting algorithm usually provides a slightly better quality
of background restoration than exemplar-based inpainting when analysed statisti-
cally. However, it produces far more temporally stable output, which is very
important for video inpainting.
We applied our algorithm to a proprietary database of videos with two types of
resolution: 1920 1080 (FHD) and 960 540 (quarter HD, qHD). The method
shows a reasonable quality for scenes without changes in scene properties
Fig. 4.19 Temporal stability: (a) simple motion; (b) affine motion
Fig. 4.20 Background inpainting results
(brightness change, different focus, changing fog or lights) and with rigid scene
structure. A few examples of our algorithm outputs are shown in Fig. 4.20.
Typical visible artefacts include misalignments on the edges of the temporal
inpainting direction (presumably in cases when motion integration is not precise
enough) and mixing the same part of a scene whose appearance properties changed
with time. Examples of such artefacts are shown in Fig. 4.21.
The running time of the algorithm (not including optical flow estimation) is
around 1 s/frame for qHD and 3 s/frame for FHD sequences on a PC with a single
GPU. The results are quite acceptable for the restoration of areas occluded due to
Fig. 4.21 Typical artefacts: misalignment (top row) and scene change artefacts (bottom row)
stereoscopic parallax for a limited range of scenes, but for wider applicability, it is
necessary to decrease the level of artefacts. It is possible to apply a more advanced
analysis of propagation reliability for a better decision between spatial and temporal
inpainting; also, the temporal inpainting direction uses the analysis of the level of
scene changes. In addition, it may be useful to improve the alignment of temporally
inpainted parts from different time moments or apply the Poisson seamless stitching
approach from Pérez et al. (2003) in case there are overlapping parts.
4.6 View Rendering
As we mentioned in our discussion of view rendering in Chap. 3, the main goal of

view rendering is to fill holes caused by objects’ disocclusion. In the case of a still
background, where no temporal inpainting is available, the holes should be filled
with a conventional image-based hole-filling algorithm. There are quite a few
methods proposed in the literature. The well-known Criminisi algorithm (Criminisi
et al. 2004) allows the restoration of large holes in the image. The main idea of this
method is to select a pixel in the boundary of the damaged area, centre the point, and
select the texture block of the appropriate size according to the texture features of the
image. Then, it finds the best matching block and replaces the texture block with this
block to complete the image restoration.
The main problem with this method is caused by image discontinuities
overlapping with holes. The artefacts caused by regular patterns are the most
noticeable and disturbing. To overcome such a problem, we proposed looking for
straight lines (or edges) on the image, finding those that cross holes, filling the hole
along the line in the first step, and filling the rest of the hole in the next step.
Rendered stereo (with holes):
Hole filing – along straight vertical lines
Fig. 4.22 Background inpainting with straight lines
Straight line detection was made with the Hough transform and voting procedure.
Patches along the lines are propagated inside the hole, and the rest of the hole is filled
with a conventional method, like the one described in Criminisi’s paper (Criminisi
et al. 2004). Figure 4.22 illustrates the steps of the algorithm.
4.7 Stereo Content Quality
A large area of research is devoted to stereo quality estimation. Bad stereo results
lead to a poor viewing experience, headache, and eye fatigue. That is why it is
important to review media content and possibly eliminate production or post-
production mistakes. The main causes of eye fatigue are listed in Chap. 3, and
content quality is among the mentioned items.
The most common quality issues that can be present in stereo films (including
even well-known titles with million-dollar budgets) include the following:
• Mixed left and right views
• Stereo view rotation
• Different sizes of objects
• Vertical disparity
• Temporal shift between views
• Colour imbalance between views
• Sharpness difference
A comprehensive review of the defects with examples from well-known titles is

given by Vatolin (2015a, b,); Vatolin et al. (2016). The enumerated defects are more
common for films shot in stereo, when, for example, a camera rig was not properly
calibrated, and this led to a vertical disparity, or there are different sizes of objects
between views, or when different camera settings (like the aperture setting) were
used for the left and right cameras, which leads to a sharpness difference and a depth
of field difference. Such problems can be detected automatically and sometimes even
fixed without noticeable artefacts (Voronov et al. 2013).
There are many difficult situations during stereo conversion that require special
attention during post-production. The most difficult situations for stereo conversion
are the following:
• The low contrast of input video does not allow reliable depth propagation.
• A fast motion scene is a challenge for motion estimation and background motion
model computation
• Transparent and semi-transparent objects are very difficult to render without
artefacts.
• Objects changing shape, when the depth map of an object can change over time.
• Depth changes/zoom present in a scene requires accurate depth interpolation for
smooth stereo rendering with growing occlusion areas.
• Foreground and background similarity lead to background inpainting errors.
• Small details are often lost during interpolation and object matting.
• Changing light makes it difficult to find interframe matching for motion estima-
tion and depth propagation.
Such difficult situations can be automatically detected during the very first step of
semi-automatic stereo conversion: the key frame detection step. Usually, for post-
processing quality estimation, people are involved, since the converted title may
look totally different on a computer monitor and on the big screen. That is why
quality control requirements for different systems (like small-screened smartphones,
small TV sets, and wide screen cinema) will be different.
References
Appia, V., Batur, U.: Fully automatic 2D to 3D conversion with aid of high-level image features. In:
Stereoscopic Displays and Applications XXV, vol. 9011, p. 90110W (2014)
Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation
methodology for optical flow. Int. J. Comp. Vision. 92(1), 1–31 (2011)
Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Proceedings of the 27th
Annual Conference on Computer graphics and Interactive Techniques, p. 417 (2000)
Bugeau, A., Piracés, P.G.I., d'Hondt, O., Hervieu, A., Papadakis, N., Caselles, V.: Coherent
background video inpainting through Kalman smoothing along trajectories. In: Proceedings of
2010–15th International Workshop on Vision, Modeling, and Visualization, p. 123 (2010)
Butler D.J., Wulff J., Stanley G.B., Black M.J.: A Naturalistic Open Source Movie for Optical Flow
Evaluation. In: Fitzgibbon A., Lazebnik S., Perona P., Sato Y., Schmid C. (eds) Computer
Vision – ECCV 2012. ECCV 2012. Lecture Notes in Computer Science, vol 7577. Springer,
Berlin, Heidelberg (2012)
Cao, X., Li, Z., Dai, Q.: Semi-automatic 2D-to-3D conversion using disparity propagation. IEEE
Trans. Broadcast. 57, 491–499 (2011)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications
to imaging. J. Math. Imag. Vision. 40(1), 120–145 (2011)
Criminisi, A., Pérez, P., Toyama, K.: Region filling and object removal by exemplar-based image
inpainting. IEEE Trans. Image Process. 13(9), 1200–1212 (2004)
Feng, J., Ma, H., Hu, J., Cao, L., Zhang, H.: Superpixel based depth propagation for semi-automatic
2D-to-3D video conversion. In: Proceedings of IEEE Third International Conference on Net-
working and Distributed Computing, pp. 157–160 (2012)
Feng, Z., Chao, Z., Huamin, Y., Yuying, D.: Research on fully automatic 2D to 3D method based
on deep learning. In: Proceedings of the IEEE 2nd International Conference on Automation,
Electronics and Electrical Engineering, pp. 538–541 (2019)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with
applications to image analysis and automated cartography. Commun. ACM. 24(6), 381–395
(1981)
Harman, P.V., Flack, J., Fox, S., Dowley, M.: Rapid 2D-to-3D conversion. In: Stereoscopic
displays and virtual reality systems IX International Society for Optics and Photonics, vol.
4660, pp. 78–86 (2002)
Ignatov, A., Bucha, V., Rychagov, M.: Disparity estimation in real-time 3D acquisition and
reproduction system. In: Proceedings of the International Conference on Computer Graphics
«Graphicon 2009», pp. 61–68 (2009)
Irony, R., Cohen-Or, D., Lischinski, D.: Colorization by example. In: Proceedings of the Sixteenth
Eurographics conference on Rendering Techniques, pp. 201–210 (2005)
Korman, S., Avidan, S.: Coherency sensitive hashing. IEEE Trans. Pattern Anal. Mach. Intell. 38
(6), 1099–1112 (2015)
Muelle, M., Zill, F., Kauff, P.: Adaptive cross-trilateral depth map filtering. In: Proceedings of the
IEEE 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video,
pp. 1–4 (2010)
Pérez, P., Gangnet, M., Blake, A.: Poisson image editing. In: ACM SIGGRAPH Papers,
pp. 313–318 (2003)
Pohl, P., Molchanov, A., Shamsuarov, A., Bucha, V.: Spatio-temporal video background
inpainting. Electron. Imaging. 15, 1–5 (2016)
Pohl, P., Sirotenko, M., Tolstaya, E., Bucha, V.: Edge preserving motion estimation with occlusions
correction for assisted 2D to 3D conversion. In: Image Processing: Algorithms and Systems XII,
9019, pp. 901–906 (2014)
Shiratori, T., Matsushita, Y., Tang, X., Kang, S.: Video completion by motion field transfer. In:
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, vol. 1, p. 411 (2006)
Sun, J., Xie, J., Li, J., Liu, W.: A key-frame selection method for semi-automatic 2D-to-3D
vonversion. In: Zhang, W., Yang, X., Xu, Z., An, P., Liu, Q., Lu, Y. (eds.) Advances on Digital
Television and Wireless Multimedia Communications. Communications in Computer and
Information Science, vol. 331. Springer, Berlin, Heidelberg (2012)
Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: Pro-
ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recogni-
tion, pp. 2432–2439 (2010a)
Sun, D., Sudderth, E., Black, M.: Layered image motion with explicit occlusions, temporal
consistency, and depth ordering. In: Proceedings of the 24th Annual Conference on Neural
Information Processing Systems, pp. 2226–2234 (2010b)
Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: Proceedings of the
European Conference on Computer Vision, pp. 402–419 (2020)
Telea, A.: An image inpainting technique based on the fast marching method. J. Graph. Tools. 9
(1) (2004)
Tolstaya E.: Implementation of Coherency Sensitive Hashing algorithm. (2020). Accessed on
03 October 2020. https://github.com/ktolstaya/PyCSH
Tolstaya, E., Hahn, S.-H.: Method and system for selecting key frames from video sequences. RU
Patent 2,493,602 (in Russian) (2012)
Tolstaya, E., Pohl, P., Rychagov, M.: Depth propagation for semi-automatic 2d to 3d conversion.
In: Proceedings of SPIE Three-Dimensional Image Processing, Measurement, and Applications,
vol. 9393, p. 939303 (2015)
Varekamp, C., Barenbrug, B.: Improved depth propagation for 2D to 3D video conversion using
key-frames. In: Proceedings of the 4th European Conference on Visual Media Production
(2007)
Vatolin, D.: Why Does 3D Lead to the Headache? / Part 8: Defocus and Future of 3D (in Russian)
(2019). Accessed on 03 October 2020. https://habr.com/ru/post/472782/
Vatolin, D., Bokov, A., Erofeev, M., Napadovsky, V.: Trends in S3D-movie quality evaluated on
105 films using 10 metrics. Electron. Imaging. 2016(5), 1–10 (2016)
Vatolin, D.: Why Does 3D Lead to the Headache? / Part 2: Discomfort because of Video Quality (in
Russian) (2015a). Accessed on 03 October 2020. https://habr.com/en/post/377709/
Vatolin, D.: Why Does 3D Lead to the Headache? / Part 4: Parallax (in Russian) (2015b). Accessed
on 03 October 2020. https://habr.com/en/post/378387/
Voronov, A., Vatolin, D., Sumin, D., Napadovsky, V., Borisov, A.: Methodology for stereoscopic
motion-picture quality assessment. In: Proceedings of SPIE Stereoscopic Displays and Appli-
cations XXIV, vol. 8648, p. 864810 (2013)
Wang, D., Liu, J., Sun, J., Liu, W., Li, Y.: A novel key-frame extraction method for semi-automatic
2D-to-3D video conversion. In: Proceedings of the IEEE international Symposium on Broad-
band Multimedia Systems and Broadcasting, pp. 1–5 (2012)
Werlberger, M., Pock, T., Bischof, H.: Motion estimation with non-local total variation regulari-
zation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, pp. 2464–2471 (2010)
Wexler, Y., Shechtman, E., Irani, M.: Space-time completion of video. IEEE Trans. Pattern Anal.
Mach. Intell. 29(3), 463–476 (2007)
Xie, J., Girshick, R., Farhadi, A.: Deep3d: Fully automatic 2d-to-3d video conversion with deep
convolutional neural networks. In: Proceedings of the European Conference on Computer
Vision, pp. 842–857 (2016)
Yuan, H.: Robust semi-automatic 2D-to-3D image conversion via residual-driven optimization.
EURASIP J. Image Video Proc. 1, 66 (2018)
Zhao, S., Sheng, Y., Dong, Y., Chang, E., Xu, Y.: MaskFlownet: asymmetric feature matching with
learnable occlusion mask. In: Proceedings of the CVPR, vol. 1, pp. 6277–6286 (2020)
Zitnick, C.L., Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: High-quality video view
interpolation using a layered representation. ACM Transactions on Graphics. 23(3) (2004)
Chapter 5
Visually Lossless Colour Compression
Technology
Michael N. Mishourovsky
5.1 Why Embedded Compression Is Required
5.1.1 Introduction
Modern video processing systems such as TV devices, mobile devices, video coder/
decoders, and surveillance equipment process a large amount of data: high-quality
colour video sequences with resolution up to UD at a high frame rate, supporting
HDR (high bit depth per colour component). Real-time multistage digital video
processing usually requires high-speed data buses and huge intermediate buffering,
which lead to increased costs and power consumption. Although the latest achieve-
ments in RAM design make it possible to store a lot of data, reduction of bandwidth
requirements is always a desired and challenging task; moreover, for some applica-
tions it might become a bottleneck due to the balance between cost and technological
advances in a particular product.
One possible approach to cope with this issue is to represent video streams in a
compact (compressed) format, preserving the visual quality, with a possible small
level of losses as long as the visual quality does not suffer; this category of
algorithms is called embedded or texture memory compression and is usually
implemented in HW and should provide very high visual fidelity (quality), a
reasonable compression ratio and low hardware (HW) complexity. It should impose
no strong limitations on data-fetching, and it should be easy to integrate into
synchronous parts of application-specific integral circuits (ASIC) or system on
chip (SoC).
In this chapter, we describe in detail the so-called visually lossless colour
compression technology (VLCCT), which satisfies the abovementioned technical
M. N. Mishourovsky (*)
Huawei Russian Research Institute, Moscow, Russia
e-mail: mmishourovsky@gmail.com

116 M. N. Mishourovsky
requirements. We also touch on the question of visual quality evaluation (and what
visual quality means) and present an overview of the latest achievements in the field
of embedded compression, video compression, and quality evaluations.
One of the key advantages of the described technology is that it uses only one row
of memory for buffering, works with extremely small blocks, and provides excellent
subjective and objective quality of images with a fixed compression ratio, which is
critical for random access data-fetching. It is also robust to error propagation, as it
introduces no inter-block dependency, which effectively prevents error propagation.
According to estimates done during the technology transferring, the technology
requires only a small number of equivalent physical gates and could be widely
adopted into various HW video processing pipelines.
5.1.2 Prior-Art Algorithms and Texture Compression

Technologies
The approach underlying VLCCT is not new; it is applied for GPU, OpenGL, and
Vulcan supporting several algorithms to effectively compress textures; and OS such
as Android and iOS both support several techniques for encoding and decoding
textures.
During the initial stage of the development, several prototypes were identified
including the following: algorithms based on decorrelating transformations (DCT,
wavelet, differential-predictive algorithms) combined with entropy coding of differ-
ent kinds; the block truncation encoding family of algorithms; block palette
methods; and the vector quantization method. Mitsubishi developed a technology
called fixed block truncation coding by Matoba et al. (1998), Takahashi et al. (2000),
and Torikai et al. (2000) and even produced an ASIC implementing this algorithm
for the mass market. It provides fixed compression with relatively high quality. Fuji
created an original method called Xena (Sugita and Watanabe 2007), Sugita (2007),
which was rather effective and innovative, but provided only lossless compression.
In 1992, a paper was published (Wu and Coll 1992) suggesting an interesting
approach which is essentially targeted to create an optimal palette for a fixed block
to minimize the maximum error. The industry adopted several algorithms from S3 –
so-called S3 Texture Compression (DXT1 ~ DXT5), and ETC/ETC1/ETC2 were
developed by Ericsson. ETC1 is currently supported by Android OS (see the
reference) and is included in the specification of OpenGL – Methods for encoding
and decoding ETC1 textures (2019). PowerVR Texture Compression was designed
for graphic cores and then patented by Imagination Technologies. Adaptive Scalable
Texture Compression (ASTC) was jointly developed by ARM and AMD and
presented in 2012 (which is several years after VLCCT was invented).
In addition to the abovementioned well-known techniques, the following tech-
nologies were reviewed: Strom and Akenine-Moeller (2005), Mitchell and Delp
(1980), Jaspers and de With (2002), Odagiri et al. (2007), and Lee et al. (2008). In
5 Visually Lossless Colour Compression Technology 117
general, all these technologies provide compression ratios from ~1.5 to 6 times with
different quality of decompressed images, different complexity, and various optimi-
zation levels. However, most of them do not provide the required trade-off between
algorithmic and HW complexity (most of these algorithms require several rows –
from 2. . .4 and even more), visual quality, and other requirements behind VLCCT.
Concluding this part, the reader can familiarize himself with the following links
providing a thorough overview of different texture compression methods: Vulkan
SDK updated by Paris (2020), ASTC Texture Compression (2019), Paltashev and
Perminov (2014).
5.2 Requirements and Architecture of VLCCT
According to the business needs, the limitations for HW complexity, and acceptance
criteria for visual quality, the following requirements were defined (Table 5.1):
Most of compression technologies rely on some sort of redundancy elimination
mechanisms, which can essentially be classified as follows:
1. Visual redundancy (caused by human visual system perception).
2. Colour redundancy.
3. Intra-frame and inter-frame redundancy.
4. Statistical redundancy (attributed to the probabilistic nature of elements in the
Table 5.1 Initial VLCCT requirements

Characteristic Initial value
Input spatial resolution From 720 480 to 1920 1080, progressive
Frame rate Up to 60 fps
Input colour format YCbCr 4:4:4, mandatory; 4:2:2, optional
Bit depth 8;10 bits
Scan order Left-to-right; up-to-down
Targeted compression ratio >2 times
Distortion level Visually lossless
Bitrate character Constant
Spatial resolution of chroma Should not be reduced by means of decimation
Complexity of a decoder 30 k gates
Type of potential content Any type
Maximum pixel size in a display ~0.40 0.23 (based on LE40M8 LCD which is like
device, in mm. LN40M81BD)
Minimal distance for observing display 0.4 m
device
Zoom in No zoom
stream; a stream is not random, and this might be used by different methods,
such as:
• Huffman encoding and other prefix codes
• Arithmetic encoding – initial publication by Rissanen and Langdon (1979)
• Dictionary-based methods
• Context adaptive binary arithmetic encoding and others
Usually, some sort of data transformations is used:
• Discrete cosine transform, which is an orthogonal transform with normalized
basic functions. DCT approximates decorrelation (removal of linear relations in
data) for natural images (PCA for natural images) and is usually applied at the
block level.
• Different linear prediction schemes with quantization (so-called adaptive pulse
coding modulation – ADPCM) are applied to effectively reduce the self-
similarity of images.
However, a transform or prediction itself does not provide compression; its
coefficients must be further effectively encoded. To accomplish this goal, classical
compression schemes include entropy coding, reducing the statistical redundancy.
As HW complexity has been a critical, limiting factor, it was decided to consider a
rather simple pipeline which prohibited the collection of a huge amount of statistics
and provision of long-term adaptation (as the CABAC engine does in modern H264,
H265 codecs). Instead, a simple approach with a one-row buffer and sliding window
was adopted. The high-level diagram of the data processing pipeline for the
encoding part is shown in Fig. 5.1.
As the buffering lines contribute a lot to the overall HW complexity, the only
possible scenario with one cache line meant that the vertical decorrelation is limited.
However, the proposed sliding window processing still allowed us to implement a
variety of methods; it was realized that a diversity of images (frames) potentially
being compressed by VLCCT inevitably requires different models for effective
representation of a signal.
Before going further, let us highlight several methods tried:
• JPEG2000, which is based on bit-plane arithmetic encoding following right after
wavelet transform
• Classical DCT-based encodings (JPEG-like).
K pixels N-1 line(int. buffer)
Input K pixels Coder Packet

Window
Fig. 5.1 High-level data processing pipeline for encoding part of VLCCT
Fig. 5.2 The structure of a

compressed elementary
packet
• ADPCM methods with different predictors including adaptive and edge-sensitive

causal 2D predictors
• ADPCM combined with modular arithmetic
• Grey codes to improve bit-plane encoding
• Adaptive Huffman codes
Unfortunately, all these methods suffered from the problem of either a low
compression ratio or visually noticeable distortions or high complexity. Content
analysis helped to identify several categories of images:
1. Natural images
2. Computer graphic images
3. High-frequency (HF) and test pattern images (charts)
Firstly, a high-level structure of a compressed bitstream of an image was pro-
posed. It consists of blocks (elementary packets) of a fixed size (which is opposite to
the variable-length packet structure); this approach helped fetch data from RAM and
organize the bitstream and packing/unpacking mechanisms. Within the elementary
packet, several subparts were defined (Fig. 5.2), where VLPE stands for variable-
length prefix encoding; as the input might be 10-bit depth, we decided to represent
the bitstream as two parts: a high 8-bit and low 2- bit part. The reason for such a
representation is that, according to the analysis, it followed that display systems
(at least at the time of the development) were not able to show much difference in
signals differing in low significant bits. Also, the analog-digital converters of
capturing devices as well as displaying systems had a signal-to-noise ratio which
is roughly just a little higher than the 8-bit range. Bit-stuffing helped to keep all the
packages well-aligned with the fixed size.
It was decided to consider a small processing block of 2 4 pixels. Two lines
directly relate to the specification limits on the complexity, and four pixels are
sufficiently small to work locally. From another point of view, this is a minimal
block size which can be constructed from simple transforms like 2 2 transforms.
Assuming 10 bits per pixel and three colour channels, the original bit size was
240 bits; to provide a compression ratio larger than 2, we targeted a final elementary
packet size of from 100 to 120 bits per packet.
To pick the right method for encoding, special modules were assumed; a VLPE
table and bitstream packer were also included in the detailed architecture of the
VLCCT. To process data efficiently, two types of methods were identified:
1. Methods that process 2 4 without explicit sub-partitioning
2. Methods that can split 2 4 into smaller parts, for example, two sub-blocks of
2 2 size
The final architecture of the VLCCT is presented in Fig. 5.3.
Fig. 5.3 The final architecture of VLCCT including all main blocks and components
5.3 Methods for Encoding
According to the identified categories of image content, we proposed methods that

work with 2 4 blocks and methods that work with 2 2 blocks: they will be
explained further. Their names came from internal terminology and were kept “as is”
to be consistent with the original VLCCT description.
5.3.1 Overview of 2 3 4 Methods
There are seven sub-methods developed to compress 2 4 pixels blocks. In

particular:
• D-method. It is designed to encode parts of an image that are complex and in
which diagonal edges dominate.
• F-method. Optimized for effective encoding of image parts that include gradients.
• M-method. This method is designed to effectively compress test image patterns
and charts; usually such images are used to verify the resolution and other
technical parameters of devices.
• O-method. This method is designed to effectively compress image parts where
only one colour channel dominates – red, green, blue or luminance.
• L-method. This method is applied to all three colour channels simultaneously. To

effectively represent the image structure, an individual palette is provided for
each 2 2 sub-block. Three sub-modes are provided to encode the palette values
for each 2 2 sub-block: inter-sub-block differential encoding of palettes, intra-
sub-block differential palette encoding, and bypass mode with quantization of
palette colours.
• U-method. This method is provided for such parts of an image that consist of
complex structure parts and uniform regions; uniform regions are represented by
means of a mean value, while the remaining bits are allocated as complex
structure parts.
• H-method. This method is to encode natural images where smooth transitions
exist. Each 2 4 block is considered as a set of two 2 2 blocks, and then 2 2
DCT transform is applied. The DCT coefficients are encoded by means of fixed
quantization (the Lloyd-Max Quantizer, which is also known as k-means
clustering).
5.3.2 Overview of 2 3 2 Methods
For individual encoding of 2 2 sub-blocks, five additional methods are provided.

These methods are applied for each 2 2 sub-block independently according to the
minimal error and can be combined providing additional adaptation:
• N-method. This method is to encode natural images and is based on a new
transform, called NSW (explained later) and fixed quantization of differential
values. NSW transform is a simple 2D transform which combines predictive
encoding, 1D Haar wavelet and a lifting scheme.
• P-method. This method is a modification of the N-method and improves the
encoding of vertical/horizontal/diagonal-like sub-blocks. It is based on vertical/
horizontal/diagonal and mean value encoding for an underlying sub-block.
• C-method. This method is based on the NSW transform which is applied for all
colour channels simultaneously, followed by group encoding of differential
values from all colour channels all together. By doing so, it is possible to reduce
the error of the encoded values if the colour channels are well-correlated to each
other. It is achieved by removing the quantizer adaptation for each colour channel
and increasing the quantizer dynamic range.
• E-method. This method is applied to increase the precision of colour transition
reproduction near to the boundary between two colours.
• S-method. This method is like the N-method, but the idea of this method is to
improve the precision of encoding image parts where differential values after
NSW are small for all colour channels.
NSW transform is a 2D transform specifically designed to decorrelate pixel
values within the smallest possible block – 2 2. If pixels within a 2 2 block
Fig. 5.4 Calculation of the

features using lifting
scheme: (a) pixel notation;
(b) 2D-lifting directions
are denoted as specified in Fig. 5.4a, the following features are calculated using a
lifting scheme (Sweldens 1997) (Fig. 5.4b):
{A, B, C, D} ! {s, h, v, d} are transformed according to the following:
Inverse transform
Forward transform
hþvþd
h¼AB A¼sþ
4
v¼AC B¼Ah
d ¼AD
hþvþd C ¼Av
s¼A
4 D¼Ad
Here, s is the mean value of four pixel values; h, v, d – the simplest directional
derivative values (differences). The s-value requires 8 bits to store, h, v and d require
8 bits +1 sign-bit. How can it be encoded effectively with a smaller number of bits?
Let us consider two 2 2 blocks and initial limits on the bit-budget. According to
this, we may come to an average 15 bits per block per colour channel. The mean
value s can be represented with 6-bit precision (uniform quantization was proven to
be a good choice); quantization of h, v, and d is based on a fixed quantization that is
very similar to the well-known Lloyd-Max quantization (which is similar to the k-
means clustering method) (Kabal 1984; Patane and Russo 2001). Several quantizers
can be constructed, where each of them consists of more than 1 value and is
optimized for different image parts. In detail, the quantization process includes the
selection of an appropriate quantizer for the h, v, and d values for each colour; then
each difference is approximated by a quantization value to which it is most similar
(Fig. 5.5a–c). To satisfy bit limits, the following restrictions are applied: only eight
quantizer sets are provided; each consists of four positive/negative values; trained
quantizer values are shown in Table 5.2.
To estimate an approximation error, let us note that, once the quantization is
completed, it is possible to express an error and then estimate pixel errors as follows:
8
8 0 > Δh þ Δv þ Δd
>
> ΔA ¼
< h ¼ h þ Δh
> >
< 4
v0 ¼ v þ Δv ΔB ¼ ΔA Δh
>
: 0 >
> ΔC ¼ ΔA Δv
d ¼ d þ Δd >
>
:
ΔD ¼ ΔA Δd
Fig. 5.5 Selection of appropriate quantizers: (a) distributions for h, v, d, and reconstructed levels;
(b) bit allocation for h, v, d components; (c) an example of quantization with quantizer set selection
and optimal reconstruction levels
Table 5.2 Example of NSW quantizer set values

Quantizer set number Id0 Id1 Id2 Id3
0 220 80 80 220
1 23 4 4 23
2 200 10 10 200
3 80 40 40 80
4 138 50 50 138
5 63 18 18 63
6 42 12 12 42
7 104 28 28 104
If a mean value is encoded with an error, this error should be added to an error
estimation of the A pixel; then, it is possible to aggregate errors for all pixels using,
as an example, the squared sum error (SSE). Once the SSE is calculated for every
quantizer set, the quantizer set which provides the minimal error is selected; other
criteria can be applied to simplify the quantization. The encoding process explained
above is named Fixed Quantization via Table (FQvT) and is described by the
following:

arg min E fd1 , . . . dk g, fId QI 1 , . . . Id QI k g, QI ,
QI¼0...QS
Table 5.3 Sub-modes in P-method (NSW-P) and its encoding

2-bit prefix Sub-mode
00 Horizontal direction
01 Vertical direction
100 H value is close to 0
101 V value is close to 0
110 D value is close to 0
1110 Diagonal direction
1111 Uniform area
Fig. 5.6 Sub-modes in the P-method
where {d1 . . . dK}, original differences; {Id1 . . . IdK}, levels used to reconstruct
differences; and QI, a quantizer defined by a table. E stands for an error of
reconstructed differences relative to the original value.
Another method adopted by VLCCT is the P-method. It helps encoding of such
areas where a specific orientation of details exists; it might be vertical, horizontal, or
diagonal. In addition, this method helps to encode non-texture areas (where the mean
value is a good approximation) and areas where one of the NSW components is close
to zero. The best sub-mode is signalled by a 2-bit value according to Table 5.3:
According to Fig. 5.6, the h sub-mode means that each 2 2 block is encoded
using only two pixels A and C. In the same way, the v sub-mode uses A and B to
approximate the remaining pixel values by these pixels; the diagonal sub-mode
encodes a block in a way similar to that of h and d but uses a diagonal pattern; in
the uniform mode, the mean value is used to approximate the whole 2 2 block. If
any one of the h, v, or d differences is close to zero, the corresponding sub-mode is
invoked. In this case, the following is applied:
h 0; fv, dg ! fQuantizer N, Iv, Id g;

v 0; fh, dg ! fQuantizer N, Ih, Id g;
d 0; fh, vg ! fQuantizer N, Ih, Ivg:
The quantization table for non-zero differences is provided in Table 5.4a.

The mean value (s-value) of the NSW transform is quantized using 6 bits via LSB
bit truncation. However, sometimes a higher bit depth precision is required; one of
the quantizers (N ¼ 1) is used as a means to efficiently signal such a mode, as in this
case 1 extra bit might be saved. Table 5.4b shows that modification.
Table 5.4a Quantizers for the P-method

Quantizer N Id0 Id1 Id2 Id3
0 250 240 240 250
1 0 0 7 7
2 224 9 9 224
3 118 66 66 118
4 94 19 19 94
5 220 125 125 220
6 185 52 52 185
7 39 12 12 39
Table 5.4b Modification of quantizer N, where N ¼ 1

1 0 Prohibited value 7 7
3bit(N) 2bit(SMode) 1(2)bits(I1) 1(2)bits(I2) 6bit+Free Bits(s)
Fig. 5.7 Encoding bits package for encoding sub-modes in P-method
The actual bit precision of the mean value is based on the analysis of “free bits”.
Depending on this, the mean value can be encoded using 6, 7, or 8 bits. The structure
of the bits comprising encoding information for this mode is shown in Fig. 5.7. The
algorithm to determine the amount of free bits is shown in Fig. 5.8.
The next method is the S-method. This method is adopted for low-light images
and is derived from the N-method by means of changing the quantization tables and
special encoding of the s-value. In particular, it was found that the least significant
6 bits are enough for encoding the s-value of low-light areas and the least significant
bit of these is excluded, so only 5 bits are used. The modified quantization table is
shown in Table 5.5.
The C-method is also based on the NSW transform for all colour channels
simultaneously, followed by joint encoding of all three channels. The efficiency of
this method is confirmed by the fact that the NSW values are correlated for all three
colour channels. Due to sharing the syntax information between colour channels and
removing the independent selection of quantizers for each colour (they are encoded
by the same quantizer set), an increase of the quantizer’s dynamic range (to eight
levels instead of four) is enabled – see Table 5.6.
The quantization process is like that of the P-method. The main change is that, for
all differences h,v,d for R,G,B colours, one quantizer subset is selected. Then each
difference value is encoded using a 3-bit value; 9 difference values take 27 bits; the
quantizer subset requires another 3 bits; thus, the differences take 30 bits; every
s-value (for R,G, and B) is encoded using 5 bits by truncation of 3 LSBs that are
reconstructed by binary value 100b. Every 2 2 colour block is encoded using
Fig. 5.8 The algorithm to determine the number of free bits for the P-method
Table 5.5 Quantizers for encoding of differential values of low-light areas

0 1 2 1 2
1 2 6 2 6
2 1 4 1 4
3 4 10 4 10
4 3 8 3 8
5 4 20 4 20
6 8 40 8 40
7 3 15 3 15
45 bits; to provide optimal visual quality, the quantization error should take into
account human visual system colour perception, which is translated into weights for
every colour:
Table 5.6 Quantizer table for the C-method

Quantizer N Id0 Id1 Id2 Id3 Id4 Id5 Id6 Id7
0 128 180 213 245 128 180 213 245
1 3 15 30 50 3 15 30 50
2 20 86 160 215 20 86 160 215
3 50 82 104 130 50 82 104 130
4 90 145 194 240 90 145 194 240
5 9 40 73 106 9 40 73 106
6 163 122 66 15 15 66 122 163
7 180 139 63 25 25 63 139 180
8
> ΔhC þ ΔvC þ Δd C
>
> ΔAC ¼
>
> 4
>
<
ΔBC ¼ ΔAC ΔhC
>
>
>
> ΔC C ¼ ΔAC ΔvC
>
>
:
ΔDC ¼ ΔAC ΔdC
X ðΔAC Þ2 þ ðΔBC Þ2 þ ðΔC C Þ2 þ ðΔDC Þ2
E¼ WC,
C¼R, G, B
4
where Wc, weights reflecting the colour perception by an observer for a colour
channel C ¼ R, G, and B; ΔhC, ΔvC and ΔdC, errors to approximate h, v, and
differences. A simpler yet still efficient way to calculate weighted encoding errors is
X
E0 ¼ ½MAX ðjΔhj, jΔvj, jΔdjÞ W C
C¼R, G, B
By subjective visual testing, it was confirmed that this method provided good
visual results and it was finally adopted into the solution.
Let us review the E-method, which is intended for images with sharp colour edges
and colour gradients where other methods usually cause visible distortions. To deal
with this problem, it was suggested to represent each 2 4 block as four small
“stripes” – 1 2 sub-blocks. Every 1 2 sub-block consists of three colour
channels (Fig. 5.9); only one colour channel is defined as a dominant colour channel
for such a sub-block; its values are encoded using 6 bits; the remaining colour
channels are encoded using average values only.
By the design, a dominant channel is determined and pixel values for that channel
are encoded using 6 bits. The remaining channels are encoded via the average values
of two values for every spatial position. Besides, a quantization and clipping are
applied. Firstly, the R, G, and B colour channels are analysed, and, if conditions are
not met, YCbCr colour space is used. The luminance channel is considered as
dominant, while Cb and Cr are encoded as the remaining channels. The key point
Fig. 5.9 Splitting of 2 4 block into 1 2 colour sub-blocks (AB)
of this method is that, when calculating the average value, it is calculated jointly for
both the remaining channels for every spatial position. Every 1 2 sub-block is
extended with extra information indicating the dominant channel. The algorithm
describing this method is shown in Fig. 5.10.
In addition to the methods described above, seven other methods are provided
that encode 2 4 blocks without further splitting into smaller sub-blocks. In general,
they are all intended for the cases explained above, but, due to their different
mechanisms to represent data, they might be more efficient in different specific
cases. The 2 4 D-method is targeted for diagonal-like image patches combined
with natural parts (which means a transition region between natural and structure
image parts). The sub-block with a regular diagonal structure is encoded according
to a predefined template, while the remaining 2 2 sub-block is encoded by means
of truncating the rwo least significant bits of every pixel of this 2 2 sub-block.
A sub-block with a regular pattern is determined, which is accomplished via
calculating errors over special templates locations; then the block with the smallest
error is detected (Fig. 5.11). It is considered as a block with a regular structure; the
remaining block is encoded in simple PCM mode with 2 LSB bits truncation.
The template values mentioned above are calculated according to the following
equations:
Rð0, 0 þ 2k Þ þ Rð1, 1 þ 2kÞ þ Gð0, 1 þ 2kÞ þ Gð1, 0 þ 2kÞ þ Bð0, 0 þ 2kÞ þ Bð1, 1 þ 2k Þ
C 0 ðkÞ ¼
6
Rð0, 1 þ 2k Þ þ Rð1, 0 þ 2kÞ þ Gð0, 0 þ 2kÞ þ Gð1, 1 þ 2kÞ þ Bð0, 1 þ 2kÞ þ Bð1, 0 þ 2k Þ
C 1 ðkÞ ¼
6
where C0, C1 – so-called template values; k – the index of a sub-block: k¼¼0 means
the left sub-block, k¼¼1 means the right sub-block. The approximation error for the
k-th block is defined as follows:
Fig. 5.10 The algorithm of determining dominant channels and sub-block encoding (E-method)
Fig. 5.11 Sub-blocks positions and naming according to D-method

BlEðkÞ ¼
jRð0, 0 þ 2kÞ C0ðk Þj þ jRð1, 1 þ 2kÞ C0ðkÞjþ
jGð0, 1 þ 2kÞ C0ðk Þj þ jGð1, 0 þ 2k Þ C0ðkÞjþ
jBð0, 0 þ 2kÞ C0ðk Þj þ jBð1, 1 þ 2kÞ C0ðkÞjþ
jRð0, 1 þ 2kÞ C1ðk Þj þ jRð1, 0 þ 2kÞ C1ðkÞjþ
jGð0, 0 þ 2kÞ C1ðk Þj þ jGð1, 1 þ 2k Þ C1ðkÞjþ
jBð0, 1 þ 2kÞ C1ðk Þj þ jBð1, 0 þ 2kÞ C1ðkÞj:
After determining which sub-block is better approximated by the template-

pattern, its index is placed into the bitstream along with the template values and
PCM values. This is shown in Fig. 5.12 below.
The next 2 4 method is the so-called F-method, which is suitable for the
gradient structure and is applied for each channel independently. To keep the target
bitrate and high accuracy of gradient transitions, reference pixel values, the mean
value of horizontal inter-pixel differences, and signs of differences are used.
Figure 5.13a shows the reference values and directions of horizontal differences,
and the block-diagram for the F-method is shown in Fig. 5.13b.
Now, let us consider the M-method for a 2 4 block, which includes three
modes. The first mode provides uniform quantization with different precision for all
pixels of the underlying 2 4 block: every pixel colour value is encoded using 3 or
4 bits; these three and four codes are calculated according to the following equations:
Fig. 5.12 Algorithm

realizing D-method Start
To detect a spatial 2x2 sub-block which could be encoded

via pattern + pattern values: sub-block N
To encode 2x2 sub-block N with pattern + pattern values
To encode another 2x2 sub-block with PCM
To encode the number of a block which could be encoded

via pattern
End
(a)
A1 A2 A3 A4
B1 B2 B3 B4
Reference points and differences directions

(b)
Start
Calculate:
D1= A1-A2; D2=A2-A3; D3=A3-A4
D4=B1-B2; D5=B2-B3; D6=B3-B4
Calculate:
D_Average = (|D1| + |D2| + |D3| + |D4| + |D5| + |D6|)/6
Calculate: Sign(D1); Sign(D2); Sign(D3); Sign(D4); Sign(D5);

Sign(D6)
Transmit corresponding bits: D_Average –8 bits; signs –1 bit

each
End
The algorithm for the F-method
Fig. 5.13 (a) Reference points and difference directions. (b) The algorithm for the F-method
8
>
> I R,i,j
if ði, jÞ 2 fð0, 0Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 3Þg
>
< 16 ,
C R,i,j ¼
>
> I R,i,j
>
: , else
32

I G,i,j
C G,i,j ¼ , 8ði, jÞ
16
8
>
> I B,i,j
if ði, jÞ 2 fð0, 1Þ, ð0, 3Þ, ð1, 2Þg
>
< 16 ,
C B,i,j ¼
>
> I B,i,j
>
: , else
32
where IR, IG, IB stands for the input pixel colour value. Output values after quanti-
zation are defined as CR, CG, CB. Reconstruction is performed according to the
following equations:
(
16CR,i,j 0 8, if ði, jÞ 2 fð0, 0Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 3Þg
DR,i,j ¼
32CR,i,j 0 10, else

DG,i,j ¼ 16C G,i,j 0 8, 8ði, jÞ
(
16CB,i,j 0 8, if ði, jÞ 2 fð0, 1Þ, ð0, 3Þ, ð1, 2Þg
DB,i,j ¼ :
32CB,i,j 0 10, else
The second mode is based on averaging combined with LSB truncation for 2 2
sub-blocks:
I R,i,j þ I R,i,jþ2
CR,i,j ¼ , i ¼ 0, 1:j ¼ 0; 1,
4
I G,i,j þ I G,i,jþ2
CG,i,j ¼ , i ¼ 0, 1:j ¼ 0; 1,
2
I B,i,j þ I B,i,jþ2
CB,i,j ¼ , i ¼ 0, 1:j ¼ 0; 1:
4
And decoding can be done according to
DR,i,j ¼ DR,i,jþ2 ¼ 2CR,i,j , i ¼¼ 0, 1:j ¼ 0; 1,

DG,i,j ¼ DG,i,jþ2 ¼ CG,i,j , i ¼ 0, 1:j ¼ 0; 1,
DB,i,j ¼ DB,i,jþ2 ¼ 2CB,i,j , i ¼ 0, 1:j ¼ 0; 1:
The third mode of the M-method is based on partial reconstruction of one of the
colour channels using the two remaining colours; these two colour channels are
encoded using bit truncation according to the following:
I Rij
CRi,j ¼ ; if ði, jÞ 2 fð0, 0Þ, ð0, 1Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 2Þ, ð1, 3Þg
8
I Gij
CGi,j ¼ ; if ði, jÞ 2 fð0, 0Þ, ð0, 1Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 2Þ, ð1, 3Þg:
8
The two boundary pixels of the blue channel are encoded similarly:
I B10 I
CB1,0 ¼ , CB0,3 ¼ B03 :
8 16
Decoding (reconstruction) is done according to these equations:

DRi,j ¼ C Ri,j 8 4; if ði, jÞ 2 fð0, 0Þ, ð0, 1Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 2Þ, ð1, 3Þg

DGi,j ¼ C Gi,j 8 4; if ði, jÞ 2 fð0, 0Þ, ð0, 1Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 2Þ, ð1, 3Þg
DB0,0 ¼ ðI R1,1 8Þj4; DB0,1 ¼ ðI R1,2 8Þj4; DB0,2 ¼ ðI R1,3 8Þj4
DB1,1 ¼ ðC G0,0 8Þj4; DB1,2 ¼ ðC G0,1 8Þj4; DB1,3 ¼ ðC G0,2 8Þj4
DB1,0 ¼ ðC B0,1 8Þj4; , DB0,3 ¼ ðC B0,3 16Þj8
The best encoding mode is determined according to the minimal reconstruction, and
its number is signalled explicitly in the bitstream.
In the case where one colour channel strongly dominates over the others, the
2 4 method denoted as the O-method is applied. Four cases are supported:
luminance, red, green, and blue colours, which are signalled by a 2-bit index in
the bitstream. All the pixel values of the dominant colour channel are encoded using
PCM without any distortions (which means 8 bits per pixel value) and the remaining
colours are approximated by the mean value over all pixels comprising 2 4 blocks
and encoding it using an 8-bit value. This is explained in Fig. 5.14. This method is
like the E-method but gives a different balance between precision and locality,
although both methods use the idea of dominant colours.
The 2 4 method denoted as the L-method is applied to all colour channels, but it
is processed independently. It is based on construction of a colour palette. For every
2 2 sub-block a palette of two colours is defined; then three modes are provided to
encode the palettes: differential encoding of the palettes, differential encoding of the
colours for every palette, and explicit PCM for colours through quantization. The
differential colour encoding of palettes enhances the accuracy of colours encoding in
some cases; in addition, in some cases the palette colours might coincide due to
calculations, which is why extra palette processing is provided, which increases the
chance of differential encoding being used. If the so-called InterCondition is true,
then the first palette colours are encoded without any changes (8 bits per colour),
while the colours of the second palette are encoded as a difference relative to the
colours of the first palette:
Fig. 5.14 The algorithm for the O-method
Table 5.7 CDVal distribution over TIdx_x, TIdx_y

CDVal TIdx_x ¼ 0 TIdx_x ¼ 1 TIdx_x ¼ 2 TIdx_x ¼ 3
TIdx_y ¼ 0 0 1 2 3
TIdx_y ¼ 1 4 5 6 7
TIdx_y ¼ 2 8 9 10 11
TIdx_y ¼ 3 12 13 14 15
InterCondition ð2 C00 C10 < 2Þ ^ ð1 C01 C11 2Þ:
To manage this, two indexes are calculated:
TIdx y ¼ C00 C10 þ 2:Range is ½0 . . . 3

TIdx x ¼ C01 C11 þ 1:Range is ½0 . . . 3:
These indexes are transformed into 4-bit values, called cumulative difference
values ¼ CDVal, which encode the joint distribution of TIdx_y, TIdx_x (Table 5.7):
This table helps in the encoding of the joint distribution for better localization and
the encoding of rare/common combinations of indexes. There is another Table 5.8
provided to decode indexes according to CDVal.
Using decoded indexes, it is possible to reconstruct colour differences (and hence
colours encoded differentially):
C00 C10 ¼ TIdx y 2:A new range is ½2::2:
C01 C11 ¼ TIdx x 1:A new range is ½1::2:
Extra palette processing is provided to increase the chance of using the differential
mode. It consists of the following:
• Detection of the situation when both colours are equal; setting flag FP to 1 in case
of equal colours; this is done for each palette individually.
• Checking if FP1 + FP2 ¼¼ 1.
• Suppose the colours of the left palette are equal. Then, a new colour from a
second palette should be inserted into the first palette. This colour must be as far
from the colours of the first palette as possible (this will extend the variability of
the palette): the special condition is checked and a new colour from the second
palette is inserted:
if jC00 C11j < jC00 C10j ! C00 ¼ C10

else C01 ¼ C11:
It must be mentioned that if such a colour substitution occurs, it should be tracked

appropriately to reverse it back if differential encoding is not then invoked. Finally, a
colour map is generated. If inter-palette differential encoding is not applied, two
remaining modes exist: intra-palette differential encoding and colour quantization.
Intra-palette mode works in such a way that the very first colour corresponds to the
left-top pixel of a 2 2 sub-block. Then, referring to this colour, the differential
condition is checked for every sub-block independently. If it is true, appropriate
signalling is enabled, and the second colour is encoded as a 2-bit correction to the
first colour, which is encoded without errors. If the intra-palette condition is false,
then every colour is encoded by means of truncation of three LSBs. This is executed
for every 2 2 sub-block. A colour map is created where every pixel has its own
colour index; two sub-modes are provided: three entries per 2 2 sub-block and four
entries per 2 2 sub-block. Each entry in the colour map is a bit index of a colour to
be used to encode the current pixel. The best colour for each pixel is determined by
means of finding a minimum absolute difference between a colour and a pixel value,
as shown in Fig. 5.15.
In addition, there is a method which is effective for regions where a combination
of complex and relatively uniform areas exists either within one colour channel or
within an underlying block. This method is named the U-method with two
sub-methods: U1 and U2. The best sub-method is picked according to the minimal
error of approximating the original image block.
136
Table 5.8 Correspondence between CDVal and actual TIdx_x/y values

CDVal
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
TIdx_y 2 2 2 2 1 1 1 1 0 0 0 0 1 1 1 1
CDVal
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
TIdx_x 1 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2
M. N. Mishourovsky
Fig. 5.15 Algorithm of 2 4 L-method (palette method)
Sub-method U1 is applied for a 2 2 sub-block which is part of the 2 4 block

and is especially effective if one of the colour parts of this sub-block can be
effectively represented by the mean value, while the remaining colours are encoded
with NSW transform and FQvT quantization.
To estimate the colour values, different methods can be used. In particular, the
least squares method can be applied. VLCCT adopted a simple yet effective method
which is shown in Fig. 5.16.
First, for every colour channel, the mean value is evaluated (over the 2 2
sub-block to be considered); then the SAD between all pixel values comprising this
Fig. 5.16 Fast algorithm to estimate colours for palette encoding
sub-block and the mean value is calculated. For example, for the red colour channel
SAD is calculated as follows:
SADR ¼ jAR mRj þ jBR mRj þ jC R mRj þ jDR mRj:
Then, the colour channel which can be effectively encoded with the mean value is
determined: every SAD is compared with Threshold1 in well-defined order; the
colour channel which has the minimal SAD (in accordance with the order shown in
Fig. 5.17) is determined. In addition, the mean value is analysed to decide if it is
small enough (compared with Threshold2 which is set to 32); it might be signalled in
Fig. 5.17 Algorithm of sub-method U1
the bitstream that the mean (average) value is small, as a 2-bit value prefix is used to
signal which colour channel is picked, and value 3 is available. In this case, 2-bit
extra colour index is also added to notify the decoder which colour channel is to be
encoded with the mean value using 3 bits by truncating the remaining two (as the
mean value takes 5 bits at most). Otherwise, if the mean value is greater than
Threshold2, 6 bits are used to encode; again, 2 LSB bits are truncated; the remaining
2 colours (“active colours”) within the underlying 2 2 sub-block are encoded using
NSW + FQvT: each active colour is transformed to the set of values: {s1, h1, v1, d1},
{s2, h2, v2, d2}.
{h1, v1, d1} and {h2, v2, d2} are encoded according to the FQvT procedure,
using quantization Table 5.9.
Table 5.9 Quantization table for difference values of sub-method U1

Quantizer – Set N Id0 Id1 Id2 Id3 Id4 Id5 Id6 Id7
0 0 18 34 45 0 18 34 45
1 14 60 115 160 14 60 115 160
2 6 38 62 95 6 38 62 95
3 56 77 100 112 56 77 100 112
4 70 107 142 160 70 107 142 160
5 0 17 32 130 0 17 32 130
6 53 125 196 232 53 125 196 232
7 16 65 100 120 16 65 100 120
Every set of differential values is encoded using its own quantizer set. s1 and s2
values are encoded in one of two modes:
1. 6 bits per s-value. It is used if the mean value of the uniform channel is small and
the number of bits is enough to reserve 12 bits for s-values.
2. Differential mode is used if 6-bit mode cannot be used; the following is provided
in this mode:
• An error caused by encoding of s1, s2 into 5-bit/5-bit (via LSB truncation) is
evaluated:
ErrQuant ¼ js1 ðs1&0 F8Þj0 04j þ js2 ðs2&0 F8Þj0 04j;
• An error caused by encoding of s1, s2 as 6-bit/6-bit followed by the signed

3-bit difference value between s2 and s1:
temp s1 ¼ s1&0 FC; temp s2 ¼ s2&0 FC;

temp s2 ¼ temp s2 temp s1; сlampðtemp s2, 31, 31Þ
temp s2 ¼ temp s2&0 FC
temp s2 ¼ temp s2 þ temp s1
ErrDifQuant ¼ js1 temp s1j þ js2 temp s2j;
• if ErrQuant < EffDiffQuant, s1 and s2 are encoded using 5 bits per s-value;
otherwise, 6 bits are used to encode s1 and 4 bits for 3-bit signed difference.
In terms of bitstream structure, the diagram in Fig. 5.18 shows the bits distribu-
tion. According to this diagram, the U1 sub-method spends 43 bits per 2 2
sub-block. The second sub-method of the U-method is called U2. Every 2 4
colour block is processed as three independent 2 4 colour blocks. Every indepen-
dent 2 4 block is further represented as two adjacent 2 2 sub-blocks. For each
2 2 sub-block, approximation by a mean value is estimated. Then, the sub-block
which is approximated by a mean value with the smallest error is determined.
Another sub-block is encoded with NSW and FQvT (see quantization Table 5.10).
2-bit colour index

0/1/2 3
6-bitÆaverage of 2-bitÆsmooth colour

smooth colour channel number
1-bit ÆSDif 3-bitÆaverage of

mode smooth colour
6-bitÆS1
5-bitÆS1
1-bitÆsign (S1-S2) 6-bitÆS1
5-bitÆS2
3-bit Æ|S1-S2| 6-bitÆS2
Fig. 5.18 Diagram of bits distribution of sub-method U1
Table 5.10 Quantization table for difference values of sub-method U2

Quantizer set Id0 Id1 Id2 Id3 Id4 Id5 Id6 Id7
0 60 141 178 233 60 141 178 233
1 2 16 28 50 2 16 28 50
2 10 76 150 190 10 76 150 190
3 44 80 108 129 44 80 108 129
4 100 150 201 227 100 150 201 227
5 1 2 3 5 1 2 3 5
6 8 38 65 88 8 38 65 88
7 60 110 160 170 60 110 160 170
Every difference value {h,v,d} is encoded using the 3-bit quantizer index and
3-bit quantizer set number. The s-value is encoded using a 7-bit value by truncating
LSB. A 1-bit value is also used to signal which block is encoded as uniform. This
sub-method spends 28 bits per independent 2 4 block.
The last method is based on the Hadamard transform (Woo and Won 1999) and is
called the H-method. In this method the underlying 2 4 block is considered as a set
of two 2 2 blocks and H-transform is applied to every block as follows:
8
>
> AþBþCþD
>
> S¼ 8
>
> 4 A ¼ S þ dD þ dH þ dV
>
> AþCBD > >
>
< dH ¼ >
< B ¼ S þ dV dH dD
4 ,
>
> AþBCD > > C ¼ S þ dH dV dD
>
> dV ¼ >
:
>
> 4 ¼ S þ dD dH dV
>
> D
: dD ¼ A B C þ D
>
4
Table 5.11 Quantizer values adopted for H-transform

Quantizer set Id0 Id1 Id2 Id3 Id4 Id5 Id6 Id7
0 6 47 86 106 6 47 86 106
1 20 30 41 47 20 30 41 47
2 1 20 38 48 1 20 38 48
3 3 10 16 20 3 10 16 20
4 18 40 64 75 18 40 64 75
5 6 16 26 32 6 16 26 32
6 8 40 73 89 8 40 73 89
7 9 30 52 63 9 30 52 63
The dD value is set to zero as it is less important for the human visual system; the
remaining components S, dH, and dV are encoded as follows:
• the S-value is encoded using a 6-bit value via 2LSB truncation; reconstruction of
these 2LSB is done with a fixed value of 10 (binary radix).
• dH, dV are encoded using FQvT; Table 5.11 describes the quantizer indexes for
eight quantizer sets adopted for H-transform.
Every 2 2 sub-block within every colour channel is encoded using 6 bits for the
s-value, one 3-bit quantizer set index shared between dV and dH, and a 3-bit value
for each difference value. This approach is similar to methods described before, for
example, the U-method; the difference in encoding is in the shared quantizer set
index, which saves bits and relies on correlation between the dV and dH values.
5.4 How to Pick the Right Method
The key idea is to pick a method that provides the minimum error; in general, the
more similar it is to how a user ranks methods according to their visual quality, the
better the final quality is. However, complexity is another limitation which bounds
the final efficiency and restricts the approaches that can be applied in VLCCT.
According to a state-of-the-art review and experimental data, two approaches were
adopted in VLCCT.
The first approach is weighted mean square wrror, which is defined as follows:
2 h
X X 2 2 2 2 i
WMSE ¼ Ai Ai þ Bi Bi þ Ci C i þ Di Di
C¼R, G, B i¼1
WC,
where A, B, C, and D are pixel values and Wc are weights dependent on the colour
channel. According to experiments conducted, the following weights are adopted:
Red Green Blue

WC value 6 16 4
Another approach adopted in VLCCT is the weighted max-max criterion. It

includes the following steps:
• Evaluate the weighted squared maximum difference between the original and
encoded blocks:

2
MaxSqC ¼ MAX AC AC , BC BC , CC C C , DC DC W C ;
• Then, calculate the sum of MaxSq and the maximum over all colour channels:
X
SMax ¼ MaxSqC
C¼R, G, B
MMax ¼ MAX ½MaxSqR , MaxSqG , MaxSqB ;
• Then, the following aggregate value is calculated:
WSMMC ¼ SMax þ Mmax:
To calculate WSMMC for the 2 4 block, the WSMMC values for both 2 2
sub-blocks are added. This method is simpler than WMSE but was still shown to be
effective. In general, to determine the best encoding method, all feasible combina-
tions of methods are checked, errors are estimated, and the one with the smallest
error is selected.
5.5 Bitstream Syntax and Signalling
As explained at the beginning of the chapter, each 2 4 block being represented as

10 bits per colour per pixel value occupies 100 bits after compression. It consists of a
prefix, called VLEH, compressed bits (syntax elements according to the semantics of
the methods explained above), and padding (if needed). To provide visually lossless
compression, VLEH and combinations of different VLCCT methods were opti-
mized. The final list of VLCCT methods along with the syntax elements is summa-
rized in Table 5.12. Every prefix (VLEH) may take from 1 to 10 bits. Each row in the
table has the name of the encoding method for an underlying block. The notation is
as follows: if the name consists of two letters, then the first letter corresponds to the
method applied for the left sub-block and the second to the method applied for the
second sub-block.
Table 5.12 Summary of methods, prefixes, bits costs in VLCCT

Method name Number of Header (VLEH)
for 2 4 compressed LSB Padding
block bits bits 1 2 3 4 5 6 7 8 9 10 11 size
NN 90 3 1 1 1 0 0 0 0 0
NP 93 3 1 0 0 0 0
NS 87 6 1 1 1 0 0 0 1 0
NC 90 3 1 1 1 0 0 1 0 0
NE 85 6 1 1 1 1 1 1 0 0 0 0
PN 93 3 1 0 0 1 0
PP 96 3 0 0
PS 90 6 1 0 1 0 0
PC 93 3 1 0 1 1 0
PE 88 4 1 1 1 1 1 0 0 0 0
SN 87 6 1 1 1 0 0 1 1 0
SP 90 6 1 1 0 0 0
SS 84 6 1 1 1 1 1 1 1 1 0 1
SC 87 6 1 1 1 0 1 0 1 0
SE 82 6 1 1 1 1 1 1 0 0 1 3
CN 90 3 1 1 1 0 1 1 0 0
CP 93 3 1 1 0 1 0
CS 87 6 1 1 1 0 1 1 1 0
CC 90 3 1 1 1 1 0 0 0 0
CE 85 6 1 1 1 1 1 1 0 1 0 0
EN 85 6 1 1 1 1 1 1 0 1 1 0
EP 88 4 1 1 1 1 1 0 0 1 0
ES 82 6 1 1 1 1 1 1 1 0 0 3
EC 85 6 1 1 1 1 1 1 1 0 1 0
EE 80 6 1 1 1 1 1 1 1 1 1 1 4
D 90 3 1 1 1 1 0 0 1 0
O 82 6 1 1 1 1 1 1 1 1 1 0 2
M 90 3 1 1 1 1 0 1 0 0
F 90 3 1 1 1 1 0 1 1 0
L 87 5 1 1 1 1 1 0 1 0 0
U 87 5 1 1 1 1 1 0 1 1 0
H 90 3 1 1 1 0 1 0 0 0
VLCCT provided a simple way to represent the least significant 2 bits of the 2 4
colour block. To keep a reasonable trade-off between bits costs, quality, and
complexity, several methods to encode LSB have been proposed:
• 3-bit encoding – the mean value of LSB bits for every colour is calculated over
the whole 2 4 block followed by quantization:
P xP
y¼1 ¼3
½I c ðx, yÞ&3
y¼0 x¼0
mLSBc ¼ ,
16
where Ic(x, y) – input 10-bit pixel-value in the colour channel C, at the position (x, y).
mLSBc is encoded using 1 bit for each colour:
• 4-bit encoding – like the 3-bit encoding approach, but every 2 2 sub-block is
considered independently for the green colour channel (which reflects higher
sensitivity of the human visual system to green):
P x¼1
y¼1 P P x¼1
y¼1 P
½IG ðx, yÞ&3 ½IG ðx, yÞ&3
y¼0 x¼0 y¼0 x¼2
mLSBc Left ¼ mLSBG Right ¼ ;
8 8
• 5-bit encoding – like 4-bit encoding, but the red channel is also encoded using
splitting into left/right sub-blocks; thus, the green and red colours have higher
precision for LSB encoding.
• 6-bit encoding – every colour channel is encoded using 2 bits, in the same way as
the green channel is encoded in the 4-bit encoding approach.
Reconstruction is done following the next equation (applicable for every channel
and every sub-block or block):

2, ðmLSB ¼¼ 1Þ
Middle Value ¼ :
0, ðmLSB ¼¼ 0Þ
This middle value is assigned to 2 LSB bits of the corresponding block/sub-block for
every processed colour.
5.6 Complexity Analysis
Hardware complexity was a critical limitation for VLCCT technology. As explained

at the beginning, this limitation was the main reason for focusing this research on
simple compression methods that worked locally and had potential for hardware
(HW) implementation.
According to the description given so far, VLCCT consists of a plurality of
compression methods; they are structured as shown in Fig. 5.19 (encoder part).
The methods are denoted by English capital letters according to the names given
above. The following ideas became the basis for the complexity analysis and
optimization:
Left Right Left + Right
N N F
P P O
S S D
C C L
E E U
Error Calculation and Error Calculation and Error Calculation and

Optimal Method Optimal Method Optimal Method
Minimum Error Detection and Method Selection
LSB Encoding
Output Bit Stream Generation
Fig. 5.19 Detailed structure of VLCCT encoder
• Each operation or memory cell takes the required predefined number of physical
gates in a chip.
• For effective use of the hardware and to keep constrained latency, pipelining and
concurrent execution should be enabled.
• Algorithmic optimizations must be applied where possible.
The initial HW complexity was based on estimating elementary operations for
almost every module (method). The elementary operations are the following.
1. Addition / subtraction. 8-bit signed/unsigned operation. Denoted as (+)
2. Bit-shift without carry-bit. Denoted as (<<)
3. Comparison of 8-bit signed or unsigned values. Denoted as (?)
4. Look-up table operation for extraction of some value from some position within
table (2D or 1D table). Denoted as (T.Opt)
5. Multiplication/division of 8-bit signed/unsigned values. Denoted as (*/)
6. Logical AND/OR operation. Marked as (AND/OR)
7. Some complex (combined) sequence of operations which are very
HW-dependent. Denoted as (ETC)
8. Transformation from conventional complementary code into sign-magnitude
representation. Denoted as (CC!MM)
Table 5.13 Complexity in gates per typical elementary operation

Elementary operation for 8-bit arguments HW complexity estimation, gates
(+) F/F (9 bit) 150
(+) F/F (8 bit) 120
(?) F/F (8 bit) 25
(<<) 0
(*/) 750
(AND/OR) N/A
(CC! MM) N/A
A2 600
NOTE: F/F – Flip-flop memory (buffer) that is used to synchronize
Table 5.14 Complexity of all elementary operations, non-pipeline architecture

Unit
name + ? */ ETC ANDOR CC! MM TOpt
Total 1073 1726 2 (div 6), 3 (div 3) ~6incs ~3mux 0 18
111 multiplications,
75 a2
Fig. 5.20 Time diagram of the encoding process; error estimation is pipelined
Table 5.13 provides the estimation of the H/W complexity for these operations
(in gates). The total number of gates required for the VLCCT implementation (main
routines of the encoder part) is shown in Table 5.14 (straightforward
implementation).
If pipelining is applied for the error estimation process as shown in Fig. 5.20, a
small but fixed latency occurs, but the complexity is reduced (Table 5.15).
Table 5.15 Complexity of all elementary operations, pipeline architecture

Unit name + ? */ ETC ANDOR CC! MM TOpt
Total 929 1591 2 (div 6), 3 (div 3) ~6incs ~3mux 0 18 0
84 multiplications,
48 a2
Fig. 5.21 An example of {h,v,d} distribution, middle levels, and reconstructed levels
Further optimizations were to simplify the error estimation algorithm. In partic-

ular, the sum of squared wrror (SSE) was substituted with two simpler ones –
MaxErr1 and MaxErr2:
MaxErr1 ¼ MAX fjΔhj, jΔvj, jΔdjg and MaxErr2 ¼ jΔhj þ jΔvj þ jΔdj:
Subjective testing showed that both error functions provided very similar results
in terms of visual quality, with negligible differences under normal viewing condi-
tions. Other simplifications dealt with the selection of the best compression method
for a block, compact storage of the quantized tables, and modification of the
quantization process by means of using middle levels as comparative nodes, as
shown in Fig. 5.21.
After applying all these optimization techniques, the complexity of the encoding
part was reduced to 217 kGates. Table 5.16 sums up the HW complexity in kGates
required to implement VLCCT.
Table 5.16 Overall complexity of encoder and decoder (estimations)

Part kGates
Coder part, logic 217
Coder part, memory for tables 3
Coder part, memory for intermediate buffers 2
Decoder part, logic 15
Decoder part, memory for tables 2
Line memories for coder/decoder parts 90
Total (coder + decoder + memory) 329
5.7 Quality Analysis
Visual quality (if an image of video is for humans) refers to human perception of the
“goodness” of the visual information being presented. In fact, the “goodness” is a
very generic term: depending on a particular situation, it might refer to an absolute
scale (which is usually defined as the Mean Opinion Score (MOS)), a relative scale
(Differential Mean Opinion Score (DMOS)), or it might be attributed to the human
perception of the visual quality of still images or moving images, low-level (human
visual system) or high-level processing (brain). Many materials on this are publicly
available (see, e.g. Watson 1993, Ohta and Robertson 2005, and others).
According to the classification provided in several international standards/recom-
mendations on visual quality evaluations, such as BT.500 (2019), BT.1676 (2004),
BT.1683 (2004), BT.2095 (2017), P.910 (2008), P.913 (2016) and others, a subjec-
tive testing methodology might be based on a double stimulus setup (where a
reference and impaired image/video is presented in one trial). Alternatively, single
stimulus continuous quality evaluation might be applied. Every setup has its own
pros and cons, and these are well explained in the literature.
The visual quality provided by VLCCT was analysed under 1, 2, and 3
zoom in; this caused all the spatial frequencies (usually expressed in cycles per
degree of visual angle) to be condensed around the low part of the spectrum, which
leads to the following:
• All possible defects become highly visible
• After zoom in, there is not much difference between low and high spatial
frequencies, which means all frequencies become equally important.
To test the visual quality (how accurate VLCCT is), a special dataset was
prepared which included several categories of images:
1. Natural images
2. Computer graphics
3. Screen-content
4. Synthetic images (including resolution and colour charts, different colour
transitions)
5. Combined frames.
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
(j) (k) (l)
(m) (n) (o)
Fig. 5.22 Examples of test images from the dataset used for VLCCT testing: (a) synthetic colour
transitions (red); (b) synthetic colour transitions (blue); (c) synthetic colour transitions (green); (d)
radial/moiré patterns; (e) natural, complex scene; (f) spatial-frequency chart; (g) circle-like chart; (h)
gradient/transition; (i) gamma-checker pattern; (j) TV test chart; (k) natural, complex scene; (l) colour
transitions; (m) natural, small details; (n) complex colour transition; (o) angular colour transition
Every image was compressed up to three times; this was necessary to verify that
even if an image is passed through a coder-decoder multiple times, no significant
degradation occurs, which is important for real use cases. Examples of some images
from the dataset are shown in Fig. 5.22.
The whole dataset included more than 300 test images (including colour/resolu-
tion charts (e.g. ISO 12233:2017), sharp colour transitions, gradients, natural
images). All images were subjected to both objective (PSNR) and subjective (visual
testing) analysis. Finally, it was confirmed that even after three consecutive VLCCT
compression-decompression loops, no visual distortions were observed. The PSNR
was higher than 33–35 dB (which is close to the perceptually lossless level).
Nowadays, there is huge progress in visual quality analysis; more and more
systems perform automotive quality control. Due to the wide adoption of machine
learning methods, several methods have appeared that are proved to be in good
agreement with human scores of visual quality. It is worth mentioning VMAF
(Video Multi-Method Assessment Fusion), by Li et al. (2016) and a method to
recover subjective scores from noisy raw data – Li and Bampis (2017); fully blind
metrics like Natural Image Quality Evaluator (NIQE) by Mittal et al. (2013), eMOS
(2020), the “perceptual loss” criterion which was used in technologies like style-
transferring (Johnson et al. 2016) and super-resolution; and Nvidia’s just recently
published a paper on image quality evaluation for computer graphic images
(Andersson et al. 2020). Very detailed review of visual quality metrics is given
in Athar and Wang (2019). Most of these technologies have reached maturity
because of the data becoming available, for example:
1. The dataset related to texture perceptual similarity by Zhang et al. (2018).
2. Several large annotated datasets on visual quality collected by experts from
Workgroup Multimedia Signal Processing (https://www.mmsp.uni-konstanz.de/
). The datasets are available at: http://database.mmsp-kn.de/.
3. Other datasets became popular in last years:
• CIDIQ by Liu et al. (2014)
• CSIQ by Larson and Chandler (2010)
• IVL by Corchs et al. (2014)
• IVC (accessed on Sept. 16, 2020)
• Image and Video Quality Assessment datasets LIVE (2020)
• TID by Ponomarenko et al. (2015)
5.8 Conclusions
Visually Lossless Colour Compression Technology combined multiple approaches

and enabled new functionality for hardware and software with a new level of visual
quality. Another goal of this chapter was to explain the ideas and approaches that are
helping the rich technical maturity of the technology. As a result, the proposed
methods comprising VLCCT technology were optimized from the point of view of
the subjective and objective quality of compressed images and HW complexity. The
results of this work are summarized in Table 5.17.
Table 5.17 Final specification of VLCCT

Parameter Value
Resolution @ colour format 1920 1080p @RGB 4:4:4/YCbCr 4:4:4, 10 bits/8 bits/colour
HW complexity 329 kGates: coder + decoder + memory
Compression ratio @ bitrate 2.4:1 @ fixed bitrate (10 bits compression) for 2 4 block
character
Compression type Colour-dependent, optimized for HW implementation (“1-
clock” execution)
Quality Visually lossless:
Natural images (35–40 dB)
Computer graphics (40. . .46 dB)
High-frequency and test charts (>39 dB)
Robustness Yes, each 2 4 block is encoded independently
Signal delay 1 raw
Multiple compress/decompress Yes, up to three times without quality degradation
References
Andersson, P., Nilsson, J., Akenine-Möller, T., Oskarsson, M., Åström, K., Fairchild, M.D.: FLIP:
A difference evaluator for alternating images. In: Proceedings of the ACM on Computer
Graphics and Interactive Techniques. 3 (2), Article 15 (2020) Accessed on 16th September
2020. https://www.highperformancegraphics.org/2020/
ASTC Texture Compression (2019). Accessed on 15 September 2020 https://www.khronos.org/
opengl/wiki/ASTC_Texture_Compression https://github.com/ARM-software/astc-encoder
Athar, S., Wang, Z.: A comprehensive performance evaluation of image quality assessment
algorithms. In: IEEE Access, vol. 7, pp. 140030–140070 (2019)
BT. 1676, Methodological framework for specifying accuracy and cross-calibration of video
quality metrics. ITU recommendation (2004)
BT.1683 Objective perceptual video quality measurement techniques for standard definition digital
broadcast television in the presence of a full reference. ITU recommendation (2004)
BT.2095, Subjective assessment of video quality using expert viewing protocol, actual revision
BT.2095-1. ITU recommendation (2017)
BT.500, Methodologies for the subjective assessment of the quality of television images. ITU
recommendation, actual revision: BT.500-14 (2019)
Corchs, S., Gasparini, F., Schettini, R.: No reference image quality classification for JPEG-distorted
images. Digital Signal Processing. 30, 86–100 (2014)
eMOS technology (2020) Accessed on 16 September 2020. https://deelvin.com/machine-learning-
technologies
Image and Video Quality Assessment at LIVE. The University of Texas at Austin, Laboratory of
Image and Cideo Engineering (2020) Accessed on 16 of September 2020. http://live.ece.utexas.
edu/research/Quality/
Strom, J., Moller, T.A.: iPACKMAN: High-Quality, Low-Complexity Texture Compression for
Mobile Phones, Graphics Hardware, pp. 63–70 (2005)
ISO 12233:2017 Photography–Electronic still picture imaging–Resolution and spatial frequency
responses (2017) Accessed on 16 September 2020. https://www.iso.org/standard/71696.html
IVL dataset. Institut de Recherche en Communications et Cybernétique de Nantes (2014) Accessed
on 16 September 2020. http://ivc.univ-nantes.fr/en/pages/view/44/
Jaspers E.G.T., de With, P.H.N.: Compression for reduction of off-chip video bandwidth. In:
Proceedings of SPIE. 4674 (2002) Accessed on 22 September 2020. https://doi.org/10.1117/
12.451065
Springer (2016)
Kabal, P.: Quantizers for symmetric gamma distributions. IEEE Transactions on Acoustics, Speech,
and Signal Processing. 32(4), 836–841 (1984)
Larson, E.C., Chandler, D.M.: Most apparent distortion: full-reference image quality assessment
and the role of strategy. Journal of Electronic Imaging. 19(1), 011006 (2010)
Lee, S-Jo, Lee, Si-Hwa, Kim, Do-Hyung: Method, medium, and system compressing and/or
reconstructing image information with low complexity. US Patent Application 20080317116
(2008)
Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A., Manohara, M.: Toward a practical perceptual
video quality metric (2016). Accessed on 16 September 2020. https://netflixtechblog.com/
toward-a-practical-perceptual-video-quality-metric-653f208b9652. Github repository: https://
github.com/Netflix/vmaf
Li, Z., Bampis, C.: Recover subjective quality scores from noisy measurements.
arXiv:1611.01715v3 (2017) Accessed on 16 September 2020 https://github.com/Netflix/sureal
Liu, X., Pedersen, M., Hardeberg, J.Y.: CID:IQ – A new image quality database. In: Proceedings of
the International Conference on Image and Signal Processing, pp. 193–202 (2014)
Matoba, N., Terada, K., Saito, M., Tanioka, M.: Real-time continuous recording technique using
FBTC in digital still cameras. In: Proceedings of the SPIE. Digital Solid State Cameras: Designs
and Applications, vol. 3302, (1998)
Methods for encoding and decoding ETC1 textures (2019) Accessed on 22 September 2020. https://
developer.android.com/reference/android/opengl/ETC1
Mitchell, O.R., Delp, E.J.: Multilevel graphics representation using block truncation coding. Proc.
IEEE. 68(7), 868–873 (1980)
Mittal, A., Soundararajan, R., Bovik, A.C.: Making a completely blind image quality analyzer.
IEEE Signal processing Letters. 22(3), 209–212 (2013). Accessed on 16 September 2020. http://
live.ece.utexas.edu/research/Quality/nrqa.htm
Odagiri, J., Nakano, Y., Yoshida, S.: Video compression technology for in-vehicle image trans-
mission: SmartCODEC. Fujitsu Scientific & Technical Journal. 43(4), 469–474 (2007)
Ohta, N., Robertson, A.: Colorimetry: Fundamentals and Applications. John Wiley & Sons, Ltd
(2005)
P.910: Subjective video quality assessment methods for multimedia applications (2008)
P.913: Methods for the subjective assessment of video quality, audio quality and audiovisual
quality of Internet video and distribution quality television in any environment. ITU recom-
mendation (2016)
Paltashev, T., Perminov, I.: Texture compression techniques (2014) Accessed on 15 September
2020. http://sv-journal.org/2014-1/06/en/index.php?lang¼en
Patane, G., Russo, M.: The enhanced LBG algorithm. Elsevier, Neural Networks. 14(9), 1219–1237
(2001)
Paris, G.: Vulkan SDK update (1995–2020). Accessed on 15 September 2020. https://community.
arm.com/developer/tools-software/graphics/b/blog/posts/vulkan-sdk-update
Ponomarenko, N., Jin, L., Ieremeiev, O., Lukin, V., Egiazarian, K., Astola, J., Vozel, B., Chehdi,
K., Carli, M., Battisti, F., Jay Kuo, C.-C.: Image database TID2013: peculiarities, results and
perspectives. Signal Process. Image Commun. 30, 57–77 (2015)
Rissanen, J.J., Langdon, G.G.: Arithmetic coding. IBM J. Res. Dev. 23(2), 149–162 (1979)
Someya, J., Nagase, A., Okuda, N.: Image processing apparatus and method, and image coding
apparatus and method. US Patent Application 20080019598 (2008)
Sugita, Y.: Data compression apparatus, and data compression program storage medium. US Patent
7,183,950 (2007)
Sugita, Y., Watanabe, A.: Development of new image compression algorithm (Xena). In: Pro-
ceedings of SPIE. Real-Time Image Processing. vol. 6496. (2007)
Sweldens, W.: The lifting scheme: a construction of second generation of wavelets. J. Math. Anal.
29(2), 511–546 (1997)
Takahashi, T., Matoba, N., Ohashi, S.: Image coding apparatus for converting image information to
variable length codes of predetermined code size, method of image coding and apparatus for
storing/transmitting image. US Patent 6,052,488 (2000)
Torikai, Y., Tanioka, M., Matoba, N.: Apparatus and method of image compression and decom-
pression not requiring raster block and block raster transformation. US Patent 6, 026,194 (2000)
Watson, A.B.: DCTune: A technique for visual optimization of DCT quantization matrices for
individual images. Society for Information Display Digest of Technical Papers. XXIV,
pp. 946–949 (1993)
Woo, S.H., Won, C.S.: Multiresolution progressive image transmission using a 2x2 DCT. In: 1999
Digest of Technical Papers. International Conference on Consumer Electronics (Cat.
No.99CH36277) (1999) https://doi.org/10.1109/ICCE.1999.785243
Wu, Y., Coll, D.C.: Single bit-map block truncation coding of color images. IEEE J. Select. Areas
Commun. 10(5), 952–959 (1992)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep
features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. arXiv:1801.03924 (2018)
Chapter 6
Automatic Video Editing
Sergey Y. Podlesnyy
6.1 Related Work
6.1.1 Introduction
Portable devices equipped with video cameras, like smartphones and action cameras,
are showing rapid progress in imaging technology. 4 K sensor resolution and highly
efficient video codecs like H265 provide firm ground for social video capturing and
consumption. However, despite this impressive progress, users seem to prefer photos
as the main medium to capture their daily events and mostly consume video clips
produced by professionals or skilled bloggers. We argue that the main reason is that
the time needed to watch a video is orders of magnitude longer than photo browsing.
Social video footage captured by ordinary gadget users is too long, not even
speaking of their quality. Videos should be edited before presenting them even to
one’s closest friends and/or family members, yet video editing is a lengthy and
complicated process for most video gadget users.
Video editing should involve at least the steps of selecting the most valuable
footage from the points of view of visual quality and the importance of the action
filmed. This is a time-consuming process even on its own, but the next step is to cut
the footage into a brief and coherent visual story that will be interesting to watch.
This process is cinematographic in nature and thus requires many artistic and
technical skills, which makes it almost impossible for a broad range of users to use
successfully.
Recently, deep learning has shown huge success in the visual data processing
area. We aim to apply the proven techniques of machine learning, convolutional
neural networks and reinforcement learning to create automatic tools for social video
S. Y. Podlesnyy (*)
Cinema and Photo Research Institute (NIKFI) of Gorky Film Studios, Moscow, Russia
e-mail: s.podlesnyy@nikfi.ru

156 S. Y. Podlesnyy
editing that unifies compression/summarising and cinematographic approaches to

obtain concise and aesthetically pleasing video clips of memorable moments for a
broad audience of non-professional video gadget users. Here we discuss a system
capable of learning a cinematography editing style from samples extracted from the
content created by professional editors, including the motion picture masterpieces,
and of applying this data-driven style to cut unprofessional videos with the ability to
mimic the individual style of selected reference samples.
6.1.2 Film Theory Studies
Tsivian (2009) used detailed film shot length metrics to study the history of cine-
matography editing and individual styles of famous film editors. Specifically, these
measure the film’s average shot length (ASL) tabulated by the shot size (from close-
up to long shot and even their extreme scales: from big close-up to very long shot),
cutting swing (standard deviations of shorter and longer shots from ASL), their
cutting range (difference in seconds between the shortest and the longest shot of the
film) and their dynamic profiles (polynomial trend lines which reflect fluctuations of
shot lengths within the duration of the film). The results of applying dynamic
profiling to separate stories within the film, but not the full-feature film, look the
most promising (see Fig. 6.1). It is argued that the profile pattern may reflect either
some general rule of dramatic rhythm or the editors’ individual way of shaping the
narrative flow of their films.
As timing statistics may be the most distinctive “fingerprint” for determining
whether a feature film belongs to a particular authorship, we argue that it may be
beneficial to measure the statistics of transitions between shot types, e.g. classified
by shot size. Compliance with basic cinematography rules (e.g. famous Russian
cinematographer Kuleshov in the early twentieth century recommended transitions
between shot sizes separated by two steps) could be checked by a simple count of
transitions. The importance of the cinematographic cut is well-known to filmmakers
(Pudovkin 1974).
Examples of basic cinematography editing rules are:
• “avoid jump cuts” rule (transitions between the shots should go in two shot size
steps, shot sizes being “extra-long shot”, “long shot”, “middle long shot”, “mid-
dle shot”, “close-up”, “extra close-up”; the camera position should move at least
30 between the two shots).
• “180 line of action” rule (camera should never cross the line of action while
capturing dialogue and other actions having a distinct axis).
Of course, we should not take the rules for granted and demand 100% compli-
ance, as video editing is an artistic process and not a subject for mechanical
judgment.
6 Automatic Video Editing 157
Fig. 6.1 Example metric profiles of four separate stories in D.W. Griffith’s 1916 Intolerance.
(Reproduced with permission from Tsivian 2009)
6.1.3 Video Summarising
A broad literature exists on video summarising, i.e. automatic rendering of a video

summary. Key frame-based methods compose the video summary as a collection of
salient images (key frames) picked from the original video. Various strategies have
been studied, including shot boundary detection, colour histogram, motion stability,
clustering, etc. Key frame-based methods typically produce isolated and
uncorrelated still image sequences, without smooth temporal continuity. We argue
that video summarising is a purely technical process of compressing the content by
eliminating information redundancy, while automatic video editing involves artistic
aspects of creating concise movies while observing cinematography rules.
Cong et al. (2012) suggest performing consumer videos summarisation by for-
mulating video summarisation as a dictionary selection problem using sparsity
158 S. Y. Podlesnyy
Fig. 6.2 15 video segments left; reconstruction error for each video segment right. (Reproduced
with permission from Zhao and Xing 2014)
consistency, where a dictionary of key frames is selected such that the original video
can be best reconstructed from this representative dictionary. An efficient global
optimisation algorithm is introduced to solve the dictionary selection model.
Zhao and Xing (2014) further develop a dictionary-based approach analysing the
sparse reconstruction error of a new video segment with a dictionary learnt by
observing the previous part of a potentially very long video. Figure 6.2 illustrates
a video reconstruction error.
An important insight of this work is that typical consumer videos do not have any
temporal segmentation characterised with minimum variation and consistency of
objects. Amateur users often shoot videos with a constantly moving camera, chang-
ing the zoom and shaking the device. Conventional shot boundary detection methods
(e.g. based on colour histograms or motion estimation) do not work in this setting.
Zhao and Xing (2014) cope with this problem by simply breaking the raw footage
into fixed-length sequences, e.g. each 50 frames long.
They represent features for video data as spatio-temporal cuboids of interest
points using the method of Dollar et al. (2005) and describe each detected interest
point with a histogram of gradient (HoG) and histogram of optical flow (HoF). The
feature representation for each detected interest point is then obtained by concatenat-
ing the HoG feature vector and HoF feature vector. Finally, each video segment is
represented as a collection of feature vectors, corresponding to the detected interest
points.
They initiate a dictionary by learning from feature vectors obtained from the first
m segments and further scan through the video; segments that cannot be sparsely
reconstructed using the current dictionary, indicating unseen and interesting content,
are incorporated into the summary video. The current dictionary is updated online
when a segment reconstruction error exceeds a given threshold.
They used a test data set consisting of over 12 hours of raw video footage hand
labelled by human experts. The test videos span a wide variety of scenarios: indoor
and outdoor, moving camera and still camera, with and without camera zoom in/out,
with different categories of targets (human, vehicles, planes, animals etc.) and cover
a wide variety of activities and environmental conditions. For each video in the test
data set, three judges selected segments from the original video to compose their
preferred version of the summary video. The final ground truth was then constructed
by pooling together those segments selected by at least two judges. To quantitatively
determine the overlap between the algorithm-generated summary and the ground
truth, both the video segment content and time differences were considered. The
final accuracy was computed as the ratio of segments in the algorithm-generated
summary video that overlap with the ground truth.
The average accuracy of their summarising was 72.30%, and the ratio of the total
time spent on generating feature representations, learning the initial dictionary, video
reconstruction and online dictionary updating to the raw video duration ranged from
0.60 to 1.71.
Although video summarising is causally related to the topic of this chapter, we
refer to the method of determining a measure of the importance of film shots or
segments shown in Uchihachi et al. (2003), where video shots have been clustered
using a measure of visual similarity, such as colour histograms or transform coeffi-
cients. Consecutive frames belonging to the same cluster are considered as a video
shot, each shot having attributes of length and cluster weight (total number of frames
belonging to the cluster). A shot is important if it is both long and rare. Additional
amplification factors are proposed for preferring specific shot categories, for exam-
ple, close-ups.
This approach allows for a concise presentation of long video sequences as
shortened video clips of still-frame storyboards. However, the method of visual
clustering is based on low-level graphical measurements and does not contain
semantic information about the frame content and geometry. In one of our recent
works (Podlesnaya and Podlesnyy 2016), it has been shown that feature vectors
obtained from a frame image by a convolutional neural network trained to recognise
a wide nomenclature of classes, such as ImageNet contest (Russakovsky et al. 2014),
comprise semantic information suitable for visual example-based information
retrieval and for segmenting videos into distinct shots. Here, we will show that the
same feature vector comprises geometry-related semantic information to some
extent and is at least capable of differentiating between cinematography shot sizes
(from close-up to long shots).
6.1.4 Automated Editing of Video Footage from Multiple

Cameras to Create a Coherent Narrative of an Event
Arev et al. (2014) of Disney Research describe a system capable of automatic editing
of video footage obtained from multiple cameras, e.g. collected from viewers of a
basketball game. Automatic editing is performed by optimising a path in the trellis
graph constructed of frames of multiple sources, ordered in time. Edges in the graph
160 S. Y. Podlesnyy
Fig. 6.3 Method pipeline: from multiple camera feeds to a single output cut. (Reproduced with
permission from Arev et al. 2014)
represent transitions between different frames, i.e. effectively cuts (or no cuts if the
edge connects two nodes representing the same footage).
The system efficiently produces high-quality narratives by means of constructing
the cost functions for nodes and edges of the trellis graph to closely correspond to the
basic rules of cinematography. For example, in order to enforce the 180 line of
action rule, they estimate the 3D camera position and rotation for every source of
video footage and further estimate the most important action location in a 3D scene
as a joint focus of attention of multiple cameras (see Fig. 6.3). In order to avoid jump
cuts, the system estimates the cameras’ movement in 3D space and constructs the
loss function for the graph edges so that both the transition angle and the distance
between camera positions in a transition are constrained. For example, an optimal
transition angle is reported around 30 , and the optimal distance is around 6 metres.
By estimating the distance between the camera and the joint attention focus, the
system is capable of evaluating the size of each shot as wide, long, medium, close-up
or extreme close-up. Cost functions for the graph edges penalise transitions between
shots that are more than two sizes apart. Lastly, the system promotes cuts-on-action
transitions by means of estimating actions as local maxima of joint attention focus
acceleration.
Let’s go into greater depth on the details of this wonderful work. An input for the
algorithm is k synchronised video streams captured from numerous positions,
possibly moving in time. The overall processing pipeline is shown in Fig. 6.3.
The first step in the pipeline is 3D camera pose estimation for every video stream.
The standard procedure widely used in computational photography is described in
Snavely et al. (2006). Given k synchronously taken frames, a few thousand interest
points are found in each frame. Classic methods for interest points detection and their
description are reviewed in Mikolajczyk et al. (2005), the SIFT method being just
one. Next, for each pair of frames, interest descriptors are matched between the pair,
using the approximate nearest neighbour package of Arya et al. (1998).
More recent approaches suggest that the detection and feature-matching stages
can be avoided and, instead, features can be extracted on a dense grid across the
image. In particular, Neighbourhood Consensus Networks (NCNet) (Rocco et al.
2018) allow for jointly trainable feature extraction, matching, and match-filtering to
directly output a strong set of (mostly) correct correspondences.
Given the pairwise correspondences of interest points, a fundamental matrix for
the pair is estimated using RANSAC (Fischler and Bolles 1987). After numerous
refinements and suppression of outliers, a set of geometrically consistent matches
between each image pair is obtained. Earlier, the efficiency of using the RANSAC
algorithm for outlier suppression was demonstrated, in particular, for image
matching and coordinate transformations in document image processing (Safonov
et al. 2019, Chap. 7).
Next, a set of camera parameters and a 3D location for each pair of interest points
are recovered. The recovered parameters should be consistent, in that the
reprojection error, i.e. the sum of distances between the projections of each track
and its corresponding image features, is minimised. This minimisation problem is
formulated as a non-linear least squares problem and solved iteratively, starting with
one pair of frames and adding more frames one by one to eliminate bad local
minima, which are known to affect large-scale Structure from Motion problems.
The pose of each camera is estimated using a perspective-n-point algorithm (Lepetit
et al. 2009).
The second step in the pipeline is to use the gaze clustering algorithm of Park et al.
(2012) to extract 3D points of joint attention (JA-point) in the scene at each time
instant. All gaze concurrences g are calculated in the scene through time, and their
importance rank(g) is counted as the number of camera gazes that intersect at that
point. Thus, this process produces multiple 3D locations of joint interest, and the
algorithm uses them all during its reasoning about producing the cut.
The third step in the pipeline is trellis graph construction, where nodes
representing a camera and a point of 3D joint attention are estimated via gaze
concurrences; nodes are laid out in slices, each slice corresponding to a time-slice.
The edges in the graph connect all nodes in a slice to nodes in the next time-slice.
Both the nodes and edges of the graph are weighted. The node weights combine such
parameters of a camera frame as:
162 S. Y. Podlesnyy
• Stabilisation cost (to limit shaky camera movement, camera shakiness estimated
from camera trajectory across time)
• Camera roll cost (to enforce camera alignment to the horizon line)
• JA-point importance rank
• 2D location of JA-point projection in the frame (to penalise JA-points lying
outside the 10% margin of the frame resulting in centring the frame around the
main point of interest of the narrative, or stabilising shaky footage, or reducing
distortion in wide FOV cameras, or creating more aesthetic compositions)
• 3D distance between the JA-point and the node’s camera centre (to eliminate
cameras that are too far or too close)
The edge weights of the graph combine:
• Absolute angle difference between the two front vectors of adjacent frames of the
same video feed (continuous feed case; no cut is produced).
• Transition angle (ensures small overlap between the frames and different back-
ground arrangements).
• Distance between the two cameras (ensures small angle change between the
transition frames).
• Shot size (the size of each shot is identified as wide, long, medium, close-up or
extreme close-up, according to the distance from the JA-point; transitions
between shots whose sizes are more than two levels apart can be confusing for
the viewer and should be penalised).
• Acceleration of the 3D JA-point as a measure for action change (to promote cut-
on-action transitions).
The final step in the pipeline is optimal path computation in the graph. A path in
the trellis graph starting at the first slice and ending at the last defines an output
movie whose length matches the length of the original footage. Following continu-
ous edges in the path continues the same shot in the movie, while following
transition edges creates a cut. The cost of a path is the sum of all the edge weights
and node costs in the path. Once the graph is constructed, it becomes possible to find
the “best” movie by choosing the lowest cost path. Dijkstra’s algorithm or the
modified dynamic programming algorithm proposed by Arev et al. (2014) can be
used to find the lowest cost path.
For algorithm evaluation, the following methods have been proposed: compari-
son of automatically edited clips with manually edited clips and random cuts made of
the same footage. Among the metrics proposed is the time of processing (ranging
from 20 hours on average for manual editing of a few minutes’ clip to 6–17 minutes
of automatic editing including rendering). Another metric is the diversity of time
between the cuts and how well it follows the pace of actions in the scene. Yet another
way to compare automated video editing algorithms is by counting the cinematog-
raphy rules violations (e.g. 180 line of action rule violation) and the diversity of the
shot sizes of the transitions.
The proposed method relies heavily on the availability of multiple video sources
taken from different angles to make it possible to derive a joint attention point in 3D
Fig. 6.4 A visualisation of three shots from a coherent video cut of a social event. In this case, eight
social cameras record a basketball game. Their views are shown at three different times. The 3D top
view shows the 3D camera poses, the 3D joint attention estimate (blue dots) and the line-of-action
(green line). Using cinematographic guidelines, the quality of the footage and joint attention
estimation, our algorithm chooses times to cut from one camera to another (from the blue camera
to purple and then to green). (Reproduced with permission from Arev et al. 2014)
space. Cinematography rules are hard-coded into the system. The system is reported
to show the additional benefit of improving the visual quality of the resulting films
by means of applying a crop and stabilising some shots in order to achieve the
transition between shot sizes dictated by cinematography rules (Fig. 6.4).
This is an impressive result, but we would suggest adding an evaluation of the
general aesthetics of a frame to prevent the inclusion of technically defective footage
in the resulting film. Good results in determining the aesthetical score of a photo
image were shown by Jin et al. (2016). They used a crowdsourced collection of rated
photos to train a convolutional neural network to directly regress the aesthetical
score of an image. When we analysed the general patterns of their scoring, we found
that their system penalises basic technical defects like blurred images, skyline slopes,
face occlusions, etc. Having in mind that professional video editors may widely use
these as creative effects, not defects, we are still convinced that, as concerns social
video device users, these issues should be penalised as probable technical defects.
Examples of aesthetical scores are shown in Fig. 6.5 (scores range from 0, “bad”, to
1, “excellent”). Examples (a) to (g) correlate well with common sense, while
(h) assigns a relatively low score of 0.18 to a frame extracted from one of the
greatest masterpieces of the XX century.
6.1.5 Automatic Editing of Single-Camera, Multiple-Angle

and Multiple-Take Footage to Create a Coherent
Final Scene
Leake et al. (2017) presented a system for automatically editing video of dialogue-
driven scenes. Given a script and multiple takes of video footage, this system
performs sentiment analysis of the text spoken and facial key point analysis in
order to determine which video clips are associated with each line of dialogue, and
whether or not the performer speaking the line is visible in the corresponding clip.
164 S. Y. Podlesnyy
(a) (b) (c) (d)
(e) (f) (g) (h)
Fig. 6.5 Examples of aesthetic scores of video footage: (a, b) GoPro frames, score depends on
horizon alignment; (c) high score for frame from “L.A. Confidential”, 1997 by Curtis Hanson; (d)
high score for frame from “Apocalypse Now”, 1979 by Francis Ford Coppola; (e, f) GoPro frames,
score depends on face lighting; (h) low score assigned to a frame from “Last Tango in Paris”, 1972
by Bernardo Bertolucci
For shot size attribution, they use a face detector to determine the median area of a
face in the frame. Basic film-editing rules are encoded as probability functions for
starting a scene with a particular clip or performing a transition between the takes.
Such cinematography rules as “start wide”, “avoid jump cuts”, “speaker visible” and
“intensify emotion” are encoded using the attributes obtained from the text senti-
ment, face area and aligning dialogue lines with clips. The system provides a
graphical user interface capable of composing several editing rules and controlling
the pace of the final video.
In order to evaluate the results, automatically edited videos were compared with
manually edited film clips. It is reported that it took 2–3 hours of a highly profes-
sional video editor to perform the job. Due to heavy usage of face detection
algorithms, the time taken for automatic editing was comparable, but no human
interference was required during that time. The proposed system is limited to
dialogue-based videos only and requires a script with the performers’ names
labelled.
6.1.6 Learning Editing Styles from Existing Films
As shown in Merabti et al. (2015), it is possible to model the editing process using a
Hidden Markov Model (HMM). They hand-annotate existing film scenes to serve as
training data. In particular, they annotate shot sizes and action types as “main
character entering scene”, “both characters leave scene”, etc. They also hand-
annotate the communicative values of utterances in the dialogue following the way
filmmakers construct dialogue: symbolic, moral and narrative communicative
values, as this features a stronger relationship with the type of shot used to portray
a given communicative value. As a result, in comparison to the dialogue-based

scenes filmed by the same editor, 78% of the generated shots in the computed
sequence are identical to those in the original sequence.
But manual annotation is tedious, and this approach still requires access to the
underlying scene actions when applying the learned style to new animated scenes.
As can be seen from this review, existing methods of automatic film editing rely
on broad metadata either extracted from multiple footage sources in the form of joint
attention focus or from parsing dialogue scripts and even hand-annotating them. In
particular, many methods described in the literature require information about the
shot size that is also a cornerstone for traditional cinematography editing rules. The
purpose of this work is to explore the possibility of exploiting high-level semantic
visual features to learn the editor’s style and apply it for new video production.
6.1.7 Generating Video from Video Clips Based on Moments

of Interest
GoPro Inc. discloses a set of patents covering its Quik software for automated video
editing (Médioni 2017; Matias and Phan 2017). According to these publications, the
following use case for automated video editing may be implemented for
non-professional users of action/sports video cameras:
• User highlights favourite moments in raw footage (see Fig. 6.6 for example of
user interface).
• Application applies pretrained spatio-temporal convolutional neural network for
calculating semantic feature vectors of the highlights.
• Application performs training of LSTM recurrent neural network for the task of
predicting a semantic feature vector of the next video segment from the previous
video segment feature vector.
• Application finds similar video segments in raw footage provided by applying a
trained LSTM neural network and analysing the difference between the predicted
semantic feature vectors and actual feature vectors.
• Application composes a shortened movie, giving priority to video segments
similar to the user-highlighted ones; the movie duration and tempo may be
defined by music selected by the user.
The authors used video segments with a fixed number of frames, e.g. 16 or
24 frames 112x112 pixels RGB, as input data for user highlights learning. Such
segments preserve the spatio-temporal relations of the video signal while being
relatively stationary, as described in Sect. 6.1.3.
For calculation of the semantic vectors, the pretrained 3D convolutional neural
network (CNN) is used. Figure 6.7a shows the overall CNN structure, and Fig. 6.7b
shows the inception block structure. The network is trained with the Sports 1 M data
set (Karpathy et al. 2014) to classify the actions in the video into 487 classes. The
166 S. Y. Podlesnyy
Fig. 6.6 Example of user

interface for highlighting
moments of interest. Stars
312 depict the location of
highlighted moments of
interest on footage timeline
310. (Image is reproduced
from the patent by Matias
and Phan 2017)
(a) (b)
Fig. 6.7 Overall 3D CNN structure (a) and inception block structure (b). (Image is reproduced
from the patent by Médioni 2017)
Softmax classification layer is used only at the pretraining phase. At the inference
time, the output of the final layer with a dimension of 1000 is used as a feature vector
for the video segment.
More information can be found in Chap. 7 (Real-Time Detection of Sports
Broadcasts Using Video Content Analysis).
At the next stage of the video editing pipeline, an LSTM module is trained with
raw video content including user highlights with the goal of predicting the next
and/or previous spatio-temporal feature vectors in video highlights. After the train-
ing, the process determines the presence of one or more highlight moments within
the video content based on a comparison of the spatio-temporal feature vectors with
predicted spatio-temporal feature vectors.
Having identified the video segments close enough to the user highlights, the
application performs automatic video editing using hard-coded cinematography
rules, and temporal priors such as background music events (peaks and/or troughs).
6.1.8 Nonparametric Probabilistic Approach for Media

Analysis
Magisto Ltd. discloses a set of patents (Rav-Acha and Boiman 2016, 2017; Boiman
and Rav-Acha 2017) covering its software for automatic video editing. The company
(now acquired by Vimeo) claims that they perform numerous kinds of visual, audio
analysis and storytelling analysis:
• Action analysis
• Camera motion analysis
• Video stabilisation
• Face detection, recognition and indexing
• Scene analysis
• Objects detection and tracking
• Speech detection
• Audio classification
• Music analysis
• Topic analysis
• Emotion analysis
The long list of features above is implemented within a unified media content
analysis platform which the authors denote as the media predictability framework.
The predictability framework is a nonparametric probabilistic approach for media
analysis, which is used for all the basic building blocks that require high-level media
analysis: recognition, clustering, classification, salience detection, etc. The predict-
ability measure is defined as follows: given a query media entity d and a reference
media entity C (e.g. portions of images, videos or audio), we say that d is predictable
from C if the likelihood P(d | C) is high and unpredictable if it is low. For instance, if
168 S. Y. Podlesnyy
a query media is unpredictable given the reference media, we might say that this
media entity is interesting or surprising. Yet another example: we can associate a
photo of a face with a specific person if this photo is highly predictable from other
photos of that person.
Descriptor Extraction Daisy descriptors (Tola et al. 2010) are used, which com-
pute a gradient image, and then, for each sample point, produce a log-polar sampling
(of size 200). Video descriptors describe space-time regions (e.g. three frames,
yielding a descriptor of length 200 3 ¼ 600 around each sample point). Video
descriptors may be sampled at interest points, and descriptor-space representation
with reduced dimensionality is obtained by one of the known methods.
Density Estimation Given both the descriptor-space representatives {q1,. .., qL}
and the descriptor set extracted from the reference C ¼ {f1,. .., fK}, the next step is
likelihood estimation. The log likelihood log P(qi) of each representative qi is
estimated using the nonparametric Parzen window probability density estimation
method using a Gaussian kernel.
Media Predictability Score Video descriptors with their corresponding probability
density estimations are stored in a database. In order to evaluate the media likelihood
P(d | C), a weighted k-nearest neighbours method is used. The predictability
PredictabilityScore(M1 | M2) of media entity M1 given the media entity M2 as a
reference is computed. Similarly, the predictability PredictabilityScore(M2 | M1) of
media entity M2 given the media entity M1 as a reference is computed. The two
predictability scores are combined to produce a single similarity measure. As a
combination function, one can use the “average” or the “max” operators.
Now basic building blocks that require high-level media analysis: recognition,
clustering, classification, salience detection, etc., can be defined with the presented
framework. For example, the classification block computes the PredictabilityScore
(d | Ci) of the query media entity d for each class Ci. The classification decision may
be the highest scored predictability or the posterior probabilities computed using the
nonparametric Parzen window estimation. The saliency block tries to predict a
portion of a media entity (It) based on previous media entity portions (I1. .. It 1)
that precede it. This can also indicate that this point in time is “surprising”, “unusual”
or “interesting”. Let d be some query media entity, and let C denote the reference set
of media entities. The saliency of d with respect to C is defined as the negative log
predictability of d given C (i.e. log PredictabilityScore(d | C)). Using this notation,
one can say an event is unusual if its saliency measure given other events is high.
The importance measure is used to describe the importance or the amount of
interest of a video clip for some application. This measure is subjective; however, in
many cases it can be estimated with no user intervention using attributes such as the
following:
• Camera wandering may indicate that the photographer is changing the focus of
attention; shaky camera movements also indicate that the scene is less important.
• Camera zoom is usually a good indication for high importance because, in many
cases, the photographer zooms in on some object of interest to get a close-up view
of the subject.
• Face close-up, speech recognition and laughter detection are all good indicators
for the higher importance of the corresponding scene.
Given a visual entity d (e.g. a video segment), the attributes above can be used to
compute the Importancy measure as a weighted sum of all the attribute scores:
X
ImportancyðdÞ ¼ max i
αi s i , 0 ,
where αi are the relative weights of each attribute. A video editing score for a video
editing selection of clips c1,. .., cn is defined as:
EditingScoreðc1 , . . . , cn Þ ¼ Importancyðci Þ:
Finally, the authors pose the problem of automatic video editing as an optimisa-
tion of the editing score above, given some constraints (e.g. that the total length of all
the selected sub-clips is not longer than a predefined value). This highly
noncontinuous function can be approximately optimised using stochastic optimisa-
tion techniques (e.g. simulated annealing, genetic algorithms). As an example of a
greedy algorithm, Boiman and Rav-Acha (2017) suggest sorting the editing selec-
tion of clips c1,. .., cn by descending order of their editing score and taking the first
k clips from the ordered list given the constraints mentioned above.
An important method for improved video editing is proposed by Boiman and
Rav-Acha (2017). The authors say: “The term ‘cutaway shot’, or simply ‘cutaway’
. . . is the interruption of a continuously filmed action by inserting a view of
something else. It is usually, although not always, followed by a cut back to the
first shot. The term ‘B-roll’. . . is supplemental or alternative footage intercut with the
main shot which is referred to as the ‘A-roll’. . . B-roll is a well-known technique
used in both film and the television industry. In fiction film, B-roll is used to indicate
simultaneous action or flashbacks. In documentary films, B-roll is used in inter-
views, monologs, and usually with an accompanied voiceover, since B-rolls usually
do not have their own audio. . . As may be apparent, manually generating a video
production that involves B-roll is extremely time consuming and requires experience
in video production. It would, therefore, be advantageous to be able to automatically
generate video production that includes this feature”.
Using the same media predictability framework as described above, it is proposed
to automatically insert B-roll in moments where the visual footage is relatively
boring (e.g. a talking person who is not moving). These moments can be identified
as having a low saliency. Speech recognition and video analysis can be used to
understand the topic and add relevant footage (e.g. photos that are related to that
170 S. Y. Podlesnyy
topic) as a B-roll. For example, detecting the words “trip” and “forest” might yield
photos taken from a forest. When detecting objects and locations in the background,
for example, if someone was taking video with the Eiffel Tower in the background, a
B-roll containing close-ups of the Eiffel Tower could be selected.
6.2 Imitation Learning
As can be seen from the review in the previous section, existing methods of
automatic film editing rely on handcrafted rules coded into the software. The rules
require hand-engineered features such as joint attention focus extracted from multi-
ple footage sources or the parsing of dialogue scripts. Some of the systems referred
to need user input to highlight the most prominent scenes or utilise engineering
approaches for the detection of important scenes such as close-up detection or using
image classification to determine a topic of raw video materials and to cut the video
accordingly.
In this section, we explore the possibility of learning the cinematographic editing
rules directly from the reference movies and apply it for new video production
(Podlesnyy 2020).
6.2.1 Solution Overview
Figure 6.8 shows the video footage features extraction pipeline. Frames are sampled
from the video stream of possibly several video files obtained from the user’s
gadgets. After simple preprocessing (downscaling to 227x227 pixels, mean colour
value subtraction), the frame images are input into a GPU where a combination of
three convolutional neural networks reside. A GoogLeNet (Szegedy et al. 2014)-
structured network trained to classify 1000 classes of ImageNet is used to extract a
semantic feature vector of length 1024. A network trained to regress the aesthetical
score on the AVA dataset (Jin et al. 2016) produces a vector of length 2. A network
Fig. 6.8 Video footage features extraction pipeline

trained to classify an image into three classes of shot sizes (close-up, medium shot,
long shot) produces a vector of length 3 of the probabilities of an image belonging to
the corresponding shot size. The vectorised attributes of every sampled frame are
stored in the Frames database.
The next step of the pipeline is segmenting the video footage into coherent shots.
We use the approach described in detail in Podlesnaya and Podlesnyy (2016): to
determine shot boundaries, we analyse the vector distance between the semantic
feature vectors of neighbouring frames. If the vector distance is large enough, we
place the shot boundary there. For every separate shot, we calculate the attributes as
the mean value of the semantic feature vectors and median values of the shot size
vector and aesthetic score. The resulting attributes are stored in the Shot Attributes
database.
We perform the same pipeline for reference cinematography samples as well as
for user-generated video content. For reference samples, we used 63 of the 100 best
movies as listed by the American Society of Cinematographers (ASC 2019) and
processed them, excluding 2 minutes of content from the beginning and from the
end, thus eliminating captions, studio logos, etc. as irrelevant material.
The process of automatic film editing starts when the user selects the video
footage he wants to have edited. According to the user’s selection of raw footage,
the Features Preparation module reads data from Shot Attributes database and feeds
it into the Editing module comprising the learned controller model for automatic
video editing. The Editing module produces a storyboard. Based on that, it is
possible to compose the output video clip cutting from raw footage.
6.2.2 Features Extraction
For semantic features extraction, the GoogLeNet (Szegedy et al. 2014) model trained
by BVLC on the ImageNet dataset (Russakovsky et al. 2014) is used. The final
classification layer is omitted and the output of layer “pool5/7x7_s1” is used as a
semantic feature vector of length 1024. After feature vectors from over 1,670,000
frames of the motion picture masterpieces mentioned in Sect. 6.2.1 have been
extracted, an incremental PCA was performed to reduce the dimensionality of the
feature vectors to 64. The residual error for the last batch on the incremental PCA
was 0.0029.
In order to automatically detect the shot size, the classifier to distinguish images
between three classes: close-up, medium shot and long shot, was trained. A dataset
using detailed cinematography scripts of 72 episodes of various Russian TV series
was used. The dataset contained 566,000 frames with a nearly even distribution of
frames belonging to the three classes. The GoogLeNet-based network structure was
trained for its compact size and robustness to lighting and colouristic conditions. The
network had three outputs, Softmax loss function was used for training. No aug-
mentation was used for the training data except horizontal flipping. The top-1 testing
accuracy was 0.938 given the overall noisy nature of the dataset.
172 S. Y. Podlesnyy
A trained network described by Jin et al. (2016) was used for the aesthetic scoring
of the shot. Concretely, the AVA-2 variant was used. As mentioned above, the
crowdsourced rating of the training data is somewhat biased to common tastes. For
example, they clearly penalise the score of images with a sloped skyline or having
faces partially occluded by hair. However, they do a good job in scoring low quality
images with obvious technical defects such as blur, defocus, unclear spots, etc.
For each shot, a state vector comprised the following:
• Semantic features – 64 real values (average pooling over frames in the shot)
• Shot size features – three real values (median pooling over frames in the shot)
• Aesthetic score – one (median pooling)
• Normalised Euclidean distance between the pooled semantic feature vectors of
current and previous shots
6.2.3 Automatic Editing Model Training
The plan was to use motion picture masterpieces as reference samples of good
editing. The video editing was modelled as a process of making control decisions
on whether to include a shot in a final movie or to skip it. Further, the editing rhythm
is modelled by learning to make fine-grained decisions on the duration of a shot that
was selected to be included in the final movie. In particular, the video editing process
was modelled as a sequence learning problem (Ross et al. 2011) with the Hamming
loss function. The following labels were used for shots sequence labelling
(Table 6.1).
The training data were prepared as follows. As described in Sect. 6.2.2, a
sequence of shots was collected into a clip having a duration of around 2 minutes.
This could be regarded as the target movie duration. Each shot in the clip sequence
was labelled according to its duration with labels 1–4 (as in Table 6.1), producing a
reference “expert” movie cut. In order to give the model a concept of bad montage,
around 40 augmented clips were produced from each reference clip. This was
performed by randomly inserting shots taken from other movies in the masterpieces
collection, thus disrupting the author’s idea of an edit. Such a shot was assigned a
label 5. Additionally, label 5 was assigned to all shots having an aesthetical score
below some threshold (e.g. 0.1). This gave the model an idea of penalising the shots
Table 6.1 Action labels for sequence learning

Label Action
1 Include shot, duration <1 sec
2 Include shot, duration 1. . .3 sec
3 Include shot, duration 3. . .9 sec
4 Include shot, duration >9 sec
5 Skip shot
having clear technical issues as per common tastes. As a result, a training set having
108,491 sample clips was obtained.
Vowpal Wabbit (Langford et al. 2007) was used for training in the sequence
learning model using DAGGER (Dataset Aggregation), an iterative algorithm that
trains a deterministic policy in an imitation learning setting where expert demon-
strations of good behaviour are used to learn a controller.
Given a state s, denote as C(s,a) the expected immediate cost of performing action
a in state s, and denote as Cπ(s) ¼ Ea ~ π(s)[C(s,a)] the expected immediate cost of
policy π in s. In imitation learning, the true costs C(s,a) for the particular task may
not necessarily be known or observable. Instead, expert demonstrations are observed
and seek to bound J(π) for any cost function C based on how well π mimics the
expert’s policy π*. Denote as l the observed surrogate loss function, which is
minimised instead of C. The goal is to find a policy π’ which minimises the observed
surrogate loss under its induced distribution of states, i.e.:
π 0 ¼ argminπ2Π Es dπ ½I ðs, π Þ:
At the first iteration, the DAGGER algorithm uses the expert’s policy to gather a
dataset of trajectories D. Then the algorithm proceeds by collecting a dataset at each
iteration under the current policy and trains the next policy under the aggregate of all
the collected datasets. The intuition behind this algorithm is that, over the iterations,
the set of inputs is built up that the learned policy is likely to encounter during its
execution based on previous experience (training iterations).
The state vector s is constructed as follows: action labels a are used in historical
span 6, and neighbouring semantic feature vectors from the -sixth shot to the +third
shot from the current one are added to the state vector s. The held out loss after
32 epochs of training was 4.06, while the average length of a sequence was 21.
6.2.4 Qualitative Evaluation
In Fig. 6.9a, the test footage is shown in the storyboard format. It contains an
unmodified fragment of “Cool Hand Luke”, a 1967 film edited by Stuart Rosenberg.
Figure 6.9b shows the result of automatic editing. The fact that the algorithm has
modified the result proves that the model is not overfitted because “Cool Hand Luke”
was used for training. One may observe that the algorithm has shortened the footage
and deleted some beautiful shots that look foreign to the overall story, e.g. a truck
near the beginning of the fragment. Note that the transition between the medium
1 shot and the truck (a very long distance shot) is quite abrupt, and since the model
had learned average rules of editing it removed the truck shot due to the unusual
transition. The overall cut of the resulting clip looks smooth.
Figure 6.10a shows test footage constructed from a fragment of the movie not
seen by the model during training. It is “Gagarin the First in Space”, 2013, edited by
174 S. Y. Podlesnyy
(a) (b)
Fig. 6.9 A fragment of “Cool Hand Luke”, 1967, editor Stuart Rosenberg: (a) original footage; (b)
result of automatic edit
Pavel Parhomenko. The fragment was augmented by randomly inserting some shots
taken from random places in the same movie. In Fig. 6.10b the result of auto-editing
is presented. The shot with a close-up on the radio has been correctly removed, but a
few other foreign shots have been left. However, the overall cut looks smooth, and
even in colouristic format it shows a nice gradual change of tone throughout the
length of the clip.
In Fig. 6.11a a fragment from “Das Boot”, 1981, edited by Wolfgang Petersen, is
shown augmented by inserting a few shots from “Fanny and Alexander”, 1982,
edited by Ingmar Bergman. Note the shots having a tone that is very much alike with
close-up faces looking in a direction that breaks the rule of 180 of action: in Das
Boot an officer speaks to a crew member in front of him, so the correct cut would be
to montage shots with facing directions. The model trained by imitation learning
correctly removed the wrong shots, as shown in Fig. 6.11b.
6.2.5 Quantitative Evaluation
In order to estimate how well the proposed method learns basic video editing rules
from unlabelled reference samples of motion picture masterpieces, the numbers of
(a) (b)
Fig. 6.10 A fragment of “Gagarin the First in Space”, 2013, editor Pavel Parhomenko: (a) original
footage. Random shots inserted at positions 4, 8–10, 14–15 from the top; (b) the result of automatic
editing
transitions between the shot sizes in the reference footage, in raw non-professional
footage and in the automatically edited clips have been manually counted.
According to cinematography editing rules, the following shot sizes are common:
detail, close-up, medium 1, medium 2, long shot, and very long shot. It was advised,
for example, by Kuleshov in the early twentieth century, that transitions between the
shots should occur with two size steps, e.g. between the medium 1 shot and the long
shot. Transitions between the outer shot sizes – detail and very long shot – can be
done in one step, i.e. detail-close-up and long shot-very long shot.
176 S. Y. Podlesnyy
(a) (b)
Fig. 6.11 A fragment of “Das Boot”, 1981, edited by Wolfgang Petersen: (a) original footage.
Foreign faces inserted at positions 16, 18, 20; (b) the result of automatic editing
In order to evaluate whether the trained DAGGER model has learned the very
basic principles of video editing from the data features extracted from motion picture
masterpieces, the numbers of transitions between shots of different sizes in three
corpuses of footage: clips sampled from the masterpieces dataset, clips sampled from
non-professional video footage and clips automatically edited by the DAGGER
model, have been calculated. Figure 6.12 shows a normalised representation of the
distribution of transitions between the shots. It is easy to see that the
(a) (b) (c)
Fig. 6.12 Normalised representations of distribution of transitions between the shots: (a)
non-professional video; (b) motion picture masterpieces; (c) automatically edited by our algorithm
non-professional video has a somewhat random distribution of transitions while the

motion picture masterpieces and DAGGER-edited video clips demonstrate a distri-
bution with distinct elevation in the part corresponding to “2 steps difference”
between the shot sizes. The standard deviation between the masterpieces histogram
and non-professional video histogram is 0.004, while the standard deviation between
the masterpieces histogram and clips edited by our algorithm is 0.001, which is a
fourfold improvement.
6.3 Dynamic Programming
Automatic video editing, stated as an approximately optimal sequence search based

on imitation learning, in fact uses linear regression to predict action labels. This
results in an over-smoothed video cut that looks correct if you formally count scale
transitions, technical quality problems and overall colour tone changes across the cut
storyboard. In a way, the method described in the previous section is a proof of
concept that a trainable system is capable of extracting knowledge of cinematogra-
phy editing rules from the unstructured corpus of motion picture masterpieces. We
believe that substituting the linear machine with a more relevant convolutional
neural network that is trainable end to end may significantly improve the video cut
quality while preserving a high speed and compact memory footprint at the inference
stage, working at a single pass from frame pixels to cut action decisions.
However, this method lacks the means for individual adjustments of the desired
video editing parameters, such as preserving the most interesting footage even
despite its technical quality, and controlling the pace. Some users may prefer to
include more close shots in the cut, while others would like to select footage that
contains a specific person or just contains smiling faces. Many such parameters have
been specified in the patents mentioned above in Sect. 6.1.7. In this section, we
describe a unified framework capable of almost unlimited parametrisation while
seeking a globally optimal solution to the video cut problem.
178 S. Y. Podlesnyy
6.3.1 Problem Statement
Along with Médioni (2017) and Matias and Phan (2017), we state the problem of
video editing as finding an optimal path in a graph. Unlike these two works, we do
not construct a bipartite graph of possible transitions between the footage taken by
multiple video cameras.
Consider an acyclic directed weighted graph G(V, E). Its vertices V ¼ {v1, ... vN}
are video takes, each having approximately homogeneous visual content. The
vertices may be ordered naturally by video takes time codes or reordered by a
user. We are adding two special vertices to V: v0 and vN + 1, which correspond to
the film’s beginning and end markers. Each vertex is characterised with video take
duration ti.
Graph edges E ¼ {eij} are defined only for i < j and correspond to a possible
montage transition from video take i to j. Graph edges eij are characterised with the
cost function wij.
In order for the edited film to include at least one video take, the following
conditions should be met:
8j < N þ 1∃e0,j ,
8i > 0∃ei , Nþ1
A path in G(V, E) from v0 to vN + 1 corresponds to a video cut that contains video

takes vi included in the path. Then the optimal video cut problem is reduced to
finding the shortest path from v0 to vN + 1 in the graph. An example G(V, E) with
N ¼ 4 is shown in Fig. 6.13. The bold line shows one given path in the graph that
corresponds to a film containing only video takes 2 and 3.
It is the cost function wij that makes automatic video editing possible with regard
to the multiple criteria listed above. If the cost function is non-negative, the Dijkstra
algorithm can be used for finding the shortest path.
Fig. 6.13 Example graph representation of raw video footage

6.3.2 Montage Transitions and Frame Image Quality Scoring
In this section, we discuss one of the most important parts of the cost function:
transition quality. In the spirit of the previous section on imitation learning, we are
going to use motion picture masterpieces as reference samples of good editing.
Based on many publications on semantic indexing including our own (Podlesnaya
and Podlesnyy 2016), we hypothesise that the feature vector extracted from a frame
image contains information on the frame semantics, colour tone and subtle geometric
properties required for video editing decisions mimicking the experts’ actions.
To illustrate an intuition behind montage transitions quality evaluation, in
Fig. 6.14 the feature vectors fi, fi + 1, fi + 2, fi + 3 for video takes # i, ... i + 3 are
drawn as bold dots. These videos are taken from a sequence of raw video materials
provided for video editing.
It is possible to find all the feature vectors of the reference cinematographic
materials (motion picture masterpieces) within the circle of radius ri centred at fi.
These closest neighbours, drawn as small dots in Fig. 6.14, will be expert takes
having the closest visual contents to our given raw material take #i.
Consider the circles in the feature vector space, centred at fi + 1, fi + 2, fi + 3. Let us
count all the montage transitions in the reference cinematographic materials from the
neighbours of take #i. As shown in Fig. 6.14, some of the transitions do not fall near
any circles around our raw takes. One transition is found to arrive into a
neighbourhood of fi + 3, and two transitions arrive into fi + 1. None has arrived into
the neighbourhood of fi + 2.
The transition from take #i into take #i + 1, having the maximum count, may be
regarded as the reference in the sense that, in a case of the visual contents of a take #i,
the experts have most often chosen to make the transition into a take that is visually
similar to #i + 1.
In order to score the transition quality numerically, it is possible to normalise the
counts of reference transitions either, for example, by the maximum transitions count
for a given raw materials set or by the average value of the said count for the
reference cinematographic materials corpus. In the former case, the transition quality
may exceed 1.
Fig. 6.14 An intuition

behind montage transitions
quality evaluation
180 S. Y. Podlesnyy
Interestingly, having enough samples in the reference cinematographic materials

database, we may score a frame image’s technical and aesthetic quality just by
counting the neighbours’ number of reference feature vectors around the feature
vector of the raw video take in question. This becomes clear if one looks at the
problem in the following way: in order to score an image, check how many images
known to be good look similar to that image.
From practical experience, we may share a slight modification of the image
quality scoring technique: instead of counting the nearest neighbours, it is possible
to find the vector distance to the single nearest neighbour.
Thus, we have proposed two metrics for scoring the frame image quality and
montage transition quality based on a unified framework of semantic feature vectors
extracted from raw video takes as described in Sect. 6.2.2. Unlike approaches based
on the supervised learning of aesthetics score regression, the proposed method does
not require any labelling effort except an overall expert decision to include a motion
picture in the corpus of reference cinematographic materials, as in (ASC 2019).
6.3.3 Cost Function
As shown before, the optimal editing of raw video materials is modelled as finding
the shortest path in a graph G(V, E) constructed from video takes as vertices and all
possible montage transitions as edges. Intuitively, an inverse of the image quality
metric and inverse of the montage transition quality metric described above are
possible candidates for the cost function. However, the cost function penalising just
the quality may result in degenerate solutions with zero or single video takes
included in the edited film. Therefore, the cost function wij for the edge weights
should enable scalarised multi-objective optimisation on the graph.
Let us consider the linear scalarisation of the multi-objective cost function wij:
X j1
1 T 2
wij ¼ k þ μDij þ η s¼iþ1 U s þ λ t j ,
Qij N
where κ is the weight of the transition quality objective, μ is the weight of the
monotonic content penalty, η is the weight of the video take skipping penalty, λ is
the weight of the total desired film length objective, Qij is the transition quality
metric, Dij is the distance metric between takes i and j, Us is the unique value metric
of the skipped take, tj is a video take j duration, T is the target duration of the edited
film, and N is the number of raw video takes.
The transition quality metric is as follows:

α j ε þ KNN i \ KNN j
Qij ¼ ,
max i<k<N jKNN i \ KNN k j
where αj is the image quality metric based on either (Jin et al. 2016) or the nearest
neighbour method described
T in Sect. 6.3.2; ε is a small constant for numerical
stability; and | KNNi KNNj | is the number of montage transitions in reference
materials from neighbourhood i into a neighbourhood j.
Thus, the transition quality metric combines an evaluation of both image aes-
thetics and montage transition aesthetics.
The distance metric between takes is as follows:
Dij ¼ 1 ,
ε þ FV i FV j 2
where ε is a small constant for numerical stability and FVk is the semantic feature
vector of video take k (the mean value for the number of frames comprising the take).
The distance metric between the takes constitutes the monotonic content penalty.
It appears in practice, for example, that due to a superior image or transition quality
of the subsample of video takes, the shortest path-finding algorithm aggressively
excludes all vertices except for very similar ones, resulting in a dull monotonic
narration. It is advised that the weight μ be selected interactively by the user.
The unique value metric of the skipped take is based on the ideas of Uchihachi
et al. (2003). We propose the following algorithm for the calculation of Us.
1. Given a tuple of raw materials feature vectors x ¼ {FV1,... FVN}.
2. Produce a tuple of Euclidean distances between the consecutive feature vectors:
d ¼ {|| FV1 – FV2 ||2, . . . || FVN–1 – FVN ||2}.
3. Clusterise the vectors x by the DBSCAN algorithm using density radius eps ¼ k
STD(d) ranging the coefficient k between 0.3 and 1.3. Select the clustering
producing the maximum class variance for x.
4. Produce a tuple of class labels L ¼ {l1,... lN}. For x members not assigned to any
class (outliers) use label 1.
5. Calculate the unique value metric for every video take.
8
< β, 1k ¼ 1
UK ¼ 1X ,
: ln 1ðli ¼ lk Þ, otherwise
N i
where β is an interactively adjusted coefficient of the unique value metric over the
transition quality and 1(∙) is an indicator function valued as 1 if the argument is the
true condition and 0 if false.
182 S. Y. Podlesnyy
Thus, the unique value metric of the skipped take prevents the degenerate
solutions to the closest path-finding problem by penalising vertices skipping. The
amount of the penalty depends on the content’s uniqueness metric. Unique content is
rare in the raw materials, and therefore it is not assigned to any cluster and instead the
interactively chosen value beta is used. On the other hand, important content is
normally that which occupies the greatest duration of the raw footage, so it is
clustered in larger classes, and the resulting metric reflects also the content’s
importance as well as its uniqueness.
6.3.4 Implementation Details
Cost function calculation is based on the nearest neighbours (NN)-finding algo-

rithm. The datasets for NN-finding are prepared as follows:
Reference motion pictures are segmented into the shots, and the mean feature
vector for every shot is calculated as described in Sect. 6.2.2:
• The feature vectors’ dimensionality is reduced to 64 by the PCA algorithm.
• Reduced feature vectors are indexed by KD-trees for an efficient approximate
search of NN.
• A sparse matrix TR is constructed for an efficient search for transitions from
reference shot i to reference shot j. TRij ¼ 1 if their shots i and j are consecutive
and zero otherwise.
For inference, the raw video materials presented by a user for automatic editing
are processed accordingly:
• Raw videos are segmented into the shots as described in Sect. 6.2.1.
• If the shot duration exceeds the given threshold (e.g. 3 seconds), the shot is further
broken into smaller segments, each having a duration not exceeding the threshold.
• The mean feature vector for every shot/segment is calculated.
• The feature vectors dimensionality is reduced to 64 by the PCA algorithm using
the transform matrix obtained for the reference materials.
Segmenting long raw video takes into a smaller segments helps to solve a number
of problems. Non-professional video photographers often shoot very long takes,
which results in dull monotonic videos no one wants to watch. Segmenting into short
portions lets the algorithm select the most prominent moments from the raw footage;
for example, from a long take of a person’s face close-up, it would select the portion
when there was a smile.
In Fig. 6.15, an example storyboard of automatic editing of non-professional
video footage is shown. Raw footage was obtained from a GoPro camera attached to
a bicycle during a cycling trip along a beach. The total duration of the raw footage
was 18 min 18 sec. The following parameters were used for automatic editing:
μ ¼ 15, η ¼ 1, λ ¼ 0, β ¼ 0.995. The threshold for segmenting long takes was
5 seconds. In order to speed up the graph weight matrix calculation for the Dijkstra
(a) (b)
Fig. 6.15 An example storyboard of automatic editing of non-professional video footage: (a) raw
footage; (b) automatically edited clip
algorithm, the maximum number of possible transitions from every shot was limited
to 30. Besides the performance boost, this method serves as a regulariser for
aggressive optimising by Dijkstra.
The resulting video clip has a duration of 39 seconds. In the first half of the clip,
we may watch a few middle shots having different backgrounds and character poses
selected from a very long take captured by a camera facing the cyclist. Note that the
resulting clip includes a nice shot of the sun and moments when the character was
smiling or made expressive gestures. The rest of the clip is cut from the shots taken
by the camera rotated towards the front of the bike. From countless long shots, the
algorithm has selected the moments showing two different types of scenery, and,
184 S. Y. Podlesnyy
from these, the ones with an almost horizontal skyline were selected, given that the
raw footage contained 90% of frames with a skewed horizon. Note that none of the
above outcomes is hard-coded into the algorithm as an editing rule; the whole
process was completely data-driven.
6.3.5 Automatic Editing Results Evaluation
In order to evaluate the automatic video editing quality, we prepared ten raw footage
collections described in Table 6.2. All footage was taken by non-professional users
and mostly filmed family members in various settings.
Every footage set was automatically edited using the algorithm described in
Sects. 6.3.2 and 6.3.4.
To evaluate the montage quality of the automatically edited videos formally, we
manually counted the numbers of transitions between the takes of different scales
and checked if they met the cinematography editing rules. Table 6.3 summarises the
results of these calculations. For best results it is advised in cinematography that the
transition happens between shot sizes that differ by two steps from each other, while
for the closest and the longest shot sizes, one step is also recommended. In Table 6.3,
the cells corresponding to the recommended transitions are marked with a bold
outline. These cells should contain the majority of transition counts for the test
Table 6.2 Test footage and automatically edited clips overview

Setting Raw materials duration Auto-edited clip duration Shortening ratio
Park 00:11:06 00:00:51 92%
Sea shore 00:18:19 00:01:28 92%
Home indoors 00:53:01 00:12:47 76%
City street 00:20:48 00:01:59 91%
Grand canyon 00:08:28 00:01:45 79%
Ski resort 00:29:56 00:01:03 97%
Ski resort, night time 01:07:23 00:03:36 95%
Inside the car 00:10:23 00:00:54 91%
European city sights 00:24:46 00:02:35 90%
Wild nature 00:09:12 00:00:41 93%
Table 6.3 Transitions between takes of different scales

Detail Close-up Middle 1 Middle 2 Long shot
Detail – 1 3
Close-up – 5 13 22
Middle 1 – 30 29
Middle 2 – 33
footage edited by the algorithm. As seen from Table 6.3, two types of transitions
clearly break the “rules”:
• Close-up ! long distance
• First middle ! second middle
This may be explained by the fact that non-professional footage used for auto-
matic editing most often contains close-ups and long distance shots, and the software
could do nothing more than use the available shots for editing. The lack of the
representation power of feature vectors to distinguish between the first and the
second middle shots could be the reason for the second “rule-breaking” result.
Next, to evaluate the automatic video editing quality, we manually counted the
ratio of frames with low technical quality (frames subsampling was used to reduce
the amount of manual work). The following technical problems with video images
were taken into account:
• Out of focus or blurry image
• Wrong exposure (too bright or too dark)
• Skewed horizon line
• Messy/unclear content
• Scene occlusion
The overall mean ratio of defective frames in the edited clips was 0.15 (with a
standard deviation of 0.08). However, some clear outliers were apparent. For
example, in the ski resort videos captured by the GoPro camera attached to the
skier’s helmet, horizon skew was the predominant type of technical defect, and the
automatic editing algorithm fails to select the optimal content to cut the clip. In the
park outdoor footage, almost 17% of the frames were classified by an expert as
“messy/unclear”. If obvious outliers are removed, the mean ratio of defective frames
in the edited clips becomes 0.09 (with a standard deviation of 0.05).
It is possible to further reduce the ratio of defective frames in the resulting clips by
means of raising the value of α in Sect. 6.3.3. However, this may result in too
aggressive video shortening and a lack of moments of interest in the edited clip.
One of the well-known video editing rules is to “avoid jump cuts”. The jump cut
effect happens when a subject is filmed, and after the editing, the subject position is
changed horizontally by 1/3 or more of the frame width. We manually counted the
number of jump cuts in the automatically edited clips, and the mean ratio of jump
cuts was 0.01 (standard deviation equal to 0.02).
Sharp changes of colour tone between two consecutive shots in the cut are
regarded as bad editing. A manual calculation of abrupt colour tone changes
obtained a mean ratio value of 0.1 (standard deviation equal to 0.07).
The average shortening of the raw footage into an edited clip, given the default
values of automatic editing parameters, is 89%. However, the exact duration of the
resulting video clip depends on the parameters intended for interactive adjustment;
therefore, it is not practical to formally evaluate an algorithm by this parameter.
Figures 6.16 and 6.17 show a few examples of automatically edited video clips
made from non-professional footage.
Fig. 6.16 An example storyboard of automatic editing of non-professional video footage: from
4 minutes 54 seconds of highly shaky and messy raw footage captured with waterproof action
camera; this 35-second clip was created automatically, featuring scenery overview, underwater
scenes and crucial moment of a character approaching the camera with advanced swimming style
Fig. 6.17 An example storyboard of automatic editing of non-professional video footage: from
16 minutes 38 seconds of raw footage; this 1 minute 38 second clip was created automatically,
featuring scenes of driving toward the Grand Canyon, and the most dramatic scenery views and
character poses at the location
188 S. Y. Podlesnyy
6.4 Conclusion
In this chapter, we have explored the ways of building an automatic video editing
system capable of extracting cinematography editing rules from the reference motion
pictures. Both methods discussed here operate with video shots as atomic units of
montage and rely on averaged semantic feature vectors extracted by a convolutional
neural network trained for ImageNet classification.
The optimal sequence search by the imitation learning algorithm (DAGGER) is in
fact a linear method implemented as logistic regression with SGD. This approach
limits the generalising capabilities of the model and over-smooths the results, if such
words can be used for video clip evaluation by experts. However, this method
proved that optimal sequence search by means of the classification of semantic
feature vectors pairs is capable of extracting knowledge from an unstructured corpus
of reference video content. For example, it was demonstrated that automatic editing
improves the distribution of transitions between the shots of different scales by a
factor of four, as opposed to the distribution of transitions or raw footage filmed by
non-professional users.
Global path optimisation in the transitions graph based on nearest neighbours
search allows us to perform automatic video editing by mimicking the selected
reference content. The algorithm just chooses from the raw footage the shots and
transitions most similar to the reference style. This approach allows us to implement
an arbitrary cost function with possibly non-differential parts and/or arbitrary func-
tional blocks like smile detection. A multi-objective cost function with linear weight
coefficients is a natural choice for interactively adjusting different aspects of video
editing.
The dynamic programming technique has drawbacks such as the large memory
footprint for the nearest neighbour index and the quadratic complexity of the
transitions matrix calculation. The latter problem can be efficiently solved by
limiting the depth while constructing the transitions matrix, and the former problem
becomes less relevant as the memory resources of mobile devices grow. One serious
limitation of the proposed method is the linear narration: a transition graph can be
built for an ordered sequence of footage only, and reordering video takes or inserting
B-roll shots is not possible in a general way.
Interactivity is the key advantage of the proposed solution. Gamification of the
video editing process in the form of trial and error in setting various parameters
allows non-professional users to achieve quite pleasing results without requiring
technical and artistic skills. In Table 6.4 we list both the parameters described in
detail above and proposed future work on additional parameters for fine-tuning the
video editing.
Table 6.4 Interactive parameters for automatic video editing

Notation
(Sect.
Target characteristic 6.3.3) Implementation details
Imitation of the montage κ Weight of transition quality objective
transitions in reference
content
Importance of image aes- α Image quality metric based on either [AVA] or the
thetic and technical quality nearest neighbour method described in 6.3.2
Dynamism μ Weight of monotonic content penalty
Duration η Weight of video take skipping penalty
Preference of unique content β Weight of unique semantic content
over technical quality
Preference of specific person <future> Face detection and identification module should
(character) output a confidence level of presence of target
character in video shot
Preference of specific <future> Frame image classifier should output a probability
emotion distribution of certain emotions in a shot (laughter,
happiness, surprise, etc.)
Preference of specific actions <future> Analyse movements in sequence of frames, clas-
sify to known actions or find similar to a reference
sample given by the user
Preference of specific topic <future> Image classification (forest, beach, city attractions)
and speech recognition for audio track can be used
to emphasise the topic selected by a user; the topic
can be selected from closed nomenclature or by a
sample image provided by the user
Advanced footage impor- <future> Camera motion analysis (zooming as a high
tance scoring importance sign; shaking/wandering as a low
importance sign); audio track analysis (speech
loudness as a high importance sign)
References
Arev, I., Park, H.S., Sheikh, Y., Hodgins, J., Shamir, A.: Automatic editing of footage from multiple
social cameras. ACM Trans. Graph. 33(4), 1–11 (2014)
Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal algorithm for
approximate nearest neighbor searching fixed dimensions. J. ACM. 45(6), 891–923 (1998)
ASC Unveils List of 100 Milestone Films in Cinematography of the 20th Century (2019) Accessed
on 06 October 2020. https://theasc.com/news/asc-unveils-list-of-100-milestone-films-in-cine
matography-of-the-20th-century
Boiman, O., Rav-Acha, A.: System and method for semi-automatic video editing. US Patent
9,570,107 (2017)
Cong, Y., Yuan, J., Luo, J.: Towards scalable summarization of consumer videos via sparse
dictionary selection. IEEE Transactions on Multimedia. 14(1), 66–75 (2012)
190 S. Y. Podlesnyy
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behaviour recognition via sparse spatio-temporal
features. In: Proceedings of EEE International Workshop on Visual Surveillance and Perfor-
mance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)
Fischler, M., Bolles, R.: Random sample consensus: a paradigm for model fitting with applications
to image analysis and automated cartography. In: Readings in Computer Vision: Issues,
Problems, Principles, and Paradigms, pp. 726–740 (1987)
Jin, X., Chi, J., Peng, S., Tian, Y., Ye, C., Li, X.: Deep image aesthetics classification using
inception modules and fine-tuning connected layer. In: Proceedings of the 8th IEEE Interna-
tional Conference on Wireless Communications and Signal Processing, pp. 1–6 (2016)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video
classification with convolutional neural networks. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Langford, J., Li, L., Strehl, A.: Vowpal Wabbit Online Learning Project (2007) Accessed on
06 October 2020. http://hunch.net/?p¼309
Leake, M., Davis, A., Truong, A., Agrawala, M.: Computational video editing for dialog-driven
scenes. ACM Trans. Graph. 36(4), 130 (2017)
Lepetit, V., Moreno-Noguer, F., Fua, P.: EPnP: An accurate O(n) solution to the PnP problem.
Int. J. Comput. Vision. (2009). Accessed on 06 October 2020). https://doi.org/10.1109/ICCV.
2007.4409116
Matias, J., Phan, H.: System and method of generating video from video clips based on moments of
interest within the video clips. US Patent 10,186,298 (2017)
Médioni, T.: Three-dimensional convolutional neural networks for video highlight detection. US
Patent 9,836,853 (2017)
Merabti, B., Christie, M., Bouatouch, K.: A virtual director using hidden Markov models. In:
Computer Graphics Forum. Wiley (2015). https://doi.org/10.1111/cgf.12775.hal-01244643
Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzku, F., Kadir, T.,
Van Gool, L.: A comparison of affine region detectors. Int. J. Comput. Vis. 65 (1/2), pp. 43–72
(2005)
Park, H.S., Jain, E., Sheikh, Y.: 3D social saliency from head-mounted cameras. In: Proceedings of
the 25th International Conference on Neural Information Processing Systems., vol. 1, pp.
422–430 (2012)
Podlesnaya, A., Podlesnyy, S.: Deep learning based semantic video indexing and retrieval. In:
Proceedings of SAI Intelligent Systems Conference, pp. 359–372 (2016)
Podlesnyy, S.: Towards data-driven automatic video editing. In: Liu, Y., Wang, L., Zhao, L., Yu,
Z. (eds.) Advances in Natural Computation, Fuzzy Systems and Knowledge Discovery.
Advances in Intelligent Systems and Computing, vol. 1074. Springer, Cham (2020)
Pudovkin, V.I.: Model (sitter) instead of actor. In: Collected Works, vol. 1, p. 184, Moscow (1974)
(in Russian)
Rav-Acha, A., Boiman, O.: System and method for semi-automatic video editing. US Patent. 9,
554,111 (2017)
Rav-Acha, A., Boiman, O.: Method and system for automatic B-roll video production. US Patent
9,524,752 (2016)
Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus
networks. In: Proceedings of the 32nd Conference on Neural Information Processing Systems,
pp. 1658–1669 (2018)
Ross, S., Gordon, G.J., Bagnell, J.A.: A reduction of imitation learning and structured prediction to
no-regret online learning. In: Proceedings of the 14th International Conference on Artificial
Intelligence and Statistics, pp. 627–635 (2011)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., Berg, A. C., Fei-Fei, L.: ImageNet large scale visual recognition
challenge. CoRR, arXiv: 1409.0575 (2014)
Scanning and Printing. Springer Nature Switzerland AG (2019)
Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3D. ACM
Trans. Graph. 25(3), 835–846 (2006)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V.,
Rabinovich, A.: Going deeper with convolutions. CoRR, arXiv:1409.4842 (2014)
Tola, E., Lepetit, V., Fua, P.: DAISY: An efficient dense descriptor applied to wide-baseline stereo.
IEEE Trans. Pattern Anal. Mach. Intellig. 32(5), 815–830 (2010)
Tsivian, Y.: Cinemetrics: part of the humanities’ cyberinfrastructure. In: Ross, M., Grauer, M.,
Freisleben, B. (eds.) Digital Tools in Media Studies, vol. 9, pp. 93–100. Transcript Verlag,
Bielefeld (2009)
Uchihachi, S., Foote, J.T., Wilcox, L.: Automatic video summarization using a measure of shot
importance and a frame-packing method. US Patent 6,535,639 (2003)
Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 2513–2520 (2014)
Chapter 7
Real-Time Detection of Sports Broadcasts
Using Video Content Analysis
Xenia Y. Petrova, Valery V. Anisimovsky, and Michael N. Rychagov
7.1 Genre Detection as a Tool for Video Adjustment
7.1.1 Introduction
In modern consumer TV products, it is common to provide various predefined

enhancement settings for different types of content, such as sport, drama, music,
etc. For example, to obtain a better TV image quality for sports matches, it is usually
recommended to use a setting with medium contrast, low sharpness and low noise
reduction, unlike for movies and music, where these parameters can be set to high
values. Thus, the real-time detection and recognition of sports matches can be useful
in terms of both the automatic adjustment of control parameters to provide the most
comfortable viewing of sports broadcasts and to allow for the optimal configuration
of the codecs for media compression.
We have previously demonstrated the efficiency of static image classification in
various multimedia applications and in particular in content-based image orientation
recognition (Safonov et al. 2018, Chap. 12) and document management systems,
such as business processing workflows, digital libraries, multifunctional devices and
so on (Safonov et al. 2019, Chap. 6).
X. Y. Petrova (*)
e-mail: xen@excite.com
V. V. Anisimovsky
e-mail: vanisimovsky@gmail.com
M. N. Rychagov
National Research University of Electronic Technology (MIET), Moscow, Russia
e-mail: michael.rychagov@gmail.com

194 X. Y. Petrova et al.
Fig. 7.1 Recognition of the video genre in the video processing pipeline of a TV receiver
The main goal of our research, the results of which are presented in this chapter,
was to develop an algorithm for the detection of TV programmes containing sports
games in real-time video sequences, with the aim of automatically adjusting the
image settings in a TV receiver (see Fig. 7.1).
7.1.2 Video Classification in Brief
Prior works on video sequence classification can be subdivided into groups based on
the purpose of classification (e.g. genre detection or specific object detection), the
modalities used (such as video, audio or subtitles), the feature selection method and
the type of classifier (such as support vector machine (SVM), classification or
regression tree, neural network, etc.). The most similar existing works in terms of
purpose are well-known works on the automatic detection of genre. Genre detection
can be based on various modalities, for example, subtitles (Brezeale and Cook 2006;
Brezeale and Cook 2008), sound (Roach and Mason 2001; Dinh et al. 2002; Bai
et al. 2006), video (all other sources) or several of these modalities at the same time
(Subashini et al. 2012).
Brezeale and Cook (2006) used subtitles and DCT coefficients from the decom-
position of video frames as features. These authors achieved a high level of detection
accuracy but noted that subtitles are missing from many television broadcasts,
although in some countries such as in the USA, legislation obliges broadcasters to
provide subtitles with TV broadcasts. At the same time, however, subtitles are not a
description of what is shown on the screen and are not generated for scenes in which
there is no dialog. Finally, training and classification based on this feature may
involve great computational complexity, since the feature vector can consist of many
thousands of elements. Subtitles are not synchronised at the frame level and provide
no information on scene transitions and can therefore be used only for offline
processing. In addition, in a situation where the user is switching channels, this
type of algorithm will not be able to provide the timely response required for the
optimal selection of video processing settings.
Video sequences containing sporting events were indexed by Bai et al. (2006)
using SVM classifier based on the features of the audio stream. Approaches based on
the analysis of audio streams or subtitles are not applicable in a television receiver for
the automatic adjustment of video processing coefficients, since these approaches do
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis 195
not have a sufficiently short response time to the corresponding video sequence. In the
following, we will consider methods based mostly on an analysis of the video stream.
In work by Jiang et al. (2009), the following genres are defined: cartoons,
advertising, news and sports. As a basic algorithm, the support vector method
using an oriented acyclic graph (DAGSVM) was selected. Fifteen visual features
of four types were distinguished (installation, colour, texture and movement). From
the point of view of video editing, there are changes in frequency and ratio through-
out the programme, and the number of sharp and smooth transitions between scenes
was estimated. A histogram of the average brightness and saturation and the
percentage of pixels with brightness and saturation above a predetermined threshold
were used as colour features. The textural features were related to the statistical
properties of the halftone adjacency matrix: contrast, uniformity, energy, entropy
and correlation. The features related to movement were the average change in
brightness, the average difference in the RGB colour space between adjacent frames
and the proportion of frames that differed slightly from the previous frames (i.e. the
proportion of slow and/or static scenes).
The classification of movies by genre (such as action, drama or thriller) was
attempted by Huang et al. (2008). Five global features were used: the average length
of the episode, colour variation, movement, lighting (e.g. the presence of a flashlight)
and the statistics of the types of transitions between scenes. A decision tree classifier
was implemented.
The application of the properties of boundaries was described by Yuan and Wan
(2004), and a k-means classifier was used to distinguish between badminton, bas-
ketball, football, ice skating and tennis. In an earlier work by the same authors, the
properties of borders were used to identify frames containing a general view of the
spectator stands and advertising outside of the playing field.
The trajectories of faces and blocks of text were analysed by Wei et al. (2000) in
order to distinguish between advertising, news, comedy series and soap operas.
Classification was carried out by finding the maximum projection of the distribution
of the trajectories of a classified fragment on the set of trajectories of the training set.
In a study by Liu and Kender (2002), frames of video recording of lectures were
subdivided into four classes. To achieve this, a quasi-optimal procedure for selecting
features (from 300 initial features) was proposed.
Using camera movement analysis, Takagi et al. (2003a, b) presented the
categorisation of five types of sporting videos (sumo wrestling, tennis, baseball,
football and American football) in two papers. Each of these sports was
characterised by a special style of shooting, and classification was carried out
considering types of camera arrivals, episodes with a static or shaking camera and
transitions from one type of camera movement to another. Since their method was
not based on colour information, it could be very effective in classifying broadcasts
of games in different sports leagues (e.g. at the British Open tennis championship,
the courts are green, while at the French Open, they are red).
A new low-level attribute called the “energy flow of activity” was proposed by
Gillespie and Nguyen (2004). This feature was fed to the input of a network of radial
basis functions to define sports, cinema, landscape shots and news. The energy flow
of activity was measured within a given programme and was based on statistics of
macro-blocks of compressed video, including the number of I-blocks, that is, blocks
with reliably and unreliably predictable motion (e.g. in the case of a uniform
background with a moving camera).
Kittler et al. (2001) proposed the concept of “visual keys” for the identification of
sporting genres (tennis, track sports, swimming, yachting and cycling). These keys
impart semantic meaning to the low-level attributes of the frames of the video
sequence. In total, 16 types of keys were used: athletics tracks, boxing rings, indoor
cycle tracks, the ocean, ski jumps, pools, tennis nets, grass, blue sky, open cycle
tracks, the ocean (medium-range shot), treadmills (long-range shot), treadmills
(medium-range shot), treadmills (close-up), pools (close-up) and tennis courts. The
first nine of these detectors were represented by trained neural networks, and the last
eight used so-called texture codes. The results of the detection of these visual keys
were fed to the input of a k-means classifier to determine which type of sport was
being shown. This set of features was subsequently expanded by the addition of
multimodal elements to detect visual keys (Jaser et al. 2004). In the latter case,
hidden Markov models were used to analyse the time dependencies.
Later, a similar problem was considered which involved identifying seven types
of sporting events (climbing, basketball, motor racing, golf, ski jumping, football
and motorcycle racing) and five types of scenes within sports videos (Choroś and
Pawlaczyk 2010): final titles, commentators in the studio, initial titles, trailers and
tables and standings. A decision was made based on the calculation of colour
coherence vectors (Pass et al. 1996).
Another seven popular video genres were classified by Ionescu et al. (2012):
cartoons, advertising, documentaries, cinema, music videos, news and sports. Three
categories of descriptors were used, which related to colour, dynamics and structure.
Colour properties were characterised globally using statistics of the distribution of
various colours, elementary colours dominant within the image, the properties of
these colours (e.g. brightness, saturation) and the relationships between them. From
the point of view of dynamics, the rhythm of the video, the statistics of the
movements and the percentage of smooth transitions were evaluated. Structural
information was extracted at the frame level by constructing histograms of contours
and identifying the relationships between them.
It is easy to see that the tools used to classify images and videos are very diverse.
In particular, the effectiveness of the decision tree method (Huang et al. 2008),
principal component analysis (PCA) (Vaswani and Chellappa 2006), SVM (Dinh
et al. 2002; Bai et al. 2006), neural networks (Takagi et al. 2003a, b; Subashini et al.
2012), Kohonen networks (Koskela et al. 2009) and hidden Markov models (Truong
et al. 2000) has been demonstrated. An implementation of the nearest neighbour
classifier was reported by Mel (1997), and some researchers have used a random
forest or Bayesian approach (Machajdik and Hanbury 2010).
An interesting attempt was undertaken by Koskela et al. (2009) to establish
correspondence between a semantic concept and a list of synonyms for this concept
from WordNet. This idea can serve as the basis for a quick and effective synthesis of a
visual classifier based on a verbal description made by an expert. A similar idea was
described by Li et al. (2010), in which a bank of classifiers of visual objects utilises a
set of primitive classifiers that can form the inputs for more complex classifiers.
Recently, the interest of researchers and engineers in classification problems

related to video has been focused mainly on the application of machine learning
and artificial intelligence methods. After presenting our algorithm in the next two
sections, the implementation of which is based on the approaches described above,
we will give a brief overview of these promising new algorithms in Sect. 7.4.
7.1.3 Requirements for a Real-Time Video Classification

Algorithm
For the problem under consideration, methods of scene change detection are also of
interest; however, we limit ourselves here to the simplest case of a direct transition,
since this is the dominant type used in sports broadcasts.
The following requirements apply to the classification algorithm:
1. To control the settings of the video processing pipeline, a classification result
should be obtained for each individual frame, and there should be no abrupt
jumps except in the case of a scene change. Only one pass through the video
frame should be used. This requirement implies that the use of any kind of
temporal features should be avoided, since these introduce additional frame
delays.
2. The algorithm should be universal, i.e. sporting events should be differentiated
from all other types of scenes, including movies, news, cartoons, computer
graphics, concerts, etc.
3. The algorithm should be insensitive to the quality of the video stream (i.e. should
support different resolutions, both standard and high, and should be insensitive to
various compression levels and robust to progressive/interlaced types of content).
4. The complexity of the algorithm should be extremely low, allowing for a
hardware implementation as part of a video processing chip within a general,
more complex system. It is therefore desirable that only a limited local
neighbourhood should be used in the pixel analysis.
5. The algorithm should have a modular structure to enable interaction with other
video processing subsystems. It should also have a clear, straightforward inter-
pretation of parameters, to allow for simple and fast tuning in the desired direction
achieving desired behaviour in specific corner cases.
7.2 Description of the Algorithm
The development of the algorithm was accompanied by the formation of a database

consisting of video fragments with varying content, resolution and compression
level. An analysis of these data allows us to draw the following conclusions:
Table 7.1 Weights for the confusion matrix

Sports broadcasts Non-sport broadcasts
C1 C2 C3 C4
Sports broadcasts C1 0 1/100 1/10 1
C2 1/100 0 1/10 1
C3 1/10 1/10 0 1/10
Non-sport broadcasts C4 1 1 1/100 0
1. The importance of obtaining the correct result is not the same for different frames:
it is more important to achieve a stable detection result for frames containing a
green field with players moving around it than for frames containing spectator
stands or a commentator or even a player shown in close-up.
2. Not all classification errors have the same weight; it is perfectly permissible to
misclassify football as baseball, but it is unacceptable to misclassify football as a
music video.
3. Video data (at the level of individual frames) by nature has large bias, since the
distribution depends mostly on the duration of the episodes containing frames
with similar characteristics, i.e. the frequency is rather weakly related to
importance.
To formalise these requirements, an additional subclassification is introduced in
the form of a simple hierarchy and weights for penalisation of a confusion matrix.
The confusion matrix is a popular tool for assessing the quality of classification
algorithms (Godbole 2002). We divided the category of “sports games” into three
subcategories: C1 (football games), C2 (field-based sporting events other than
football) and C3 (non-game sports broadcasts, non-game scenes in game broad-
casts). All other frame types are classified into category C4. The weights for
penalisation of the confusion matrix are shown in Table 7.1.
To simplify the development of the algorithm, class C1 was divided into three
classes: long-range shots, general shots and close-up shots. A separate classifier was
built for each class. In the first step, the simplest and most obvious case was
considered: in order to distinguish between frames containing long-distance shots
and hence distinguish frames of a football match from those of other genres, the
percentage of green pixels was used as a feature. Further development was carried
out using a truncated database, containing only frames for which the classifier
designed in the first step produced a Type 1 error.
The simplest formula to detect green pixels would be Gr0 ¼ G > RBmax, where
RBmax is the maximum of red and blue channels of the pixel. But this formula is not
selective enough because it misclassifies as green pixels the following categories of
pixels: (i) almost all white pixels; (ii) pixels with low saturation; (iii) very dark
(almost black) pixels; (iv) blue-toned pixels; and (v) yellow pixels.
Effects (i)–(iv) were corrected by adding an additional term to the basic formula:

3
Gr1 ¼ Gr0 ^ SRGB > 80 ^ R þ B < G _ R þ B < 255 < R B < 35 ^ Y
2
> 80,
where SRGB ¼ R + G + B. Effect (v) was mitigated using an additional classifier for
yellow pixels:
Y e ¼ B < G ^ B < R ^ 9 ðRGMAX RGMIN Þ < RGMIN B ^ S > 0:2 ^ Y

> 110,
where RGMAX is the maximum of the green channel G and red channel R, RGMIN is
minimum of both of these values and S is the colour saturation. The final form of the
formula for detecting green pixels then becomes Gr ¼ Gr0 ^ SRGB > 80 ^
R þ B < 32 G _ R þ B < 255 < R B < 35 ^ Y > 80 ^ Y e .
Finding the proportion of green pixels is far from sufficient to solve the problem.
Some sporting events contain a fairly small number of green pixels, and even a
human observer may make a mistake in the absence of accompanying text or audio
commentary. A condition is therefore applied in which a frame with zero green
pixels belongs to class C4.
The classification of other types of scenes is based on the following observations.
Sporting scenes, and especially those in football, are characterised by the presence of
green pixels with high saturation. Moreover, the green pixels making up the image of
the pitch have low values for the blue colour channel and relatively high brightness
and saturation. The range of variation in the brightness of green pixels is not wide,
and if high values of the brightness gradient are seen within green areas, it is likely
that these frames correspond to images of natural landscapes rather than a soccer
pitch. In sports games, bright and saturated spots usually correspond to the players’
uniforms, while white areas correspond to the markings on the field and the players’
numbers. In close-up shots of soccer players, a small number of green pixels are
usually present in each frame. These observations can be described by the following
empirical relation:
W ¼ SRGB > 384 ^ max ðR, G, BÞ min ðR, G, BÞ < 30,
where SRGB ¼ R + G + B.
The detection of bright and saturated colours is formalised as follows:
Bs ¼ max ðR, G, BÞ > 150 & max ðR, G, BÞ min ðR, G, BÞ max ðR, G, BÞ=2:
The skin tone detector is borrowed from the literature (Gomez et al. 2002):
G 267R 83 SRGB 83 SRGB

Sk ¼ ðG 6¼ 0Þ ^ B G þ ^ SRGB > 7 ^ B ^G :
2 2 28 28
The classification rules are based on the following cross-domain features

(Fig. 7.2): proportion of green pixels, F1; proportion of skin tone pixels, F2; average
brightness of all pixels, F3; average gradient of green pixels, F4; proportion of bright
and saturated pixels, F5; mean saturation of green pixels, F6; proportion of white
Fig. 7.2 Calculation of cross-domain features
pixels, F7; average brightness of green pixels, F8; mean value of the blue colour
channel for green pixels, F9; and compactness of the brightness histogram for green
pixels, F10.
The proportion of green pixels is calculated as follows:
1 X
F1 ¼ δðGr ði, jÞÞ,
wh i¼1::w,j¼1::h
where w is the frame width, h is the frame height, i, j are pixel coordinates, Gr(i, j) is
detector and δ is a function that converts a
the result generated by the green pixel
0, Øx
logical type to real one, i.e. δðxÞ ¼ .
1, x
The proportion of skin tone pixels is calculated as follows:
1 X
F2 ¼ δðSk ði, jÞÞ,
wh i¼1::w,j¼1::h
where Sk(i, j) is the result of the skin tone pixel detector. The average gradient for the
green pixels is calculated as follows:
P
i¼1::w,j¼1::h jDY ði, jÞj δðGr ði, jÞÞ
F4 ¼ P ,
i¼1::w,j¼1::h δ ð G r ði, jÞÞ
where the horizontal derivative of the brightness DY is obtained by convolution of

the luminance component Yс with a linear filter Kgrad ¼ [0 0 0 1 0 0 1].
The proportion of bright and saturated pixels is calculated as follows:
1 X
F5 ¼ δðBs ði, jÞÞ
wh i¼1::w,j¼1::h
where Bs(i, j) is the detection result of bright and saturated pixels. The mean
saturation of the green pixels is calculated using the formula:
P
i¼1::w,j¼1::h Sði, jÞ δðGr ði, jÞÞ
F6 ¼ P ,
where S(i, j) is the saturation. The proportion of white pixels is estimated as:
1 X
F7 ¼ δðW ði, jÞÞ,
wh i¼1::w,j¼1::h
where W(i, j) is the detection result of the white pixels. The average brightness of the
green pixels is derived, in turn, from the formula:
P
i¼1::w,j¼1::h Y ði, jÞ δðGr ði, jÞÞ
F8 ¼ P :
The average value of the blue channel for the green pixels is obtained as follows:
P
i¼1::w,j¼1::h Bði, jÞ δðGr ði, jÞÞ
F9 ¼ P :
i¼1::w,j¼1::h δðGr ði, jÞÞ
Finally, the compactness of the brightness histogram for the green pixels, F10, is
calculated using the following steps:
• A histogram of the brightness values HYGr for the Y pixels belonging to green
areas is constructed.
• The width of the histogram D is computed as the distance between its right and
left non-zero elements.
• The feature F10 is calculated as the proportion of the histogram that lies no further
than an eighth of its width from the central value:
PP8 þD=8
i¼P8 D=8 H YGr ðiÞ
F 10 ¼ P255 :
i¼0 H YGr ðiÞ
The resulting classifier is implemented in the form of a directed acyclic graph

(DAG) with elementary classifiers at the nodes.
To make classifier synthesis more straightforward, a special editing and visual-
isation software tool was developed. This software implemented elementary classi-
fiers of the following types: one-dimensional threshold transforms θHI i ðxÞ и θ i ðxÞ,
LO
2D threshold functions θi (x, y), linear classifiers θi (x, y), 3D threshold transforms
2D L
θi3D(x, y, z) and elliptic two-dimensional classifiers θie(x, y). One-dimensional

threshold transforms have one input argument and are described by the formula:
(a) (b) (c)
(d) (e) (f)
Fig. 7.3 Elementary classifiers. (a) 1D threshold transform with higher threshold (b) 1D threshold
transform with lower threshold (c) 2D threshold transform (d) 2D linear classifier (e) 3D threshold
transform (f) elliptic 2D classifier

TRUE, x > Ti
θHI
i ðxÞ ¼ (Fig. 7.3а)
FALSE, otherwise

TRUE, x < Ti
i ð xÞ ¼
or θLO (Fig. 7.3b).
FALSE, otherwise
The parameters of two-dimensional threshold transformations (or table classi-
fiers) are presented in Fig. 7.3c. They are described by two threshold vectors,
V 1T ¼ T 10 T 11 .. . T 1N , where T 10 < T 11 < T 12 < . .. < T 1N , and V 2T ¼ T 20 T 21 ... T 2M ,
where T 20 <T 21 <T 22 <...<T 2M . The threshold values for each dimension may not
match, i.e. M 6¼ N. The output of the tabular classifier has the form of a
two-dimensional vector of dimensions M N: θ2D(x,y) ¼ |y11(x,y),y12(x,y),...,
y1N(x,y),..,yM1(x,y),yM2(x,y),...,yMN(x,y)|, where each element is a logical quantity:
yij ðx,yÞ¼xT 1i1 ^xT 1i ^yT 2j1 ^yT 2j . The output of the linear classifier is
calculated as θiL(x,y) ¼ K1 x + K2 y + B > 0, where K1, K2 and B are predefined
constants (Fig. 7.3d). The parameters of three-dimensional tabular classifiers
(Fig. 7.3e) are described by three threshold vectors: V 1
¼ T 1 T 1 . . . T 1 ,
T 0 1 N
where T 10 < T 11 < T 12 < . . . < T 1N ; V 2T ¼ T 20 T 21 . . . T 2M , where T 20 < T 21 <
T 22 < . . . < T 2M ; and V 3T ¼ T 30 T 31 . . . T 3L , where T 30 < T 31 < T 32 < . . . < T 3L .
The output of the tabular classifier has the form of a three-dimensional vector of
dimensions M N: θ2D(x, y) ¼ |y111(x, y, z), y112(x, y, z), . . ., yMNL(x, y)|. We assume
that all threshold values are real numbers defined on the interval [0, 1].
The output of the elliptical classifier (Fig. 7.3f) is calculated as: θi e ðx, yÞ ¼
ðx0 X C Þ2 0 2
a2þ ðy Y
b2
CÞ
< 1, where x0 ¼ x cos α y sin α, y0 ¼ x sin α + y cos α,
and XC, YC, a, b and α are predefined constants.
Fig. 7.4 Structure of the classifier for each frame
In the process of research and modelling, it was found that in many cases, it was
sufficient to limit ourselves to the simplest version of the classifier, including
one-dimensional and two-dimensional threshold transformations and a set of linear
classifiers (see Fig. 7.4).
We now formulate some basic rules that can be used to draw a final conclusion on
whether the current frame belongs to a sporting event.
The first decision rule G1 is implemented as a tabular classifier based on the
average saturation of the green pixels F1 and the normalised number of green pixels
F6. It calculates a Boolean vector of 16 (four by four) elements G1 ¼ |y11, y12, . . .,
y44|. This rule is parameterised by six threshold values T1. . .T6 . The purpose of this
rule is to distinguish between long-, medium-range and close-up shots and use
different decision logics for each of them. Non-zero values of yij (it is obvious that
the zero norm of the vector is equal to one) correspond to decision that current frame
belongs to a certain shot class. The result of this rule is used for further selection of
the classification thresholds. If the saturation of the green pixels is low, it makes
sense to apply stricter rules than in the case when it is high or average.
The second tabular decision rule G2 ¼ |y11, y12y21, y22| relies on characteristics F2
and F5 and is parameterised by two threshold values, T7and T8. This rule also ensures
the selection of frames with a low proportion of green pixels.
Some rules are regarded as “positive” or “negative” to indicate whether they
change the sign of the final classification result when contributing to the final
classification result, and these are denoted in the following by P and N, respectively.
Terms used as a part of several other rules will be denoted by Q; their role in these
roles may be different (which means sometimes they may contribute to positive
overall classification result and in some other rules to negative classification result).
This notation simplifies collecting corner cases from a large video database and the
analysis of certain conclusions drawn by the classifier.
The case with a very low number of green pixels is considered separately and is
implemented as a “negative” threshold rule: N1 ¼ F1 > T9. The opposite case, with a
very high proportion of skin colour pixels, is also controlled by a “negative” rule,
N2 ¼ F2 < T10, as are the textural properties of green pixels, N3 ¼ F4 < T11. Sports
broadcasts also contain a certain (although relatively small) proportion of white
pixels, N4 ¼ F7 > T12.
Since the auxiliary bright and saturated pixel detector is based on fixed thresh-
olds, it is clear that the detection result will depend on the overall image brightness.
After applying a simple gamma correction to the image, the proportion of bright
and saturated pixels detected can change significantly. To solve this problem, a
linear classifier is used: Q1 ¼ K1 F3 + K2 F5 + B > 0, where K1, K2 and B are
predefined constants.
The number of bright and saturated pixels is controlled by the rule Q2 ¼ F5 > T13.
The average brightness of green pixels is divided into two ranges and different
empirical logic subsequently used for each, for example, Q3 ¼ F8 > T14 and
P1 ¼ F8 > T15. The average value of the blue component of green pixels is controlled
by the rule P2 ¼ F9 < T16, and the degree of compactness of the histogram of the
brightness of green pixels is evaluated as follows: P3 ¼ F10 < T17. The final
classification result R is calculated as
R ¼ ðØV 1 Þ ^ Q3 ^ P3 ^ V 2 ^ ðP2 _ ðØQ3 ^ P2 ÞÞ,
where V1 ¼ N1 _ N2 _ N3 _ N4 _ z11 and

V2 ¼ (y22 ^ Q2) _ y23 _ (y24 ^ Q1) _ y32 _ y33 _ y34 _ y42 _ y43 _ y44.
To ensure consistent classification results within the same scene, the detection
result obtained for the first frame is extended to the entire scene (see Fig. 7.5).
Real-time scene change detection is a fairly elaborate topic, and research in this
direction is still required by practical applications, as explained by Cho and Kang
(2019). In hard real-time applications such as frame rate conversion (FRC) in TV
sets or mobile phones, video encoders in consumer cameras or mobile phone scene
Fig. 7.5 Structure of the video stream classifier

Fig. 7.6 Structure of the scene change detector
change detector is an indispensable part of the algorithm which needs to be very

lightweight and robust. In general, there are two approaches that have been described
in the literature: methods based on motion vector analysis and methods based on
histograms. Since motion-based methods introduce additional frame delay and also
require motion estimation, which may have a complexity comparable to the rest of
the video enhancement pipeline, we focus here on histogram-based approaches. In
addition, when a scene change is used to control the temporal smoothness of real-
time video enhancement algorithms, the problem statement becomes slightly differ-
ent than that for compression or FRC. When shooting the same event switching
between two different cameras or switching between long-range shot and close-up
should not be treated as a scene change, while changing programmes or switching to
the commentator in the studio should be considered as a scene change. In order to
meet these requirements, we apply the robust scene change detector algorithm
described below (Fig. 7.6).
The RGB colour channels of the current video frame and the cluster centres KC
from the output of the delay block are fed to the input of the clustering block. The
cluster centres form a matrix of size K 3, where K is the number of clusters:
C
R1 RC2 RC3 ... RCN K

K C ¼ GC1 GC2 GC3 . . . GCN K :

BC BC2 BC3 . . . BCN K
1
In the present study, K is assumed to be eight. Clustering is one iterative step of

the k-means method, in which each pixel Ρði, jÞ ¼ j Rði, jÞ Gði, jÞ Bði, jÞ j with
coordinates i, j is assigned to the cluster K ði, jÞ ¼ arg min DðΡði, jÞ, C k Þ, where
k¼1::N k
C k ¼ RCk GCk BCk is the centre of the k-th cluster and D(x, y) ¼ kx yk. The
P
total residual is calculated in this case as E ¼ D Ρði, jÞ, CK ði,jÞ . The
i ¼ 1::w
j ¼ 1::h
cluster centres are updated using the formula:
C
e eC2 eC3 eCN
R1 R R ... R
K

e C ¼ G
K eC eC
G eC
G e C
. . . GN K ,
1 2 3
C
Be1 eC2
B eC3
B ... B eCN
K
where
P
δðK ði, jÞ ¼ kÞ Rði, jÞ
i ¼ 1::w,
j ¼ 1::h
eCk ¼
R P ,
δðK ði, jÞ ¼ kÞ
i ¼ 1::w,
j ¼ 1::h
P
δðK ði, jÞ ¼ kÞ Gði, jÞ
i ¼ 1::w,
j ¼ 1::h
eC ¼
G P ,
k
i ¼ 1::w,
j ¼ 1::h
P
δðK ði, jÞ ¼ kÞ Bði, jÞ
i ¼ 1::w,
j ¼ 1::h
eCk ¼
B P :
i ¼ 1::w,
j ¼ 1::h
max ðE cur , E prev Þ

The scene change flag is calculated using the formula: min ðE cur , E prev Þ
> T break ,
where Ecur is the total residual in the current frame, Eprev is the total residual in the
current frame and Tbreak ¼ 8 is a predefined threshold. This type of algorithm is fairly
robust to rapid changes, motion within the same scene and changes from close-up to
mid-range or long shots, as the class centroid in this case remains almost the same
while the attribution of pixels to different classes changes. At the same time, it is
sensitive enough to detect important video cuts and provide consistent enhancement
settings. The classifier has a very low cost from the point of view of its hardware
implementation and needs to store only 24 values per frame (eight centroids with
three RGB values each).
7.3 Results
7.3.1 Basic Classification
The algorithm was implemented in the form of a С++/C++.NET programme. The

dependency of the feature values F1, F2, F3, F4, F5 and F6 on time were displayed in
the programme in the form of separate graphs, and the values of these features for the
Fig. 7.7 UI of the demo SW
Table 7.2 Testing of the classifier in real time

Duration Resolution
# Classes Title (min) (pixels)
1 C1, C3 Football: 2006 FIFA World Cup Semi-final: 147 960 544
Italy vs. Germany
2 C1, C3 Football: Liverpool-Manchester United, BBC 83 592 336
broadcast
3 C1, C3 Football: Milan-Liverpool 51 640 352
4 C1, C3 Football: Manchester United-Villareal (Sky Sports 2) 72 624 336
5 C2, C3 American football: Penn State-Ohio State 141 352 288
6 C2, C3 American football: Sugar Bowl 168 352 288
7 C4 Wild South America 100 720 576
8 C4 Greatest places 24 1280 720
9 C4 Movie: Men in Black 98 720 480
10 C4 Miscellaneous videos from YouTube, 84 files in total 421 720 528
Total 1305
current frame were shown as a bar graph (Fig. 7.7). The classifier was tested on at
least 20 hours of real-time video (Table 7.2). Errors of the first kind were estimated
by averaging the output of the classifier on the video sequences of class C4. In the
classification process, a 95% accuracy threshold was reached. To evaluate errors of
the second kind, a total number Ntotal ¼ 220 of random frames were selected from
various sequences of classes C1C3, and the classification accuracy was calculated
N þ þN þ
as C1N total C2 100%, where N þ
C1 is the number of frames from class C1 (classified as
þ
“sport”), and N C2 is the number of frames from class C2 (also classified as “sport”).
The classification accuracy was 96.5%. Calculations using the coefficients from
Table 7.1 for frames from class C3 indicated an acceptable level of accuracy
(above 95%); however, these measurements are not of great value, as experts
disagree on how to classify many of the types of images from class C3.
The performance of this algorithm on a computer with a 2 GHz processor and
2 Mb of memory reached 30 fps. The proposed algorithm can be implemented using
only shift and addition operations, which makes it attractive in terms of hardware
implementation.
7.3.2 Processing of Corner Cases
Despite the high quality of the classification method described above, further work
was required to eliminate the classification errors observed (see Fig. 7.8). Figure 7.8a
shows a fragment from a nature documentary that was misclassified as a long shot of
a soccer game, and Fig. 7.8b shows a caterpillar that was confused with a close-up of
a player. In a future version of the algorithm, such errors will be avoided through the
use of an advanced skin detector, the addition of a white marking detector for long
shots and the introduction of a cross-domain feature that combines texture and
colour for regions of bright and saturated colours. The classification error in
Fig. 7.8c is caused by an overly broad interpretation of the colour green. To solve
this problem, colour detectors could be applied to the global illumination of the
scene. To correct the error shown in Fig. 7.8d, a silhouette classifier could be
developed. However, it would be a quite a complicated solution with the perfor-
mance inacceptable for real-time application.
Many of these problems can be solved, one way or another, using modern
methods of deep learning with neural networks, and a brief review of these is
given in the next section. It must be borne in mind that although this approach
does not require the manual construction of features to describe the image and video,
this is achieved in practice at the cost of high computational complexity and poor
algorithmic interpretability (Kazantsev et al. 2019).
7.4 Video Genre Classification: Recent Approaches

and CNN-Based Algorithms
A sports video categorisation system that utilises high-order spectral features

(HOSF) for feature extraction from video frames and subsequently applies a
multiclass SVM for video classification has been presented by Mohanan (2017).
(a)
(b)
(c)
(d)
Fig. 7.8 Detection errors (a) Nature image with low texture and high amount of green
mis-classified as soccer (b) Nature image with high amount of saturated green and bright colors
mis-classified as soccer (c) Underwater image with high amount of greens mis-classified as soccer
(d) Golf mis-classified as soccer
HOSF are used to extract both the phase and the amplitude of the given input,
allowing the subsequent SVM classifier to use a rich feature vector for video
classification (Fig. 7.9).
Another work by Hamed et al. (2013) that leveraged classical machine learning
approaches tackled the task of video genre classification via several steps: initially,
shot detection was used to extract the key frames from the input videos, and the
feature vector was then extracted from the video shot using discrete cosine transform
(DCT) coefficients processed by PCA. The extracted features were subsequently
scaled to values of between zero and one, and, finally, weighted kernel logistic
regression (WKLR) was applied to the data prepared for classification with the aim
Fig. 7.9 Workflow of a sports video classification system. (Reproduced with permission from
Mohanan 2017)
Fig. 7.10 Overview of the sport genre classification method via sensor fusion. (Reproduced with
permission from Cricri et al. 2013)
of achieving a high level of accuracy, making WKLR an effective method for video
classification.
The method suggested by Cricri et al. (2013) utilises multi-sensor fusion for sport
genre classification in mobile videos. An overview of the method is shown in
Fig. 7.10. Multimodal data captured by a mobile device (video, audio and data
from auxiliary sensors, e.g. electronic compass, accelerometer) are preprocessed by

feature extractors specific to each data modality to produce features that are discrim-
inative for the sport genre classification problem. Several MPEG-7 visual descriptors
are used for video data, including dominant colour, colour layout, colour structure,
scalable colour, edge histogram and homogeneous texture. For audio data, mel-
frequency cepstral coefficient (MFCC) features are extracted. Data quality estima-
tion is also performed in conjunction with feature extraction, in order to provide
modality data confidence for subsequent classifier output fusion. SVM classifiers are
used and visual and sensor features and Gaussian mixture models (GMMs) for audio
features.
Gade and Moeslund (2013) presented a method for the classification of activities
in a sports arena using signature heat maps. These authors used thermal imaging to
detect players and calculated their positions within the sports arena using
homography. Heat maps were produced by aggregating Gaussian distributions
representing people over 10-minute periods. The resulting heat maps were then
projected onto a low-dimensional discriminative space using PCA and then classi-
fied using Fisher’s linear discriminant (FLD).
The method proposed by Maheswari and Ramakrishnan (2015) approached the
task of sports video classification using edge features obtained from a non-
subsampled shearlet transform (NSST), which were classified using a k-nearest
neighbour (KNN) classifier. The five sports categories of tennis, cricket, volleyball,
basketball and football were considered.
Following the success of convolutional neural network (CNN) approaches for
various visual recognition tasks, a surge in the number of works utilising these
networks for video classification tasks has been seen in recent years.
One CNN-based approach to video classification (Simonyan and Zisserman
2014; Ye et al. 2015) was motivated by findings in the field of neuroscience showing
that the human visual system processes what we see through two different streams,
the ventral and dorsal pathways. The ventral pathway is responsible for processing
spatial information such as shape and colour, while the dorsal pathway is responsible
for processing motion information. Based on this structure, the authors designed a
CNN to include two streams, the first of which was responsible for processing spatial
information (separate visual frames) and the second for handling temporal motion-
related data (stacked optical flow images), as depicted in Fig.7.11.
To efficiently combine information from these two streams, they introduced two
kinds of fusion, model and modality fusion, investigating both early and late fusion
approaches.
Another approach suggested by Wu et al. (2015) involved the idea of spatial and
temporal information fusion and introduced long short-term memory (LSTM) net-
works in addition to the two features produced by CNNs for the two streams (see
Fig. 7.12).
They also employed a regularised feature fusion network to perform video-level
feature fusion and classification. The usage of LSTM allowed them to model long-
term temporal information in addition to both the spatial and the short-term motion
Fig. 7.11 The processing pipeline of the two-stream CNN method. (Reproduced with permission
from Ye et al. 2015)
Fig. 7.12 Overview of a hybrid deep learning framework for video classification. (Reproduced
with permission from Wu et al. 2015)
features, while the fusion between the spatial and motion features in a regularised
feature fusion network was used to explore feature correlations.
Karpathy et al. (2014) built a large-scale video classification framework by fusing
information over the temporal dimension using only a CNN, without recurrent
networks like LSTM. They explored several approaches to the CNN-based fusion
of temporal information (see Fig. 7.13).
Another idea of theirs, motivated by the human visual system, was a
multiresolution CNN that was split into fovea and context streams, as shown in
Fig. 7.14. Input frames were fed into two separate processing streams: a context
stream, which modelled low-resolution images, and a fovea stream, which processed
high-resolution centre crop.
This design takes advantage of the camera bias present in many online videos,
since the object of interest often occupies the central region.
Fig. 7.13 Approaches for fusing information over the temporal dimension. (Reproduced with
permission from Karpathy et al. 2014)
Fig. 7.14 Multiresolution CNN architecture splatted into fovea and context streams. (Reproduced
with permission from Karpathy et al. 2014)
A compound memory network (CMN) was proposed by Zhu and Yang (2018) for
a few-shot video classification task. Their CMN structure followed the key-value
memory network paradigm, in which each key memory involves multiple constitu-
ent keys. These constituent keys work collaboratively in the training process,
allowing the CMN to obtain an optimal video representation in a larger space.
They also introduced a multi-saliency embedding algorithm which encoded a
variable-length video sequence into a fixed-size matrix representation by discover-
ing multiple saliencies of interest. An overview of their method is given in Fig. 7.15.
Finally, there are several methods which combine the advantages of both classical
machine learning and deep learning. One such method (Zha et al. 2015) used a CNN
to extract features from video frame patches, which were subsequently subjected to
spatio-temporal pooling and normalisation to produce video-level CNN features.
Fig. 7.15 Architecture of compound memory network. (Reproduced with permission from Zhu
and Yang 2018)
Fig. 7.16 Video classification pipeline with video-level CNN features. (Reproduced with permis-
sion from Zha et al. 2015)
Fig. 7.17 Learnable pooling with context gating for video classification. (Reproduced with
permission from Miech et al. 2017)
SVM was used to classify video-level CNN features. An overview of this video
classification pipeline is shown in Fig. 7.16.
Another method presented by Miech et al. (2017) and depicted in Fig. 7.17
employed CNNs as feature extractors for both video and audio data and aggregated
the extracted visual and audio features over the temporal dimension using learnable
pooling (e.g. NetVLAD or NetFV). The outputs were subsequently fused using fully
connected and context gating layers.
References
Bai, L., Lao, S.Y., Liao, H.X., Chen, J.Y.: Audio classification and segmentation for sports video
structure extraction using support vector machine. In: International Conference on Machine
Learning and Cybernetics, pp. 3303–3307 (2006)
Brezeale, D., Cook, D.J.: Using closed captions and visual features to classify movies by genre. In:
Poster Session of the 7th International Workshop on Multimedia Data Mining (MDM/KDD)
(2006)
Brezeale, D., Cook, D.J.: Automatic video classification: a survey of the literature. IEEE Trans.
Syst. Man Cybern. Part C Appl. Rev. 38(3), 416–430 (2008)
Cho, S., Kang, J.-S.: Histogram shape-based scene change detection algorithm. IEEE Access. 7,
27662–27667 (2019). https://doi.org/10.1109/ACCESS.2019.2898889
Choroś, K., Pawlaczyk, P.: Content-based scene detection and analysis method for automatic
classification of TV sports news. Rough sets and current trends in computing. Lect. Notes
Comput. Sci. 6086, 120–129 (2010)
Cricri, F., Roininen, M., Mate, S., Leppänen, J., Curcio, I.D., Gabbouj, M.: Multi-sensor fusion for
sport genre classification of user generated mobile videos. In: Proceedings of 2013 IEEE
International Conference on Multimedia and Expo (ICME), pp. 1–6 (2013)
Dinh, P.Q., Dorai, C., Venkatesh, S.: Video genre categorization using audio wavelet coefficients.
In: Proceedings of the 5th Asian Conference on Computer Vision (2002)
Gade, R., Moeslund, T.: Sports type classification using signature heatmaps. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 999–1004
(2013)
Gillespie, W.J., Nguyen, D.T.: Classification of video shots using activity power flow. In: Pro-
ceedings of the First IEEE Consumer Communications and Networking Conference,
pp. 336–340 (2004)
Godbole, S.: Exploiting confusion matrices for automatic generation of topic hierarchies and
scaling up multi-way classifiers. Indian Institute of Technology – Bombay. Annual Progress
Report (2002). http://www.godbole.net/shantanu/pubs/autoconfmat.pdf. Accessed on 04 Oct
2020
Gomez, G., Sanchez, M., Sucar, L.E.: On selecting an appropriate color space for skin detection. In:
Lecture Notes in Artificial Intelligence, vol. 2313, pp. 70–79. Springer-Verlag (2002)
Hamed, A.A., Li, R., Xiaoming, Z., Xu, C.: Video genre classification using weighted kernel
logistic regression. Adv. Multimedia. 2013, 1 (2013)
Huang, H.Y., Shih, W.S., Hsu, W.H.: A film classifier based on low-level visual
features. J. Multimed. 3(3) (2008)
Ionescu, B.E., Rasche, C., Vertan, C., Lambert, P.: A contour-color-action approach to automatic
classification of several common video genres. Adaptive multimedia retrieval. Context, explo-
ration, and fusion. In: Lecture Notes in Computer Science, vol. 6817, pp. 74–88 (2012)
Jaser, E., Kittler, J., Christmas, W.: Hierarchical decision-making scheme for sports video
categorisation with temporal post-processing. In: IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, vol. 2, pp. 908–913 (2004)
Jiang, X., Sun, T., Chen, B.: A novel video content classification algorithm based on combined
visual features model. In: Proceedings of the 2nd International Congress on Image and Signal
Processing, pp. 1–6 (2009)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video
classification with convolutional neural networks. In: Proceedings of the IEEE Conference on
Kazantsev, R., Zvezdakov, S., Vatolin, D.: Application of physical video features in classification
problem. Int. J. Open Inf. Technol. 7(5), 33–38 (2019)
Kittler, J., Messer, K., Christmas, W., Levienaise-Obada, B., Kourbaroulis, D.: Generation of
semantic cues for sports video annotation. In: Proceedings of International Conference on
Image Processing, vol. 3, pp. 26–29 (2001)
Koskela, M., Sjöberg, M., Laaksonen, J.: Improving automatic video retrieval with semantic
concept detection. Lect. Notes Comput. Sci. 5575, 480–489 (2009)
Li, L.-J., Su, H., Fei-Fei, L., Xing, E.P.: Object Bank: a high-level image representation for scene
classification and semantic feature sparsification. In: Proceedings of the Neural Information
Processing Systems (NIPS) (2010)
Liu, Y., Kender, J.R.: Video frame categorization using sort-merge feature selection. In: Pro-
ceedings of the Workshop on Motion and Video Computing, pp. 72–77 (2002)
Machajdik, J., Hanbury, A.: Affective image classification using features inspired by psychology
and art theory. In: Proceedings of the International Conference on Multimedia, pp. 83–92 (2010)
Maheswari, S.U., Ramakrishnan, R.: Sports video classification using multi scale framework and
nearest neighbour classifier. Indian J. Sci. Technol. 8(6), 529 (2015)
Mel, B.W.: SEEMORE: combining color, shape, and texture histogramming in a neurally inspired
approach to visual object recognition. Neural Comput. 9(4), 777–804 (1997). http://www.ncbi.
nlm.nih.gov/pubmed/9161022. Accessed on 04 Oct 2020
Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. arXiv
preprint arXiv:1706.06905 (2017)
Mohanan, S.: Sports video categorization by multiclass SVM using higher order spectra features.
Int. J. Adv. Signal Image Sci. 3(2), 27–33 (2017)
Pass, G., Zabih, R., Miller, J.: Comparing images using color coherence vectors. In: Proceedings of
the 4th ACM International Conference on Multimedia (1996)
Roach, M., Mason, J.: Classification of video genre using audio. Eur. Secur. 4, 2693–2696 (2001)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos.
In: Proceedings of 28th Conference on Neural Information Processing Systems, pp. 578–576
(2014)
Subashini, K., Palanivel, S., Ramalingam, V.: Audio-video based classification using SVM and
AANN. Int. J. Comput. Appl. 44(6), 33–39 (2012)
Takagi, S., Hattori, S.M., Yokoyama, K., Kodate, A., Tominaga, H.: Sports video categorizing
method using camera motion parameters. In: Proceedings of International Conference on
Multimedia and Expo, vol. II, pp. 461–464 (2003a)
Takagi, S., Hattori, S.M., Yokoyama, K., Kodate, A., Tominaga, H.: Statistical analyzing method of
camera motion parameters for categorizing sports video. In: Proceedings of the International
Conference on Visual Information Engineering. VIE 2003, pp. 222–225 (2003b)
Truong, B.T., Venkatesh, S., Dorai, C.: Automatic genre identification for content-based video
categorization. In: Proceedings of 15th International Conference on Pattern Recognition, vol.
4, p. 4230 (2000)
Vaswani, N., Chellappa, R.: Principal components null space analysis for image and video
classification. IEEE Trans. Image Process. 15(7), 1816–1830 (2006)
Wei, G., Agnihotri, L., Dimitrova, N.: Tv program classification based on face and text processing.
In: IEEE International Conference on Multimedia and Expo, vol. 3, pp. 1345–1348 (2000)
Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep
learning framework for video classification. In: Proceedings of the 23rd ACM International
Conference on Multimedia, pp. 461–470 (2015)
Ye, H., Wu, Z., Zhao, R.W., Wang, X., Jiang, Y.G., Xue, X.: Evaluating two-stream CNN for video
classification. In: Proceedings of the 5th ACM on International Conference on Multimedia
Retrieval, pp. 435–442 (2015)
Yuan, Y., Wan, C.: The application of edge feature in automatic sports genre classification. In:
IEEE Conference on Cybernetics and Intelligent Systems, vol. 2, pp. 1133–1136 (2004)
Zha, S., Luisier, F., Andrews, W., Srivastava, N., Salakhutdinov, R.: Exploiting image-trained CNN
architectures for unconstrained video classification. arXiv preprint arXiv: 1503.04144 (2015)
Zhu, L., Yang, Y.: Compound memory networks for few-shot video classification. In: Proceedings
of the European Conference on Computer Vision (ECCV), pp. 751–766 (2018)
Chapter 8
Natural Effect Generation
and Reproduction
Konstantin A. Kryzhanovskiy and Ilia V. Safonov
8.1 Introduction
The creation and sharing of multimedia presentations and slideshows has become a
pervasive activity. The development of tools for automated creation of exciting,
entertaining, and eye-catching photo transitions and animation effects, accompanied
by background music and/or voice comments, is a trend in the last decade (Chen
et al. 2010). One of the most impressive effects is the animation of still photos: for
example, grass swaying in the wind or raindrop ripples in water.
The development of fast and realistic animation effects is a complex problem.
Special interactive authoring tools, such as Adobe After Effects and VideoStudio,
are used to create animation from an image. In these authoring tools, the effects are
selected and adjusted manually, which may require considerable efforts by the user.
The resulting animation is saved as a file, thus requiring a significant amount of
storage space. During playback, such movies are always the same, thus leading to a
feeling of repetitiveness for the viewer.
For multimedia presentations and slideshows, it is preferable to generate ani-
mated effects on the fly with a high frame rate. Very fast and efficient algorithms are
necessary in order to provide the required performance; this is extremely difficult for
low-powered embedded platforms. We have been working on the development and
implementation of automatically generated animated effects for full HD images on
ARM-based CPUs without the usage of GPU capabilities. In such limited condi-
tions, the creation of realistic and impressive animated effects – especially for users
K. A. Kryzhanovskiy (*)
Align Technology Research and Development, Inc., USA, Moscow Branch, Russia
e-mail: kkryzhanovsky@gmail.com
I. V. Safonov
National Research Nuclear University MEPhI, Moscow, Russia
e-mail: ilia.safonov@gmail.com

220 K. A. Kryzhanovskiy and I. V. Safonov
Fig. 8.1 Detected beats affect the size of the flashing light
who are experienced at playing computer games on powerful PCs and consoles – is a
challenging task.
We have developed several algorithms for the generation of content-based ani-
mation effects from photos, such as flashing light, soap bubbles, sunlight spot,
magnifier effect, rainbow effect, portrait morphing transition effect, snow, rain,
fog, etc. For these effects, we propose a new approach for automatic audio-aware
animation generation. In this chapter, we demonstrate the adaptation of effect
parameters according to background audio for three effects: flashing light (see the
example in Fig. 8.1), soap bubbles, and sunlight spot. Obviously, the concept can be
extended to other animated effects.
8.2 Previous Works
There are several content-adaptive techniques for the generation of animation from
still photos. Sakaino (2005) depicts an algorithm for the generation of plausible
motion animation from textures. Safonov and Bucha (2010) describe the animated
thumbnail which is a looped movie demonstrating salient regions of the scene in
sequence. Animation simulates camera tracking in, tracking out, and panning
between detected visual attention zones and the whole scene. More information
can be found in Chap. 14 (‘An Animated Graphical Abstract for an Image’).
Music plays an important role in multimedia presentations. There are some
methods aimed at aesthetical audiovisual composition in slideshows. ‘Tiling
slideshow’ (Chen et al. 2006) describes two methods for the analysis of background
audio to select timing for photo and frame switching. The first method is beat
detection. The second is energy dynamics calculated using root-mean-square values
of adjacent audio frames.
There are other concepts for the automatic combination of audio and visual
information in multimedia presentations. Dunker et al. (2011) suggest an approach
8 Natural Effect Generation and Reproduction 221
that focuses on an automatic soundtrack selection. The process attempts to compre-

hend what the photos depict and tries to choose music accordingly. Leake et al.
(2020) present a method for transforming text articles into audiovisual slideshows by
leveraging the notion of word concreteness, which measures how strongly a phrase
is related to some perceptible concept.
8.3 Animation Effects from a Photo
8.3.1 General Workflow
In general, the procedure to create an animation effect from a single still image
consists of the following major stages: effect initialization and effect execution.
During effect initialization, certain operations that have to be made only once for the
entire effect span are performed. Such operations may include source image format
conversion, pre-processing, analysis, segmentation, and creation of some visual
objects and elements displayed during effect duration, etc. At the execution stage
for each subsequent frame, a background audio analysis is performed, visual objects
and their parameters are modified depending on the time elapsed, audio features are
calculated, and the entire modified scene is visualized. The animation effect
processing flow chart is displayed in Fig. 8.2.
8.3.2 Flashing Light
The flashing light effect displays several flashing and rotating coloured light stars
over the bright spots on the image. In this effect, the size, position, and colour of
flashing light stars are defined by the detected position, size, and colour of the bright
areas on the source still image. An algorithm performs the following steps to detect
small bright areas in the image:
• Calculation of the histogram of brightness of the source image.
• Calculation of the segmentation threshold as the grey level corresponding to a
specified fraction of the brightest pixels of the image using the brightness
histogram.
• Segmentation of the source image by thresholding – while thresholding, the
majority of the morphological filter is used to filter out localized bright pixel
groups.
• Calculation of the following features for each connected region of interest (ROI):
(a) Mean colour Cmean.
(b) Centroid (xc,yc).
(c) Image fraction F (fraction of the image area, occupied by ROI).
Effect
Initialization Obtain still image
Detect regions of interest
Detect ROI features
Create visual objects
Effect
Execution Obtain audio fragment
Detect audio features
Update visual object parameters
Generate animation frame
Display animation frame
Animation No
stopped?
Yes
Fig. 8.2 Animation effect processing flow chart
(d) Roundness (a relation of the diameter of the circle with the same area as the
ROI to the maximum dimension of the ROI):
pffiffiffiffiffiffiffiffi
2 S=π
Kr ¼ ,
max ðW, H Þ
where S is the area of the ROI and W, H are ROI bounding box dimensions.
(e) Quality (the integral parameter characterizing the possibility of the ROI being
a light source, calculated as follows):
QL ¼ wY max Y max þ wYmean Y mean þ wR K r þ wF K F ,
where Ymax is the maximum brightness of the ROI, Ymean is the mean brightness
of the ROI, and KF is the coefficient of ROI size.

F=F 0 , if F F 0
KF ¼ ,
F 0 =F, if F > F 0
where F0 is the image fraction normalizing coefficient for an optimal lightspot

size and wYmax, wYmean, wR, and wF are the weighting coefficients. Weighting
coefficients w and the optimal lightspot size normalization coefficient F0 are
obtained by minimizing differences between automatic and manual light source
segmentation results.
• Selection of regions with appropriate features. All bright spots, with image
fractions falling within the appropriate range (Fmin, Fmax) and with roundness
Kr larger than a certain threshold value Kr0, are considered to be potential light
sources. Potential light sources are sorted by their quality value QL. A specified
number of light sources with the largest quality is selected as the final position of
light star regions.
Centroids of selected light regions are used as star positions. Star size is deter-
mined by the dimensions of the appropriate light region. The mean colour of the
region determines the colour of the light star. Figure 8.3 illustrates the procedure for
detection of appropriate regions for flashing.
Every light star is composed from bitmap templates of two types, representing
star-shape elements: halo shape and star ray (spike) shape. These templates are
independently scaled alpha maps. Figure 8.4 shows examples of templates. During
rendering, the alpha map of a complete star of an appropriate size is prepared in a
separate buffer, and then the star is painted with the appropriate colour with a
transparency value extracted from the star alpha map.
During animation, light star sizes and intensities are changed gradually and
randomly to give the impression of flashing lights.
8.3.3 Soap Bubbles
This effect displays soap bubbles moving over the image. Each bubble is
composed from a colour map, an alpha map, and highlight maps. A set of highlight
maps with different highlight orientations is preliminary calculated for each bubble.
Highlight position depends on the lighting direction in corresponding areas of the
image. Lighting gradient is calculated using a downscaled brightness channel of the
image.
Collect brightness
histogram
Calculate
threshold
Image
segmentation
Calculate features
for each region
Find appropriate
regions
Fig. 8.3 Illustration of detection of appropriate regions for flashing
(a) (b)
Fig. 8.4 Light star-shape templates: (a) halo template, (b) ray template
The colour map is modulated by the highlight map, selected from the set of
highlight maps in accordance with the average light direction around the bubble, and
then is combined with the source image using alpha blending with a bubble alpha
map. Figure 8.5 illustrates the procedure of soap bubble generation from alpha and
colour maps.
During animation, soap bubbles move smoothly over the image from bottom to
top or vice versa while oscillating slightly in a horizontal direction to give the
impression of real soap bubbles floating in the air. Figure 8.6 demonstrates a
frame of animation with soap bubbles.
Fig. 8.5 Soap bubble generation from alpha and colour maps
Fig. 8.6 Frame of

animation containing soap
bubbles
8.3.4 Sunlight Spot
This effect displays a bright spot moving over the image. Prior to starting the effect,
the image is dimmed according to its initial average brightness. Figure 8.7a shows an
image with the sunlight spot effect. The spotlight trajectory and size are defined by
the attention zones of the photo.
Similar to the authors of many existing publications, we consider human faces
and salient regions to be attention zones. In addition, we regard text inscriptions as
attention zones because these can be the name of a hotel or town in the background
of the photo. In the case of a newspaper, such text can include headlines.
Despite great achievements by deep neural networks in the area of multi-view
face detection (Zhang and Zhang 2014), the classical Viola–Jones face detector
(Viola and Jones 2001) is widely used in embedded systems due to its low power
consumption. The number of false positives can be decreased with additional skin
tone segmentation and processing of the downsampled image (Egorova et al. 2009).
So far, a universal model of human vision does not exist, but the pre-attentive
vision model based on feature integration theory is well known. In this case, because
the observer is at the attentive stage while viewing the photo, a model of human
pre-attentive vision is not strictly required. However, existing approaches for the
detection of regions of interest are based on saliency maps, and these often provide
reasonable outcomes, whereas the use of the attentive vision model requires too
much prior information about the scene, and it is not generally applicable. Classical
saliency map-building algorithms (Itti et al. 1998) have a very high computational
(a)
(b)
Fig. 8.7 Demonstration of the sunlight spot effect: (a) particular frame, (b) detected attention zones
complexity. That is why researchers have devoted a lot of effort to developing fast
saliency map creation techniques. Cheng et al. (2011) compare several algorithms
for salient region detection. We implemented the histogram-based contrast method
into our embedded platform.
While developing an algorithm for the detection of areas with text, we took into
account the fact that text components are ordered the same way and are similar to
texture features. Firstly, we applied a LoG edge detector. Then, we filtered the
resulting connected components based on an analysis of the texture features. We
used the following features (Safonov et al. 2019):
1. Average block brightness Bi:
P
N P
N
Bi ðr, cÞ
r¼1 c¼1
Bi ¼ :
N2
2. Average difference in average brightnesses of the blocks Bk in four connected

neighbourhoods of block Bi:
P
4
j Bi Bk j
dBi ¼ k¼1 :
4
3. Average of vertical dBiy and horizontal dBix block derivatives:
P P
N N1 P P
N1 N
dBix ðr, cÞ þ dBiy ðr, cÞ
r¼1 c¼1 r¼1 c¼1
dx,y Bi ¼ :
2N ðN 1Þ
P N d ði, jÞ
4. Block homogeneity Bi: H ¼ 1þjijj ,
i, j
where Nd is a normalized co-occurrence matrix, and d defines the spatial

relationship.
5. The percentage of pixels with a gradient greater than the threshold:
X
Pg ¼ f1j∇Bi ðr, cÞ > T g=N 2 ,
8ðr, cÞ2Bi
where ∇Bi(r, c) is calculated as the square root of the sum of the squares of the
horizontal and vertical derivatives.
6. The percentage of pixel value changes after the morphological operation of
opening Boi on a binary image Bbi , obtained by binarization with a threshold
of 128:
X
Pm ¼ 1jBoi ðr, cÞ 6¼ Bbi ðr, cÞ =N 2 :
8ðr, cÞ2Bi
Also, an analysis of geometric dimensions and relations was performed. We

merged closely located connected components, arranging those with the same
order and similar colours and texture features, in groups. After that, we classified the
resulting groups. We formed final text zones on the basis of groups classified as text.
Figure 8.7b shows the detected attention zones. A red rectangle depicts a detected
face; green rectangles denote text regions; yellow is a bounding box of the most
salient area according to HC method.
8.4 Adaptation to Audio
What animation parameters depend on the characteristics of background audio

signals? First, the size and intensity of animated objects, and the speed of their
movement and rotation, can be adjusted. In addition, we investigated the question:
how can we change the colour of animate objects, depending on music? Attempts to
establish a connection between music and colour are long established. French
mathematician Louis Bertrand Castel is considered to have been a pioneer in this
area. In 1724, in his work Traité de Physique sur La Pesanteur Universelle des
Corps, he described an approach to the direct ‘translation’ of music to colour on a
‘spectre–octave’ basis. To illustrate his ideas, Castel even constructed le clavecin
pour les yeux (ocular harpsichord, 1725). About 100 years ago, the famous Russian
composer and pianist Alexander Scriabin also proposed a theory of connection
between music and colour. Colours corresponding to notes are shown in Fig. 8.8.
This theory connects major and minor tonality of the same name.
Scriabin’s theory was embodied in clavier à lumières (keyboard with lights),
which he invented for use in his work Prometheus: The Poem of Fire. The instru-
ment was supposed to be a keyboard (Fig. 8.9) with notes corresponding to the
colours of Scriabin’s synesthetic system.
Fig. 8.8 Colours arranged

on the circle of fifths,
corresponding to Scriabin’s
theory
Fig. 8.9 Tone-to-colour

mapping on Scriabin’s
clavier à lumières
Fig. 8.10 Colour circle

corresponding to each
octave
On our platform, we worked with a stereo audio signal with a frequency of

44 kHz. We considered four approaches to connect the animation of the three effects
mentioned above with background audio. In all approaches, we analysed the average
of two signal channels in the frequency domain. The spectrum was built 10 times per
second for 4096 samples. The spectrum was divided into several bands as in a
conventional graphic equalizer. The number of bands depended on the approach
selected.
For fast Fourier transform computing with a fixed-point arithmetic, we used the
KISS FFT open-source library (https://github.com/mborgerding/kissfft). This library
does not use platform-specific commands and is easily ported to ARM.
Our first approach for visualizing music by colours was inspired by Luke
Nimitz’s demonstration of a ‘Frequency Spectrograph – Primary Harmonic Music
Visualizer’. This is similar to Scriabin’s idea. It can be considered to be a specific
visualization of the graphic equalizer. In this demonstration, music octaves are
associated with the HSL colour wheel, as shown in Fig. 8.10, using the statement:

f
Angle ¼ 2π log 2 ,
c
where f is the frequency, and c is the origin on the frequency axis. The angle defines
the hue of the current frequency.
Depending on the value of the current note, we defined the brightness of the
selected hue and drew it on a colour circle. We used three different approaches to
display colour on the colour wheel: painting sectors, painting along the radius, and
using different geometric primitives inscribed into the circle.
In the soap bubble effect, depending on the circle colour generated, we deter-
mined the colour of the bubble texture. Figure 8.11 demonstrates an example of soap
bubbles with colour distribution depending on music. In the sunlight spot effect, the
generated colour circle determines the distribution of colours for the highlighted spot
(Fig. 8.12).
In the second approach, we detected beats or rhythm of the music. There are
numerous techniques for beat detection in the time and frequency domains (Scheirer
1998; Goto 2004; Dixon 2007; McKinney et al. 2007; Kotz et al. 2018; Lartillot and
Grandjean 2019). We faced constraints due to real-time performance limitations, and
we were dissatisfied with the outcomes for some music genres. Finally, we assumed
that the beat is present if there are significant changes of values in several bands. This
method meets performance requirements with an acceptable finding for quality of
beats. Figure 8.1 illustrates how detected beats affect the size of the flashing light. If
a beat was detected, we instantly maximized the size and brightness of the lights, and
they then gradually returned to their normal state until the next beat happened. Also,
Fig. 8.11 Generated colour

distribution of soap bubbles
depending on music
Fig. 8.12 Generated colour

distribution of sunlight spot
depending on music
Fig. 8.13 Low, middle, and

high frequencies affect
brightness and saturation of
corresponding flashing
lights
it was possible to change the flashing lights when the beat occurred (by turning on
and off light sources). In the soap bubble effect, we maximized saturation of the soap
bubble colour when the beat occurred. We also changed the direction of moving
soap bubbles as the beat happened. In the sunlight spot effect, if a beat was detected,
we maximized the brightness and size of the spot, and these then gradually returned
to their normal states.
In the third approach, we analysed the presence of low, middle, and high
frequencies in audio signals. This principle is used in colour music installations. In
the soap bubble effect, we assigned a frequency range for each soap bubble and
defined its saturation according to the value of the corresponding frequency range. In
the flashing light effect, we assigned each light star to its own frequency range and
defined its size and brightness depending on the value of the frequency range.
Figure 8.13 shows how the presence of low, middle, and high frequencies affect
flashing lights.
Another approach is not to divide the spectrum into low, middle, and high
frequencies but rather to assign these frequencies to different tones inside octaves.
Therefore, we used an equalizer containing a large number of bands and where each
octave had enough corresponding bands. We accumulated the values of each
equalizer band to a buffer cell where the corresponding cell number was calculated
using the following statement:
f
log 2 360 mod 360
num ¼ c
360
þ 1,
length
where f is the frequency, c is the origin of the frequency axis, and length is the
number of cells.
Each cell controls the behaviour of selected objects. In the soap bubble effect, we
assigned each soap bubble to a corresponding cell and defined its saturation
depending on the value of the cell. In the flashing light effect, we assigned each light
to a corresponding cell and defined its size and brightness depending on the value of
the cell.
8.5 Results and Discussion
The major outstanding question is how these functions can be implemented in

modern multimedia devices for real-time animation. The algorithms were optimized
for ARM-based platforms with a CPU frequency of 800–1000 MHz. Limited
computational resources of the target platform combined with an absence of graphics
acceleration hardware are a serious challenge for the implementation of visually rich
animation effects. Therefore, comprehensive optimization is required to obtain
smooth frame rates. The total performance win was 8.4 times in comparison with
the initial implementation. The most valuable optimization approaches are listed in
Tables 8.1 and 8.2.
Because objective evaluation of the proposed audiovisual presentation is difficult,
we evaluated the advantages of our technique through a subjective user-opinion
survey. Flashing light, soap bubbles, and sunlight spot effects with octave-based
audio adaptation were used for demonstration. Two questions were asked about the
three audiovisual effects:
1. Are you excited by the effect?
2. Would you like to see that effect in your multimedia device?
Twenty-three observers participated in the survey. Figure 8.14 reflects the survey
results. In general, an absolute majority of the interviewees rated the effects posi-
tively. Only two people said that they did not like not only the demonstrated effects
but any multimedia effects. Some observers stated, ‘It’s entertaining, but I cannot
say “I’m excited”, because such expression would be too strong.’ Several
Table 8.1 Optimization approaches

Approach Speed-up
Fixed-point arithmetic 4.5
SIMD CPU instructions (NEON) 3
Effective cache usage 1.5
Re-implementing of key glibc functions 1.25
Table 8.2 Performance of proposed effects for HD photo

Effect Initialization time, s FPS
Flashing light 0.15 20
Soap bubbles 0.08 45
Sunlight spot 1.4 50
Fig. 8.14 Survey results
participants of the survey said that they did not like the photos or background music
used for the demonstration. It is also worth noting that eight of the respondents were
women and, on average, they rated the effects much higher than men.
Therefore, we can claim that the outcomes of the subjective evaluation demon-
strate the satisfaction of the observers with this new type of audiovisual presentation,
because audio-aware animation behaves uniquely each time it is played back and
does not repeat itself during the playback duration, thus creating vivid and lively
impressions for the observer. A lot of observers were excited by the effects; they
would like to see such features in their multimedia devices.
8.6 Future Work
Several other audio-awareness animation effects can be proposed. Figure 8.15 shows
screenshots of our audio-aware effect prototypes. In rainbow effect (Fig. 8.15a), the
colour distribution of the rainbow changes according to the background audio
spectra. The movement direction, speed, and colour distribution of confetti and
serpentines are adjusted according to music rhythm in the confetti effect
(Fig. 8.15b). Magnifier glass movement speed and magnification are affected by
background music tempo in the magnifier effect (Fig. 8.15c). In the lightning effect
(Fig. 8.15d), lightning bolt strikes are matched to accents in background audio.
Obviously, other approaches to adapting the behaviour of animation to back-
ground audio are also possible. It is possible to analyse left and right audio channels
separately and to apply different behaviours to the left and right sides of the screen,
respectively. Other effects amenable to music adaption can be created.
(a) (b)
(c) (d)
Fig. 8.15 Audio-aware animation effect prototypes: (a) rainbow effect; (b) confetti effect; (c)
magnifier effect; (d) lightning effect
References
Chen, J.C., Chu, W.T., Kuo, J.H., Weng, C.Y., Wu, J.L.: Tiling slideshow. In: Proceedings of the
ACM International Conference on Multimedia, pp. 25–34 (2006)
Chen, J., Xiao, J., Gao, Y.: iSlideShow: a content-aware slideshow system. In: Proceedings of the
International Conference on Intelligent User Interfaces, pp. 293–296 (2010)
Cheng, M.M., Zhang, G.X., Mitra, N.J., Huang, X., Hu, S.M.: Global contrast based salient region
detection. In: Proceedings of the Conference on Computer Vision and Pattern Recognition,
pp. 409–416 (2011)
Dixon, S.: Evaluation of the audio beat tracking system BeatRoot. J. New Music Res. 36(1), 39–50
(2007)
Dunker, P., Popp, P., Cook, R.: Content-aware auto-soundtracks for personal photo music
slideshows. In: Proceedings of the IEEE International Conference on Multimedia and Expo,
pp. 1–5 (2011)
Egorova, M.A., Murynin, A.B., Safonov, I.V.: An improvement of face detection algorithm for
color photos. Pattern Recognit. Image Anal. 19(4), 634–640 (2009)
Goto, M.: Real-time music-scene-description system: predominant-F0 estimation for detecting
melody and bass lines in real-world audio signals. Speech Comm. 43(4), 311–329 (2004)
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis.
IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
Kotz, S.A., Ravignani, A., Fitch, W.T.: The evolution of rhythm processing. Trends Cogn. Sci.
Special Issue: Time in the Brain. 22(10), 896–910 (2018)
Lartillot, O., Grandjean, D.: Tempo and metrical analysis by tracking multiple metrical levels using
autocorrelation. Appl. Sci. 9(23), 5121 (2019) Accessed on 01 October 2020. https://www.
mdpi.com/2076-3417/9/23/5121
Leake, M., Shin, H.V., Kim, J.O., Agrawala, M.: Generating audio-visual slideshows from text
articles using word concreteness. In: Proceedings of the 2020 CHI Conference on Human
Factors in Computing Systems, pp. 1–11 (2020)
McKinney, M.F., Moelants, D., Davies, M.E.P., Klapuri, A.: Evaluation of audio beat tracking and
music tempo extraction algorithms. J. New Music Res. 36(1), 1–16 (2007)
Safonov, I.V., Bucha, V.V.: Animated thumbnail for still image. In: Proceedings of the
GRAPHICON Symposium, pp. 79–86 (2010)
Scanning and Printing. Springer Nature Switzerland AG (2019)
Sakaino, H.: The photodynamic tool: generation of animation from a single texture image. In:
Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 1090–1093
(2005)
Scheirer, E.D.: Tempo and beat analysis of acoustic musical signals. J. Acoust. Soc. Am. 103(1),
588–601 (1998)
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In:
pp. 511–518 (2001)
Zhang, C., Zhang, Z.: Improving multiview face detection with multi-task deep convolutional
neural networks. In: Proceedings of the IEEE Winter Conference on Applications of Computer
Vision, pp. 1036–1041 (2014)
Chapter 9
Image Classification as a Service
9.1 Introduction
In the context of this book, we define image classification as an algorithm of

predicting semantic labels or classes for a given two- or more-dimensional digital
image. A very simple example of this is an algorithm that takes a photograph as an
input and predicts whether it contains a person or not. Image classification as a
service (ICaaS) refers to the specific implementation of this algorithm as a web
service which accepts requests containing an image and returns a response with
classification results (Hastings 2013).
Image classification is an important problem having applications in many areas
including:
• Categorising and cataloguing images
• Visual search
• Inappropriate content detection
• Medical imaging and diagnostics
• Industrial automation
• Defect detection
• Cartography and satellite imaging
• Product recognition for e-commerce
• Visual localisation
• Biometric identification and security
• Robotic perception
• And others
In this chapter, we will discuss all stages of building an image classification service.
M. Y. Sirotenko (*)
111 8th Ave, New York, NY 10011, USA
e-mail: mihail.sirotenko@gmail.com

238 M. Y. Sirotenko
9.1.1 Types of Image Classification Systems
Image classification systems could be divided into different kinds depending on the
application and implementation:
1. Binary classifiers separate images in one of two mutually exclusive classes, while
multi-class classifiers can predict one of many classes for a given image.
2. Multi-label classifiers are classifiers that can predict many labels per image
(or none).
3. Based on the chosen classes, there are hierarchical or flat classification systems.
Hierarchical ones assume a certain taxonomy or ontology imposed on classes.
4. Classification systems could be fine-grained or coarse-grained based on the
granularity of chosen classes.
5. If classification is localised, then it means the class is applied to a certain region of
the image. If it is not localised, then the classification happens for the entire
image.
6. Specialist classifiers usually focus on a relatively narrow classification task
(e.g. classifying dog breeds), while generalist classifiers can work with any images.
Figure 9.1 shows an example of the hierarchical fine-grained localised multi-class
classification system (Jia et al. 2020).
Fig. 9.1 Fashionpedia taxonomy

9 Image Classification as a Service 239
9.1.2 Constraints and Assumptions
Given that the type of image classification system discussed here is a web-based
service, we can deduce a set of constraints and assumptions.
The first assumption is that we are able to run the backend on an arbitrary server
as opposed to running image classification on a specific hardware (e.g. smartphone).
This means that we are free to choose any type of model architecture and any type of
hardware to run the model without very strict limitations on the memory, computa-
tional resources or battery life. On the other hand, transferring data to the web service
may become a bottleneck, especially for users with small bandwidth. This means
that the classification system should work well with compressed images having
relatively low resolution.
Another important consideration is concurrency. Since we are building a web
server system, it should be designed to support concurrent user requests.
9.2 Ethical Considerations
The very first question to ask before even starting to design an image classification
system is how this system may end up being used and how to make sure it will do no
harm to people. Recently, we saw a growing number of examples of unethical or
discriminatory uses of AI. Such uses include using computer vision for military
purposes, mass surveillance, impersonation, spying and privacy violations.
One of the recent examples is this: certain authorities use facial recognition to
track and control a minority population (Mozur 2019). This is considered by many as
the first known example of intentionally using artificial intelligence for racial
profiling. Some local governments are banning facial recognition in public places
since it is a serious threat to privacy (y Arcas et al. 2017).
While military or authoritarian governments using face recognition technology is
an instance of unethical use of otherwise useful technology, there are other cases
when AI systems are flawed by design and represent pseudoscience that could hurt
some groups of people if attempted to be used in practice. One example of such
pseudoscience is the work titled “Automated Inference on Criminality Using Face
Images” published in November 2016. The authors claimed that they trained a neural
network to classify people’s faces as criminal or non-criminal. The practice of using
people’s outer appearance to infer inner character is called physiognomy, a pseudo-
science that could lead to dangerous consequences if put into practice and represents
an instance of a broader scientific racism.
Another kind of issue that may lead to unethical use of the image classification
system is algorithmic bias. Algorithmic bias is defined as unjust, unfair or prejudicial
treatment of people related to race, income, sexual orientation, religion, gender and
other characteristics historically associated with discrimination and marginalisation,
when and where it is manifested in algorithmic systems or algorithmically aided
240 M. Y. Sirotenko
Fig. 9.2 Many ways how human bias is introduced into the machine learning system
decision-making. Algorithmic biases could amplify human biases while building the
system. Most of the time, the introduction of algorithmic biases happens without
intention. Figure 9.2 shows that human biases could be introduced into machine
learning systems at every stage of the process and even lead to positive feedback
loops.
Such algorithmic biases could be mitigated by properly collecting the training
data and using metrics that measure fairness in addition to standard accuracy metrics.
9.3 Metrics and Evaluation
It is very important to define metrics and evaluation procedures before starting to

build an image classification system. When talking about a classification system as a
whole, we need to consider several groups of metrics:
• End-to-end metrics are application specific and measure how the system works as
a whole. An example of an end-to-end metric is the percentage of users who
found top ten search results useful.
• Classification metrics are metrics used to evaluate the model predictions using the
validation or test set.
• Training and inference performance metrics measure latency, speed, computing,
memory and other practical aspects of training and using ML models.
• Uncertainty metrics help to measure how over- or under-confident a model is.
• Robustness metrics measure how well the model performs classification on the
out-of-distribution data and how stable predictions are under natural perturbation
of the input.
• Fairness metrics are useful to measure whether an ML model treats different
groups of people fairly.
Let’s discuss each group of metrics in more detail.
9.3.1 End-to-End Metrics
It is impossible to list all kinds of end-to-end metrics because they are very problem
dependent, but we can list some of the common ones used for image classification
systems:
• Click-through rate measures what percentage of users clicked a certain hyperlink.
This metric could be useful for the image content-based recommendation system.
Consider a user who is looking to buy an apparel item, and based on her
preferences, the system shows an image of a recommended item. Thus, if a user
clicked on the image, it means that result looks relevant.
• Win/loss ratio measures the number of successful applications of the system vs
unsuccessful ones over a certain period of time compared to some other system or
human. For example, if the goal is to classify images of a printed circuit board as
defective vs non-defective, we could compare the performance of the automatic
image classification system with that of the human operator. While comparing,
we count how many times the classifier made a correct prediction while the
operator made a wrong prediction (win) and how many times the classifier
made a wrong prediction while the human operator was correct (loss). Dividing
the count of wins by the count of losses, we can make a conclusion whether
deploying an automated system makes sense.
• Man/hours saved. Consider a system that classifies a medical 3D image and based
on classification highlights areas indicating potential disease. Such a system
would be able to save a certain amount of time for a medical doctor performing
a diagnosis.
9.3.2 Classification Metrics
There are dozens of classification metrics being used in the field of machine learning.
We will discuss the most commonly used and relevant for image classification.
Accuracy This is the most simple and fundamental metric for any classification
problem. It is computed as the number of correctly classified samples divided by the
number of incorrectly classified samples. Here and further, by sample we mean an
image from the test set associated with one or more ground-truth labels. A correctly
classified sample is the one for which the model prediction matches ground truth.
This metric can be used in single-label classification problems. The downside of this
metric is that for every sample, there is a prediction; it ignores the score and order of
predictions; it could be sensitive to incorrect or incomplete ground truth; it ignores
test data imbalance. It would be fair to say that this metric is good for toy problems
but not sufficient for most of the real practical tasks.
Top K Accuracy This is a modification of the accuracy metric where the prediction
is considered correct if any of the top K-predicted classes (according to the score)
242 M. Y. Sirotenko
matches with the ground truth. This change makes the metric less prone to incom-
plete or ambiguous labels and helps to promote models that do a better job at ranking
and scoring classification predictions.
Precision Let’s define true positives (TP) as the number of all predictions that
match (at least one) ground-truth label for a corresponding sample, false positives
(FP) as the number of predictions that do not match any ground-truth labels for a
corresponding sample and false negatives (FN) as the number of ground-truth labels
for which no matching predictions exist. Then, the precision metric is computed as:
Precision ¼ TP=ðTP þ FPÞ:
The precision metric is useful for models that may or may not predict a class for
any given input (which is usually achieved by applying a threshold to the prediction
confidence). If predictions exist for all test samples, then this metric is equivalent to
the accuracy.
Recall Using the notation defined above, the recall metric is defined as:
Recall ¼ TP=ðTP þ FNÞ:
This metric ignores false predictions and only measures how many of the true
labels were correctly predicted by the model.
Note that precision and recall metrics are oftentimes meaningless if used in
isolation. By tuning a confidence threshold, one can trade off precision for recall
and vice versa. Thus, for many models, it is possible to achieve nearly perfect
precision or recall.
Recall at X% Precision and Precision at Y% Recall These are more practical
versions of the precision and recall metrics we defined above. They measure the
maximum recall at a target precision or maximum precision at a given recall. Target
precision and recall are usually derived from the application. Consider an image
search application where the user enters a text query and the system outputs images
that match that query from the database of billions of images. If the precision of an
image classifier used as a part of that system is low, the user experience could be
extremely poor (imagine searching something and seeing only one out of ten results
to be relevant). Thus, a reasonable goal could be to set a precision goal to be 90% and
try to optimise for as high recall as possible. Now consider another example –
classifying whether an MRI scan contains a tumour or not. The cost of missing an
image with a tumour could be literally deadly, while the cost of incorrectly
predicting that an image has a tumour would require a physician to double check
the prediction and discard it. In this case, recall of 99% could be a reasonable
requirement, while the model could be optimised to deliver as high precision as
possible. Further information dealing with MRI can be revealed in Chap. 11 as well
as in Chap. 12.
F1 Score Precision at X% recall and recall at Y% precision are useful when there is
a well-defined precision or recall goal. But what if we don’t have a way to fix either
precision or recall and measure the other one? This could happen if, for example,
prediction scores are not available or if the scores are very badly calibrated. In that
case, we can use the F1 score to compare two models. The F1 score is defined as
follows:
F1 ¼ 2 ðPrecision RecallÞ=ðPrecision þ RecallÞ:
The F1 score takes the value of 1 in the case of 100% precision and 100% recall
and takes the value of 0 if either precision or recall is 0.
Precision-Recall Curve and AUC-PR In many cases, model predictions have
associated confidence scores. By varying the score threshold from the minimum to
the maximum, we can calculate precision and recall at each of those thresholds and
generate a plot that will look like the one in Fig. 9.3. This plot is called the PR curve,
and it is useful to compare different models. From the example, we can conclude that
model A provides a better precision in low recall modes, while model B provides a
better recall at low precision mode.
PR curves are useful to better understand model performance in different modes,
but comparing and concluding which model is absolutely better could be hard. For
that purpose, we can calculate an area under the PR curve. This will provide us with
the single metric that captures both precision and recall of the model on the entire
range of threshold.
Precision
1.
Model A
Model B
1. Recall
Fig. 9.3 Example of the PR curves for two models

244 M. Y. Sirotenko
9.3.3 Training and Inference Performance Metrics
Both inference and training time are very important for all practical applications of
image classification models for several reasons. First of all, faster inference means
lower classification latency and therefore better user experience. Fast inference also
means classification systems may be applied to real-time streamed data or to
multidimensional images. Inference speed also correlates with the model complexity
and required computing resources, which means it is potentially less expensive
to run.
Training speed is also an important factor. Some very complex models could take
weeks or months to train. Not only could this be very expensive, but it also slows
down innovation and increases risks since it would take a long time before it would
be clear that the model did not succeed.
One of the ways to measure computational complexity of the model is by
calculating FLOPs (floating point operations per second) required to run the infer-
ence or the training step. This metric is a good approximation to compare complexity
of different models. However, it could be a bad predictor of real processing time. The
reason is because certain structures in the model may be better utilised by modern
hardware accelerators. For example, it is known that models that require a lot of
memory copies are running slower even if they require fewer FLOPs for operation.
For the reasons above, the machine learning community is working towards
standardising benchmarks to measure actual inference and training speed of certain
models on a certain hardware. Training benchmarks measure the wall-clock time
required to train a model on one of the standard datasets to achieve the specified
quality target. Inference benchmarks consist of many different runs with varying
input image resolution, floating point precision and QPS rates.
9.3.4 Uncertainty and Robustness Metrics
It is common to generate a confidence score for every prediction made by an image

classifier. The user’s expectation is that at least a prediction with a higher score has a
higher chance of being correct than the prediction with a lower score. Ultimately, we
would like for the score to reflect the probability that the prediction is correct. This
requirement allows different model predictions to be directly compared and com-
bined. It also allows the use of model predictions in applications where it is critical to
be confident in the prediction. Raw predictions from the neural network, however,
are typically overconfident, meaning that prediction with a normalised score of 0.8
will have less than 80% correct predictions on the validation set. The process of
adjusting model confidences to represent true prediction confidence is called
calibration.
A number of metrics were proposed to measure model uncertainty (Nixon et al.
2019), among which the most popular ones are expected calibration error (ECE) and
static calibration error (SCE).
Another important feature of the model is how it behaves if the input images are
disturbed or from a domain different from the one the model was trained on. The set
of metrics used to measure these features is the same as that used for measuring
classification quality. The difference is in the input data. To measure model robust-
ness, one can add different distortions and noise to the input images or collect a set of
images from a different domain (e.g. if the model was trained on images collected
from the internet, one can collect a test set of images collected from smartphone
cameras).
9.4 Data
Data is the key for modern image classification systems. In order to build an image
classification system, one needs training, validation and test data at least. In addition
to that, a calibration dataset is needed if the plan is to provide calibrated confidence
scores, and out-of-domain data is needed to fine-tune model robustness. Building a
practical machine learning system is more about working with the data than actually
training the model. This is why more and more organisations today are establishing
data operations teams (DataOps) whose responsibilities are acquisition, annotation,
quality assurance, storage and analysis of the datasets.
Figure 9.4 shows the main steps of the dataset preparation process. Preparation of
the high-quality dataset is an iterative process. It includes data acquisition, human
annotation or verification, data engineering and data analysis.
Internal historical data
Public data
Synthetic data
Human
Data
annotation or Data analysis
Commercial datasets engineering
verification
Crowdsourced
Controlled acquisition
Data acquisition
Feedback
Fig. 9.4 Dataset preparation stages

246 M. Y. Sirotenko
9.4.1 Data Acquisition
Data acquisition is the very first step in preparing the dataset. There are several ways
to acquire the data with each way having its own set of pros and cons. The most
straightforward way is to use internal historical data owned by the organisation. An
example of that could be a collection of images of printed circuit boards with quality
assurance labels whether PCB is defective or not. Such data could be used with
minimal additional processing. The downside of using only internal data is that the
volume of samples might be small and the data might be biased in some way. In the
example above, it may happen that all collected images are made with the same
camera having exactly the same settings. Thus, when the model is trained on those
images, it may overfit to a certain feature of that camera and stop working after a
camera upgrade.
Another potential source of data is publicly available data. This may include
public datasets such as ImageNet (Fei-Fei and Russakovsky 2013) or COCO (Lin
et al. 2014) as well as any publicly available images on the Internet. This approach is
the fastest and least expensive way to get started if your organisation does not own
any in-house data or the volumes are not enough. There is a big caveat with this data
though. Most of these datasets are licensed for research or non-commercial use only
which makes it impossible to use for business applications. Even datasets with less
restrictive licences may contain images with wrong licence information which may
lead to a lawsuit by the copyright owner. The same applies to public images
collected from the Internet. In addition to copyright issues, many countries are
tightening up privacy regulations. For example, General Data Protection Regulation
(GDPR) law in the European Union treats any image that may be used to identify an
individual as personal data, and therefore companies that collect images that may
contain personally identifiable information have to comply with storage and reten-
tion requirements of the law.
The more expensive way of acquiring the data is to buy a commercial dataset if
one exists that fits your requirements. The number of companies that are selling
datasets is growing rapidly these days; so, it is possible to purchase the dataset for
most popular applications.
Data crowdsourcing is the strategy to collect the data (usually including annota-
tions) using a crowdsourcing platform. Such a platform asks users to collect the
required data either for compensation or for free as a way to contribute to the
improvement of a service. An example of a paid data crowdsourcing platform is
Mobeye, and an example of a free data crowdsourcing platform is Google
Crowdsource. Another way of data crowdsourcing implementation is through the
data donation option available in the application or service. Some services provide
an option for users to donate their photos or other useful information that is
otherwise considered private to the owner of the service in order to improve that
service.
Controlled data acquisition is the process of creating data using a specially
designed system. An example of such system is a set of cameras pointing to a
rotating platform at certain angles designed to collect a dataset of objects in a

controlled environment (Singh et al. 2014). This approach allows collecting all the
parameters of the acquired data such as camera position, lighting, type of lens, object
size, etc.
Synthetic data acquisition is becoming increasingly popular for training computer
vision models (Nikolenko 2019). There are multiple ways to generate synthetic data
depending on the task. For a simple task such as character recognition, one can
generate data by generating images of a text while adding noise and distortions. A
more advanced way that is widely used in building autonomous driving and robotic
systems is computer graphics and virtual environments. Some more recent attempts
propose to use generative deep models to synthesise data. One should be careful with
using synthetic data as deep learning models could overfit on subtle nuances in the
synthetic data and work poorly in practice. The common strategy is to mix synthetic
data with real-world data to train the model. Increasing concerns about personal data
privacy and as a result of new regulations push companies to rely less on user data to
improve their models. On the other hand, there is pressure to improve fairness and
inclusivity of the developed models which often require very rare data samples. This
makes synthetic data a very good candidate to fulfil future data needs, and many new
companies appeared in recent years offering synthetic data generation services.
9.4.2 Human Annotation or Verification
Depending on the way the images were acquired, they may or may not have ground-
truth labels, or the labels may be not reliable enough and need to be verified by
humans. Data annotation is the process of assigning or verifying ground-truth labels
to each image.
Data annotation is one of the most expensive stages of building image classifi-
cation systems as it requires the human annotator to visually analyse every sample,
which could be time-consuming. The process is typically managed using special
software that handles a dataset of raw images and associated labels, distributes work
among annotators, combines results and implements a user interface to do annota-
tions in the most efficient way. Currently, dozens of free and paid image annotation
platforms exist that offer different kinds of UIs and features. Several strategies in
data annotation exist aimed to reduce the cost and/or increase the ground-truth
quality which we discuss below:
Outsourcing to an Annotation Vendor Currently, there exist dozens of companies
providing data annotation services. These companies are specialising in the
cost-effective annotation of data. Besides having the right software tools, they
handle process management, quality assurance, storage and delivery. Many of
such companies provide jobs in areas of the world with very low income or to
incarcerated people who would not have other means to earn money. Besides the
benefits of reduced costs of labour and reduced management overhead, another
248 M. Y. Sirotenko
advantage of such approach is data diversity and avoidance of annotation biases.

This can be achieved by contracting multiple vendors from different regions of the
world and different demographics. The reasons why this approach might not be
appropriate are as follows: the highly confidential or private nature of the data,
required expertise in a certain domain or too low volumes of the data to justify the
annotation costs.
If using an external vendor for data annotation is not an option, then one may
consider running a custom data annotation service. As mentioned above, there are
many available platforms to handle the annotation process. Depending on the
complexity of the annotation task and required domain expertise, three different
approaches could be used.
1. Simple tasks requiring common human skills. These tasks do not require any
special knowledge and could be performed by almost anyone, for example,
answering whether an image contains a fruit. For such tasks, there are
crowdsource platforms such as Amazon Mechanical Turk, Clickworker and
others where any person can become a worker and do work any time.
2. Tasks that require some training. For example, an average person may not be able
to tell the difference between plaid, tartan and floral textile patterns. However,
after studying examples and running a quick training session, she would be able
to annotate these patterns. For such tasks, it is best to have a more or less fixed set
of workers because every time a new worker joins, she needs to complete training
before doing work.
3. Very complex tasks requiring expert knowledge. Examples are annotating med-
ical images or categories and attributes of a specific product. Such annotation
campaigns are very expensive and usually involve a limited number of experts.
Thus, they require a relatively low overhead in managing such work.
Even with the best quality assurance process, humans make mistakes, especially
if the task is very complex, subjective, abstract or ambiguous. Studies show
(Neuberger et al. 2017) that for hard tasks, it is beneficial to invest in improving
the quality of the annotations rather than collecting more unreliable annotations. This
can be achieved by a majority voting approach where the same sample is sent to
multiple workers and results get aggregated by voting (i.e. if two out of three
workers choose label A and one chooses label B, then label A is the correct one).
Some versions of this approach use disagreement between raters as a label confi-
dence. Also, in some tasks, there was success in recording the time that each worker
takes to answer the question and use it as a measure of task complexity.
One of the big problems in data annotation is redundancy. Not all samples are
equally valuable for improving model performance or for measuring it. One of the
main reasons for that is class imbalance which is a natural property of most of the
real-world data. For example, if our task is to annotate a set of images of fruits
crawled from the Internet, we may realise that we have hundreds of millions of
images of popular fruits such as apples or strawberries but only a few hundreds of
rambutans or durians. Since we do not know which image shows which fruit, we
Human
Update datasets annotation
Labeled data
Inference Training
Predictions /
Unlabeled ML Model Selected
Features /
data Selection samples
Gradients
strategy
Fig. 9.5 Active learning system
would have to spend a lot of money to annotate all of them to get annotations for all
rare fruits. In order to tackle this problem, an active learning approach could be used
(Schröder and Niekler 2020). Active learning attempts to maximise a model’s
performance gain while annotating the fewest samples possible. The general idea
of active learning is shown in Fig. 9.5.
Active learning helps to either significantly reduce costs or improve quality by
only annotating the most valuable samples.
9.4.3 Data Engineering
Data engineering is a process of manipulating the raw data in a way to make it useful
for training and/or evaluation. Here are some typical operations that may be required
to prepare the data:
• Finding near-duplicate images
• Ground-truth label smearing and merging
• Applying taxonomy rules and removing contradicting labels
• Offline augmentation
• Image cropping
• Removing low-quality images (low resolution or size) and low confidence labels
(e.g. labels that have low agreement between annotators)
• Removing inappropriate images (porn, violence, etc.)
• Sampling, querying and filtering samples satisfying certain criteria (e.g. we may
want to sample no more than 1000 samples for each apparel category containing a
certain attribute)
• Converting storage formats
Multimedia data takes a lot of storage compared to text and other structured data.
At the same time, images are one of the most abundant types of information storage
with billions of images uploaded online every day. This makes data engineering
tasks for image datasets not a trivial task.
250 M. Y. Sirotenko
Unlike non-multimedia datasets that may be stored in relational databases, for

image datasets, it makes more sense to store them in a key-value database. In this
approach, the key uniquely identifies an image (oftentimes it is an image hash), and
the value includes image bytes and optionally ground-truth labels. If one image may
be associated with multiple datasets or ground-truth labels, the latter may be stored in
a separate relational database. This also simplifies the filtering and querying
procedures.
Processing such a large volume of data usually requires massive parallelism and
use of frameworks implementing the MapReduce programming model. This model
divides processing into three operations: map, shuffle and reduce. The map operation
applies a given function to every sample of the dataset independently from other
samples. The shuffle operation redistributes data among worker nodes. The reduce
operation runs a summary over all the samples. There are many commercial and
open-source frameworks available for distributed data processing. One of the most
popular is Apache Hadoop.
9.4.4 Data Management
There are several reasons why datasets should be handled by specialised software
rather than just kept as a set of files on a disk. The first reason is backup and
redundancy. As we mentioned above, multimedia data may take a lot of storage,
which increases the chance of data corruption or loss if no backup and redundancy
is used.
The second is dataset versioning. In academia, it is typical for a dataset to be used
without any change for over 10 years. Usually, this is very different in practical
applications. Datasets created for practical use cases are always evolving – new
samples added, bad samples removed, ground-truth labels could be added continu-
ously, datasets could be merged or split, etc. This leads to a situation where
introducing a bug in the dataset is very easy and debugging this bug is extremely
hard. Dataset versioning tools help to treat dataset versions similarly to how code
versions are treated. A number of tools exist for dataset version control including
commercial and open source. DVC is one of the most popular tools. It allows to
manage and version data and integrates with most of the cloud storage platforms.
The third reason is data retention management. A lot of useful data could contain
some private information or be copyrighted, especially if this data is crawled from
the web. This means that such data must be handled carefully and should have ways
to remove a sample following an owner request. Some regulators also require that
the data should be stored for a limited time frame after which it should be deleted.
9.4.5 Data Augmentation
Data augmentation is a technique used to generate more training samples from an

existing training dataset by applying various (usually random) transformations to the
input images. Such transformations may include the following:
• Affine transformations
• Blur
• Random crops
• Colour distortions
• Impulse noise
• Non-rigid warping
• Mixing one or more images
• Overlaying a foreground image onto a different background image
The idea behind augmentation is that applying these types of transformations
shouldn’t change the semantics of the image and by using them for training a model,
we improve its generalisation capabilities. Choosing augmentation parameters,
however, could be nontrivial because too strong augmentations can actually reduce
model accuracy on a target test set. This led to a number of approaches that aim to
automatically estimate optimal augmentation parameters for a given dataset and the
model (Cubuk et al. 2018). Image augmentation techniques are especially useful in
the context of self-supervised learning that is discussed in the following sections.
9.5 Model Training
The deep learning revolution made most of the classical computer vision approaches
to image classification obsolete. Figure 9.6 shows the progress of image classifica-
tion models over the last 10 years on a popular ImageNet dataset (Deng et al. 2009).
The only classical computer vision approach is SIFT-FV with 50.9% top 1 accuracy.
The best deep learning model is over four times better in terms of classification error
(EfficientNet-L2). For quite some time, deep learning was considered a high-
accuracy but high-cost approach because it required considerable computational
resources to run inference. In recent years however, much new specialised hardware
has been developed to speed up training and inference. Today, most flagship
smartphones have some version of a neural network accelerator. Reducing costs
and time for training and inference is one of the factors of the increasing popularity
of deep learning. Another factor is a change of software engineering paradigm that
some call “Software 2.0”. In this paradigm, developers no longer build a system
piece by piece. Instead, they specify the objective, collect training examples and let
optimisation algorithms build a desired system. This paradigm turns system devel-
opment into a set of experiments that are easier to parallelise and thus speed up
progress.
252 M. Y. Sirotenko
Fig. 9.6 Evolution of image classification models’ top 1 accuracy on ImageNet dataset
The amount of research produced every year in the area of machine learning and
computer vision is incomprehensible. The NeurIPS conference has a growing trend
of paper submissions (e.g. in 2019 the number of submissions was 6743, and 1428
papers were accepted). Not all of the published research passes the practice test. In
this chapter, we will briefly discuss the most effective model architectures and
training approaches.
9.5.1 Model Architecture
A typical neural network-based image classifier can be divided into backbone and
prediction layers (Fig. 9.7). The backbone consists of the input layer, hidden layers
and feature (or bottleneck) layers. It takes an image as an input and produces the
image representation or features. The predictor is then using this representation to
predict classes. This kind of division is rather virtual and allows to conveniently
separate a reusable and more complex part that extracts the image representation
from a predictor that usually is a simple few-layer fully connected network. In the
following, we focus on choosing the architecture for the backbone part of the
classifier.
There are over a hundred different neural network model architectures that exist
today. However, nearly all successful architectures are based on the concept of the
residual network (ResNet) (He et al. 2016). A residual network consists of a chain of
residual blocks as depicted in Fig. 9.8. The main idea is to have a shortcut connection
between the input and output. Such connection allows for the gradients to flow freely
and avoid a vanishing gradient problem that for many years prevented building very
deep neural networks. There are several modifications to the standard residual block.
One modification proposes to add more levels of shortcut connections (DenseNet).
Input Feature
Input Hidden layers Hidden layers Class
layer layer
image predictions
Backbone Predictor
Fig. 9.7 Neural network-based image classifier
Input Output
Weights Activation Normalization
1x1
Conv
Residual
connection
Fig. 9.8 Residual block
Another modification proposes to assign and learn weights for the shortcut connec-
tion (HighwayNet).
Deploying an ML model is a trade-off between cost, latency and accuracy. Cost is
mostly managed through the hardware that runs the model inference. A more costly
and powerful hardware can either run a model faster (low latency) or run a larger and
more accurate model with the same latency. If the hardware is fixed, then the trade-
off is between the model size (i.e. latency) and the accuracy.
When choosing or designing the ResNet-based model architecture, the accuracy-
latency trade-off is achieved through varying the following parameters:
1. Resolution: this includes input image resolution as well as the resolution of
intermediate feature maps.
2. Number of layers or blocks of layers (such as residual block).
3. Block width which is to the number of channels of the feature maps of the
convolutional layers.
By varying model depth, width and resolution, one can influence model accuracy
and inference speed. However, predicting how accuracy will change as a result of
changing one of those parameters is not possible. The same applies to predicting
model inference speed. Even though it is straightforward to estimate the amount of
computing and memory required for the new architecture, different hardware may
work faster with certain architectures requiring more computations vs some other
ones requiring less computations.
Because of the abovementioned problems, designing model architectures used to
be an art as much as science. But recently, more and more successful architectures
are designed by optimisation or search algorithms (Zoph and Le 2016). Examples of
the architectures that were designed by algorithm are EfficientNet (Tan and Le 2019)
and MobileNetV3 (Howard et al. 2019). Both architectures are a result of neural
254 M. Y. Sirotenko
architecture search. The difference is that EfficientNet is a group of models

optimised for general use, while MobileNetV3 is specifically optimised to deliver
the lowest latency on mobile devices.
9.5.2 Classification Model Training Approaches
Depending on the amount and kind of training data, privacy requirements, available
resources to spend for model improvement and other constraints and considerations,
different training approaches could be chosen.
The most straightforward approach is the fine-tuning of the pre-trained model.
The idea here is to find a pre-trained visual model and fine-tune it using a collected
dataset. Fine-tuning in this case means training a model that has a backbone
initialised from the pre-trained model and a predictor initialised randomly. Thou-
sands of models pre-trained on various datasets are available online for download.
There are two modes of fine-tuning: full and partial. Full-model fine-tuning trains the
entire model, while partial freezes most of the model and trains only certain layers.
Typically, in the latter mode, the backbone is frozen while the predictor is trained.
This mode is used when the dataset size is small or when there is a need to do a quick
training. Fine-tuning mode is also used as a baseline before using other ways of
improving model accuracy.
As was mentioned above, data labelling is one of the most expensive stages of
building an image classification system. Acquiring unlabelled data on the other hand
could be much easier. Thus, it is very common to have a small labelled dataset and
much larger dataset with no or weak labels. Self-supervised and semi-supervised
approaches aim to utilise the massive amounts of unlabelled data to improve model
performance. The general idea is to pre-train a model on a large unlabelled dataset in
an unsupervised mode and then fine-tune it using a smaller labelled dataset. This idea
is not new and has been known for about 20 years. However, only recent advances in
unsupervised pre-training made it possible for such models to compete with fully
supervised training regimes where all the data is labelled. One of the most successful
approaches for unsupervised pre-training is contrastive learning (Chen et al. 2020a).
The idea of contrastive learning is depicted in Fig. 9.9. An unlabelled input image is
transformed by two random transformation functions. Those two transformed
images are fed into an encoder network to produce corresponding representations.
Finally, two representations are used as inputs to the projection network whose
outputs are used to compute consistency loss. Consistency loss pushes two pro-
jections from the same image to be close, while projections from different images are
pushed to be far away.
It was shown that unsupervised contrastive learning works best with large
convolutional neural networks (Chen et al. 2020a, b). In order to make this approach
more practical, one of the ways is to use knowledge distillation.
Knowledge distillation in a neural network consists of two steps:
Fig. 9.9 Contrastive unsupervised learning
1. In the first step, a large model or an ensemble of models is trained using ground-
truth labels. This model is called a teacher model.
2. In the second stage, (typically) a smaller network is trained using predictions of
the teacher network as a ground truth. This model is called a student network.
In the context of semi-supervised learning, a larger teacher model got trained
using unlabelled data and got distilled into a smaller student network. It was shown
that it is possible to do such distillation with negligible accuracy loss (Hinton et al.
2015).
Knowledge distillation is not only useful for semi-supervised approaches but also
as a way to control model size and computational requirements while keeping high
accuracy.
Another approach that is related to semi-supervised learning is domain transfer
learning. Domain transfer is a problem of using one dataset to train a model to make
predictions on a dataset from a different domain. Examples of domain transfer
problems include training on synthetic data and predicting on real data, training on
e-commerce images of products and predicting on user images and so on.
There are two ways of tackling the domain transfer problem depending on
whether there is some training data from the target domain or not:
256 M. Y. Sirotenko
1. If the data from the target domain is available, then the contrastive training that
aims to minimise distance between source and target domain is one of the most
successful approaches.
2. If the target domain samples are unavailable, then the goal is to train a model
robust to the domain shift. In this case, heavy augmentation, regularisation and
contrastive losses help.
Another common problem when training image classification models is data
imbalance. Almost every real-life dataset has some kind of data imbalance which
manifests in some classes having orders of magnitude more training samples than
others. Data imbalance may result in a biased classifier. The most obvious way to
solve a problem is to collect more data for underrepresented classes. This, however,
is not always possible or could be too costly. Another approach that is widely used is
under- or over-sampling. The idea is to use fewer samples of the overrepresented
class and duplicate samples of the underrepresented class during training. This
approach is simple to implement, but the accuracy improvement for the underrepre-
sented classes often comes with the price of reduced accuracy for the overrepre-
sented ones. There is a vast amount of research aimed at handling data imbalance by
building a better loss function (see Cao et al. 2019, for instance).
One more training approach we would like to mention in this chapter is federated
learning (FL) (Yang et al. 2019). Federated learning is a type of privacy-preserving
learning where no central data store is assumed. As shown in Fig. 9.10, in FL there is
a centralised server that does federated learning and a large enough number of user’s
devices. Each user’s device downloads a shared model from the server, uses it and
computes gradients using only data available on that device. According to a sched-
ule, those gradients are sent to a centralised server where gradients from all users
integrated into a shared model. This approach guarantees that no actual data could
leak from the centralised server. Such approach is gaining more popularity recently
since it allows improving model performance using users’ data without compromis-
ing privacy.
Fig. 9.10 Federated

learning diagram
Shared
model
Centralized server
Gradients
Updated
model
Model Model Model
User 1 User N
device device
9.6 Deployment
After the model is trained and evaluated, the final step is model deployment.
Deployment in the context of this chapter means running your model in production
to classify images coming from the users of your service. The factors that are
important at this step are latency, throughput and cost. Latency means how fast a
user will get a response from your service after sending an input image, and
throughput means how many requests your service can process per unit of time
without failures. The other important factors during the model deployment stage are
convenience of updating the model, proper versioning and ability to run A/B tests.
The latter depends on the software framework that was chosen for deployment.
TensorFlow serving is a great example of the platform that delivers high-
performance serving of models with gRPC and REST client support.
The way to reduce the latency or increase the throughput of the model serving
system is by using a more powerful hardware. This could be done either by building
your own server or by using one of many cloud solutions. Most of the cloud
solutions for serving models offer instances with modern GPU support that provide
a much better efficiency than CPU-only solutions. Another alternative to GPU for
accelerating neural networks is tensor processing units (TPU) that were specifically
designed for running neural networks’ training and inference.
Another aspect of model deployment that is important to keep in mind is
protecting the model from stealing. Protecting the confidentiality of ML models is
important for two reasons: (a) a model can be a business advantage to its owner, and
(b) an adversary may use a stolen model to find transferable adversarial examples
that can evade classification by the original model. Several methods were proposed
recently to detect model stealing attacks as well as protecting from them by embed-
ding watermarks in the neural networks (Juuti et al. 2019; Uchida et al. 2017).
References
Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-
distribution-aware margin loss. In: Advances in Neural Information Processing Systems,
pp. 1567–1578 (2019)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of
visual representations. arXiv preprint arXiv:2002.05709 (2020a)
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.: Big self-supervised models are
strong semi-supervised learners. arXiv preprint arXiv:2006.10029 (2020b)
Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation
policies from data. arXiv preprint arXiv:1805.09501 (2018)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical
image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 248–255 (2009)
Fei-Fei, L., Russakovsky, O.: Analysis of large-scale visual recognition. Bay Area Vision Meeting
(2013)
258 M. Y. Sirotenko
Hastings, R.: Making the most of the cloud: how to choose and implement the best services for your
library. Scarecrow Press (2013)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hinton, G., Vinyals, O, Dean, J.: Distilling the knowledge in a neural network, arXiv preprint
arXiv:1503.02531 (2015)
Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R.,
Vasudevan, V., Le, Q.V.: Searching for mobilenetv3. In: Proceedings of the IEEE International
Conference on Computer Vision, pp. 1314–1324 (2019)
Jia, M., Shi, M., Sirotenko, M., Cui, Y., Cardie, C., Hariharan, B., Adam, H., Belongie, S.:
Fashionpedia: Ontology, Segmentation, and an Attribute Localization Dataset. arXiv preprint
arXiv:2004.12276 (2020)
Juuti, M., Szyller, S., Marchal, S., Asokan, N.: PRADA: protecting against DNN model stealing
attacks. In: Proceedings of the IEEE European Symposium on Security and Privacy (EuroS&P),
pp. 512–527 (2019)
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.:
Microsoft coco: Common objects in context. In: Proceedings of the European Conference on
Computer Vision, pp. 740–755. Springer, Cham (2014)
Mozur, P.: One month, 500,000 face scans: How China is using A.I. to profile a minority (2019).
Accessed on September 27 2020. https://www.nytimes.com/2019/04/14/technology/china-
surveillance-artificial-intelligence-racial-profiling.html
Neuberger, A., Alshan, E., Levi, G., Alpert, S., Oks, E.: Learning fashion traits with label
uncertainty. In: Proceedings of KDD Workshop Machine Learning Meets Fashion (2017)
Nikolenko, S. I.: Synthetic data for deep learning, arXiv preprint arXiv:1909.11512 (2019)
Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G., Tran, D.: Measuring calibration in deep
learning. In: Proceedings of the CVPR Workshops, pp. 38–41 (2019)
Schröder, C., Niekler, A.: A survey of active learning for text classification using deep neural
networks, arXiv preprint arXiv:2008.07267 (2020)
Singh, A., Sha, J., Narayan, K.S., Achim, T., Abbeel, P.: Bigbird: A large-scale 3d database of
object instances. In: Proceedings of the IEEE International Conference on Robotics and
Automation, pp. 509–516 (2014)
Tan, M., Le, Q. V.: Efficientnet: Rethinking model scaling for convolutional neural networks, arXiv
preprint arXiv:1905.11946 (2019)
Uchida, Y., Nagai, Y., Sakazawa, S., Satoh, S.I.: Embedding watermarks into deep neural networks.
In: Proceedings of the ACM on International Conference on Multimedia Retrieval, pp. 269–277
(2017)
y Arcas, B.A., Mitchell, M., Todorov, A.: Physiognomy’s New Clothes. In: Medium. Artificial
Intelligence (2017) Accessed on September 27 2020 https://medium.com/@blaisea/
physiognomys-new-clothes-f2d4b59fdd6a.
Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: concept and applications. ACM
Trans. Intell. Syst. Technol. 10(2), 1–19 (2019)
Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning, arXiv preprint
arXiv:1611.01578 (2016)
Chapter 10
Mobile User Profiling
Alexey M. Fartukov, Michael N. Rychagov, and Lyubov V. Stepanova
10.1 Introduction
It would be no exaggeration to say that modern smartphones fully reflect the

personality of their owners. Smartphones are equipped with a myriad of built-in
sensors (Fig. 10.1). At the same time, almost half of smartphone users spend more
than 5 hours a day on their mobile devices (Counterpoint press release 2017). Thus,
smartphones collect (or able to collect) huge amounts of personalized data from the
applications and the frequency of their usage as well as heterogeneous data from its
sensors, which record various physical and biological characteristics of an individual
(Ross et al. 2019). This chapter is dedicated to the approach and methods intended
for personalization of customer services and mobile applications and unobtrusive
mobile user authentication by using data generated by built-in sensors in
smartphones (GPS, touchscreen, etc.) (Patel et al. 2016).
It should be specially noted that collection of the abovementioned data and
extraction of ancillary information, such as a person’s age, gender, etc., do not
lead to privacy infringement. First, such information should be extracted only after
the subject’s consent. Second, the information should be protected from
unauthorized access. Later requirements place restrictions on which algorithmic
solutions can be selected. Enactment of the EU General Data Protection Regulation
has emphasized the importance of privacy preservation (Regulation EU 2016).
Discussion of data protection issues and countermeasures is out of the scope of
this chapter.
A. M. Fartukov · L. V. Stepanova (*)

e-mail: a.fartukov@samsung.com; podoynitsinalv@gmail.com
M. N. Rychagov

260 A. M. Fartukov et al.
Fig. 10.1 Sensors and data sources available in a modern smartphone (part of this image was
designed by Starline/Freepik)
The aim of the methods we present in this chapter is to enhance a customer’s

convenience during his/her interaction with a smartphone. In Sect. 10.2, we describe
a method for demographic prediction (Podoynitsina et al. 2017), considering demo-
graphic characteristics of a user such as gender, marital status and age. Such
information plays a crucial role in personalized services and targeted advertising.
Section 10.3 includes a brief overview of passive authentication of mobile users. In
Section 10.4, we shed light on the procedure of dataset collection, which is needed to
develop demographic prediction and passive authentication methods.
10.2 Demographic Prediction Based on Mobile User Data
Previous research on demographic prediction has been predominantly focused on

separate use of data sources such as Web data (Hu et al. 2007; Kabbur et al. 2010),
mobile data (Laurila et al. 2013; Zhong et al. 2013; Dong et al. 2014) and application
data (Seneviratne et al. 2014). These methods have several issues. As mentioned
earlier, information such as marital status, age and user data tend to be sensitive. To
avoid leaking sensitive information, it is important to analyse user behaviour directly
on a smartphone. This endeavour requires developing lightweight algorithms in
terms of computational complexity and memory consumption.
With our present research, we intend to examine usage of all possible types of
available data that can be collected by modern smartphones and select the best
combination of features to improve prediction accuracy. For demographic predic-
tion, different types of data are considered: call logs, SMS logs and application usage
10 Mobile User Profiling 261
data among others. Most of them can be directly used as features for demographic
predication, except Web data. Thus, another issue is to develop a way to represent
the massive amount of text data from Web pages in the form of an input feature
vector for the demographic prediction method. To solve this issue, we propose using
advanced natural language processing (NLP) technologies based on a probabilistic
topic modelling algorithm (Blei 2012) to extract meaningful textual information
from the Web data.
In this method, we propose extracting user’s interests from Web data to construct
a compact representation of the user’s Web data. For example, user interests may be
expressed by the following: which books the user reads or buys, which sports are
interesting for the user or which purchases the user could potentially make.
This compact representation is suitable for training the demographic classifier. It
should be noted that the user’s interests can be directly used by the content provider
for better targeting of advertisement services or other interactions with the user.
To achieve flexibility in demographic prediction or to provide language indepen-
dence, we propose using common news categories as a model for user interests.
News streams are available in all possible languages of interest. Its categories are
also reasonably universal across languages and cultures. Thus, it allows one to build
a multi-lingual topic model.
This endeavour requires building and training a topic model with a classifier of
text data for specified categories (interests). A list of wanted categories can be
provided by the content provider. The topic model categorizes the text extracted
from Web pages. The topic model can be built with the additive regularization of
topic models (ARTM) algorithm. User interests are then extracted using the trained
topic model.
ARTM is based on the generalization of two powerful algorithms: probabilistic
latent semantic analysis (PLSA) (Hofmann 1999) and latent Dirichlet allocation
(LDA) (Blei 2012). The additive regularization framework allows imposing addi-
tional necessary constraints on the topic model, such as sparseness or desired word
distribution. It should be noted that ARTM can be used not only for clustering but for
classification for a given list of categories.
Next, the demographic model is trained using datasets collected from mobile
users. Features from the collected data are extracted with the help of topic model
trained on previous step. The demographic model comprises several (in our case
three) demographic classifiers. Demographic classifiers must predict the age of the
user (one of the following: ‘0–18’, ‘19–21’, ‘22–29’, ‘30+’), gender (male or female)
or marital status (married/not married) based on a given feature vector. Architecture
of the proposed method is depicted in Fig. 10.2.
It is important to emphasize that the language the user will use cannot be
predicted in advance. This uncertainty means that the model must be multi-lingual:
a speaker of another language may only need to load data for his/her own language.
ARTM allows inclusion of various types of modalities (translations into different
languages, tags, categories, authors, etc.) into one topic model. We propose using
cross-lingual features to implement a language-independent (multi-lingual) NLP
procedure. The idea of cross-lingual feature generation involves training one topic
Fig. 10.2 Architecture of the proposed solution for demographic prediction
model on translation of documents into different languages. The translations are

interpreted as modalities in such a way that it is possible to embed texts in different
languages into the same space of latent topics. Let us consider each aspect of the
proposed method in detail.
First, we would like to shed light on the proposed topic model. We need to
analyse webpages viewed by the user. Such analysis involves the following steps:
• Pre-processing Web pages
• PLSA
• Extension of PLSA with ARTM
• Document aggregation
10.2.1 Pre-processing
As mentioned earlier, the major source of our observational data is Web pages that
the user has browsed. Pre-processing of Web pages includes the following opera-
tions: removing HTML tags, performing stemming or lemmatization of every word,
removing stop words and converting all characters to lowercase and translating the
webpage content into target languages. We consider three target languages (Russian,
English and Korean) in the proposed algorithm. We use the ‘Yandex.Translate’ Web
service for translation.
10.2.2 Probabilistic Latent Semantic Analysis
To model the user’s interests, one must analyse the documents viewed by the user.
Topic modelling enables us to assign a set of topics T and to use T to estimate a
conditional probability that word w appears inside document d:
X
phwjdi ¼ t2T
phwjt iphtjdi,
where T is a set of topics. In accordance with PLSA, we follow the assumption that
all documents in a collection inherit one cluster-specific distribution for every cluster
of topic-related words. Our purpose is to assign such topics T, which will maximize a
functional L:
YY max
LðΦ, ΘÞ ¼ ln pðw _ d Þndw ! ,
d2Dw2d Φ, Θ
where ndw denotes the number of times that word w is encountered in a document d,
Φ ¼ ( p(w| t))W T ¼ (φwt)W T is the matrix of term probabilities for each topic, and
Θ ¼ ( p(t| d ))T D ¼ (θtd)T D is the matrix of topic probabilities for each document.
By using the abovementioned expressions, we obtain the following:
X X X max
LðΦ, ΘÞ ¼ n
w2d dw
ln pðw _ t Þpðt _ d Þ ! and
d2D t2T Φ, Θ
X X
w2W
phwjt i ¼ 1, phwjt i 0; t2T
phtjdi ¼ 1, phtjdi 0:
10.2.3 Additive Regularization of Topic Models
Special attention should be drawn to the following fact: zero probabilities are not
acceptable to the natural logarithm in the above equation for L(Φ, Θ). To overcome
this issue, we followed the ARTM method proposed by Vorontsov and Potapenko
(2015). The regularization coefficient R(Φ, Θ) is added:
Xr
RðΦ, ΘÞ ¼ τ R ðΦ, ΘÞ, τi
i¼1 i i
0,
where τi is a regularization coefficient and Ri(Φ, Θ) is a set of different regularizers.

In the proposed method, we use the smoothing regularizers for both matrices Φ, Θ.
Let us define the Kullback-Leibler divergence as follows:
Xn pi
KLðp _ qÞ ¼ p
i¼1 i
ln :
qi
The Kullback-Leibler divergence evaluates how well the distribution

p approximates another distribution q in terms of information loss. To make the
smoothing regularization for values p(w| t) in the matrix Φ, one must find a fixed
distribution β ¼ (βw)w 2 W, which can approximate p(w| t). Thus, we look for the
minimum KL values:
X min
KLw ðβw _ ϕωt Þ ! :
t2T Φ
Similarly, to make the smoothing regularization of the matrix Θ, we leverage

another fixed distribution α ¼ (αt)t 2 T that can approximate p(t| d ):
X min
KLt ðαt _ θtd Þ ! :
d2D Θ
To achieve both minima, we combine the last two expressions into a single
regularizer Rs(Φ, Θ):
X X X X
Rs ðΦ, ΘÞ ¼ β0 t2T
β
w2W ω
ln ðϕωt Þ þ α0 d2D
α
t2T t
ln ðθtd Þ ! max :
Finally, we combine L(Φ, Θ) with Rs(Φ, Θ) in a single formula:
max
LðΦ, ΘÞ þ Rs ðΦ, ΘÞ ! :
Φ, Θ
We maximize this expression using the EM algorithm by Dempster et al. (1977).
10.2.4 Document Aggregation
After performing step 3, it is possible to describe each topic t with the set of its words
w using the probabilities p(w| t). It also becomes possible to map each input
document d into a vector of topics T in accordance with probabilities p(t| d ). At
the next step, we need to aggregate all topic information about the documents d1u,
. . ., dnu viewed by the user u into a single vector. Thus, we average the obtained topic
vectors pu(ti| dj) in the following manner:
Table 10.1 Parameters adjusted by a genetic algorithm

Parameter Minimum value Maximum value
Number of neurons 8 256
Learning rate 0.0001 0.1
Weights decay 0 0.01
Gradient moment 0 0.95
Weights deviation 0.00001 0.1
1 XN d
pu ð t i _ d Þ ¼ p u t i _ d ju ,
Nd j¼1
where Nd denotes the number of documents viewed by a user u and dju is the j-th
document viewed by the user.
The resulting topic vector (or user interest vector) is used as feature vector for
demographic model. Let us consider it in detail.
The demographic model consists of several demographic classifiers. In the
proposed method, the following classifiers are used: age, gender and marital status.
In present work, a deep learning approach builds such classifiers, and the Veles
framework (Veles 2015) is used as a deep learning platform.
Each classifier is built with a neural network (NN) and optimized with a genetic
algorithm. NN architecture is based on the multi-layer perceptron (Collobert and
Bengio 2004). It should be noted that we determined the possible NN architecture of
each classifier and optimal hyper-parameters using a genetic algorithm.
We used the following hyper-parameters of NN architecture: size of the
minibatch, number of layers, number of neurons in each layer, activation function,
dropout, learning rate, weight decay, gradient moment, standard deviation of
weights, gradient descent step, regularization coefficients, initial ranges of weights
and number of examples per iteration.
Using a genetic algorithm, we can adjust these hyper-parameters (Table 10.1).
We also use a genetic algorithm to select optimal features in the input feature vector
and to reduce the size of the input feature vector of demographic model.
When operating a genetic algorithm, a population P with M ¼ 75 instances of
demographic classifiers with the abovementioned parameters is created. Next, we
use error backpropagation to train these classifiers. Based on training results, clas-
sifiers with highest performance in terms of demographic profile prediction are
chosen. Subsequently, to add new classifiers into the population, a crossover oper-
ation is applied. Such a crossover includes random substitution of numbers taken
from the parameters of two original classifiers if they do not coincide with these
classifiers.
Let us illustrate the process with the following example. If classifier C1 contains
n1 ¼ 10 in the first layer, and classifier C2 contains n2 ¼ 100 neurons, then
replacement of its value to 50 may be performed in the crossover operation. The
newly created classifier C3 ¼ crossover(C1, C2) replaces a classifier that shows the
worst performance in the population P.
To introduce modifications of best classifiers, an operation of mutation is also
applied. Each classifier with new parameters is added to the population of classifiers;
subsequently, all new classifiers are retrained and their performance is measured.
This process is continued until the classification process is improved. A demo-
graphic classifier with the best performance in the last population is chosen as the
final one.
Let us consider results from the proposed method. To build robust demographic
prediction models, we collected an extensive dataset with various available types of
features from mobile users. The principles and a detailed description of the dataset
are described in Sect. 10.4.
For demographic prediction, we explored different machine learning approaches
and methods: support vector machines (SVMs) (Cortes and Vapnik 1995), NNs
(Bishop 1995) and logistic regression (Bishop 2006). We performed accuracy tests
(without optimization), the results of which are presented in Table 10.2.
Based on the obtained results, the NN approach builds a demographic prediction
classifier. Although there are a myriad of freely available deep learning frameworks
for training NNs (Caffe, Torch, Theano, etc.), we decided to use our custom deep
learning framework (Veles 2015) because it was designed as a very flexible tool in
terms of workflow construction, data extraction and pre-processing and visualiza-
tion. It also has an additional advantage: the ease of porting the resulting classifier to
mobile devices.
It should be noted that we also had to optimize the number of topic features: we
decreased the initial number of ARTM features from 495 to 170. The demographic
prediction accuracies using topic features generated from ARTM and LDA are
shown in Table 10.3.
Table 10.2 Test results (without optimization)

Accuracy, %
Method Gender Marital status Age
Fully connected NNs 84.85 68.66 51.25
Linear SVM 75.76 61.19 42.42
Logistic regression 72.73 52.24 43.94
Table 10.3 A comparison of accuracy in case of ARTM and LDA

Demographic prediction accuracy, %
Task ARTM LDA
Gender 93.7 88.9
Marital status 87.3 79.4
Age 62.9 61.3
In accordance with experimental results, the proposed method achieves demo-

graphic prediction accuracies on gender, marital status and age as high as 97%, 94%
and 76%, respectively.
10.3 Behaviour-Based Authentication on Mobile Devices
Most authentication methods implemented on modern smartphones involve explicit

authentication, which requires specific and separate interactions with a smartphone
to gain access. This is true of both traditional (based on passcode, screen pattern,
etc.) and biometric (based on face, fingerprint or iris) methods. Smartphone unlock,
authorization for payment and accessing secure data are the most frequent operations
that users perform on a daily basis. Such operations often force users to focus on the
authentication step and may be annoying for them.
The availability of different sensors in modern smartphones and recent advances
in deep learning techniques coupled with ever-faster system on a chip (SoC) allows
one to build unique behavioural profiles of the users based on position, movement,
orientation and other data from mobile device (Fig. 10.3). Using these profiles
ensures convenient and non-obtrusive ways for user authentication on a smartphone.
This approach is called passive (or implicit) authentication. It provides an additional
layer of security by continuously monitoring the user’s interaction with the device
(Deb et al. 2019).
This section contains a brief overview of the passive authentication approach.
In accordance with papers by Crouse et al. (2015) and Patel et al. (2016), passive
authentication includes the following main steps (Fig. 10.4). Monitoring incoming
data from smartphone sensors begins immediately after the smartphone starts
up. During the monitoring, the incoming data are passed to an implicit authentication
step to establish a person’s identity. Based on the authentication result, a decision is
made whether the incoming data correspond to a legitimate user. If the collected
Fig. 10.3 Approaches to

user authentication on a
smartphone
Fig. 10.4 A mobile passive authentication framework
behavioural profile corresponds to a legitimate user, the new incoming data are
passed to implicit authentication step. Otherwise, the user will be locked out, and the
authentication system will ask the user to verify his/her identity by using explicit
authentication methods such as a password or biometrics (fingerprint, iris, etc.).
The framework that is elaborated above determines requirements that the implicit
authentication should satisfy. First, methods applied for implicit authentication
should be able to extract representative features that reflect user uniqueness from
noisy data (Hazan and Shabtai 2015). In particular, a user’s interaction with the
smartphone causes high intra-user variability, which should be handled effectively
by methods of implicit authentication.
Second, these methods should process sensor data and profile a user in real time
on the mobile device without sending data out of the mobile device. This process
indicates that implicit authentication should be done without consuming much
power (i.e. should have low battery consumption). Fortunately, SoCs that are used
in modern smartphones include special low-power blocks aimed at real-time man-
agement of the sensors without waking the main processor (Samsung Electronics
2018).
Third, implicit authentication requires protecting both collected data and their
processing. This protection can be provided by a special secure (trusted) execution
environment, which also imposes additional restrictions on available computational
resources. These include the following: restricted number of available processor
cores, reduced frequencies of the processor core(s), unavailability of extra compu-
tational hardware accelerators (e.g. GPU) and a limited amount of memory (ARM
Security Technology 2009).
The application of machine learning techniques to task of implicit authentication

dictates the need to use online or offline training approaches for an authentication
model (Deb et al. 2019). Each training approach has its inherent advantages and
disadvantages, and selection of a training approach is determined by applied algo-
rithms and, concomitantly, by the nature of the input data. An online training
approach requires collection of user data for a certain period of time and training
of an authentication model on the device. As a result, an individual model for each
user is obtained. A delay in model deployment due to the necessity to collect enough
data for training is the main disadvantage of online training. Deb et al. (2019) also
highlighted difficulties in estimating authentication performance across all the users
in case of online training.
The idea of an offline training approach is to obtain a common authentication
model for all users by performing training on pre-collected sensor data. In this case,
the authentication model learns distinctive features and can be deployed immediately
without any delay. The main challenge of offline training is the necessity of dataset
collection, which leads to the development of data collection procedures and their
implementation (described in Sect. 10.4). It should be noted that pre-trained authen-
tication models can be updated during the operational mode. Thus, we can talk about
the sequential application of the abovementioned approaches.
Many passive authentication methods have been described in the literature. All of
them can be divided into two categories: methods that use a single modality (touch
dynamics, usage of mobile applications, movement, GPS location, etc.) and multi-
modal methods, which fuse decisions from multiple modalities to authenticate the
user. Historically, early work on passive smartphone authentication had been ded-
icated to methods that use a single modality (Patel et al. 2016).
In particular, the smartphone’s location can be considered a modality. Simple
implicit authentication based on GPS location is implemented in Smart Lock
technology by Google for Android operating system (OS) (Google 2020). A
smartphone stays unlocked when it is in a trusted location. The user can manually
set up such locations (e.g. home or work) (Fig. 10.5).
An evolutionary way of developing this idea is automatic determination of trusted
locations and routes specific to the user. When the user is in such a location or on the
route, explicit authentication for unlocking the smartphone can be avoided.
Evidence for the possibility to use information about locations and routes for
implicit authentication is that most people regularly visit the same places and choose
the same routes. Fluctuations in location coordinates over time can be considered as
time series, and the next location point can be predicted based on history and
‘learned’ habits. The abovementioned factors lead to the idea of predicting next
location points and their comparison to user’s ‘usual’ behaviour at a particular time.
It was successfully implemented by method based on gradient boosting (Chen and
Ernesto 2016). The method includes median filtering of input latitude and longitude
values to reduce noise in location data. The method considers account timestamps
corresponding to location data: time, day of week (e.g. working day/weekend) and
month. A decision for user authentication is based on the difference between the
actual and predicted locations at a particular time. It should be noted that the
described method has several shortcomings: it requires online training of gradient-
Fig. 10.5 Smart Lock

settings screen of an
Android-based smartphone
boosted decision trees and, more importantly, to some extent decreases the level of a
smartphone’s security.
Decision of later issue is to use multi-modal methods for passive authentication,
which have obvious advantages over methods based on a single modality. Deb et al.
(2019) described an example of a multi-modal method for passive authentication. In
that paper, the authors proposed using Siamese long short-term memory (LSTM)
architecture (Varior et al. 2016) to extract deep temporal features from the data
corresponding to a number of passive sensors in smartphones for user authentication
(Fig. 10.6). The authors proposed a passive user authentication method based on
keystroke dynamics, GPS location, accelerometer, gyroscope, magnetometer, linear
accelerometer, gravity and rotation modalities that can unobtrusively verify a gen-
uine user with 96.47% True Accept Rate (TAR) at 0.1% False Accept Rate (FAR)
within 3 seconds.
10.4 Dataset Collection System: Architecture

and Implementation Features
As we mentioned in the previous sections, building a behaviour-based user profile

requires collecting datasets that should accumulate various types of information
from a smartphone’s sensors. To collect such dataset, we developed the following
system.
Fig. 10.6 Architecture of the model proposed by Deb et al. (2019). (Reproduced with permission
from Deb et al. 2019)
The main tasks of the system include data acquisition, which tracks usual user
interactions with a smartphone, subsequent storage and transmission of the collected
data, customization of data collection procedure (selection of sources/sensors for
data acquisition) for individual user (or groups of users) and controlling and mon-
itoring the data collection process.
The dataset collection system contains two components: a mobile application for
Android OS (hereafter called the client application) and a server for dataset storage
and controlling dataset collection (Fig. 10.7). Let us consider each component in
detail.
The client application is designed to collect sensor data and user activities (in the
background) and to send the collected data to the server via encrypted communica-
tion channel. The client application can operate on smartphones supporting Android
4.0 or higher. Because users would continue to use their smartphones in a typical
manner during dataset collection, the client application is optimized in terms of
battery usage.
Immediately after installation of the application on a smartphone, it requests the
user to complete a ground-truth profile. This information is needed for further
verification of the developed methods. The client collects the following categories
of information: ‘call + sensors’ data, application-related data and Web data.
‘Call + sensors’ data comprises SMS and call logs, battery and light sensor status,
location information (provided by GPS and cell towers), Wi-Fi connections, etc. All
sensitive information (contacts used for calls and SMS, SMS messages, etc.) is
transformed in a non-invertible manner (hashed) to ensure the user’s privacy. At
the same time, the hashed information can be used to characterize user behaviour.
Physical sensors provide an additional data source about the user context. Infor-
mation about the type and current state of the battery can reveal the battery charging
pattern. The light sensor helps to determine ambient light detection.
Fig. 10.7 Architecture of the dataset collection system
Location information (a pair of latitude and longitude coordinates accompanied

with a time label) is used to extract meaningful labels for places that the user visits
frequently (e.g. home, work) and to determine routes specific for the user.
Application-related data includes a list of installed and running applications, the
frequency and duration of particular application usage, etc. Market-related informa-
tion about applications can be obtained from the Google Play store, the Amazon App
store, etc. Tracking (which applications are used at what time, location, etc.) pro-
vides the richest information to determine the context and predict the user’s
behaviour.
Web data can be obtained from various mobile browsers (i.e. Google Chrome) by
using Android platform content provider functions to get history. The history of
browsing is then used to get textual (Web page content) information for further
analysis by NLP algorithms. It should be noted that Bird et al. (2020) support a
finding that browsing profiles are highly distinctive and stable characteristics for user
identification.
An important point is that the user fully controls the collected information and can
enable/disable the collection process at any time (in case a user decides that data
collection could violate his/her privacy at the moment). The client explicitly selects
which data will be collected. In addition, the user can modify the data transfer policy
(at night hours, via Wi-Fi connection only, etc.) for his/her convenience and mobile
traffic saving.
Fig. 10.8 Visualization of history and current distribution of participants by gender and marital
status
Another component of the data collection system is the server, which is aimed to
store collected data and to control the collection process (Fig. 10.7). Let us briefly
describe the main functions of the server:
1. Storing collected data in common dataset.
2. Monitoring activity of the client application. This factor includes tracking client
activation events; collecting information about versions of the client application
that are currently used for data collection, number and frequency of data trans-
actions between the clients and the server; and other technical information about
the client application’s activities.
3. Providing statistical information. It is important to obtain actual information
about the current status of the dataset collection. The server should be able to
provide information about the amount of already collected data, number of data
collection participants, their distribution by age/gender and marital status
(Fig. 10.8), etc. in convenient way.
4. Control of the dataset collection procedure. The server should be able to set up a
unified configuration for all clients. Such a configuration determines the set of
sensors that should be used for data collection and the duration of data collection.
Participants can be asked different questions to establish additional data labelling
or ground-truth information immediately before the data collection. Another
aspect of data collection control is notifying participants that the client application
should be updated or requires special attention during data collection. This
function is implemented as push notifications received from the server.
5. System health monitoring and audit. An actual aim of this function is to detect
software issues promptly and provide information that can help to fix them. For
this endeavour, the server should be able to define potentially problematic user
environments (by automatically collecting and sending to the server client
application’s logs in case of any error) and to collect participants’ feedback,

which can be obtained by the ‘Comment to developer’ function in the client
application.
We have successfully applied the described system for dataset collection. More
than 500 volunteers have been involved for the data collection, which lasted from
March 2015 to October 2016. Each volunteer supplied data for at least 10 weeks. To
associate the collected data with demographic information, each volunteer had to
complete a questionnaire with the following questions: date of birth, gender, marital
status, household size, job position, work schedule, etc. As mentioned earlier, the
data collection procedure was totally anonymized.
References
ARM Security Technology: Building a Secure System Using TrustZone Technology. ARM
Limited (2009). https://developer.arm.com/documentation/genc009492/c
Bird, S., Segall, I., Lopatka, M.: Replication: why we still can’t browse in peace: on the uniqueness
and reidentifiability of web browsing histories. In: Proceedings of the Sixteenth Symposium on
Usable Privacy and Security, pp. 489–503 (2020)
Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, New York
(1995)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Blei, D.M.: Probabilistic topic models. Commun. ACM. 55(4), 77–84 (2012)
Chen, T., Ernesto, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794
(2016)
Collobert, R., Bengio, S.: Links between perceptrons, MLPs and SVMs. In: Proceedings of the
Twenty-First International Conference on Machine Learning, pp. 1–8 (2004)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Counterpoint Technology Market Research: Almost half of smartphone users spend more than
5 hours a day on their mobile device (2017) (Accessed on 29 September 2020). https://www.
counterpointresearch.com/almost-half-of-smartphone-users-spend-more-than-5-hours-a-day-
on-their-mobile-device/
Crouse, D., Han, H., Chandra, D., Barbello, B., Jain, A.K.: Continuous authentication of mobile
user: fusion of face image and inertial measurement unit data. In: Proceedings of International
Conference on Biometrics, pp. 135–142 (2015)
Deb, D., Ross, A., Jain, A.K., Prakah-Asante, K., Venkatesh Prasad, K.: Actions speak louder than
(pass)words: passive authentication of smartphone users via deep temporal features. In: Pro-
ceedings of International Conference on Biometrics, pp. 1–8 (2019)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM
algorithm. J. R. Statist. Soc. Ser. B (Methodolog.). 39(1), 1–38 (1977)
Dong, Y., Yang, Y., Tang, J., Yang, Y., Chawla, N.V.: Inferring user demographics and social
strategies in mobile social networks. In: Proceedings of the 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 15–24 (2014)
Google LLC: Choose when your Android phone can stay unlocked (2020) Accessed on
29 September 2020. https://support.google.com/android/answer/9075927?visit_
id¼637354541024494111-316765611&rd¼1
Hazan, I., Shabtai, A.: Noise reduction of mobile sensors data in the prediction of demographic
attributes. In: IEEE/ACM MobileSoft, pp. 117–120 (2015)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual Interna-
tional ACM SIGIR Conference on Research and Development in Information Retrieval,
pp. 50–57 (1999)
Hu, J., Zeng, H.-J., Li, H., Niu, C., Chen, Z.: Demographic prediction based on user’s browsing
behaviour. In: Proceedings of the 16th International Conference on World Wide Web,
pp. 151–160 (2007)
Kabbur, S., Han, E.-H., Karypis, G.: Content-based methods for predicting web-site demographic
attributes. In: Proceedings of the 2010 IEEE International Conference on Data Mining,
pp. 863–868 (2010)
Laurila, J.K., Gatica-Perez, D., Aad, I., Blom, J., Bornet, O., Do, T.M.T., Dousse, O., Eberle, J.,
Miettinen, M.: From big smartphone data to worldwide research: the mobile data challenge.
Pervasive Mob. Comput. 9(6), 752–771 (2013)
Patel, V.M., Chellappa, R., Chandra, D., Barbello, B.: Continuous user authentication on mobile
devices: recent progress and remaining challenges. IEEE Signal Process. Mag. 33(4), 49–61
(2016)
Podoynitsina, L., Romanenko, A., Kryzhanovskiy, K., Moiseenko, A.: Demographic prediction
based on mobile user data. In: Proceedings of the IS&T International Symposium on Electronic
Imaging. Mobile Devices and Multimedia: Enabling Technologies, Algorithms, and Applica-
tions, pp. 44–47 (2017)
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the
protection of natural persons with regard to the processing of personal data and on the free
movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)
(GDPR). Off. J. Eur. Union L119(59), 1–88 (2016)
Ross, A., Banerjee, S., Chen, C., Chowdhury, A., Mirjalili, V., Sharma, R., Swearingen, T., Yadav,
S.: Some research problems in biometrics: the future beckons. In: Proceedings of International
Conference on Biometrics, pp. 1–8 (2019)
Samsung Electronics: Samsung enables premium multimedia features in high-end smartphones
with Exynos 7 Series 9610 (2018) Accessed on 29 September 2020. https://news.samsung.com/
global/samsung-enables-premium-multimedia-features-in-high-end-smartphones-with-exynos-
7-series-9610
Seneviratne, S., Seneviratne, A., Mohapatra, P., Mahanti, A.: Predicting user traits from a snapshot
of apps installed on a smartphone. ACM SIGMOBILE Mob. Comput. Commun. Rev. 18(2),
1–8 (2014)
Varior, R.R., Shuai, B., Lu, J., Xu, D., Wang, G.: A Siamese long short-term memory architecture
for human re-identification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer
Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol. 9911. Springer,
Cham (2016)
Veles Distributed Platform for rapid deep learning application development. Samsung Electronics
(2015) Accessed on 29 September 2020. https://github.com/Samsung/veles
Vorontsov, K., Potapenko, A.: Additive regularization of topic models. Mach. Learn. 101, 303–323
(2015)
Zhong, E., Tan, B., Mo, K., Yang, Q.: User demographics prediction based on mobile data.
Pervasive Mob. Comput. 9(6), 823–837 (2013)
Chapter 11
Automatic View Planning in Magnetic
Resonance Imaging
Aleksey B. Danilevich, Michael N. Rychagov, and Mikhail Y. Sirotenko
11.1 View Planning in MRI
11.1.1 Introduction
Magnetic resonance imaging (MRI) is one of the most widely used noninvasive
methods in medical diagnostic imaging. For better quality of the MRI image slices,
their position and orientation should be chosen in accordance with anatomical
landmarks, i.e. respective imaging planes (views or slices) should be preliminarily
planned. This procedure is named in MRI as view planning. The locations of planned
slices and their orientations depend on the human body parts under investigation. For
example, typical cardiac view planning consists of obtaining two-chamber, three-
chamber, four-chamber and short-axis views (see Fig. 11.1).
Typically, view planning is performed manually by a doctor. Such manual
operation has several drawbacks:
• It is time-consuming. Depending on the anatomy and study protocol, it could take
up to 10 minutes and even more in special cases. The patient should stay in the
scanner during this procedure.
A. B. Danilevich
e-mail: a.danilevich@samsung.com
M. N. Rychagov
M. Y. Sirotenko (*)
111 8th Ave, New York, NY 10011, USA
e-mail: mihail.sirotenko@gmail.com

278 A. B. Danilevich et al.
Fig. 11.1 Cardiac view planning
• It is operator-dependent. The image of the same anatomy produced with the same
protocol by different doctors may differ significantly. This fact degrades diagno-
sis quality and analysis of the disease dynamics.
• It requires qualified medical personnel to do the whole workflow (including the
view planning) instead of only analysing the images for diagnostics.
To overcome all these disadvantages, an automatic view planning system may be
used which estimates positions and orientations of desired view planes by analysis of
a scout image, i.e. a preliminary image obtained prior to performing the major
portion of a particular study.
The desired properties of such AVP system are:
• High acquisition speed of the scout image
• High computational speed
• High accuracy of the view planning
• Robustness to noise and anatomical abnormalities
• Support (suitability) for various anatomies
The goal of our work is an AVP system (and respective approach) being devel-
oped in accordance with the requirements listed above.
We describe a fully automatic view planning framework which is designed to be
able to process four kinds of human anatomies, namely, brain, heart, spine and knee.
Our approach is based on anatomical landmark detection and includes several
anatomy-specific pre-processing and post-processing algorithms. The key features
of our framework are (a) using deep learning methods for robust detection of the
landmarks in rapidly acquired low-resolution scout images, (b) unsupervised learn-
ing for overcoming the problem of small training dataset, (c) redundancy-based
midsagittal plane detection for brain AVP, (d) spine disc position alignment via
11 Automatic View Planning in Magnetic Resonance Imaging 279
3D-clustering and using a statistical model (of the vertebrae periodicity) and
(e) position refinement of detected landmarks based on a statistical model.
11.1.2 Related Works on AVP
Recent literature demonstrates various approaches for view planning in tomographic

diagnostic imaging. Most authors consider some specific MRI anatomies only and
create highly specialized algorithms aimed to perform view planning for this anat-
omy. Several methods for cardiac view planning are based on localization of the left
ventricle (LV) and building slices according to its position and orientation. In the
paper by Zheng et al. (2009), the authors perform localization of the LV using
marginal space learning and probabilistic boosting trees.
In Lu et al. (2011), a 3D mesh model of the LV is fitted in the MRI volume to
localize all LV anatomical structures and anchors. For brain MRI view planning, a
widely used approach is to find the midsagittal plane and to localize special land-
marks for building other planes, as shown by Young et al. (2006) and Li et al. (2009).
As for anatomically consistent landmark detection, a lot of approaches are used
which are adjusted for specific landmark search techniques. For example, according
to Iskurt et al. (2011), non-traditional landmarks are proposed for brain MRI, as well
as rather a specific approach for their finding. As for knee automatic view planning,
Bystrov et al. (2007), Lecouvet et al. (2009) and Bauer et al. (2012) demonstrate the
application of 3D deformable models of the femur, tibia and patella as an active
shape model for detecting anchor points and planes. In Zhan et al. (2011), AVP for
the knee is performed using landmark detection via the Viola-Jones-like approach
based on modified Haar wavelets. Spine view planning is generally performed by
estimation of positions and orientations of intervertebral discs. A great majority of
existing approaches uses simple detectors based on image contrast and brightness
gradients in a spine sagittal slice to localize the set of discs and their orientation
(Fenchel et al. 2008). Pekar et al. (2007) apply a disc detector which is based on
eigenvalue analysis of the image Hessian. Most of the spinal AVP approaches work
with 2D images in sagittal slices.
All mentioned methods could be divided into three categories:
1. Landmark-based methods which build view planes relative to some predefined
anatomical landmarks
2. Atlas-based methods which use an anatomical atlas for registration of the input
scout image and estimate view planes based on this atlas
3. Segmentation-based methods which try to perform either 2D or 3D segmentation
of a scout image in order to find desired view planes
Our approach is related to the first category.
Fig. 11.2 AVP workflow
11.2 Automatic View Planning Framework
The AVP framework workflow consists of the following steps (see Fig. 11.2):
1. 3D scout MRI volume acquisition. It is a low-resolution volume acquired at high
speed.
2. Pre-processing of the scout image, which includes such common operations as
bounding box estimation and statistical atlas anchoring and anatomy-specific
operations which include midsagittal plane estimation for the brain.
3. Landmark detection.
4. Post-processing, which consists of common operations for landmark position
refinement and filtering as well as anatomy-specific operations like vertebral disc
position alignment for spine AVP.
5. Estimation of the positions of view planes and their orientation.
A commonly used operation for the pre-processing stage is a bounding box
reconstruction for a body part under investigation. Such a bounding box is helpful
for various working zone estimations, local coordinate origin planning, etc.
The bounding box is a three-dimensional rectangular parallelepiped that bounds
only the essential part of the volume. For example, for brain MRI, the bounding box
simply bounds the head, ignoring the empty space around it. This first rough
estimation already brings information about the body part position, which reduces
the ambiguity of positions of anatomical points within a volume. Such ambiguity
appears due to the variety of body part positions relative to the scanner. This
reduction of ambiguity yields a reduction of the search zone for finding anatomical
landmarks. The bounding box is estimated via integral projections of the whole
Fig. 11.3 Bounding box and statistical atlas
volume onto coordinate axes. The bounding box is formed by utmost points of
intersections of the projections with the predefined thresholds. The integral projec-
tion is a one-dimensional function whose value in each point is calculated as a total
of all voxels with a respective coordinate fixed. Using non-zero thresholds allows
cutting off noise in side areas of the volume.
In the next step, the search zone is reduced even more by application of the
statistical atlas. The statistical atlas contains information about the statistical distri-
bution of anatomical landmarks’ positions inside a certain part of a human body. It is
constructed on the basis of annotated volumetric medical images. Positions of
landmarks in a certain volume are transformed to a local coordinate system which
relates to the bounding box, not to the whole volume. Such transformation prevents
wide dispersion of the annotations. On the basis of the landmark positions calculated
for several volumes, the landmarks’ spatial distribution is estimated. In the simplest
case, such distribution can be presented by the convex hull of all points (for a certain
landmark type) in local coordinates. When the statistical atlas is anchored to the
volume, the search zone is defined (Fig. 11.3).
From the landmark processing point of view, the post-processing stage represents
filtering out and clustering of detected points (the landmark candidates).
From the applied MRI task point of view, the post-processing stage contains
procedures which perform the computation of the desired reference lines and planes.
For knee AVP and brain AVP, the post-processing stage implies auxiliary reference
line and plane evaluation on the basis of previously detected landmarks.
During post-processing, at the first stage, all detected point candidates are filtered
by thresholds for the landmark criterion. All candidates of a certain landmark type
whose quality criterion value is less than the threshold are eliminated. Such thresh-
olds are estimated in advance via a set of annotated MRI volumes. They are chosen
in such a way that minimizes the loss function consisting of false-positive errors and
false-negative errors. Optimal thresholds provide a balanced set of false positives
and false negatives for the whole set of available volumes. This balance could be
adjusted by some trade-off parameter. In the loss function mentioned above, the
number of false negatives is calculated as a sum of all missed ground truths in all
volumes. A ground truth is regarded as missed if no landmarks are detected within a

spherical zone with a predefined radius, surrounding the ground-truth point. The
total of false positives is calculated as a sum of all detections that are outside of these
spheres (too far from ground truths).
11.2.1 Midsagittal Plane Estimation in Brain Images
The AVP workflow includes some anatomy-specific operations as pre-processing

and post-processing steps. One of such operations is midsagittal plane (MSP)
estimation. The MSP is the plane which divides the human brain into two cerebral
hemispheres.
To estimate the MSP position, a brain longitudinal fissure is used in our approach.
The longitudinal fissure is detected in several axial and coronal sections of a brain
image; the obtained fissure lines are used as reference for MSP estimation. A
redundant set of sections and reference lines is used to make the algorithm more
robust.
The main stages of brain MSP computation are the following:
1. A set of brain axial and coronal slices is chosen automatically on the basis of a
Fig. 11.4 Novel idea for automatic selection of working slices to be used for longitudinal fissure
detection
(a) (b)
Fig. 11.5 Longitudinal fissure detection: (a) example of continuous and discontinuous fissure
lines; (b) fissure detector positions (shift and rotation) which should be tested
head bounding box and anatomical proportions (Fig. 11.4).
2. The 2D bounding box is estimated for each slice.

3. The longitudinal fissure line is detected in each slice by an algorithm which
detects the longest continuous line within a bounding box. In each slice, the
fissure detector is represented as a strip (continuous or discontinuous, it depends
on the slice anatomy) which bounds a set of pixels to be analysed (see Fig. 11.5a).
The fissure is found as the longest continuous or discontinuous straight line in the
given slice by applying a detector window many times, step by step changing its
position and orientation (Fig. 11.5b).
4. The redundant set of obtained reference lines is statistically processed for the
clustering of line directions. Some detections could be false due to image artefacts
or disease, as shown in Fig. 11.6a. Nevertheless, because of data redundancy, the
outliers are filtered out (Fig. 11.6b). The filtering is performed via the line
projections onto each other and respective orthogonal discrepancy comparison.
As a result, two generalized (averaged) directional vectors are obtained (for the
fissure direction in axial and coronal slices, respectively). They permit us to create
the MSP normal vector, which is orthogonal to the two directions mentioned
above.
5. For MSP creation, a point is necessary for the plane to pass through. Such a point
may be obtained by statistical averaging of points formed by intersection of
reference lines with the head contour in respective slices. Finally, the MSP is
created via this “central” point and the normal vector, computed via the reference
lines (fissures, detected in slices).
(a) (b)
Fig. 11.6 An example of adequately detected and wrongly detected fissures. The idea for filtering
out outliers: (a) an example of the fissure detection in a chosen set of axial slices; (b) groups of
detected directions (shown as vectors) of the fissure in axial and coronal planes, respectively
The MSP may be also created directly, on the basis of the points formed by the
reference line intersection with a head contour in slices: as a least-square optimiza-
tion task. The result is just the same as for the averaged vectors and central points.
It should be pointed out that the redundancy in obtained reference lines plays a
great role in MSP estimation. As each reference line may be detected with some
inaccuracy, the data redundancy reduces the impact of errors on the MSP position.
The data redundancy feature makes our approach differ from others (Wang and Li
2008); it permits us to make the procedure more stable. In contrast, in other
approaches, as a rule, two slices only are used for MSP creation – one axial and
one coronal.
Some authors describe an entirely different approach which does not use the MSP
at all. For example, according to van der Kouwe et al. (2005), the authors create
slices and try to map them to some statistical atlas, solving an optimization task with
a rigid body 3D transformation. Then, this spatial transformation relative to an atlas
is used for MRI plane correction.
The estimated MSP is used as one of the planned MRI views. Further, the MSP
helps us to reduce the search space for the landmark detector since the landmarks
(corpus callosum anterior (CCA) and corpus callosum posterior (CCP)) are located
just in this plane.
11.2.2 Anatomical Landmark Detection
Anatomical landmark detection is a process of localizing landmark positions in a 3D

volume. For each MRI type (anatomy), a set of specific landmarks are chosen to be
used for the desired plane evaluation. Unique landmarks of different types are used
for brain, cardiac and knee anatomies, respectively (see Table 11.1).
It should be pointed out that in these tasks, anatomical points are determined
unambiguously, i.e. a single landmark exists for each of the mentioned anatomical
structures. In contrast to this fact, another situation may occur that a group of similar
anatomical points exists.
For example, for spine MRI, there could be several landmarks of the same type in
a volume. Since we have to find all vertebrae (or intervertebral discs), a lot of similar
anatomical objects exist in this task. The target points for detection are chosen as the
disc posterior points where the discs join the spinal canal. Typically, a set of such
landmarks exist in the spine, and these points are similar to each other.
As a rule, spinal MRI investigation is performed separately for three spinal zones:
upper, middle and lower. Such zones correspond to the following vertebrae types:
cervical (C), thoracic (T) and lumbar (L). For each of these zones, a separate
landmark type is established. All such spinal landmarks are located at the posterior
point of intervertebral discs. Every single landmark corresponds to the vertebra
located over it. Figure 11.7 shows landmarks of C-type (red) and T-type (orange).
Landmark detection is equivalent to point classification which is performed as
mapping of a set of all search points onto a set of respective labels, such as
Landmark_1 or Landmark_2 or . . . Landmark_N or Background. “Background”
relates to the points where no landmarks are located. Thus, landmark detection is
reduced to applying a discriminative function to each point.
Points to be tested are called “search points”. These points are actually only a
subset of all points in the volume. Firstly, search points are picked up from the search
Table 11.1 Landmark titles and description

Anatomy Landmark Description
Brain CCA Anterior Corpus Callosum
CCP Posterior Corpus Callosum
Cardiac APEX Left Ventricular Apex
MVC Center of Mitral Valve
AVRC Atrioventricular Node
RVL Lateral Right Ventricle
RVII Superior Right Ventricular Insertion
RVIS Inferior Right Ventricular Insertion
LVOT Left Ventricular Outflow Tract
Knee LPC Lateral Posterior Condyle
MPC Medial Posterior Condyle
TPL Lateral Tibial Plateau
TPI Internal Tibial Plateau
Spine VP Posterior Vertebral Disc
Fig. 11.7 Spinal landmarks
Fig. 11.8 Surrounding

context of search point
area defined by the statistical atlas anchored to the volume. This search area is
obtained as a union of sets of points from subvolumes that correspond to statistical
distributions of landmarks’ positions. Secondly, inside the search area, a grid of
search points is defined with some prescribed step (i.e. distance between
neighbouring points).
For classification of a point, its surrounding context is used. The surrounding
context is a portion of voxel data extracted from neighbourhood of the search point.
In our approach, we pick up a cubic subvolume surrounding the respective search
point and extract three orthogonal slices of this subvolume passing through the
search point (Fig. 11.8).
Thus, the landmark detector scans a selected volume with a 3D sliding window
and performs classification of every point by its surrounding context (Fig. 11.9).
Fig. 11.9 Landmark

detection
Classification is done by means of a learned deep discriminative system. The system

is based on a multi-layer convolutional neural network (CNN) (Sirotenko 2006).
In inference mode, the trained network takes three slices as its input and produces
a vector of pseudo-probabilities of the fact that the input belongs to one of the
specified classes (one of the landmarks or “Background”). Finally, after scanning a
whole search space with the mentioned discriminative system, we obtain a vector
field of outputs at each search point. Nevertheless, such probabilities describe only
the absolute magnitude of assurance of the mentioned fact, which is not enough for
adequate classification. Thus, a comparative measure is necessary which takes into
account all class probabilities relative to each other. Being calculated for each point,
such a measure is named the “landmark quality” (LMQ). If a certain landmark type
(for which the calculation of the landmark’s quality is performed) can be estimated
as the greatest value in the CNN output vector in a certain position, then the quality is
calculated as a difference between this value of output and the second highest value
(by magnitude) in the output vector. Otherwise, if a certain landmark type does not
have the greatest value in the CNN output vector, the quality value is calculated as a
difference between value of the output for this landmark and the greatest value in the
output vector. The same operation is done for the background class.
Algorithm 11.1 Landmark Quality Calculation
Given output of CNN

[a1 . . . aN]
Compute LMQ vector [Q1 . . . QN]
C max ¼ arg max fai g
i
For C ¼ 1: N
if (Cmax ¼¼ C)
QC ¼ aC – max ({ai}/aC)
else
QC ¼ aC – max ({ai})
At this step, classification of each search point is done by eliminating the

candidates with negative landmark quality value (for each class). Then, the whole
set of search points is divided into a disjoint set of subsets corresponding to each
class. Sets of points corresponding to landmarks (not background) are passed as
output detections (candidates) of the landmark detection procedure.
11.2.3 Training Landmark Detector
The main part of the landmark detector is a discriminative system used for classifi-
cation of the surrounding context extracted around the current search point. In our
approach, we utilize the neural network approach (Rychagov 2003). During the last
years, convolutional neural networks (LeCun and Bengio 1995; Sirotenko 2006;
LeCun et al. 2010) were applied for various recognition tasks and showed very
promising results (Sermanet et al. 2012; Krizhevsky et al. 2012). The network has
several feature-extracting layers, pooling layers, rectification layers and fully
connected layers (Jarrett et al. 2009). Layers of the network contain trainable weights
which prescribe behaviour of the discriminative system. The process of tuning these
weights is based on learning or training. Convolutional layers produce feature maps
which are obtained by convolution of input maps and applying a nonlinear function
to the maps after convolution. This nonlinearity also depends on some parameters
which are trained. Pooling layers alternate with convolutional layers. This kind of
layer performs down-sampling of feature maps. We use max pooling in our
approach. Pooling layers provide invariance of small translations and rotations of
features. As rectification layers, we use abs rectification and local contrast normal-
ization layers (Jarrett et al. 2009). Finally, on top of the network, a fully connected
layer is placed, which produces the final result. The output of the convolutional
neural network is a vector with a number of elements equal to a number of landmarks
plus one. For example, for the multiclass landmark detector designed for detecting
CCA and CCP landmarks, there are three outputs of the neural network: two
landmarks and background. These output values correspond to pseudo-probabilities
that the current landmark is located in a current search point (or no landmarks are
located here in case of background).
We train our network in two stages. In the first stage, we perform an unsupervised
pre-training using predictive sparse decomposition (PSD) (Kavukcuoglu et al.
2010). Then, we perform a supervised training (refining the pre-trained weights
and learning other weights) using stochastic gradient descent with energy-based
learning (LeCun et al. 2006).
A specially prepared dataset is used for the training. Several medical images are
required to be used to compose this training dataset. We have collected several brain,
cardiac, knee and spine scout MRI images for this purpose. These MRI volumes
were manually annotated: using a special programme, we pointed positions of
landmarks of interest in each volume. These points are used to construct true samples
corresponding to the landmarks. As a sample, we suppose a combination of a chosen
class label (target output vector) and a portion of respective voxel data taken from the
surrounding context of the certain point with a predefined size. The way of extracting
such surrounding context around the point of interest is explained in Sect. 11.2.2.
The samples are randomly picked from annotated volumes. The class label is a
vector of target probabilities of the fact that the investigated landmark
(or background) is located in a respective point. These target values are calculated
using the energy-based approach. For every landmark class, the target value is
calculated on the basis of the distance from current sample to the closest ground-
truth point of this class. For example, if sample is picked right in the ground-truth
point for Landmark_1, then the target value for this landmark is “1”. If the sample is
picked far from any ground truths of Landmark_1 (with distance exceeding the
threshold), then the target value for this landmark is “1”. And while we approach
the ground truth, the target value is monotonically increased. We added some extra
samples with spatial distortions to train the system to be robust and invariant to
noise.
At the first stage of learning, an unsupervised pre-training procedure initializes
the weights W of all convolutional (feature extraction) layers of the neural network.
Learning in unsupervised mode uses unlabelled data. This learning is performed via
sparse coding and predictive sparse decomposition techniques (Kavukcuoglu et al.
2010). The learning process is done separately for each layer by performing an
optimization procedure:
X
W ¼ argmin kz F W ð y Þ k2 ,
W y2Y
where Y is a training set, y is an input of the layer from the training set, z is a sparse
code of y, and Fw is a predictor (a function which depends on W; it transforms the
input y to the output of the layer).
This optimization is performed by stochastic gradient descent. Each training
sample is encoded into a sparse code using the dictionary D. The predictor FW
produces features of the sample which should be close to sparse codes. The recon-
struction error, calculated on the basis of features and the sparse code, is used to
calculate the gradient in order to update the weights W. To compute a sparse code for
a certain input, the following optimization problem is solved:
z ¼ argminkzk0 : y ¼ Dz,
z
where D is the dictionary, y is the input signal, z is the encoded signal (code), and
z is the optimal sparse code.
In the abovementioned equation, the input y is represented as a linear combination
of only a few elements of some dictionary D. It means that the produced code z (the
vector of coefficients of decomposition) is sparse. The dictionary D is obtained from
the training set in unsupervised mode (without using annotated labels). An advan-
tage of the unsupervised approach for the optimal dictionary D finding is the fact that
the dictionary is learned directly from data. So, the found dictionary D optimally
represents a hidden structure and specific nature of used data. An additional advan-
tage of the approach is that it does not need a large amount of annotated input data
for the dictionary training. Finding D is equivalent to solving the optimization
problem:
X
W ¼ argmin y2Y
kDz yk2 ,
W
where Y is a training set, y is an input of the layer from the training set, z is a sparse
code of y, and D is the dictionary.
This optimization is performed via a stochastic gradient descent. Decoding of the
sparse code is performed to produce decoded data. The reconstruction error (dis-
crepancy) is calculated on the basis of the training sample and the decoded data; the
discrepancy is used to calculate gradients for the dictionary D updating. The process
of the dictionary D adjustment is alternated with finding the optimal code z for the
input y with fixed D. For all layers except the first, the training set Y is formed as a set
of the previous layer’s outputs.
Unsupervised pre-training is useful when we have only a few labelled data. We
demonstrate the superiority of using PSD with few labelled MRI volumes. For such
experiment, unsupervised pre-training of our network is performed in advance.
Then, several annotated volumes were picked up for supervised fine-tuning. After
training, the misclassification rate (MCR) on the test dataset (samples from MRI
volumes which were not used at the training) was calculated. The plot in Fig. 11.10
shows the performance of classification (the lower MCR is the better one) depending
on various numbers of annotated volumes taking part in supervised learning.
After the unsupervised training is completed, the entire convolutional neural
network is adjusted to produce multi-level sparse codes which are a good hierarchi-
cal feature representation of input data.
In the next step, a supervised training is performed to tune the whole neural
network for producing features (output vector) which correspond to probabilities of
the appearance of a certain landmark or background in a certain point. This is done
by performing the following optimization:
X
W ¼ argmin y2Y
kx G W ð y Þ k2 ,
W
where W represents a set of all trainable parameters, Y is a training set, y is an input of

the feature extractor from the training set, xis a target vector based on the ground
truth corresponding to the input y, and GW depends on W and defines the entire
transformation done by the neural network from the input y to the output.
With such optimization, the neural network is learned to produce outputs similar
to those predefined by annotation labels. The optimization procedure is performed
by a stochastic gradient descent. During such training, the error (discrepancy)
calculated on the basis of the loss function is backpropagated from the last layer to
MCR on Test dataset

0.65
Without PSD
0.6
With PSD
0.55
0.5
0.45
MCR
0.4
0.35
0.3
0.25
0.2
0 5 10 15 20 25 30
Number of training MRI volumes
Fig. 11.10 Misclassification rate plot: MCR is calculated on the test dataset using CNN learned
with various numbers of annotated MRI volumes. Red line, pure supervised mode; blue line,
supervised mode with PSD initialization
the first one with their weights updating. At the beginning of the procedure, some
weights of feature extraction layers are initialized with the values computed at the
pre-training stage. The final feature extractor is able to produce a feature vector that
could be directly used for discriminative classification of the input or for assigning to
every class a probability of the input belonging to a respective class.
The trained classifier shows good performance on datasets composed of samples
from different types of MRI volumes (such as knee, brain, cardiac, spine). For future
repeatability and comparison, we also trained our classifier on the OASIS dataset of
brain MRI volumes which is available online (Marcus et al. 2007). MCR results
calculated on test datasets are shown in Table 11.2.
For validation of the convolutional neural network approach, we have compared
it with the widely used support vector machine classifier (SVM) (Steinwart and
Christmann 2008) applied to samples. We used training and testing datasets com-
posed of samples from cardiac MRI volumes. Table 11.3 demonstrates the superi-
ority of convolutional neural networks.
11.2.4 Spine Vertebrae Alignment
The spine AVP post-processing stage is used for the filtering of detected landmarks,
their clustering and spinal curve approximation on the basis of these clustered
landmark nodes.
Table 11.2 Classification results on different datasets

Dataset type MCR
OASIS 0.0590
Brain 0.0698
Cardiac 0.1553
Knee 0.0817
Spine 0.0411
Table 11.3 Classification results of SVM and ConvNets

Algorithm MCR
SVM (Linear kernel) 0.2952
SVM (degree-2 polynomial kernel) 0.29
SVM (degree-3 Polynomial kernel) 0.3332
ConvNet (pure supervised) 0.1993
ConvNet (with PSD initialization) 0.1553
The clustering of the detected points is performed for elimination of the outliers
among them (if they present). This operation finds several clusters – dense groups of
candidates – and all points apart from these groups are filtered out. The point quality
weight factors may be optionally considered in this operation. So, after such
pre-processing, a set of clusters’ centres is obtained which may be regarded as
nodes for further spinal approximation curve creating. They represent more adequate
data than the original detected points. The clustering operation is illustrated in
Fig. 11.11.
On the basis of the clustered nodes, a refined sagittal plane is created. This is one
of the AVP results.
Another necessary result is a set of local coronal planes adapted for each
discovered vertebra (or intervertebral disc). For such plane creating, respective
intervertebral disc locations should be found, as well as the planes’ normal vector
directions.
Firstly, a spinal curve approximation is created on the basis of previously
estimated nodes – clustered landmarks. The approximation is represented as two
independent functions of the height coordinate: x(z) and y(z) in coronal and sagittal
sections, respectively. The coronal function x(z) is represented as a sloped straight
line. The sagittal function y(z) is different for C, T and L zones of the spine (upper,
middle and lower spine, respectively). Sagittal approximation for these zones is
represented as a series of such components as a straight line, parabolic function and
trigonometric one (for C and L zones) with adjusted amplitude, period and starting
phase. The approximation is fitted with obtained nodes via the least squares
approach. As a rule, such approximation is quite satisfactory; see the illustration in
Fig. 11.12.
Fig. 11.11 Spinal landmark

clustering
Fig. 11.12 An example of spinal curve approximation (via nodes obtained as the clustered
landmark centres). Average brightness measured along the curve is shown in the illustration, too
Then, the intervertebral disc location should be determined. On the basis of the
spinal curve approximation, a curved “secant tube” is created, which passes approx-
imately through the vertebrae centres. An averaged brightness of voxels in the tube’s
axial sections is collected along the tube. So, the voxels’ brightness function is
obtained for the spinal curve: B(z), or B(L), where L is a running length of the spinal
curve.
In a T2 MRI protocol, vertebrae appear as bright structures, whereas
intervertebral discs appear as dark ones. So, the brightness function (as well as its
gradients) is used for supposed intervertebral disc position detection (see Fig. 11.12).
Nevertheless, the disc locations may be poorly expressed, and, on the other hand,
there may be a lot of false-positive locations detected. To avoid this problem,
additional filtration is used for these supposed discs’ positions; we call it “periodic
filtration”. A statistical model of intervertebral distances is created, and averaged
positions of the discs along the spinal curve are presented as a pulse function. A
convolution of this pulse function with the spinal curve brightness function (or with
Fig. 11.13 The vertebrae statistical model (the periodic “pulse function”) as additional knowledge
for vertebrae supposed location estimation
the brightness gradient function) permits us to detect the disc location more pre-
cisely. The parameters of this pulse function – its shift and scale – are to be adjusted
during the convolution optimization process. The spinal brightness function
processing is illustrated in Fig. 11.13.
Finally, the supposed location of the intervertebral discs is determined. The local
coronal planes are computed in these points, and the planes’ directional vectors are
evaluated as the spinal approximation curve local direction vectors.
The result of intervertebral disc secant plane estimation is shown in Fig. 11.14.
As a rule, 2D sagittal slices are used for vertebra (or intervertebral disc) detection
via various techniques (gradient finding, active shape models, etc.). Some authors
use segmentation (Law et al. 2012), and some do not (Alomari et al. 2010). In the
majority of the works, the statistical context model is used to make vertebrae finding
more robust (Alomari et al. 2011; Neubert et al. 2011).
Sometimes, the authors use a marginal space approach to boost up the process
(Kelm et al. 2011).
A distinctive point of our approach is that we use mostly 3D techniques. We did
not use any segmentation: neither 3D, nor 2D. We obtain the spinal column spatial
location directly via detected 3D points. Our spinal curve is a 3D one, as well as the
clustering and filtering methods.
To make the detection operation more robust, we use a spine statistical model: the
vertebrae “pulse function” which represents statistically determined intervertebral
distance along the spinal curve. This model is used in combination with pixel
brightness analysis in the spatial tubular structure created near the spinal curve.
Fig. 11.14 Intervertebral

discs’ secant planes
computed in the post-
processing stage: an
example
11.2.5 Landmark Candidate Position Refinement Using

Statistical Model
In some complicated cases (poor scout image quality, artefacts, abnormal anatomy,
etc.), the output of the landmark detector could be ambiguous. This means that there
could be several landmarks detected with high LM quality measure for any given LM
type. In order to eliminate this ambiguity, a special algorithm is used based on
statistics of the landmark’s relative positions.
The goal of the algorithm is to find the configuration of landmarks with high LMQ
value minimizing the energy:
X
E ðX, M X , ΣX Þ ¼ xs 2X, xt 2X,
ψ st ðxs , xt , M X , ΣX Þ,
where X 2 R3 K is a set of K vectors of coordinates of landmarks, called the

landmark configuration; MX, mean distances of each landmark from each other;
ΣX, landmark coordinate distance covariance tensor; E, energy of the statistical
model, lower values correspond to more consistent configurations; and ψ st, spatial
energy function measuring statistical consistency of two landmarks. In our imple-
mentation, we define spatial energy as follows:
X1
ψ st ðxs , xt , M X , ΣX Þ ¼ 0:5ðxs xt μst ÞT st
ðxs xt μst Þ,
where xs and xt correspond to three-dimensional coordinate vectors from the config-

uration X; μst, three-dimensional mean distance vector between landmarks s and t;
and Σ1st , inverse covariance matrix of three-dimensional vectors of distances
between landmarks s and t. Statistics μst and Σ1
st are computed once on an annotated
dataset.
The complete minimization of the abovementioned energy function would

require sophisticated and computationally expensive algorithms. Instead, we use a
heuristic approach based on the assumption that landmark detections are mostly
correct. In the first step of this algorithm, from the plurality of landmark candidate
points, a subset Sa is selected, consisting of M subsets Sa1. . .SaM of N candidate
points having the greatest pseudo-probability for each of the M landmarks. Also, a
set Sb is selected from Sa consisting of M best candidates – one for each landmark.
Next, the loop is defined for all candidates xj from Sb. Then, the partial derivative
∂E ðX, M X , ΣX Þ
∂xi
of the statistical model with respect to xi is calculated. If the magnitude
of this partial derivative is the greatest among the elements of Sb, then the nested loop
is initialized for all xi in Sai. In the next step, a new Sb0 configuration is defined by
substitution of xj in Sb instead of xi. This new configuration is then used to compute
the energy of the statistical model; if the energy is lower than the energy for Sb, then
the Sb0 is assigned to Sb. This process is repeated until some stopping criterion is
satisfied: for instance, maximum number of iterations or minimum value of partial
derivative magnitude.
11.3 Results
For the algorithm testing and the landmark detector training, a database of real MRI
volumes of various anatomies was collected. This included 94 brain, 80 cardiac,
31 knee and 99 spine MRI volumes. Based on the robustness of our landmark
detector, we acquired low-resolution 3D MRI volumes. The advantage of this
approach is short acquisition time. All data were acquired with our specific protocol
and then annotated by experienced radiologists.
In our implementation, we used a convolutional neural network with input size
32323 slices. Input volumes (at both learning and processing stages) were
resized to spacing 2, so 32 pixels correspond to 64 mm. The architecture of the
network is the following: the first convolutional layer has 16 kernels 88 with
shrinkage transfer function; on top of it follows abs rectification, local contrast
normalization and max-pooling with down-sampling factor 2; then we have the
second convolutional layer with 32 kernels 88 connected to 16 input feature maps
with a not fully filled connection matrix; then a fully connected layer with hyperbolic
tangent sigmoid transfer function is situated, which finalizes the network.
Verification of the AVP framework was performed by comparing automatically
built views with views built on the basis of ground-truth landmark positions. The
constructed view planes were compared with the ground-truth ones by angle and
generalized distance between them. For spine MRI result verification, the number of
missed and wrongly determined vertebral discs was counted. In this procedure, the
discs of a specified type only were considered. Examples of the constructed views
are shown in Figs. 11.15, 11.16, 11.17, and 11.18.
Fig. 11.15 Brain AVP results: (a–d) Automatically built midsagittal views. Red lines mean
intersection with corresponding axial and coronal planes. It is shown that the axial plane passes
through CCA and CCP correctly
Comparison of time required for AVP procedures is presented in Table 11.4.

Statistics (mean STD) of the discrepancies between the constructed views and
respective ground truths are presented in Table 11.5. The results show that the
developed technique provides better quality of the planned anatomical views in
comparison with competitors’ results, while the time spent for the operations (image
acquisition and processing) is much less.
11.4 Conclusion
We have presented a novel automatic view planning framework based on robust

landmark detectors, able to perform high-accuracy view planning for the different
anatomies using low-quality rapidly acquired scout images. The quality of the
proposed algorithmic solutions was verified using collected and publicly available
datasets. Benchmarking of the developed system shows its superiority compared
with most of the competitive approaches in terms of view planning and workflow
speed up. The presented framework was implemented as a production-ready soft-
ware solution.
Fig. 11.16 Knee AVP results: (a–c) ground-truth views, (d–f) automatically built views. Red lines
mean intersection with other slices
(a) (b) (c)
Fig. 11.17 Spine AVP results: positions and orientations of detected (green) and spinal curve (red)
for different spinal zones – (a) cervical, (b) thoracic and (c) lumbar
Our results demonstrate that we were able to achieve reliable, robust results in
much less time than our best competitors. Based on the results, we believe that this
novel AVP framework will help clinicians in achieving fast, accurate and repeatable
MRI scans and be a big differentiating aspect of a company’s MRI product portfolio.
(a) (e)
(b) (f)
(c) (g)
(d) (h)
Fig. 11.18 Cardiac AVP results: (a–d) ground-truth views; (e–h) automatically built views; (a–d)
2-chamber view; (b–f) 3-chamber view; (c–g) 4-chamber view; (d–h) view from short-axis stack
Table 11.4. Time comparison with competitors: imaging time (IT), processing time (PT) and total
time (TT)
Competitors Our approach
MRI type Name IT (s) PT (s) TT (s) IT (s) PT (s) TT (s)
Cardiac S 20 12.5 32.5 19 5 24
P – 103 100+
Brain S 42 2 44 25 1 26
G 40 2 42
Knee S – 30 30+ 23 2 25
P 40 6 46
Spine S 30 5 45 27 2 29
P 120 8 128
G 25 7 32
Table 11.5 Quality verification of AVP framework

Nearest competitor Our approach
MRI Type Dist (mm) Angle(deg) Dist (mm) Angle(deg)
Brain 4.55 5.4 3.18 1.8 1.2 0.9 1.59 1.2
Cardiac 8.38 12.4 14.35 15 6.37 7.4 9.64 7.7
Knee 1.53 0.4 1.18 0.4 2.1 1.6 1.73 1.3
Spine 2.42 1 3.85 2.3 3.84 1.9 4.14 2.8
References
Alomari, R.S., Corso, J., Chaudhary, V., Dhillon, G.: Computer-aided diagnosis of lumbar disc
pathology from clinical lower spine MRI. Int. J. Comput. Assist. Radiol. Surg. 5(3), 287–293
(2010)
Alomari, R.S., Corso, J., Chaudhary, V.: Labeling of lumbar discs using both pixel-and object-level
features with a two-level probabilistic model. IEEE Trans. Med. Imaging. 30(1), 1–10 (2011)
Bauer, S., Ritacco, L.E., Boesch, C., Reyes, M.: Automatic scan planning for magnetic resonance
imaging of the knee joint. Ann. Biomed. Eng. 40(9), 2033–2042 (2012)
Bystrov, D., Pekar, V., Young, S., Dries, S.P.M., Heese, H.S., van Muiswinkel, A.M.: Automated
planning of MRI scans of knee joints. Proc. SPIE Med. Imag. 6509 (2007)
Fenchel, M., Thesen, A., Schilling, A.: Automatic labeling of anatomical structures in MR
FastView images using a statistical atlas. In: Medical Image Computing and Computer-Assisted
Intervention–MICCAI, pp. 576–584. Springer, Berlin, Heidelberg (2008)
Iskurt, A., Becerikly, Y., Mahmutyazicioglu, K.: Automatic identification of landmarks for standard
slice positioning in brain MRI. J. Magn. Reson. Imaging. 34(3), 499–510 (2011)
Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., LeCun, Y.: What is the best multi-stage architecture
for object recognition? In: Proceedings of 12th International Conference on Computer Vision,
vol. 1, pp. 2146–2153 (2009)
Kavukcuoglu, K., Ranzato, M.A., LeCun, Y.: Fast inference in sparse coding algorithms with
applications to object recognition. arXiv preprint arXiv: 1010.3467 (2010)
Kelm, B.M., Zhou, K., Suehling, M., Zheng, Y., Wels, M., Comaniciu, D.: Detection of 3D spinal
geometry using iterated marginal space learning. In: Medical Computer Vision. Recognition
Techniques and Applications in Medical Imaging, pp. 96–105. Springer, Berlin, Heidelberg
(2011)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural
networks. Adv. Neural Inform. Process. Syst. 25(2), 1–9 (2012)
Law, M.W.K., Tay, K.Y., Leung, A., Garvin, G.J., Li, S.: Intervertebral disc segmentation in MR
images using anisotropic oriented flux. Med. Image Anal. 17(1), 43–61 (2012)
Lecouvet, F.E., Claus, J., Schmitz, P., Denolin, V., Bos, C., Vande Berg, B.C.: Clinical evaluation
of automated scan prescription of knee MR images. J. Magn. Reson. Imaging. 29(1), 141–145
(2009)
LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. In: The
Handbook of Brain Theory and Neural Networks, vol. 3361(10) (1995)
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M.A., Huang, F.J.: A tutorial on energy-based
learning. In: Bakir, G., Hofman, T., Schölkopf, B., Smola, A., Taskar, B. (eds.) Predicting
Structured Data. MIT Press, Cambridge, USA (2006)
LeCun, Y., Kavukcuoglu, K., Farabet, C.: Convolutional networks and applications in vision. In:
Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS),
pp. 253–256 (2010)
Li, P., Xu, Q., Chen, C., Novak, C.L.: Automated alignment of MRI brain scan by anatomic
landmarks. In: Proceedings of SPIE, Medical Imaging, vol. 7259, (2009)
Lu, X., Jolly, M.-P., Georgescu, B., Hayes, C., Speier, P., Schmidt, M., Bi, X., Kroeker, T.,
Comaniciu, D., Kellman, P., Mueller, E., Guehring, J.: Automatic view planning for cardiac
MRI acquisition. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI
2011, pp. 479–486. Springer, Berlin, Heidelberg (2011)
Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open Access
Series of Imaging Studies (OASIS): cross-sectional MRI data in young, middle aged,
nondemented, and demented older adults. J. Cogn. Neurosci. 19(9), 1498–1507 (2007)
Neubert, A., Fripp, J., Shen, K., Engstrom, C., Schwarz, R., Lauer, L., Salvado, O., Crozier, S.:
Automated segmentation of lumbar vertebral bodies and intervertebral discs from MRI using
statistical shape models. In: Proc. of International Society for Magnetic Resonance in Medicine,
vol. 19, p. 1122 (2011)
Pekar, V., Bystrov, D., Heese, H.S., Dries, S.P.M., Schmidt, S., Grewer, R., den Harder, C.J.,
Bergmans, R.C., Simonetti, A.W., van Muiswinkel, A.M.: Automated planning of scan geom-
etries in spine MRI scans. In: Medical Image Computing and Computer-Assisted Intervention–
MICCAI, pp. 601–608. Springer, Berlin, Heidelberg (2007)
Rychagov, M.: Neural networks: Multilayer perceptron and Hopfield networks. Exponenta Pro.
Appl. Math. 1, 29–37 (2003)
Sermanet, P., Chintala, S., Yann LeCun, Y.: Convolutional neural networks applied to house
numbers digit classification. In: Proceedings of the 21st International Conference on Pattern
Recognition (ICPR), pp. 3288–3291 (2012)
Sirotenko, M.: Applications of convolutional neural networks in mobile robots motion trajectory
planning. In: Proceedings of Scientific Conference and Workshop. Mobile Robots and
Mechatronic Systems, pp. 174–181. MSU Publishing, Moscow (2006)
Steinwart, I., Christmann, A.: Support vector machines. Springer, New York (2008)
van der Kouwe, A.J.W., Benner, T., Fischl, B., Schmitt, F., Salat, D.H., Harder, M., Sorensen, A.G.,
Dale, A.M.: On-line automatic slice positioning for brain MR imaging. Neuroimage. 27(1),
222–230 (2005)
Wang, Y., Li, Z.: Consistent detection of mid-sagittal planes for follow-up MR brain studies. In:
Proceedings of SPIE, Medical Imaging, vol. 6914, (2008)
Young, S., Bystrov, D., Netsch, T., Bergmans, R., van Muiswinkel, A., Visser, F., Sprigorum, R.,
Gieseke, J.: Automated planning of MRI neuro scans. In: Proceedings of SPIE, Medical
Imaging, vol. 6144, (2006)
Zhan, Y., Dewan, M., Harder, M., Krishnan, A., Zhou, X.S.: Robust automatic knee MR slice
positioning through redundant and hierarchical anatomy detection. IEEE Trans. Med. Imaging.
30(12), 2087–2100 (2011)
Zheng, Y., Lu, X., Georgescu, B., Littmann, A., Mueller, E., Comaniciu, D.: Automatic left
ventricle detection in MRI images using marginal space learning and component-based voting.
In: Proceedings of SPIE, vol. 7259, (2009)
Chapter 12
Dictionary-Based Compressed Sensing MRI
Artem S. Migukin, Dmitry A. Korobchenko, and Kirill A. Gavrilyuk
12.1 Toward Perfect MRI
12.1.1 Introduction
In a contemporary clinic, magnetic resonance imaging (MRI) is one of the most

widely used and irreplaceable tools. Being noninvasive, MRI offers superb soft-
tissue characterization with global anatomic assessment and allows object represen-
tation from arbitrary vantage points. In contrast to computer tomography, there is no
emitted ionizing radiation, which grants MRI a great potential for further spreading
as a dominant imaging modality (Olsen 2008). Conventionally, MRI data, which are
samples of the so-called k-space (spatial Fourier transform of the object), are
acquired by a receiver coil, and the resulting MR image to be analysed is computed
by the discrete inverse Fourier transform of the full spectrum. Despite the
abovementioned assets, the most important disadvantage of MRI is its relatively
high acquisition time (about 50 minutes of scan time) due to fundamental limitations
by physical (gradient amplitude and slew rate) and physiological (nerve stimulation)
constraints. Phase-encoding (PE) lines, forming rows in the k-space, are sampled in
series, sequentially in time. Because of that, the intuitive way to reduce the acqui-
sition time is to decrease the number of sampling PE lines. It results in the partially
A. S. Migukin (*)
e-mail: artem.migukin@huawei.com
D. A. Korobchenko
Nvidia Corporation, Moscow Office, Moscow, Russia
e-mail: dkorobchenko@nvidia.com
K. A. Gavrilyuk
University of Amsterdam, Amsterdam, The Netherlands
e-mail: kgavrilyuk@uva.nl

304 A. S. Migukin et al.
Fig. 12.1 2D k-space undersampling by an aligned sampling mask
sampled (undersampled) k-space, which can be also considered as the Hadamard

(element-wise) multiplication of the full k-space by a sparse binary sampling mask
(Fig. 12.1).
Hardware-based parallel data acquisition (pMRI) methods (Roemer et al. 1990;
Pruessmann 2006) reduce the required amount of k-space samples by simultaneous
scanning using receiver arrays, so data diversity, provided by multiple receiver coils,
allows eliminating the resulting aliasing (Pruessmann et al. 1999; Griswold et al.
2002; Blaimer et al. 2004; Larkman and Nunes 2007). Although tens of receiver
coils may be available, the conventional pMRI techniques are, however, nonetheless
limited by noise, the extremely growing size of datasets and imperfect alias correc-
tion. This typically enables acceleration smaller than three- or fourfold (Ravishankar
and Bresler 2010). Therefore, one is seeking for methods to essentially reduce the
amount of acquired data and to reconstruct tissues without visible degrading of
image quality.
Complementary to hardware-based acceleration, the total time to obtain an MR
image can be significantly decreased by algorithmic reconstruction of the highly
undersampled k-space. These rely on implicit or explicit modelling or constraints on
the underlying image (Liang and Lauterbur 2000), with some methods even adapting
the acquisition to the imaged object (Aggarwal and Bresler 2008; Sharif et al. 2010).
One of the most promising techniques is compressed sensing (CS) utilizing a priori
information on sparsity of MR signals in some domain (Lustig et al. 2008).
According to the mathematical theory of CS (Candès et al. 2006; Donoho 2006),
if the acquisition is incoherent, then an image with a sparse representation in some
domain can be recovered accurately from significantly fewer measurements than the
number of unknowns or than mandated by the classical Nyquist sampling condition.
The cost of such acceleration is that the object reconstruction scheme is nonlinear
and far from trivial.
Intuitively, artefacts due to random undersampling add as noise-like interference.
For instance, a naive object reconstruction by the discrete inverse Fourier transform
of an undersampled spectrum with zero filling (ZF) of empty positions leads to
strong aliasing effect (see Fig. 12.2b). In addition, a lot of high-frequency compo-
nents are lost.
At the time of work, many proposals for MRI data reconstruction try to mitigate
undersampling artefacts and recover fine details (Do and Vetterli 2006; Lustig et al.
12 Dictionary-Based Compressed Sensing MRI 305
Fig. 12.2 Aliasing artefacts: the magnitude of the inverse Fourier transform – (a) for the fully
sampled spectrum and (b) for the undersampled k-space spectrum as it is shown in Fig. 12.1
2007; Ma et al. 2008; Goldstein and Osher 2009). However, the reconstruction by
leading CS MRI methods with nonadaptive, global sparsifying transforms (finite
differences, wavelets, contourlets, etc.) are usually limited to relatively low
undersampling rate and still have many undesirable artefacts and loss of features
(Ravishankar and Bresler 2010). The images are usually represented by a general
predefined basis or frame which may not provide sufficient sparse representation for
them. For instance, the traditional separable wavelet fails to sparsely represent the
geometric regularity along the singularities, and the conventional total variation
(TV) results in staircase artefacts in case of limited acquired k-space data (Goldstein
and Osher 2009). Contourlets (Do and Vetterli 2006) can sparsely represent the
smooth details but not the spots in images. All these transforms are only in favour of
the sparse representation for the global image specifics.
The local sparsifying transforms allow highlighting a broad set of fine details,
i.e. they carry the local geometric information (Qu et al. 2012). In the so-called
patch-based approach, an image is divided into small overlapping blocks (patches),
and the vector corresponding to each patch is modelled as a sparse linear combina-
tion of candidate vectors termed atoms taken from a set called the dictionary. Here
one requires a huge arsenal of various transforms, whose perfect fit is extremely hard
to assort. Alternatively, the size of patches needs to be constantly decreased.
In perspective, researchers have shown great interest in finding adaptive sparse
regularization. Images may be content-adaptive represented by patches via collabo-
rative sparsifying transform (Dabov et al. 2007) or in terms of dictionary-based
image restoration (Aharon et al. 2006). Adaptive transforms (dictionaries) can
sparsify images better because they are constructed (learnt) especially for the
particular image instance or class of images. Recent studies have shown the promise
of patch-based sparsifying transforms in a variety of applications such as image
denoising (Elad and Aharon 2006), deblurring (Danielyan et al. 2011) or in a specific
task such as phase retrieval (Migukin et al. 2013).
In this work, we exploit adaptive patch-based dictionaries to obtain substantially
improved reconstruction performance for CS MRI. According to our best knowledge
and following Ravishankar and Bresler (2010), Caballero et al. (2012) and Song
et al. (2014), such a sparse regularization with training patch-based dictionaries
provides state-of-the-art results in actual MRI reconstruction. Aiming at optimal

sparse representations, and thus optimal noise/artefact removing capabilities, we
learn the dictionary on fully sampled clinic data with sparse constraints on codes
(linear weights for atoms) enforced by the ℓ 0 norm.
The subject of this chapter is an efficient, high-performance and fast-converging
CS algorithm for the MRI reconstruction from highly undersampled k-space data.
Moreover, we are looking for a robust, clear and flexible algorithm whose imple-
mentation may be easily tuned for end-users. This task is comprehensively realized
by our method (Migukin et al. 2015, 2017; Korobchenko et al. 2018): it is not
necessary to choose any proper sparsifying functions, and a good dictionary (with
patches of non-uniform size adapted to specifics of undersampling) precooked in
advance with settings of its use (tolerance for image patch approximating) fully
determines the CS MRI reconstruction for particular datasets. Irrespective of
denoising/deblurring approaches, almost all authors omit the initialization problem.
In particular, the initial guess for MRI algorithms is typically achieved by zero filling
(ZF). Here, we present a helpful hint of an efficient initialization with the iterative
split Bregman cast algorithm (Goldstein and Osher 2009). Finally, our implementa-
tion on a graphics processing unit (GPU) makes the reconstruction time negligible
with respect to the data acquisition time.
12.2 Compressed Sensing MRI
We use a vector-matrix notation for discrete representation of complex-valued object

images and their k-space spectra. Let x denotes hereafter the column vector of a
target object to be reconstructed and y denotes its vectorized spatial Fourier trans-
form. In case of a noiseless and fully sampled k-space, the relation can be given as
follows:
y ¼ Fx,
where F is the Fourier transform matrix. The vectors x and y are of the same length.
12.2.1 Observation Model
Undersampling occurs whenever the number of k-space measurements is less than

the length of x:
yu ¼ F u x = m∘Fx:
The problem is to find the unknown object vector x from the vector of the
undersampled spectral measurements yu, i.e. to solve an underdetermined system
of linear equations. In the equation above, Fu is the undersampled Fourier encoding

matrix (Ravishankar and Bresler 2010). The calculation of yu can be reinterpreted
via the Hadamard multiplication (denoted by ) by the binary sampling mask vector
m, which nicely fits in with our illustration in Fig. 12.1.
12.2.2 Sparse Object Representation
Compressed sensing solves the problem by minimizing the ℓ0 quasi-norm (number

of non-zero components of the vector) of the sparsified signal or sparse codes Ψ(x),
where Ψ is typically a global orthonormal sparsifying transform. The corresponding
optimization problem can be presented as follows:
min jjΨðxÞjj0 s:t: yu ¼ F u x:

x
This sparse coding problem can be solved by greedy algorithms (Elad 2010).
Following Donoho (2006), the CS reconstruction problem can be also simplified by
replacing the ℓ0 norm with its convex relaxation, the ℓ1 norm. Since the real
measurements are always noisy, the CS problem is shown (Donoho et al. 2006) to
be efficiently solved using basis pursuit denoising. Thus, the typical formulation of
the CS MRI reconstruction problem has the following Lagrangian setup (Lustig et al.
2007):
min jjyu F u xjj22 þ λ jjΨðxÞjj1:

x
Here λ is a regularizing parameter, and the ℓ1 norm is defined as a sum of absolute

values of items of the vector.
12.3 Dictionary-Based CS MRI
As mentioned above, adaptive CS techniques lead to higher sparsities and hence to

potentially higher undersampling factors in CS MRI. The key to adaptation to the
data is dictionary learning. In dictionary-based approaches, the sparsifying transform
Ψ is represented by a sparse decomposition of patches of the target object to basis
elements from a dictionary D. Thus, the optimization problem is reformulated
(Ravishankar and Bresler 2010, cf. Eq. 4) as:
min kPðxÞ DZk2F þ ν kyu F u xk22 s:t: kZi k0 T 8i

x, Z
where the subscript F denotes the Frobenius norm, columns of the matrix X ¼ P(x)
are vectorized patches extracted by the operator denoted by P, column vectors of the
matrix Z are sparse codes, and ν is a positive parameter for the synthesis penalty
term. Literally, it is assumed that each vectorized patch Xi can be approximated by
the linear combination DZi, where each column vector Zi contains no larger than
T non-zero components. Note that X is formed as a set of overlapping patches
extracted from the object image x. Since the sparse approximation is performed
for vectorized patches, no restriction on the patch form is imposed. In our work, we
are dealing with rectangular patches to harmonize their size with specifics of k-space
sampling.
12.3.1 Alternating Optimization
We are faced with a challenging synthesis-based CS problem: given the dictionary

D, minimize the previous equation on the object x and the column vectors of the
sparse codes Zi. The conventional “trick” is to transform such unconstrained prob-
lem into the constrained one via variable splitting and then resolve this constrained
problem using the alternating optimization (Gabay and Mercier 1976; Bertsekas and
Tsitsiklis 1989; Eckstein and Bertsekas 1992). In this case, the optimization vari-
ables x and {Zi} are decoupled according to their roles: data consistency in k-space
and sparsity of the object approximation. Thus, the Lagrangian function is mini-
mized with respect to these blocks, which leads to alternating minimization in the
following iterative algorithm:
k
Z kþ1 ¼ arg min kZ i k0 s:t: Pi x DZi 2 τ 8i;
i Zi F

xkþ1 ¼ arg min x A DZkþ1 2 þ ν kyu F u xk2 :
x 2 2
Here A denotes the operator that assembles the vectorized image from the set of
patches (columns of input matrix). Particularly, the approximation of the image
vector is assembled from the sparse approximation of patches {DZi}. Here the
positive parameter ν represents the confidence in the given (noisy) measurements,
and the parameter τ controls the accuracy of the object synthesis from sparse codes.
Such kind of algorithms is the subject of intensive research in various application
areas (Bioucas-Dias and Figueiredo 2007; Afonso et al. 2010).
In the first step, the object estimate is assumed to be fixed, and the sparse codes
{Zi} are found using the given dictionary D in terms of the basis pursuit denoising so
the sparse object approximation is satisfied by a certain tolerance τ. In our work, it is
realized based on the orthogonal matching pursuit algorithm (OMP) (Elad 2010). In
the other step, the sparse representation of the object is assumed to be fixed, and the
object estimate is updated targeting the data consistency. The last equation in the
above-given system can be resolved from the normal equation:
1 1
u F u þ ν IÞx ¼ F u yu þ ν AðDZ
ðF H H kþ1
Þ:
The superscript {∙}H denotes the Hermitian transpose operation. Solving the
equation directly is tedious due to inversion of a typically huge matrix. It can be
simplified by transforming from the image to the Fourier domain. Let the Fourier
transform matrix F from the first equation in the current chapter be normalized such
that FHF ¼ I. Then:
1 1
u F u F þ ν IÞFx ¼ FF u yu þ ν FAðDZ
ðFF H H H kþ1
Þ,
where FF H u FuF
H
is a diagonal matrix consisting of ones and zeros – ones are at
those diagonal entries that correspond to a sampled location in the k-space. Here
yu ¼ FF Hu yu and y
k + 1/2
¼ FA(DZk + 1). The vector represents the Fourier spectrum
of the sparse approximated object at the k-th iteration. It follows that the resulting
Fourier spectrum estimation is of the form (Ravishankar and Bresler 2010, cf. Eq. 9):

1
ykþ1 ¼ Øm∘ykþ1=2 þ m∘ ykþ1=2 þ ν yu ,
1þν
where the mask Øm is logically complementary to m and represents all empty,

non-sampled positions of the k-space. In general, the spectrum in sampled positions
is recalculated as a mixture of the given measurements yu and the estimate yk + 1/2.
For noiseless data (ν!1), the sampled frequencies are merely restored to their
measured values only.
12.3.2 Dictionary Learning
Dictionary learning aims to solve the following basis pursuit denoising problem
(Elad 2010) with respect to D:
min k Zi k0 s:t: kX DZk2F τ 8i:

D, Zi
Since dictionary elements are basis atoms used to represent image patches, they
should be also learnt on the same type of signals, namely: image patches. In the
equation above, the columns of the matrices X and Z represent vectorized training
patches and the corresponding sparse codes, respectively. Again, one commonly
alternates searching for D for the fixed Z (dictionary update step) and seeking for
Z taking D fixed (sparse coding step) (Elad 2006, 2010; Yaghoobi et al. 2009). In
our method, we exploit the state-of-the-art dictionary learning algorithm: K-SVD
(Aharon et al. 2006), successfully applied for image denoising (Mairal et al. 2008;
Fig. 12.3 Complex-valued dictionary with rectangular (8 4) atoms: left magnitudes and right
arguments/phases of atoms. Phase of atoms is represented in the HSV (Hue, Saturation,
Value) colour map
Protter and Elad 2009). We recommend taking into consideration the type of data it
contains, namely: training patches should be extracted from datasets of images
similar to the target object or from actual corrupted (aliased) images to be
reconstructed. An example of a dictionary learnt on complex-valued data is shown
in Fig. 12.3.
12.4 Proposed Efficient MRI Algorithm
In this section, we share some hints found during our long-term painstaking research
on efficient CS MRI reconstruction by beforehand precomputed dictionaries: effec-
tive innovations providing fast convergence and imaging enhancement, spatial
adapting to aliasing artefacts and acceleration by parallelization under limited
GPU resources (Korobchenko et al. 2016).
12.4.1 Split Bregman Initialization
While zeros in the empty positions of the Fourier spectrum lead to strong aliasing
artefacts, some other initial guess seems to be clever. It is found that the result of the
split Bregman iterative algorithm (Goldstein and Osher 2009) is an efficient initial
guess that essentially suppresses aliasing effects and significantly increases both the
initial reconstruction quality and the convergence rate of the main computationally
expensive dictionary-based CS algorithm.
In accordance with (Wang et al. 2007), the ℓ 1 and ℓ 2 norms in the equation given
in Sect. 12.2.2 may be decoupled as follows:
min jjF u x yu jj22 þ λ jjχ jj1 þ μ jjΨðxÞ χ jj22 ,

x, χ
where χ is a sparse object approximation found by a global sparsifying transform.

This optimization problem can be resolved simply via approximating by the
so-called Bregman distance between the optimal and approximate result (Liu et al.
2009, cf. Fig. 1). The solution is adaptively refined by iteratively updating the
regularization function, which follows the split Bregman iterative algorithm (Gold-
stein and Osher 2009, Sect. 3):
2
xkþ1 ¼ arg min
x
jjF u x yu jj22 þ μ jjχ k ΨðxÞ bk jj2 ,
2
χ kþ1 ¼ arg min λ jjχ jj1 þ μ jjχ Ψðxkþ1 Þ bk jj2
χ

bkþ1 ¼ bk þ Ψ xkþ1 χ kþ1 ,
where μ is a regularization parameter, and bk is an update of the Bregman parameter

vector. In our particular case, the conventional differentiation operator is used as a
sparsifying transform Ψ, i.e. here we resolve the total variation (TV) regularization.
12.4.2 Precomputed Dictionary
In the absence of reference signals, dictionary learning is commonly performed

online using patches extracted from intermediate results of an iterative reconstruc-
tion algorithm (Ravishankar and Bresler 2010). Note that at the beginning of the
reconstruction procedure, training patches are significantly corrupted by noise and
aliasing artefacts, so a pretty large number of iterations are required to sufficiently
suppress the noise. It is experimentally found (Migukin et al. 2015, 2017;
Korobchenko et al. 2018) that precomputed dictionaries (offline learning) provide
a better reconstruction quality in comparison with learning on actual data
(Ravishankar and Bresler 2010) and iterative dictionary updating (Liu et al. 2013).
Online dictionary learning could be well applied for daily-photo denoising
because atoms, used in the dictionary-learning procedure, are free from random
noise due to smoothing properties of the ℓ 2 norm. Regarding MRI, patches are
corrupted with specific aliasing artefacts, and thus these artefacts significantly
corrupt dictionary atoms, which consequently impacts the reconstruction quality. It
leads to the idea that the dictionary should be learnt on fully sampled (and therefore
artefact-free) experimental data. Moreover, this approach leads to a significant
speedup of the reconstruction algorithm, because the time-consuming step of dic-
tionary learning is moved outside the reconstruction and could be performed
only once.
In our development, dictionaries are computed in advance on a training dataset of

fully sampled high-quality images from a target class. Training patches are randomly
extracted from the set of training images. The subset of all possible overlapping
patches from all training images is shuffled, and then a desired amount of training
patches is selected to be used as columns of X in the learning procedure formulated
earlier in Sect. 12.3.2.
12.4.3 Multi-Band Decomposition
Aiming at the computational efficiency of the proposed CS MRI algorithm, we

exploit the linearity of the Fourier transform to parallelize the reconstruction pro-
cedures. Taking into account that a plurality of object-image components corre-
sponds to different frequency bands, we split the k-space spectrum to be
reconstructed y and the sampling mask m into several bands (e.g., corresponding
to a group of low, middle and high frequencies) to produce a number of new band-
oriented subspectra {yb} and sampling submasks {mb}, respectively. Each resulting
subspectrum yb contains only frequencies from the corresponding band, while the
other frequencies are filled with zeros and marked as measured by mb. The decom-
position of the MR signal into frequency bands yields a contraction of signal variety
within a band (the signal has a simple form within each band) and consequently
increasing the signal sparsity and the reconstruction quality. The undersampled
subspectra {∙} are treated as inputs to the proposed dictionary-based reconstruction
(alternating minimization in the iterative algorithm presented in Sect. 12.3.1) and are
processed in parallel. Once the subspectra are reconstructed, the resulting object
reconstruction is obtained by summing the reconstructed images for all bands.
Note that each subspectrum is reconstructed using its own special dictionary Db.
Thus, dictionary learning is performed respectively: training images are decomposed
into frequency bands, and a number of dictionaries are learnt for each of such bands
separately. For Cartesian one-dimensional sampling (see Fig. 12.1), the bands are
rectangular stripes in a k-space, which are aligned with sampling lines. For a radially
symmetric or isotropic random sampling (Vasanawala et al. 2011), bands are
represented as concentric rings centred in a zero frequency.
12.4.4 Non-uniform-Sized Patches
It seems to be straightforward that an optimal aspect ratio (relative proportion of

width and height) of a rectangular patch depends on a k-space sampling scheme. Let
us start from the two-dimensional (2D) case. A common way of Cartesian sampling
in an MRI device is to acquire phase-encoding lines. Such lines represent rows in a k-
space. The information about frequencies is lost along the y-direction (see Fig. 12.1),
which affects in vertical aliasing artefacts in the zero filling (ZF) reconstruction (see
Fig. 12.2b). Analogously in the three-dimensional (3D) case, we are proposing to

cover more data along the direction of undersampling by applying non-uniform-
sized patches.
If the number of dictionary elements is fixed, the size of a patch should not be
very large in all dimensions due to difficulty of encoding a large amount of
information with only a small number of patches’ components. It is found (Migukin
et al. 2017; Korobchenko et al. 2018) that rectangular/parallelepiped patches (one of
dimensions is in priority) allow achieving better reconstruction quality than square/
cubic patches, with the same or even a smaller number of pixels. For a non-Cartesian
isotropic random sampling scheme, we achieve a higher reconstruction quality for
square/cubic patches. In the general case, the patch aspect ratio depends on the
amount of anisotropy.
12.4.5 DESIRE-MRI
Let us assume that the measurement vector and sampling mask are split into Nb
bands, and proper multi-band dictionaries {Db} for all these bands are already learnt
by K-SVD as we discussed above. Let the initial precooking of the undersampled
spectrum of the object be performed by the iterative split Bregman algorithm (see
Sect. 12.4.1). We denote this initial guess by xSB. Then, considering the resulting
Fourier spectrum estimation presented in Sect. 12.3.1, the reconstruction of such a
pre-initialized object is performed by the proposed iterative multi-band algorithm
defined in the following form:
Algorithm 12.1: DESIRE-MRI

Algorithm: DESIRE-MRI
Input data: fDb g, ybu
Initialization: k ¼ 0, x0 ¼ xSB.
For all object k-space bands b ¼ 1, . . .Nb
Repeat until convergence:
1. Sparse codes update via mini-batch-OMP
kþ1 X 2
Zb ¼ arg min Zb þ γ b
P x b k
Db b
Z ,
i 0 i
Zb F
i
2. Update for spectra of sparse object approximations

b kþ1=2 kþ1
y ¼ FA Db Zb ,
(continued)
Algorithm 12.1 (continued)
3. Update of object band-spectra and band-objects
kþ1 kþ1=2 1 kþ1=2

ðyb Þ ¼ Ømb ∘ððyb Þ þ ν ybu Þ þ mb ∘ðyb Þ ,
1þνb
kþ1 kþ1
ðxb Þ ¼ F H ðyb Þ , k ¼ k þ 1:
When converges, combine all resulting band-objects
XN b
b
x¼ xb
b¼1
We name this two-step algorithm with the split Bregman initialization and
beforehand multi-band dictionary learning precomputed – Dictionary Express
Sparse Image Reconstruction and Enhancement (DESIRE-MRI).
The initial guess x0 is split into Nb bands forming a set of estimates for the band
objects {xb}. Then, all these estimates are partitioned into patches to be sparse
approximated by mini-batch OMP (Step 1). We discuss the mini-batch OMP
targeting on GPU realization in later Sect. 12.5 in details. In Step 2, estimates of
the object subspectra are found by successively assembling the sparse codes into the
band objects and their Fourier transform. Step 3 gives the update of the resulting
band objects by restoring the measured valued in the object subspectra (as it is
defined by the last equation in Sect. 12.3.1) and returning to the image domain. So,
we go back to Step 1 until DESIRE-MRI converges. Note that the output of the
algorithm is the sum of the reconstructed band objects. DESIRE algorithm for a
single-coil case was originally published in (Migukin et al. 2015).
There is a clear parallel with the well-studied Gerchberg-Saxton-Fienup (Fienup
1982) algorithms, but in contrast with loss/ambiguity of the object phase, here we are
faced with the total loss of some complex-valued observations.
12.4.6 GPU Implementation
One of the main problems in the classical OMP algorithm is the computationally
expensive matrix pseudo-inversion (Rubinstein et al. 2008; Algorithm 1, step 7). It is
efficiently resolved in OMP Cholesky and Batch OMP by the progressive Cholesky
factorization (Cotter et al. 1999). Batch OMP also uses pre-computation of the Gram
matrix of dictionary atoms (Rubinstein et al. 2008), which allows omitting iterative
recalculation of residuals. We use such an optimized version of OMP because of
encoding a huge set of patches by a single dictionary. Moreover, Batch OMP is
based on the matrix-matrix and matrix-vector operations, and thus it is tempting to

use the advantage of parallelization by GPU implementation.
Direct treatment of all patches simultaneously may fail due to lack of GPU
resources (capacity). We are applying mini-batch technique for CS MRI, namely:
the full set of patches is divided into subsets, called mini-batches, which are then
processed independently. This approach allows using multiple GPUs to process each
mini-batch on an individual GPU or CUDA streams to obtain maximum perfor-
mance by leveraging concurrency of CUDA kernels and data transfers. In our work,
we are considering the lower border of parallelization, and therefore the results by
only one GPU are presented. Our mini-batch OMP algorithm is realized by highly
optimized CUDA kernels and standard CUBLAS batch functions such as
“trsmBatched” to solve multiple triangular linear systems simultaneously. The
Cholesky factorization (less space-consuming compared with the QR factorization)
is taken in order to possibly enlarge the size and number of mini-batches in our
simultaneous processing. The result of acceleration by these tricks is presented in
Sect. 12.5 in Table 12.1.
12.5 Experimental Results for Multi-Coil MRI

Reconstruction
The goal of our numerical experiments is to analyse the reconstruction quality and to
study the performance of the algorithm. Here, we consider the reconstruction quality
of complex-valued 2D and 3D target objects as in vivo MR scans with the normal-
ized amplitudes. The used binary sampling masks with the undersampling rate (ratio
of the total number of components to sampling ones) equal to 4 are illustrated in
Fig. 12.4. In addition to our illustration, we present the reconstruction accuracy in
Table 12.1 Acceleration for various parts of DESIRE-MRI

SB RM SA SA
pi 8.5% 11.5% 80%
2D 1.5 4.5 55.8 10.3
3D 2.7 13.9 10.5 8.6
Fig. 12.4 Binary sampling

masks, undersampling rate
4: left 2D Cartesian mask for
the row-wise PE and right
3D Gaussian mask for the
tube-wise PE (all slices are
the same)
peak signal-to-noise ratio (PSNR) and the high-frequency error norm (HFEN).
Following Ravishankar and Bresler (2010), the reference (fully sampled) and
reconstructed object are (slice-wise) filtered by the Laplacian of Gaussian filter,
and HFEN is found as the ℓ2 norm for the difference between these filtered objects.
Note that in practice, one has no true signal x to calculate the reconstruction
accuracy in terms of PSNR or HFEN. DESIRE-MRI is assumed to be converged if
the norm ||xk xk 1||2 for the difference between the successive iterations (denoted
by DIFF) reaches an empirically found threshold.
We exploit the efficient basis pursuit denoising approach, i.e. for particular
objects, we choose proper parameters of patch-based sparse approximation and the
tolerance τ. In general, the parameters for the split Bregman initialization are λ ¼ 100
and μ ¼ 30 for 2D objects and λ ¼ 0.1 and μ ¼ 3 for 3D objects. For easy comparison
with recent (by the time of development) MRI algorithms (Lustig et al. 2007;
Ravishankar and Bresler 2010), all DESIRE-MRI results are given for ν ¼ 1,
i.e. act on the assumption that our spectral measurements are noise-free.
12.5.1 Efficiency of DESIRE-MRI Innovations
In processing of experimental data, the trigger which stops DESIRE-MRI might be

either achieving a certain number of iterations or decreasing of DIFF lower than a
threshold. The use of DIFF reduces the required number of iterations and shows a
potential of further improvement during the MRI reconstruction. The challenge is to
choose the proper threshold, which largely depends on the convergence rate and the
reached imaging quality. In Fig. 12.5, we demonstrate the influence of the initiali-
zation by split Bregman on the convergence rate and hence on the resulting recon-
struction accuracy. The red horizontal line in Fig. 12.5(top) denotes the stopping
threshold (here we take DIFF ¼ 0.258), and the vertical red dotted line maps this
difference onto the reached PSNR and HFEN. Note that the further increase of DIFF
after reaching the threshold has no effect. It can be seen that the proposed split
Bregman initialization (hereinafter SB) gives approximately 4 dB improvement in
PSNR (see the 0th iteration in Fig. 12.5(middle)) and about 0.24 in HFEN (Fig. 12.5
(bottom)). The DESIRE-MRI stopping condition is achieved in 50 iterations for SB
and in 93 iterations for the conventional ZF initialization. In addition, SB in
50 iterations gives 0.16 dB higher PSNR and 0.01 lower HFEN compared with ZF
in 93 iterations.
Let us consider the influence of the multi-band decomposition on the reconstruc-
tion imaging based on the Siemens T1 axial slice (one of four coil images). DESIRE-
MRI is performed with the tolerance τ ¼ 0.04, and the used dictionary is composed
of 64 complex-valued 2D patches of 8 4. For the consistency with our further
experiments and results by sided algorithms, the DESIRE-MRI reconstruction is
hereafter performed during 100 iterations. Figure 12.6a illustrates a fragment of the
original object. In Fig. 12.6b, we present the DESIRE-MRI result with no band
splitting (PSNR ¼ 38.5 dB). Figure 12.6c demonstrates a clear imaging
Fig. 12.5 Influence of initialization on convergence and accuracy of DESIRE-MRI, split Bregman
(SB, solid curve) vs zero filling (ZF, dashed curve) in 2D: (top) stopping condition and recon-
struction quality in terms of (middle) PSNR and (bottom) HFEN
Fig. 12.6 Imaging enhancement with the multi-band scheme: fragment of (a) original axial
magnitude and the DESIRE-MRI result by (b) the whole bandwidth and (c) two bands
enhancement due to applying two subbands in DESIRE-MRI. It can be seen that

multi-band decomposition results in swiping remaining wave-like aliasing artefacts
out, and PSNR is approximately 39 dB. Note that the effect of imaging enhancement
by the multi-band decomposition reflects in the name of the proposed algorithm.
12.5.2 DESIRE-MRI Reconstruction Accuracy
Let us compare the DESIRE-MRI result with the recently applied CS MRI
approaches. Again, the Fourier spectrum of the Siemens T1 axial coil image is
undersampled by the binary sampling mask shown in Fig. 12.4 (left) and
reconstructed by LDP (Lustig et al. 2007) and DLMRI (Ravishankar and Bresler
2010).
In Fig. 12.7, we present the comparison of the normalized magnitudes. DLMRI
with online recalculation of dictionaries is unable to remove the large aliasing
artefacts (Fig. 12.7c, PSNR ¼ 35.3 dB). LDP suppresses aliasing artefacts but still
not sufficiently (Fig. 12.7b), PSNR ¼ 34.2 dB). A slightly lower PSNR compared
with DLMRI is due to oversmoothing of the LDP result. DESIRE-MRI with
two-band splitting (see Fig. 12.7d) produces an aliasing-free reconstruction that
Fig. 12.7 Comparison of the reconstructed MR images: (a) the original axial magnitude; (b) its
reconstruction obtained by LDP (Lustig et al. 2007), PSNR ¼ 34.2 dB; (c) DLMRI (Ravishankar
and Bresler 2010), PSNR ¼ 35.3 dB; (c) our DESIRE-MRI, PSNR ¼ 39 dB
Fig. 12.8 Imaging of multi-coil DESIRE-MRI reconstructions for 2D and 3D objects (column-
wise, from left to right): Samsung TOF 3D, Siemens TOF 3D, Siemens T1 axial 2D slice and
Siemens 2D Phantom. The comparison of SOS (row-wise, from top to bottom): for the original
objects, the zero-filling reconstruction and by DESIRE-MRI
looks much closer to the reference: a small degree of smoothing is inevitable at high
undersampling rate.
All experimental data are multi-coil, and thus individual coil images are typically
reconstructed independently, and then the final result is found by merging these
reconstructed coil images with the so-called SOS (“sum-of-squares” by Larsson
et al. 2003).
In Fig. 12.8, some DESIRE-MRI results of such multi-coil reconstructions are
demonstrated. In the top row, we present SOS for the original objects (column-wise,
from left to right): Samsung TOF (344 384 30 angiography), Siemens TOF
(348 384 48 angiography), Siemens T1 axial slice (320 350) and Siemens
Phantom (345 384). In the middle row, we demonstrate SOS for the alias
reconstructions by zero filling. The undersampling of the 2D and 3D object spectra
are performed by the corresponding 2D Cartesian and 3D Gaussian masks given in
Fig. 12.4. In the bottom row of Fig. 12.8, SOS for the DESIRE-MRI reconstructions
are illustrated. For both 2D and 3D cases, DESIRE-MRI results in clear reconstruc-
tion with almost no essential degradations. Note that on Siemens Phantom with a lot
of high-frequency details (Fig. 12.8, bottom-right image), DESIRE-MRI returns
some visible defects on the borders of the bottom black rectangle and between radial
“beams”.
12.5.3 Computational Performance
Here, we compare the computational performance of the proposed algorithm for 2D

and 3D objects. DESIRE-MRI is developed in MATLAB v8.1 (R2013a) and further
implemented in C++ with CUDA 5.5. Computations were performed with an Intel
Xeon E5-2665 CPU at 2.4 GHz with 32 GB RAM, using a 64-bit Ubuntu 12.04. For
parallel computations, we used Nvidia Tesla K20c with 5120 MB memory. Con-
sidering the overheads, the overall speedup A of DESIRE-MRI is found according to
Amdahl’s law (Amdahl 1967):
X
SA ¼ 1= p =s ,
i i i
where si is the acceleration (speedup) of the programme part pi.

In order to represent the computational performance, the proposed DESIRE-MRI
algorithm is roughly divided into three blocks: the initialization by split Bregman
(SB), restoring the measured values in the k-space domain (RM) and the sparse
object approximation in the image domain (SA). In Table 12.1, the acceleration for
all these operations for reconstructing 2D and 3D objects is presented.
Practically, split Bregman is an isolated operation performed only once before the
main loop, so its contribution is 8.5% in the whole DESIRE-MRI duration. Still we
accelerate SB in about two times. Computationally, the RM part is relatively cheap
because it includes well-optimized discrete Fourier transforms and replacement of
values. Nevertheless, the contribution of this block is 11.5% due to its interactive
(we assume 100 iterations) repeating. The larger the problem dimension, the greater
the speedup: for 3D data, we have three times better acceleration factor. The most
time-consuming part SA (80% of the whole computation time) is well optimized by
GPU based mini-batch OMP: 55.8 and 10.5 times for 2D and 3D objects. The SA
acceleration drop for the 3D case is due to more complex memory access patterns
leading to an inefficient cache use.
Note that here we omit the influence of overheads on SA which may significantly
decrease the total speedup. The total time of multi-coil DESIRE-MRI for 2D objects
is about 1 second and about 1000 seconds for 3D in slice-wise scheme. This
computation time coupled with high acceleration rate SA allows claiming a great
productivity of the proposed CS MRI algorithm.
12.6 Conclusion
In this chapter, an efficient procedure of the CS MRI reconstruction by means of a

precomputed dictionary is presented. Aiming at optimal sparse object regularization,
the proposed algorithm takes into consideration a significant amount of noise and
corruption by aliasing artefacts. Based on basis pursuit denoising, we have devel-
oped the iterative optimization algorithm for the complex-valued 2D and 3D object
reconstruction. The algorithm demonstrates both perfect convergence rate by the

effective Split Bregman initialization and the state-of-the-art reconstruction imaging
for real-life noisy experimental data. In addition, our implementation on commodity
(available on the market in 2014) GPU allows achieving remarkably high perfor-
mance fully applicable for commercial use.
References
Afonso, M.V., Bioucas-Dias, J.M., Figueiredo, M.A.T.: Fast image recovery using variable split-
ting and constrained optimization. IEEE Trans. Image Process. 19(9), 2345–2356 (2010)
Aggarwal, N., Bresler, Y.: Patient-adapted reconstruction and acquisition dynamic imaging method
(PARADIGM) for MRI. Inverse Prob. 24(4), 1–29 (2008)
Aharon, M., Elad, M., Bruckstein, A.: K-SVD: an algorithm for designing overcomplete dictionar-
ies for sparse representation. IEEE Trans. Signal Process. 54(11), 4311–4322 (2006)
Amdahl, G.M.: Validity of the single-processor approach to achieving large-scale computing
capabilities. In: Proceedings of AFIPS Conference, vol. 30, pp. 483–485 (1967)
Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods.
Prentice-Hall, Englewood Cliffs. 735 p (1989)
Bioucas-Dias, J.M., Figueiredo, M.A.T.: A new TwIST: two-step iterative shrinkage/thresholding
algorithms for image restoration. IEEE Trans. Image Process. 16(12), 2980–2991 (2007)
Accessed on 04 October 2020. http://www.lx.it.pt/~bioucas/TwIST/TwIST.htm
Blaimer, M., Breuer, F., Mueller, M., Heidemann, R.M., Griswold, M.A., Jakob, P.M.: SMASH,
SENSE, PILS, GRAPPA: how to choose the optimal method. Top. Magn. Reson. Imaging.
15(4), 223–236 (2004)
Caballero, J., Rueckert, D., Hajnal, J.V.: Dictionary learning and time sparsity in dynamic MRI. In:
Proceedings of International Conference on Medical Image Computing and Computer-Assisted
Intervention (MICCAI), vol. 15, pp. 256–263 (2012)
Candès, E., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from
highly incomplete frequency information. IEEE Trans. Inf. Theory. 52(2), 489–509 (2006)
Cotter, S.F., Adler, R., Rao, R.D., Kreutz-Delgado, K.: Forward sequential algorithms for best basis
selection. In: IEE Proceedings - Vision, Image and Signal Processing, vol. 146 (5), pp. 235–244
(1999)
Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising by sparse 3-D transform-
domain collaborative filtering. IEEE Trans. Image Process. 16(8), 2080–2095 (2007)
Danielyan, A., Katkovnik, V., Egiazarian, K.: BM3D frames and variational image deblurring.
Do, M.N., Vetterli, M.: The contourlet transform: an efficient directional multiresolution image
representation. IEEE Trans. Image Process. 14(2), 2091–2106 (2006)
Donoho, D.: Compressed sensing. IEEE Trans. Inf. Theory. 52(4), 1289–1306 (2006)
Donoho, D.L., Elad, M., Temlyakov, V.N.: Stable recovery of sparse overcomplete representations
in the presence of noise. IEEE Trans. Inf. Theory. 52(1), 6–18 (2006)
Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point
algorithm for maximal monotone operators. Math. Program. 55, 293–318 (1992)
Elad, M.: Sparse and Redundant Representations: from Theory to Applications in Signal and Image
Processing. Springer Verlag, New York., 376 p (2010)
Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned
dictionaries. IEEE Trans. Image Process. 15(12), 3736–3745 (2006)
Fienup, J.R.: Phase retrieval algorithms: a comparison. Appl. Opt. 21(15), 2758–2769 (1982)
Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via
finite-element approximations. Comput. Math. Appl. 2(1), 17–40 (1976)
Goldstein, T., Osher, S.: The Split Bregman method for L1-regularized problems. SIAM J. Imag.
Sci. 2(2), 323–343 (2009) Accessed on 04 October 2020. http://www.ece.rice.edu/~tag7/Tom_
Goldstein/Split_Bregman.html
Griswold, M.A., Jakob, P.M., Heidemann, R.M., Nittka, M., Jellus, V., Wang, J., Kiefer, B., Haase,
A.: Generalized autocalibrating partially parallel acquisitions (GRAPPA). Magn. Reson. Med.
47(6), 1202–1210 (2002)
Korobchenko, D.A., Danilevitch, A.B., Sirotenko, M.Y., Gavrilyuk, K.A., Rychagov, M.N.:
Automatic view planning in magnetic resonance tomography using convolutional neural net-
works. In: Proceedings of Moscow Institute of Electronic Technology. MIET., 176 p, Moscow
(2016)
Korobchenko, D.A., Migukin, A.S., Danilevich, A.B., Varfolomeeva, A.A, Choi, S., Sirotenko, M.
Y., Rychagov, M.N.: Method for restoring magnetic resonance image and magnetic resonance
image processing apparatus, US Patent Application 20180247436 (2018)
Larkman, D.J., Nunes, R.G.: Parallel magnetic resonance imaging. Phys. Med. Biol. 52(7),
R15–R55 (2007)
Larsson, E.G., Erdogmus, D., Yan, R., Principe, J.C., Fitzsimmons, J.R.: SNR-optimality of sum-
of-squares reconstruction for phased-array magnetic resonance imaging. J. Magn. Reson.
163(1), 121–123 (2003)
Liang, Z.-P., Lauterbur, P.C.: Principles of Magnetic Resonance Imaging: a Signal Processing
Perspective. Wiley-IEEE Press, New York (2000)
Liu, B., King, K., Steckner, M., Xie, J., Sheng, J., Ying, L.: Regularized sensitivity encoding
(SENSE) reconstruction using Bregman iterations. Magn. Reson. Med. 61, 145–152 (2009)
Liu, Q., Wang, S., Yang, K., Luo, J., Zhu, Y., Liang, D.: Highly undersampled magnetic resonance
image reconstruction using two-level Bregman method with dictionary updating. IEEE Trans.
Med. Imaging. 32, 1290–1301 (2013)
Lustig, M., Donoho, D., Pauly, J.M.: Sparse MRI: the application of compressed sensing for rapid
MR imaging. Magn. Reson. Med. 58, 1182–1195 (2007)
Lustig, M., Donoho, D.L., Santos, J.M., Pauly, J.M.: Compressed sensing MRI. IEEE Signal
Process. Mag. 25(2), 72–82 (2008)
Ma, S., Wotao, Y., Zhang, Y., Chakraborty, A.: An efficient algorithm for compressed MR imaging
using total variation and wavelets. In: Proceedings of IEEE International Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008)
Mairal, J., Elad, M., Guillermo, S.: Sparse representation for color image restoration. IEEE Trans.
Image Process. 17(1), 53–69 (2008)
Migukin, A., Agour, M., Katkovnik, V.: Phase retrieval in 4f optical system: background compen-
sation and sparse regularization of object with binary amplitude. Appl. Opt. 52(1), A269–A280
(2013)
Migukin, A.S., Korobchenko, D.A., Sirotenko, M.Y., Gavrilyuk, K.A., Choi, S., Gulaka, P.,
Rychagov, M.N.: DESIRE: efficient MRI reconstruction with Split Bregman initialization and
sparse regularization based on pre-learned dictionary. In: Proceedings of the 27th Annual
International Conference on Magnetic Resonance Angiography, p. 34 (2015) http://
society4mra.org/
Migukin, A.S., Korobchenko, D.A., Sirotenko, M.Y., Gavrilyuk, K.A., Gulaka, P., Choi, S.,
Rychagov, M.N., Choi Y.: Magnetic resonance imaging device and method for generating
magnetic resonance image. US Patent Application 20170053402 (2017)
Olsen, Ø.E.: Imaging of abdominal tumours: CT or MRI? Pediatr. Radiol. 38, 452–458 (2008)
Protter, M., Elad, M.: Image sequence denoising via sparse and redundant representations. IEEE
Trans. Image Process. 18(1), 27–36 (2009)
Pruessmann, K.P.: Encoding and reconstruction in parallel MRI. NMR Biomed. 19(3), 288–299
(2006)
Pruessmann, K.P., Weiger, M., Scheidegger, M.B., Boesiger, P.: SENSE: sensitivity encoding for
fast MRI. Magn. Reson. Med. 42(5), 952–962 (1999)
Qu, X., Guo, D., Ning, B., Hou, Y., Lin, Y., Cai, S., Chen, Z.: Undersampled MRI reconstruction
with patch-based directional wavelets. Magn. Reson. Imaging. 30(7), 964–977 (2012)
Ravishankar, S., Bresler, Y.: MR image reconstruction from highly undersampled k-space data by
dictionary learning. IEEE Trans. Med. Imag. 30(5), 1028–1041 (2010) Accessed on 04 October
2020. http://www.ifp.illinois.edu/~yoram/DLMRI-Lab/DLMRI.html
Roemer, P.B., Edelstein, W.A., Hayes, C.E., Souza, S.P., Mueller, O.M.: The NMR phased array.
Magn. Reson. Med. 16, 192–225 (1990)
Rubinstein, R., Zibulevsky, M., Elad, M.: Efficient implementation of the K-SVD algorithm using
batch orthogonal matching pursuit. Technical report – CS-2008-08 Technion (2008)
Sharif, B., Derbyshire, J.A., Faranesh, A.Z., Bresler, Y.: Patient-adaptive reconstruction and
acquisition in dynamic imaging with sensitivity encoding (PARADISE). Magn. Reson. Med.
64(2), 501–513 (2010)
Song, Y., Zhu, Z., Lu, Y., Liu, Q., Zhao, J.: Reconstruction of magnetic resonance imaging by
three-dimensional dual-dictionary learning. Magn. Reson. Med. 71(3), 1285–1298 (2014)
Vasanawala, S.S., Murphy, M.J., Alley, M.T., Lai, P., Keutzer, K., Pauly, J.M., Lustig, M.:
Practical parallel imaging compressed sensing MRI: summary of two years of experience in
accelerating body MRI of pediatric patients. In: Proceedings of the IEEE International Sympo-
sium on Biomedical Imaging: from Nano to Macro, pp. 1039–1043 (2011)
Wang, Y., Yin, W., Zhang, Y.: A fast algorithm for image deblurring with total variation regular-
ization. CAAM Technical Report TR07-10 (2007)
Yaghoobi, M., Blumensath, T., Davies, M.E.: Dictionary learning for sparse approximations with
the majorization method. IEEE Trans. Signal Process. 57(6), 2178–2191 (2009)
Chapter 13
Depth Camera Based on Colour-Coded
Aperture
Vladimir P. Paramonov
13.1 Introduction
Scene depth extraction, i.e. the computation of distances to all scene points visible on
a captured image, is an important part of computer vision. There are various
approaches for depth extraction: a stereo camera and a camera array in general, a
plenoptic camera including dual-pixel technology as a special case, and a camera
with a coded aperture to name a few. The camera array is the most reliable solution,
but it implies extra cost and extra space and increases the power consumption for any
given application. Other approaches use a single camera but multiple images for
depth extraction, thus working only for static scenes, which severely limits the
possible application list. Thus, the coded aperture approach is a promising single-
lens single-frame solution which requires insignificant hardware modification
(Bando 2008) and can provide a depth quality sufficient for many applications
(e.g. Bae et al. 2011; Bando et al. 2008). However, a number of technical issues
have to be solved to achieve a level of performance which is acceptable for
applications. We discuss the following issues and their solutions in this chapter:
(1) light-efficiency degradation due to the insertion of colour filters into the camera
aperture, (2) closeness to the diffraction limit for millimetre-size lenses (e.g.,
smartphones, webcams), (3) blindness of disparity estimation algorithms in
low-textured areas, and (4) final depth estimation in millimetres in the whole
image frame, which requires the use of a special disparity with the depth conversion
method for coded apertures generalised for any imaging optical system.
V. P. Paramonov (*)
e-mail: v.paramonov@samsung.com

326 V. P. Paramonov
13.2 Existing Approaches and Recent Trends
Depth can be estimated using a camera with a binary coded aperture (Levin et al.
2007; Veeraraghavan et al. 2007). It requires computationally expensive depth
extraction techniques based on multiple deconvolutions and a sparse image gradient
prior. Disparity extraction using a camera with a colour-coded aperture which pro-
duces spatial misalignment between colour channels was first demonstrated by
Amari and Adelson in 1992 and has not changed significantly since that time
(Bando et al. 2008; Lee et al. 2010, 2013). The main advantage of these cameras
over cameras with a binary coded aperture is the lower computational complexity of
the depth extraction techniques, which do not require time-consuming
deconvolutions.
The light efficiency of the systems proposed in Amari and Adelson 1992; Bando
et al. 2008; Lee et al. 2010, 2013; Levin et al. 2007; Veeraraghavan et al. 2007; Zhou
et al. 2011 is less than 20% compared to a fully opened aperture, which leads to a
decreased signal-to-noise ratio (SNR) or longer exposure times with motion blur.
That makes them impractical for compact handheld devices and for real-time
performance by design. A possible solution was proposed by Chakrabarti and
Zickler (2012), where each colour channel has an individual effective aperture
size. Therefore, the resulting image has colour channels with different depths of
field. Due to its symmetrical design, this coded aperture cannot provide discrimina-
tion between objects closer to or further than the in-focus distance. Furthermore, it
requires a time-consuming disparity extraction algorithm.
Paramonov et al. (2016, b, c) proposed a solution to the problems outlined above
by presenting new light-efficient coded aperture designs and corresponding algo-
rithm modifications. All the aperture designs detailed above are analysed and
compared in Sect. 13.7 of this chapter.
Let us consider the coded aperture concept. A simplified imaging system is
illustrated schematically in Fig. 13.1. It consists of a single thin lens and RGB
colour sensor.
A coded aperture is placed next to the thin lens. The aperture consists of colour
filters with different passbands, e.g. red and green colour filters (Fig. 13.2a).
Fig. 13.1 Conventional single-lens imaging system image formation: (a) focused scene; (b)
defocused scene
13 Depth Camera Based on Colour-Coded Aperture 327
Fig. 13.2 Colour-coded aperture image formation: (a) in-focus foreground; (b) defocused
background
Fig. 13.3 Colour image restoration example. From left to right: image captured with colour-coded
aperture (causing a misalignment in colour channels); extracted disparity map; restored image
Defocused regions of an image captured with this system have different viewpoints
in the red and green colour channels (see Fig. 13.2b). By considering the correspon-
dence between these two channels, the disparity map for the captured scene can be
estimated as in Amari and Adelson (1992).
The original colour image cannot be restored in the case of the absence of a blue
channel. Bando et al. (2008), Lee et al. (2010, 2013), and Paramonov et al.
(2016a, b, c) changed the aperture design to include all three colour channels, thus
making image restoration possible and enhancing the disparity map quality. The
image is restored by applying colour shifts based on the local disparity map value
(Fig. 13.3).
To get the depth map from an estimated disparity map, one may use the thin lens
equation. In practice, most of the prior art works in this area do not discriminate
disparity and depth, treating them as synonyms as one-to-one correspondence exists.
However, a modern imaging system usually consists of a number of different lenses,
i.e. an objective. That makes the use of a thin lens formula impossible. Furthermore,
the planar scene does not have plane depth if we apply a trivial disparity-to-depth
conversion equation. A number of researchers worked on this problem for different
optics systems (Dansereau et al. 2013; Johannsen et al. 2013; Trouvé et al. 2013a, b
in two papers). Depth results for coded aperture cameras (Lee et al. 2013; Panchenko
et al. 2016) are valid only in the centre of the captured image.
328 V. P. Paramonov
A recent breakthrough in efficient deep neural network architectures (He et al.

2016; Howard et al. 2017; Paramonov et al. 2016a, b, c; Wu et al. 2018; Chen et al.
2019) allows cheaper AI models inference on mobile devices for computer vision
applications. This trend is also evident in coded aperture approaches, including
colour-coded apertures and chromatic aberration-coded apertures (Sitzmann et al.
2018; Chang and Wetzstein 2019; Moriuchi et al. 2017; Mishima et al. 2019).
At the same time, practitioners would like to avoid deep learning approaches in
subtasks which could be solved by direct methods, thus avoiding all the burden of
expensive dataset recollections and hard-to-achieve generalization. These tasks
include precise disparity-to-depth conversion and optical system calibration which
is required for this conversion as described in this chapter. Given that the disparity
estimation provided by a deep learning algorithm performs well on a new camera
(with different optical characteristics), one does not have to recollect and/or retrain
anything, while the calibration procedure is significantly cheaper to implement
(requiring minutes of work by a single R&D engineer). Furthermore, some part of
the dataset can be synthesized using numerical simulation of the optics with the
coded aperture as described in the next section of this chapter.
This chapter is organized as follows: the next Sect. 13.3 presents an overview of
coded aperture numerical simulation and its possible applications. We consider
light-efficient aperture designs in Sect. 13.4. In Sects. 13.5 and 13.6, a method of
depth estimation for a generalized optical system which provides a valid depth in the
whole frame is analysed. We evaluate these approaches in Sect. 13.7 and show the
prototypes and implementations (including 3D reconstruction using a coded
aperture-based depth sensor) in Sect. 13.8.
13.3 Numerical Simulation and Its Applications
A simplified numerical simulation of the image formation process was disclosed by

Paramonov et al. (2014) and implemented in Paramonov et al. (2016a, b, c) to
accelerate the research process and new coded aperture design elaboration. The goal
of the simulation is to provide an image captured by a camera with a given coded
aperture pattern (Fig. 13.4) for a given undistorted input image and given disparity
map.
Fig. 13.4 Aperture designs for image formation numerical simulation: (a) open aperture, (b) binary
coded aperture (Levin et al. 2007); (c) colour-coded aperture (Bando et al. 2008); (d) colour-coded
aperture (Lee et al. 2010); (e) colour-coded aperture (Chakrabarti and Zickler 2012); (f) colour-
coded aperture (Paramonov et al. 2016a, b, c)
Fig. 13.5 PSF numerical simulation for different coded aperture designs at different point source
distances from camera along optical axis. From top to bottom row: conventional open aperture,
binary coded aperture (Levin et al. 2007); colour-coded aperture (Bando et al. 2008); colour-coded
aperture (Lee et al. 2010); colour-coded aperture (Chakrabarti and Zickler 2012); colour-coded
aperture (Paramonov et al. 2016a, b, c)
The first step is to simulate the point spread function (PSF) for a given coded
aperture design and different defocus levels, for which we follow the theory and the
code provided in Goodman (2008), Schmidt (2010), and Voelz (2011). The resulting
PSF images are illustrated in Fig. 13.5.
Given a set of PSFs corresponding to different defocus levels (i.e. different
distances), one can simulate the image formation process for a planar scene via
convolution of the input clear image with the corresponding PSF. In the case of a
complex scene with depth variations, this process requires multiple convolutions
with different PSFs for different depth levels. In order to do this, a continuous depth
map should be represented by a finite number of layers. In our simulation, we
precalculate 256 PSFs for a given aperture design corresponding to 256 different
defocus levels. Once we have precalculated the PSF, the process of any image
simulation does not require to repeat it. It should be noted that object boundaries
and semi-transparent objects require extra care to make this simulation realistic.
As a sanity check for the simulation model of the optics, one can numerically
simulate the imaging process using a pair of corresponding image and disparity
taken from existing datasets. Here, we use an image from the Middlebury dataset
(Scharstein and Szeliski 2003) to simulate an image captured through a colour-coded
aperture illustrated in Fig. 13.4c (proposed by Bando et al. 2008). Then we use the
disparity estimation algorithm provided by the original authors of Bando et al.
(2008) for their own aperture design (link to the source code: http://web.media.
mit.edu/~bandy/rgb/). Based on the results in Fig. 13.6, we conclude that the model’s
realism is acceptable.
This gives an opportunity to generate new synthetic datasets for depth estimation
with AI algorithms. Namely, one can use existing datasets of images with
330 V. P. Paramonov
Fig. 13.6 Numerical simulation of image formation for colour-coded aperture: (a) original image;
(b) ground truth disparity map; (c) numerically simulated image; (d) raw layered disparity map
extracted by algorithm implemented by Bando et al. (2008)
ground-truth depth or disparity information to numerically simulate a distorted

image dataset corresponding to a given coded aperture design. In this case, there is
no need for time-consuming dataset collection. Furthermore, the coded aperture
design could be optimized together with a neural network as its first layer, similarly
to what is described by Sitzmann et al. (2018) and Chang and Wetzstein (2019).
It is important to emphasize that image quality in terms of human perception
might not be needed in some applications where the image is used by an AI system
only, e.g. in applications with robots, automotive navigation, surveillance, etc. In
this case, a different trade-off should be found that prioritizes the information
extracted from the scene that differs from the image itself (depth, shape or motion
patterns for its recognition).
13.4 Light-Efficient Colour-Coded Aperture Designs
A number of light-efficient aperture designs proposed by Paramonov et al. (2014,

2016a, b, c) are presented in Fig. 13.7.
In contrast to aperture designs in previous works, sub-apertures of
non-complementary colours and non-congruent shapes have been utilized. These
aperture designs are now being used in more recent research works (Moriuchi et al.
2017; Mishima et al. 2019; Tsuruyama et al. 2020).
Let us consider the semi-circle aperture design illustrated in Fig. 13.7. It consists
of yellow and cyan filters. The yellow filter has a passband which includes green and
red light passbands. The cyan filter has a passband which includes green and blue
light passbands. The green channel is not distorted by those filters (at least in the
perfect world) and can be used as a reference in the image restoration procedure.
With ideal filters, this design has a light efficiency over 65% with respect to a fully
open aperture (the ratio between the coded aperture and the fully transparent one and
averaged through all three colour channels).
An image is captured in sensor colour space, e.g. RGB. However, the disparity
estimation algorithm works in coded aperture colour space, e.g. CYX, shown in
Fig. 13.8a. The artificial vector X is defined as a unit vector orthogonal to the vectors
representing the C and Y colours.
To translate the image from RGB to CYX colour space, a transform matrix M is
estimated, similar to Amari and Adelson (1992). Then, for each pixel of the image
we have:
Fig. 13.7 Light-efficient colour-coded aperture designs with corresponding light efficiency
approximation (based on effective area sizes)
332 V. P. Paramonov
Fig. 13.8 CYX colour

space visualization: (a) cyan
(C), yellow (Y) and X is a
vector orthogonal to the CY
plane; (b) cyan and yellow
coded aperture dimensions
1 i,j
wi,j
CYX ¼ M wRGB ,
where wi,j i,j

CYX and wRGB are vectors representing the colour of the (i, j) pixel in the
CYX and RGB colour spaces, respectively. A fundamental difference with Amari
and Adelson (1992) is that, in this case, the goal of this procedure is not just to
calibrate a small mismatch between colour filters in the aperture and the same set of
colour filters in the Bayer matrix of the sensor but to change the colour basis between
two sets of completely different colour bases. All the aperture designs described here
were verified numerically and through prototyping by Paramonov et al. (2014,
2016a, b, c). Despite the non-congruent shapes and non-complementary colours of
the individual sub-apertures, they are able to provide a colour channel shift that is
sufficient for depth extraction, at the same time providing superior light efficiency.
13.5 Depth Map Estimation for Thin Lens Cameras
One approach for disparity map estimation is described by Panchenko et al. (2016).
Its implementation for depth estimation and control is given also in Chap. 3. The
approach utilizes a mutual correlation of shifted colour channels in an exponentially
weighted window and uses bilateral filter approximation for cost volume regulari-
zation. We describe the basics below for the completeness and self-sufficiency of the
current chapter.
Let fI i gn1 represent a set of n-captured colour channels of the same scene from
different viewpoints, where Ii is the M N frame. A conventional correlation matrix
Cd is formed for the fI i gn1 set and candidate disparity values d:
0 1
1 ⋯ corr I d1 , I dn
B C
Cd ¼ @ ⋮ ⋱ ⋮ A,

corr I dn , I d1 ⋯ 1
where superscript ()d denotes the parallel shift in d pixels in the corresponding
channel. The direction of the shift is dictated by the aperture design. The determinant
of the matrix Cd is a good measure of fI i gn1 mutual correlation. Indeed, when all
channels are in strong correlation, all the elements of the matrix are equal to one and
det(Cd) ¼ 0. On the other hand, when the data is completely uncorrelated, we have
det(Cd) ¼ 1. To extract a disparity map using this metric, one should find disparity
values d corresponding to the smallest value of det(Cd) in each pixel of the picture.
Here, we derive another particular implementation of the generalized correlation
metric for n ¼ 3. It corresponds to the case of an aperture with three channels. The
determinant of the correlation matrix is:
2 2 2
detðCd Þ ¼ 1 corr I d1 , I d2 corr I d2 , I d3 corr I d3 , I d1

þ 2corr I d1 , I d2 corr I d2 , I d3 corr I d3 , I d1 ,
and we have
X 2 Y
argmindetðCd Þ ¼ argmax corr I i , I j 2
d d
corr I i , I j :
d d
d d
This metric is similar to the colour line metrics (Amari and Adelson 1992) but is
more robust in important cases of low texture density in some areas of the image.
The extra robustness appears when one of the three channels does not have enough
texture in a local window around a point under consideration. In this case, the colour
lines metric cannot provide disparity information, even if the other two channels are
well defined. The generalized correlation metric avoids this disadvantage and allows
the depth sensor to work similarly to a stereo camera in this case.
Usually, passive sensors provide sparse disparity maps. However, dense disparity
maps can be obtained by propagating disparity information to non-textured areas.
The propagation can be efficiently implemented via joint-bilateral filtering
(Panchenko et al. 2016) of the mutual correlation metric cost or by applying
variational methods (e.g. Chambolle and Pock 2010) for global regularization with
classic total variation or other priors. Here, we assume that the depth is smooth in
non-textured areas.
In contrast to the original work by Panchenko et al. (2016), this algorithm has also
been applied not in sensor colour space but in colour-coded aperture colour space
(Paramonov et al. 2016a, b, c). This increases the texture correlation between the
colour channels if they have overlapping passbands and helps to improve the number
of depth layers compared to RGB colour space (see Fig. 13.9c, d for comparison of
the number of depth layers sensed for the same aperture design but different colour
basis).
Let us derive a disparity-to-depth conversion equation for a single thin-lens
optical system (Fig. 13.1) as was proposed by Paramonov et al. (2016a, b, c). For
a thin lens (Fig. 13.1a), we have:
334 V. P. Paramonov
Fig. 13.9 Depth sensor on the axis calibration results for different colour-coded aperture designs:
(a) three RGB circles processed in RGB colour space; (b) cyan and yellow halves coded aperture
processed in CYX colour space; (c) cyan and yellow coded aperture with open centre processed in
CYX colour space; (d) cyan and yellow coded aperture with open centre processed in conventional
RGB colour space. Please note that there are more depth layers in the same range for case (c) than
for case (d), thanks only to the CYX colour basis (the coded aperture is the same)
1 1 1
þ ¼ ,
zof zif f
where f is the lens focal length, zof the distance between a focused object and the lens,
and zif the distance from the lens to the focused image plane. If we move the image
sensor towards the lens as shown in Fig. 13.1b, the image of the object on the sensor
is convolved with a colour-coded aperture copy, which is the circle of confusion, and
we obtain:
1 1 1
þ ¼ ,
zod zid f
1 1 þ c=D 1
þ ¼ ,
zof zid f
where zid is the distance from the lens to the defocused image plane, zod is the
distance from the lens to the defocused object plane corresponding to zid, c is the
circle of confusion diameter, and D is the aperture diameter (Fig. 13.1b). We can
solve this system of equations for the circle of confusion diameter:

fD zod zof
c¼ ,
zod zof f
which gives the final result for disparity in pixels:

β βfD zod zof
d¼ c¼ ,
2μ 2μzod zof f
where μ is the sensor pixel size, β ¼ rc/R is the coded aperture coefficient, R ¼D/2 is
the aperture radius, and rc is the distance between the aperture centre and the single-
channel centroid (Fig. 13.8b). Now, we can express the distance between the camera
lens and any object only in terms of the internal camera parameters and the disparity
value corresponding to that object:
bf zof
zod ¼ ,
bf 2μd zof f
where b ¼ βD ¼ 2rc is the distance between two centroids (see Fig. 13.8), i.e. the
colour-coded aperture baseline equivalent to the stereo camera baseline. Note that if
d ¼ 0, zod is naturally equal to zof, then the object is in the camera focus.
13.6 Depth Sensor Calibration for Complex Objectives
To use the last equation in any real complex system (objective), it was proposed to
substitute it with a black box with the entrance and exit pupils (see Goodman 2008;
Paramonov et al. 2016a, b, c for details) located at the second and the first principal
points (H0 and H ), respectively (see Fig. 13.10 as an example of principal planes
location in the case of a double Gauss lens).
The distance between the entrance and exit pupils and the effective focal length
are found through a calibration procedure proposed by Paramonov et al.
(2016a, b, c). Since the pupil position is unknown for a complex lens, we measure
336 V. P. Paramonov
Fig. 13.10 Schematic diagram of the double Gauss lens used in Canon EF 50 mm f/1.8 II lens and
its principal plane location. Please note that this approach works for any optical imaging system
(Goodman 2008)
the distances to all objects from the camera sensor. Therefore, the disparity-to-depth
conversion equation becomes:
bf ð~zof δÞ
~zod δ ¼ ,
bf 2μdð~zof δ f Þ
where ~zod is the distance between the defocused object and the sensor, ~zof is the
distance between the focused object and the sensor, and δ ¼ zif + HH0 is the distance
between the sensor and the entrance pupil. Thus, for ~zod we have:
bf ~zof 2μdδð~zof δ f Þ
~zod ¼ :
bf 2μdð~zof δ f Þ
On the right-hand side of the equation above, there are three independent
unknown variables, namely, ~zo f , b, and δ. We discuss their calibration in the
following text. Other variables are either known or dependent.
Another issue arises due to the point spread function (PSF) changing across the
image. This causes a variation in the disparity values for objects with the same
distances from the sensor but with different positions in the image. A number of
researchers encountered the same problem in their works (Dansereau et al. 2013;
Johannsen et al. 2013; Trouvé et al. 2013a, b). A specific colour-coded aperture
depth sensor calibration to mitigate this effect is described below.
The first step is the conventional calibration with the pinhole camera model and a
chessboard pattern (Zhang 2000). From this calibration, we acquire the distance
zif between the sensor and the exit pupil.
To find the independent variables ~zof , b, and HH0, we capture a set of images of a
chessboard pattern moving in a certain range along the optical axis and orthogonal to
it (see Fig. 13.11). Each time, the object was positioned by hand, which is why small
errors are possible (up to 3 mm). The optical system is focused on a certain distance
Fig. 13.11 Depth sensor calibration procedure takes place after conventional camera calibration
using pinhole camera model. A chessboard pattern is moving along the optical axis and captured
while the camera focus distance is constant
from the sensor. Our experience shows that the error in focusing by hand in a close
range is high (up to 20 mm for the Canon EF 50 mm f/1.8 lens), so we have to find
the accurate value of ~zof through the calibration as well.
Disparity values are extracted, and their corresponding distances are measured by
a ruler on the test scene for all captured images. Now, we can find ~zof and b so that
the above equation for the distance between a defocused object and the sensor holds
with minimal error (RMS error for all measurements).
To account for depth distortion due to curvature of the optical system field, we
perform the calibration for all the pixels in the image individually. The resulting
colour-coded aperture baseline b(i, j) and in-focus surface ~zof ði, jÞ are shown in
Fig. 13.12a and b, respectively.
The procedure described here was implemented on a prototype based on the
Canon EOS 60D camera and Canon EF 50 mm f/1.8 II lens. Any other imaging
system would also work, but this Canon lens allows easy disassembling (Bando
2008). Thirty-one images were captured (see Fig. 13.11), where the defocused image
plane was moving from 1000 to 4000 mm in 100 mm steps (zod) and the camera was
focused at approximately 2000 mm (zof).
The results of our calibration for different coded aperture designs are presented in
Fig. 13.9. Based on the calibration, the effective focal length of our Canon EF 50 mm
f/1.8 II lens is 51.62 mm, which is in good agreement with the focal length value
provided to us by the professional opticians (51.6 mm) who performed an accurate
calibration (it is not in fact 50 mm).
Using the calibration data, one can perform an accurate depth map estimation:
bði, jÞf~zof ði, jÞ 2μdδð~zof ði, jÞ δ f Þ

~zod ði, jÞ ¼ :
bði, jÞf 2μdð~zof ði, jÞ δ f Þ
The floor in Fig. 13.13a is flat but appears to be concave on the extracted depth
map due to depth distortion (Fig. 13.13b). After calibration, the floor surface is
corrected and is close to planar (Fig. 13.13c). The accuracy of undistorted depth
maps extracted with a colour-coded aperture depth sensor is sufficient for 3D scene
reconstruction, as discussed in the next section.
338 V. P. Paramonov
Fig. 13.12 Colour-coded aperture depth sensor 3D calibration results: (a) coded aperture equiva-
lent baseline field b(i, j); (b) optical system in-focus surface ~zof ði, jÞ , where (i, j) are pixel
coordinates
Fig. 13.13 Depth map of a rabbit figure standing on the floor: (a) captured image (the colour shift
in the image is visible when looking under magnification); (b) distorted depth map, floor appears to
be concave; (c) undistorted depth map, floor surface is planar area
13.7 Evaluation Results
First, let us compare the depth estimation error for the layered and sub-pixel
approaches. Figure 13.14 shows that the sub-pixel estimation with the quadratic
polynomial interpolation significantly improves the depth accuracy. This approach
also allows real-time implementation as we interpolate around global maxima only.
The different aperture designs are compared in Paramonov et al. (2016a, b, c),
having the same processing algorithm (except the aperture corresponding to
Chakrabarti and Zickler (2012) as it utilizes a significantly different approach).
The tests were conducted using the Canon EOS 60D DSLR camera with a Canon
EF 50 mm f/1.8 II lens in the same light conditions and for the same distance to the
object, while the exposure time was adjusted to achieve a meaningful image in each
case. Typical results are shown in Fig. 13.15.
Fig. 13.14 Cyan and yellow coded aperture: depth estimation error comparison for layered and
sub-pixel approaches
A MATLAB code was developed for processing. A non-optimized implementa-

tion takes 6 seconds on a CPU to extract 1280 1920 raw disparity maps in the case
of three colour filters in the aperture, which is very close to Bando et al.’s (2008)
implementation (designs I–IV in Fig. 13.17). In the case of two colour filters in the
aperture (designs V, VII in Fig. 13.17), our algorithm takes only 3.5 seconds to
extract the disparity. The similar disparity estimation algorithm implementation in
Chakrabarti and Zickler (2012) takes 28 seconds in the same conditions. All tests
were performed in single-thread mode and with the parameters recommended by
their authors.
Raw disparity maps usually require some regularization to avoid strong artefacts.
For clarity, the same robust regularization method was used for all the extracted
results. The variational method (Chambolle and Pock 2010) was used for global
regularization with a total variation prior (see the last column in Fig. 13.17). It takes
only 3 seconds for 1280 1920 disparity map regularization on a CPU.
Low light efficiency is a significant drawback of existing coded aperture cameras.
A simple procedure for measuring the light efficiency of a coded aperture camera
was implemented. We capture the first image I nc i:j ði, jÞ (here and after (i, j) denote the
pixel coordinates) with a non-coded aperture and compare it to the second image
I ci:j ði, jÞ captured with the same camera presets in the presence of a coded aperture. To
340 V. P. Paramonov
Fig. 13.15 Results comparison with prior art. Rows correspond to different coded aperture designs.
From top to bottom: RGB circles (Lee et al. 2010); RGB squares (Bando et al. 2008); CMY squares,
CMY circles, CY halves (all three cases by Paramonov et al. 2016a, b, c); magenta annulus
(Chakrabarti and Zickler 2012); CY with open area (Paramonov et al. 2016a, b, c); open aperture.
Light efficiency increases from top to bottom
avoid texture dependency and image sensor noise, we use a blurred captured image
of a white sheet of paper.
The transparency Ti, j shows the fraction of light which passes through the
imaging system with the coded aperture relative to the same imaging system without
the coded aperture:
I ci,j
T i,j ¼ :
I nc
i:j
The transparency is different for different colours. We provide the resulting

transparency corresponding to the colours on image sensors: red, green, and blue.
In Fig. 13.16, we present the results for a Canon EF 50 mm f/1.8 II lens. The integral
light efficiency of these designs is 86%, 55%, and 5.5% correspondingly.
Any imaging system suffers from the vignetting effect, which is a reduction of an
image’s brightness at the periphery compared to the image centre. Usually, this
effect is mitigated numerically. In the case of a colour-coded aperture, this restora-
tion procedure should take into account the difference between the transparency in
different colour channels.
Fig. 13.16 Transparency maps for different aperture designs (columns) corresponding to different
image sensor colour channels (rows)
Fig. 13.17 SNR loss for different apertures in the central and border image areas (dB)
We conducted experiments for analysing the in-focus quality of the image

captured with the proposed depth sensor using the Imatest chart and software
(Imatest 2014). The results are presented in Fig. 13.17.
342 V. P. Paramonov
Fig. 13.18 Light efficiency

for different coded aperture
designs. From left to right:
RGB circles (Lee et al.
2010); RGB squares (Bando
et al. 2008); CMY squares,
CMY circles, CY halves (all
three cases by Paramonov
et al. 2016a, b, c); magenta
annulus (Chakrabarti and
Zickler 2012); CY with
open area (Paramonov et al.
2016a, b, c); open aperture
All photos were taken with identical camera settings. It seems that SNR degra-
dation from the centre to the side is induced by lens aberrations. Different apertures
provide different SNRs, the value depending on the amount of captured light. The
loss between aperture 1 and 3 is 2.3 dB. To obtain the original SNR value for
aperture 3, one should increase the exposure time by 30%.
It is important to take into account the light efficiency while evaluating depth
sensor results. We estimated the light efficiency by capturing a white sheet of paper
through different coded apertures in the same illumination conditions and with the
same camera parameters. The light efficiency values are presented for each sensor
colour channel independently (see Fig. 13.18). The aperture designs are sorted based
on their light efficiency.
Apertures V and VI in Fig. 13.15 have almost the same light efficiency, but the
depth quality of aperture V seems to be better. Aperture VII has a higher light
efficiency and can be used if depth quality is not an issue.
This analysis may be used to find a suitable trade-off between image quality and
depth quality for a given application.
13.8 Prototypes and Implementation
A number of prototypes with different colour-coded aperture designs were devel-

oped based on the Canon EOS 60D DSLR camera and Canon EF 50 mm f/1.8 II
lens, which has an appropriate f-number and can be easily disassembled (Bando
et al. 2008). The corresponding colour-coded aperture design is shown in
Fig. 13.19a. Two examples of captured images and their extracted depth maps are
presented in Fig. 13.20.
We also describe two other application scenarios: image effects (Fig. 13.21)
based on depth (Figs. 13.22 and 13.23) extracted with a smartphone camera
Fig. 13.19 Design of colour-coded apertures: (a) inside a DSLR camera lens; (b) inside a
smartphone camera lens; (c) disassembled smartphone view
Fig. 13.20 Scenes captured with the DSLR-based prototype with their corresponding depth maps
Fig. 13.21 Depth-dependent image effects. From top to bottom rows: refocusing, pixelization,
colourization
Fig. 13.22 Tiny models captured with the smartphone-based prototype and with their
corresponding disparity maps
(Fig. 13.19b, c) and real-time 3D reconstruction using a handheld or mounted

consumer grade camera (Figs. 13.24, 13.25, 13.26, and 13.27).
For implementation of a smartphone prototype, the coded aperture could not be
inserted into the pupil plane, as most of the camera modules cannot be disassembled.
However, a number of different smartphones were disassembled, and for some
models, it was possible to insert the colour-coded aperture between the front lens
of the imaging system and the back cover glass of the camera module (see
Fig. 13.19b and c). Unfortunately, we have no access to smartphone imaging system
parameters and cannot say how far this position is from the pupil plane.
344 V. P. Paramonov
Fig. 13.23 Disparity map and binary mask for the image captured with the smartphone prototype
Fig. 13.24 Web camera-

based prototype
An Android application was developed (Paramonov et al. 2016a, b, c) to dem-

onstrate the feasibility of disparity estimation with quality sufficient for depth-based
image effects (see Figs. 13.20, 13.21, 13.22, and 13.23).
Fig. 13.25 Real-time implementation of raw disparity map estimation with web camera-based
prototype
Fig. 13.26 Point Grey Grasshopper 3 camera with inserted colour-coded aperture on the left (a)
and real-time 3D scene reconstruction process on the right (b)
The real-time 3D reconstruction scenario is based on a Point Grey Grasshopper

3 digital camera with a Fujinon DV3.4x3.8SA-1 lens with embedded yellow-cyan
colour-coded aperture. The Point Grey camera and 3D reconstruction process are
shown in Fig. 13.26.
The following 3D reconstruction scheme was chosen:
• Getting a frame from the camera and undistorting it according to the calibration
results
346 V. P. Paramonov
Fig. 13.27 Test scenes and their corresponding 3D reconstruction examples (a) test scene frontal
veiw and (b) corresponding 3D reconstruction view, (c) test scene side view and (d) corresponding
3D reconstruction veiw
• Extracting the frame depth in millimetres and transforming it to a point cloud

• Using the colour alignment technique (Bando et al. 2008) to restore the
colour image
• Utilization of GPU-accelerated dense tracking and mapping Kinect Fusion
(KinFu) algorithm (Newcomb et al. 2011) for camera pose estimation and 3D
surface reconstruction
A PC depth extraction implementation provides 53fps (Panchenko and Bucha
2015), OpenCL KinFu provides 45fps (Zagoruyko and Chernov 2014), and the whole
3D reconstruction process works at 15fps, which is sufficient for on-the-fly usage.
The test scene and 3D reconstruction results are shown in Fig. 13.27a–d. Note
that a chessboard pattern is not used for tracking but only to provide good texture.
In Fig. 13.28, we show the distance between depth layers corresponding to
disparity values equal to 0 and 1 based on the last formula in Sect. 13.5. The layered
Fig. 13.28 Depth sensor accuracy curves for different aperture baselines b: (a) full-size camera
with f-number 1.8 and pixel size 4.5 μm; (b) compact camera with f-number 1.8 and pixel size
1.2 μm
348 V. P. Paramonov
depth error is by definition two times smaller than this distance. The sub-pixel
refinement reduces the depth estimation error two times further (see Figs. 13.6 and
13.10). That gives a final accuracy better than 15 cm on the distance of 10 m and
better than 1 cm on the distance below 2.5 m for the equivalent baseline b of 20 mm.
References
Amari, Y., Adelson, E.: Single-eye range estimation by using displaced apertures with color filters.
In: Proceedings of the International Conference on Industrial Electronics, Control, Instrumen-
tation and Automation, pp. 1588–1592 (1992)
Bae, Y., Manohara, H., White, V., Shcheglov, K.V., Shahinian, H.: Stereo imaging miniature
endoscope. Tech Briefs. Physical Sciences (2011)
Bando, Y.: How to disassemble the Canon EF 50mm F/1.8 II lens (2008). Accessed on
15 September 2020. http://web.media.mit.edu/~bandy/rgb/disassembly.pdf
Bando, Y., Chen, B.-Y., Nishita, T.: Extracting depth and matte using a color-filtered aperture.
ACM Trans. Graph. 27(5), 134:1–134:9 (2008)
Chakrabarti, A., Zickler, T.: Depth and deblurring from a spectrally-varying depth-of-field. In:
Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) Lecture Notes in Computer
Science, vol. 7576, pp. 648–661. Springer, Berlin, Heidelberg (2012)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications
to imaging. J. Math. Imaging Vis. 40, 120–145 (2010)
Chang, J., Wetzstein, G.: Deep optics for monocular depth estimation and 3d object detection. In:
Proceedings of the IEEE International Conference on Computer Vision, pp. 10193–10202
(2019)
Chen, W., Xie, D., Zhang, Y., Pu, S.: All you need is a few shifts: designing efficient convolutional
neural networks for image classification. In: Proceedings of the IEEE/CVF Conference on
Dansereau, D., Pizarro, O., Williams, S.: Decoding, calibration and rectification for lenselet-based
plenoptic cameras. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1027–1034 (2013)
Goodman, J.: Introduction to Fourier Optics. McGraw-Hill, New York (2008)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M.,
Adam, H.: MobileNets: efficient convolutional neural networks for mobile vision applications.
arXiv:1704.04861. 1704, 04861 (2017)
Imatest. The SFRplus chart: features and how to photograph it (2014). Accessed on 15 September
2020. https://www.imatest.com/docs/
Johannsen, O., Heinze, C., Goldluecke, B., Perwaß, C.: On the calibration of focused plenoptic
cameras. In: Grzegorzek, M., Theobalt, C., Koch, R., Kolb, A. (eds.) Time-of-Flight and Depth
Imaging. Sensors, Algorithms, and Applications. Lecture Notes in Computer Science, vol.
8200, pp. 302–317. Springer, Berlin, Heidelberg (2013)
Lee, E., Kang, W., Kim, S., Paik, J.: Color shift model-based image enhancement for digital multi
focusing based on a multiple color-filter aperture camera. IEEE Trans. Consum. Electron. 56(2),
317–323 (2010)
Lee, S., Kim, N., Jung, K., Hayes, M.H., Paik, J.: Single image-based depth estimation using dual
off-axis color filtered aperture camera. In: Proceedings of the IEEE International Conference on
Acoustics, Speech and Signal Processing, pp. 2247–2251 (2013)
Levin, A., Fergus, R., Durand, F., Freeman, W.T.: Image and depth from a conventional camera
with a coded aperture. ACM Trans. Graph. 26(3), 70:1–70:10 (2007)
Mishima, N., Kozakaya, T., Moriya, A., Okada, R., Hiura, S.: Physical cue based depth-sensing by
color coding with deaberration network. arXiv:1908.00329. 1908, 00329 (2019)
Moriuchi, Y., Sasaki, T., Mishima, N., Mita, T.: Depth from asymmetric defocus using color-
filtered aperture. The Society for Information Display. Book 1: Session 23: HDR and Image
Processing (2017). Accessed on 15 September 2020. https://doi.org/10.1002/sdtp.11639
Newcomb, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davidson, A.J., Kohi, P., Shotton,
J., Hodges, S., Fitzgibbon, A.: Kinectfusion: realtime dense surface mapping and tracking. In:
Proceedings of the IEEE International Symposium on Mixed and Augmented Reality,
pp. 127–136 (2011)
Panchenko, I., Bucha, V.: Hardware accelerator of convolution with exponential function for image
processing applications. In: Proceedings of the 7th International Conference on Graphic and
Image Processing. International Society for Optics and Photonic, pp. 98170A–98170A (2015)
Panchenko, I., Paramonov, V., Bucha, V.: Depth estimation algorithm for color coded aperture
camera. In: Proceedings of the IS&T Symposium on Electronic Imaging. 3D Image Processing,
Measurement, and Applications, pp. 405.1–405.6 (2016)
Paramonov, V., Panchenko, I., Bucha, V.: Method and apparatus for image capturing and simul-
taneous depth extraction. US Patent 9,872,012 (2014)
Paramonov, V., Lavrukhin, V., Cherniavskiy, A.: System and method for shift-invariant artificial
neural network. RU Patent 2,656,990 (2016a)
Paramonov, V., Panchenko, I., Bucha, V., Drogolyub, A., Zagoruyko, S.: Depth camera based on
color-coded aperture. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 1, 910–918 (2016b)
Paramonov, V., Panchenko, I., Bucha, V., Drogolyub, A., Zagoruyko, S.: Color-coded aperture.
Oral presentation in 2nd Christmas Colloquium on Computer Vision, Skolkovo Institute of
Science and Technology (2016c). Accessed on 15 September 2020. http://sites.skoltech.ru/app/
data/uploads/sites/25/2015/12/CodedAperture_CCCV2016.pdf
Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light. Proc. IEEE
Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 1, 195–202 (2003)
Schmidt, J.D.: Numerical Simulation of Optical Wave Propagation with Examples in MATLAB.
SPIE Press, Bellingham (2010)
Sitzmann, V., Diamond, S., Peng, Y., Dun, X., Boyd, S., Heidrich, W., Heide, F., Wetzstein, G.:
End-to-end optimization of optics and image processing for achromatic extended depth of field
and super-resolution imaging. ACM Trans. Graph. 37(4), 114 (2018)
Trouvé, P., Champagnat, F., Besnerais, G.L., Druart, G., Idier, J.: Design of a chromatic 3d camera
with an end-to-end performance model approach. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops, pp. 953–960 (2013a)
Trouvé, P., Champagnat, F., Besnerais, G.L., Sabater, J., Avignon, T., Idier, J.: Passive depth
estimation using chromatic aberration and a depth from defocus approach. Appl. Opt. 52(29),
7152–7164 (2013b)
Tsuruyama, T., Moriya, A., Mishima, N., Sasaki, T., Yamaguchi, J., Kozakaya, T.: Optical filter,
imaging device and ranging device. US Patent Application US 20200092482 (2020)
Veeraraghavan, A., Raskar, R., Agrawal, A., Mohan, A., Tumblin, J.: Dappled photography: mask
enhanced cameras for heterodyned light fields and coded aperture refocusing. ACM Trans.
Graph. 26(3), 69:1–69:12 (2007)
Voelz, D.G.: Computational Fourier Optics: A MATLAB Tutorial. SPIE Press, Bellingham (2011)
Wu, B., Wan, A., Yue, X., Jin, P.H., Zhao, S., Golmant, N., Gholaminejad, A., Gonzalez, J.E.,
Keutzer, K.: Shift: a zero FLOP, zero parameter alternative to spatial convolutions. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 9127–9135 (2018)
Zagoruyko, S., Chernov, V.: Fast depth map fusion using OpenCL. In: Proceedings of the
Conference on Low Cost 3D (2014). Accessed on 15 September 2020. http://www.lc3d.net/
programme/LC3D_2014_program.pdf
Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell.
22(11), 1330–1334 (2000)
Zhou, C., Lin, S., Nayar, S.K.: Coded aperture pairs for depth from defocus and defocus deblurring.
Int. J. Comput. Vis. 93(1), 53–72 (2011)
Chapter 14
An Animated Graphical Abstract
for an Image
Ilia V. Safonov, Anton S. Kornilov, and Iryna A. Reimers
14.1 Introduction
Modern image capture devices are capable of acquiring thousands of files daily.
Despite tremendous progress in the development of user interfaces for personal
computers and mobile devices, the approach for browsing large collections of
images has hardly changed over the past 20 years. Usually, a user scrolls through
the list of downsampled copies of images to find the one they want. This
downsampled image is called a thumbnail or icon. Figure 14.1 demonstrates a
screenshot of File Explorer in Windows 10 with icons of photos.
Browsing is time-consuming, and search is ineffective taking into account sense-
less names of image files. Often it is difficult to recognise the detailed content of the
original image from the observed thumbnail as well as to estimate its quality. For a
downsampled copy of the image, it is almost impossible to assess its blurriness,
noisiness, and presence of compression artefacts. Even for viewing photographs, a
user frequently is forced to zoom in and scroll a photo. The situation is much harder for
browsing of images having complex layout or intended for special applications.
Figure 14.2 shows thumbnails of scanned documents. How can the required document
be found effectively in the case of inapplicability of employing optical character
recognition? Icons of slices of X-ray computed tomographic (CT) images of two
various sandstones are shown in Fig. 14.3. Is it possible to detect some given
sandstone based on the thumbnail? How can the quality of the slices be estimated?
In the viewing interface, a user needs a fast and handy way to see an abstract of
the image. In general, the content of the abstract is application-specific.
I. V. Safonov (*) · A. S. Kornilov

National Research Nuclear University MEPhI, Moscow, Russia
e-mail: ilia.safonov@gmail.com
I. A. Reimers
Moscow Institute of Physics and Technology, Moscow, Russia

352 I. V. Safonov et al.
Fig. 14.1 Large icons for photos in File Explorer for Win10
Fig. 14.2 Large icons for documents in File Explorer for Win10
14 An Animated Graphical Abstract for an Image 353
Fig. 14.3 Large icons for slices of two X-ray microtomographic images in File Explorer for
Win10: (a) Bentheimer sandstone; (b) Fontainebleau sandstone
Nevertheless, common ideas for the creation of convenient interfaces for viewing of
images can be formulated: a user would like to see clearly the regions of interest and
to estimate the visual quality. In this chapter, we describe the technique for gener-
ating a thumbnail-size animation comprising transitions between the most important
zones of the image. Such animated graphical abstract looks attractive and provides a
user-friendly way for browsing large collections of images.
14.2 Related Work
There are several publications devoted to the generation of smart thumbnails by

automatic cropping of photographs. Suh et al. (2003) describe the method for
cropping comprising face detection and saliency map building based on the
pre-attentive human vision model by Itti et al. (1998). In the strict sense, a model
of human pre-attentive vision does not quite fit in this case since the observer is in the
attentive stage while viewing thumbnails. Nevertheless, the approach by Suh et al.
(2003) often demonstrates reasonable outcomes. However, it is not clear how this
method works for photos containing several faces as well as for images with several
spatially distributed salient regions.
A lot of techniques of automatic cropping including those intended for thumbnail
creation look for attention zones. In recent decades, the theory of saliency map
building and attention zone detection has developed rapidly. Lie et al. (2016) assess
eight fast saliency detectors and three automatic thresholding algorithms for auto-
matic generation of image thumbnails by cropping. All those approaches arbitrarily
change the aspect ratio of the original photo, which may be undesirable for the user
interface. The overall composition of the photo suffers due to cropping. Frequently it
is impossible to evaluate the noise level and blurriness because the cropped fragment
is downsized and the resulting thumbnail has a lower resolution in comparison with
the original image.
The last advances with automatic cropping for thumbnailing relate to the appli-
cation of deep neural networks. Esmaeili et al. (2017) and Chen et al. (2018) depict
end-to-end fully convolutional neural networks for thumbnail generation without the
building of an intermediate saliency map. Except for the capability of preserving
aspect ratio, these methods have the same drawbacks as other cropping-based
methods.
There are completely different approaches for thumbnail creation. To reflect
noisiness (Samadani et al. 2008) or blurriness (Koik and Ibrahim 2014) of the
original image, these methods fuse corresponding defects in the thumbnail. Such
algorithms do not modify the image composition and better characterise the quality
of the originals. However, yet it is hard to recognise relatively small regions of
interest due to the much smaller size of the thumbnail than the original.
There are many fewer publications devoted to thumbnails of scanned documents.
Berkner et al. (2003) describe the so-called SmartNail for browsing document
images. SmartNail consists of a selection of cropped and scaled document segments
that are recomposed to fit the available display space while maintaining the
recognisability of document images and the readability of text and keeping the
layout close to the original document layout. Nevertheless, the overall initial view
is destroyed, especially for small display size, as well as sometimes layout alteration
is estimated negatively by the observer. Berkner (2006) depicts the method for
determination of the scale factor to preserve text readability and layout
recognisability in the downsized image. Safonov et al. (2018) demonstrate the
rescaling of images by retargeting. That approach allows to decrease the size of
the scanned document several times, but the preservation of text readability for small
thumbnails remains an unsolved problem.
14.3 An Animated Graphical Abstract
14.3.1 General Idea
To generate a good graphical abstract, we need to demonstrate both the whole

diagram and the enlarged fragments. These goals are in a contradiction in the case
of graphical abstract as a still image. That is why we propose to create smooth
animated transitions between attention zones of the image. Obtained video frames
are cropped from the initial still image and scaled to thumbnail size. The aspect ratio
can remain unchanged or altered according to user interface requirements. The
duration of the movie should not be long. The optimal duration is less than
10 seconds. Therefore, the number of attention zones is limited, from 3 to 5. Anima-
tion may be looped.
The algorithm for the generation of the animated graphical abstract comprises the
following three key stages.
1. Detection of attention zones
2. Selection of a region for quality estimation
3. Generation of video frames, which are transitions between the zones and the
whole image
Obviously, the attention zones differ for various types of images. For the dem-
onstration of advantages of the animated graphical abstract as a concept, we consider
the following image types: conventional consumer photographs, images of scanned
documents, and slices of X-ray microtomographic images of rock samples. Human
faces are adequate for identification of photo content for the most part. For photos
that do not contain faces, salient regions can be considered as visual attention zones.
The title, headers, other emphasised text elements, and pictures are enough for the
identification of a document. For the investigation of images acquired by tomogra-
phy, we need to examine the regions of various substances.
For visual estimation of blurriness, noise, compression artefacts, and specific
artefacts of CT images (Kornilov et al. 2019), observers should investigate a
fragment of an image without any scaling. We propose to use several simple rules
for the selection of the appropriate fragment: the fragment should contain at least one
contrasting edge and at least one flat region; the histogram of the fragment’s
brightness should be wide but without clipping on limits of dynamic range. These
rules are employed for the region selection in the central part of an image or inside
the attention zones.
It should be clear that to select approaches for important zone detection, one
should take into account the application scenario and hardware platform limitations
for implementation. Fortunately, panning over an image during animation allows
recognising image content even in the case when important zones were detected
incorrectly. For implementations portable to embedding platforms, we prefer tech-
niques having low computational complexity and power consumption rather than
more comprehensive ones.
14.3.2 Attention Zone Detection for Photos
Information about humans in the photo is important to recognise the scene. Thus, it is
reasonable to apply the face detection algorithm to detect attention zones in the
photo. There are numerous methods for face detection. At the present time, methods
based on the application of deep neural networks demonstrate state-of-the-art per-
formance (Zhang and Zhang 2014; Li et al. 2016; Bai et al. 2018). Nevertheless, we
prefer to apply the Viola-Jones face detector (Viola and Jones 2001), which has
several effective multi-view implementations for various platforms. The number of
false positives of the face detector can be decreased with additional skin tone
segmentation and processing of downsampled images (Egorova et al. 2009). We
Fig. 14.4 Attention zones for a photo containing people
set the upper limit for the number of detected faces equal to four. If a larger number
of faces are detected, then we select the largest regions. Figure 14.4 illustrates
detected faces as attention zones.
Faces may characterise the photo very well, but a lot of photos do not contain
faces. In this case, an additional mechanism has to be used to detect zones of
attention. The thresholding of a saliency map is one of the common ways of looking
for attention zones. Again, deep neural networks perform well for the problem
(Wang et al. 2015; Zhao et al. 2015; Liu and Han 2016), but we use a simpler
histogram-based contrast technique (Cheng et al. 2014), which usually provides a
reasonable saliency map. Figure 14.5 shows examples of attention zones detected
based on the saliency map.
14.3.3 Attention Zone Detection for Images of Documents
The majority of icons for document images look very similar. It is difficult to
distinguish from one another. To recognise some document, it is important to see
the title, emphasised blocks of text, and embedded pictures. There are a lot of
document layout analysis methods that allow to perform the segmentation and
detection of different important regions of the document (Safonov et al. 2019).
However, we do not need complete document segmentation to detect several
attention zones, so we can use simple and computationally inexpensive methods.
Fig. 14.5 Attention zones based on the saliency map
We propose a fast algorithm to detect a block of text from the very large size that
relates to title and headers. The algorithm includes the following steps. First, the
initial rgb image is converted to greyscale I. The next step is downsampling the
original document image to a size that provides recognisability of text with the size
16–18 pt or greater. For example, a scanned document image with a resolution of
300 dpi should be downsampled five times. The resulting image of an A4 document
has size 700 500 pixels. Handling of a greyscale downsampled copy of the initial
image allows significant decrease of processing time.
Downsized text regions look like a texture. These areas contain the bulk of the
edges. So, to reveal text regions, edge detection techniques can be applied. We use
Laplacian of Gaussian (LoG) filtering with zero crossing. LoG filtering is a convo-
lution of the downsampled image I with kernel k:
ðx2 þ y2 2σ 2 Þk g ðx, yÞ
kðx, yÞ ¼ PN=2 PN=2 ,
2πσ 6 x¼N=2 y¼N=2 kg ðx, yÞ
kg ðx, yÞ ¼ eðx þy Þ=2σ ,

2 2 2
where N is the size of convolution kernel; σ is standard deviation; and (x, y) are
coordinates of the Cartesian system with the origin at the centre of the kernel.
The zero-crossing approach with fixed threshold T is preferable for edge seg-
mentation. The binary image BW is calculated using the following statement:
BWðr, cÞ ¼ 1 if ðjI e ðr, cÞ I e ðr, c þ 1Þj: >¼ T

and I e ðr, cÞ < 0 and I e ðr, c þ 1Þ > 0Þ
or ðjI e ðr, cÞ I e ðr, c 1Þj: >¼ T
and I e ðr, cÞ < 0 and I e ðr, c 1Þ > 0Þ
or ðjI e ðr, cÞ I e ðr 1, cÞj: >¼ T
and I e ðr, cÞ < 0 and I e ðr 1, c 1Þ > 0Þ
or ðjI e ðr, cÞ I e ðr þ 1, cÞj: >¼ T
and I e ðr, cÞ < 0 and I e ðr þ 1, c 1Þ > 0Þ;
otherwise BWðr, cÞ ¼ 0,
where Ie is the outcome of LoG filtering; and (r, c) are coordinates of a pixel.
For segmentation of text regions, we look for the pixels that have a lot of edge
pixels in the vicinity:
( P
rþdr=2 Pcþdc=2
1, i¼rdr=2 j¼cdc=2 BWði, jÞ > Tt
Lðr, cÞ ¼ ,
8r, c 0
where L is the image of segmented text regions; dr and dc are sizes of blocks; and Tt
is a threshold.
In addition to text, regions corresponding to vector graphics such as plots and
diagrams are segmented too. Further steps are labelling of connected regions in
L and calculation of its bounding boxes. Regions with a small height or width are
eliminated.
The calculation of the average size of characters for each region of the text and
selection of several zones with large size of characters are performed in the next
steps. Let us consider how to calculate the average size of characters of the text
region, which corresponds to some connected region in the image L. The text region
can be designated as:
Z ðr, cÞ ¼ I ðr, cÞ Lðr, cÞ,

8r, c2Ω
The image Z is binarised by the threshold. We use an optimised version of the

well-known Otsu algorithm (Lin 2005) to calculate the threshold for the histogram
calculated from the pixels belonging to Ω. Connected regions in the binary image Zb
are labelled. If the number of connected regions in Zb is too small, then the text
region is eliminated. The size of the bounding box is calculated for all connected
regions in Zb.
Fig. 14.6 The results of detection of text regions
Figure 14.6 illustrates our approach for the detection of text regions. Detected text
regions L are marked with green. The image Z consists of all the connected regions.
The average size of characters is calculated for the dark connected areas inside the
green region. This is a reliable way to detect the region of the title of the paper.
At the final stage of our approach, photographic illustrations are identified
because they are important for the document recognition as well. The image I is
divided into non-overlapping blocks with size N M for the detection of embedded
photos. For each block Ei, the energy of the normalised grey level co-occurrence
matrix is calculated:
XX 1j I ðx, yÞ ¼ i and I ðx þ dx, y þ dyÞ ¼ j

C I ði, jÞ ¼
8x 8y 0j otherwise
C ði, jÞ
N I ði, jÞ ¼ P PI ,
i j C I ði, jÞ
XX
Ei ¼ N I 2 ði, jÞ,
i j
where x, y are coordinates of pixels of a block; and dx, dy are displacements.

If Ei is less than 0.01, then all pixels of the block are marked as related to the
photo. Further, all adjacent marked pixels are combined to connected regions.
Regions with small area are eliminated. Regions with too large area are eliminated
because they belong to the complex background of the document as a rule. The
Fig. 14.7 The results of detection of a photographic illustration inside a document image
bounding box of the region with the largest area defines the zone of the embedded
photo. Figure 14.7 shows the outcomes of the detection of the blocks related to a
photographic illustration inside the document image.
14.3.4 Attention Zones for a Slice of a Tomographic Image
As a rule, a specimen occupies only part of a slice of a CT image, the so-called

region of interest (ROI). Kornilov et al. (2019) depict the algorithm for ROI
segmentation. To examine the ROI, it is preferable to see several enlarged fragments
related to various substances and/or having different characteristics. The fragments
can be found by calculation of similarity between blocks of a slice inside the ROI.
Kornilov et al. (2020) apply Hellinger distance between normalised greyscale
histograms to estimate similarity across slices of the tomographic image. For two
discrete probability distributions, Hc and H, the similarity is defined as the unit minus
Hellinger distance:
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 X pffiffiffiffiffiffiffi pffiffiffiffiffi2
Dsim ¼ 1 pffiffiffi i
H ci H i :
2
Instead of the 1D histogram, we propose to calculate similarity via the normalised

structural co-occurrence matrix (Ramalho et al. 2016), where co-occurrence is
counted between a pixel with coordinates (x, y) of the initial slice and a pixel with
coordinates (x, y) of the same slice smoothed by box filter.
14.3.5 Generation of Animation
At the beginning, the sequence order of zones is selected for animation creation. The
first frame always represents a whole downsampled image that is the conventional
thumbnail. The subsequent zones are selected to provide the shortest path across the
image during moving between attention zones. The animation can be looped. In this
case, the final frame is a whole image too. The animation simulates the following
camera effects: tracking-in, tracking-out, and panning between attention zones, slow
panning across a large attention zone, and pausing on the zones. Tracking-in,
tracking-out, and panning effects between two zones are created by constructing a
sequence from N frames. Each frame of the sequence is prepared with the following
steps:
1. Calculation of coordinates of a bounding box for the cropping zone using the line
equation in the parametric form:
xðt Þ ¼ x1 þ t ðx2 x1Þ,

yðt Þ ¼ y1 þ t ðy2 y1Þ,
where (x1, y1) are coordinates of the start zone; (x2, y2) are coordinates of the
end zone; and t is the parameter, which is increased from 0 to 1 with step dt ¼ 1/
(N 1)
2. Cropping the image using coordinates of the calculated bounding box, preserving
the aspect ratio
3. Resizing of the cropped image to the target size
Figure 14.8 demonstrates an example of the animated graphical abstract for a
photo (Safonov and Bucha 2010). Two faces are detected. Hands of kids are
selected as a region for quality estimation. The animation consists of four transitions
between the whole image and these three zones. The first sequence of the frames
looks like a camera tracking-in to a face. After that, the frame is frozen on a moment
to focus on the zoomed face. The second sequence of the frames looks like a camera
panning between faces. The third sequence of the frames looks like a camera
panning and zooming-in between the face and hands. After that, the frame with
the hands is frozen on a moment for visual quality estimation. The final sequence of
frames looks like a camera tracking-out to the whole scene, and freeze frame takes
place again.
Fig. 14.8 Illustration of animation for photo
Figure 14.9 demonstrates an example of the animated graphical abstract for the
image of a scanned document. The title and embedded picture are detected in the first
stage. The fragment of the image with the title is appropriate for quality estimation.
The animation consists of four transitions between the entire image and these two
zones as well as viewing a relatively large zone of title. The first sequence of frames
looks like a camera tracking-in to the left side of the title zone. The second sequence
of frames looks like a slow camera panning across the zone of the title. After that, the
frame is frozen on a moment for visual quality estimation. The third sequence of
Fig. 14.9 Illustration of animation for the image of a document
frames looks like a camera panning from the right side of the title zone to the picture
inside the document. After that, frame is frozen on a moment. The final sequence of
frames looks like a camera tracking-out to the entire page. Finally, the frame with the
entire page is frozen on a moment. This sequence of the frames allows to identify the
image content confidently.
For CT images, the animation is created between zones having different charac-
teristics according to visual similarity and an image fragment without scaling, which
is used for quality estimation. In contrast to the two previous examples, panning
across the tomographic image often does not allow to see the location of a zone in the
image clearly. It is preferable to make transitions between zones via intermediate
entire slice.
14.4 Results and Discussion
We conducted a user study to estimate the effectiveness of animated graphical

abstracts in comparison with conventional large icons in Windows 10. The study
was focused on the recognition of content and estimation of image quality. The
survey was held among ten persons. Surely, ten participants are not enough for a
deep and confident investigation. However, that survey allows to demonstrate the
advantages of our concept.
Survey participants were asked to complete five tasks on one laptop PC with
Windows 10 in similar viewing conditions independently of one another. Conven-
tional large icons were viewed in File Explorer. Animated graphical abstracts were
inserted in a slide of Microsoft PowerPoint as an animated GIF. Each participant has
1 minute to solve each task: the first half of the minute by viewing large icons and the
second half of the minute by viewing the slide with animated graphical abstracts
having the same size.
The first task was the selection of two photos with a person known to the
respondent. The total number of viewed icons was eight. The original full-sized
photos were never seen before by the respondents. Figure 14.10 shows conventional
large icons used in the first task. Most faces are too small for confident recognition.
Nevertheless, the percentage of right answers was not bad: 60% of respondents
selected both photos with that person. It is probably explained by the high cognitive
abilities of the people. Such characteristics as hair colour, head form, build of body,
height, typical pose and expression allow to identify a known person even if the size
of the photo is extremely small. However, the recognition results for animation are
much better: 90% of respondents selected all requested photos. Figure 14.11 shows
frames that contain enlarged faces of the target person. In most cases, the enlarged
face allows to identify a person. Even if a face is not detected as an attention zone,
walking through zoomed-in image fragments allows to see the face in detail. Perhaps
10% of errors are explained by carelessness because faces were frozen on a moment
only and the time for task completion was limited.
The second task was the selection of two blurred photos from eight. Figure 14.12
shows the conventional large icons used in that survey. It is almost impossible to
detect blurred photos by thumbnail viewing. Only 30% of participants gave the right
Fig. 14.10 Conventional large icons in the task detection of photos with a certain person
Fig. 14.11 Frames of the animated graphical abstract in the task detection of photos with a certain
person
answer. It is a little bit better than random guessing. Two responders had better
results than others because they had much experience in photography and under-
stood shooting conditions which can cause a blurred photo. The animated graphical
abstract demonstrates zoomed-in fragments of the photo and allows to identify
low-quality photos. In our survey, 90% of respondents detected proper photos by
viewing animation frames.
Figure 14.13 shows enlarged fragments of sharp photos in the top row and blurred
photos in the bottom row. The difference is obvious, and blurriness is detectable.
10% of errors are explained by subjective interpretation of the blurriness concept
probably. Indeed, sharpness and blurriness are not formalised strictly and depend on
viewing conditions.
The third task was the selection of two scanned images that represent the
document related to the Descreening topic. The total number of documents was
nine. Figure 14.14 shows the conventional large icons of the scanned images used in
that survey. For icons, the percentage of correct answers was 20. In general, it is
impossible to solve the task properly using conventional thumbnails of small size.
To the contrary, animated graphical abstracts provide a high level of correct answers.
80% of respondents selected both pages related to Descreening thanks to zooming
and panning through the title of the papers, as shown in Fig. 14.15.
The fourth task was the classification of sandstones, icons of slices of CT images
of which are in Fig. 14.3. Probably, researchers experienced in material science can
classify those slices based on icons more or less confidently. However, the partic-
ipants in our survey have no such skills. They were instructed to make decisions
based on the following text description: Bentheimer sandstone has middle,
sub-angular non-uniform grains with 10–15% inclusions of other substances such
as spar and clay; Fontainebleau sandstone has large angular uniform grains. For
icons of slices, the percentage of right answers was 50, which corresponds to random
guessing. Enlarged fragments of images in frames of animation allows to classify
images of sandstones properly. 90% of respondents gave the right answers. Fig-
ure 14.16 shows examples of frames with enlarged fragments of slices.
Fig. 14.12 Conventional large icons in the task selection of blurred photos
Fig. 14.13 Frames of animated graphical abstract in the task selection of blurred photos: top row
for sharp photos, bottom row for blurred photos
The final task was the identification of the noisiest tomographic image. We
scanned the same sample six times with different exposure time and number of
frames for averaging (Kornilov et al. 2019). Longer exposure time and greater
number of frames for averaging allow to obtain a high-quality image. Shorter
exposure time and absence of averaging correspond to the noisy images. The
conventional thumbnail does not allow to estimate noise level. Icons of slices for
all six images look almost identical. That is why only 20% of respondents could
identify the noisiest image correctly. Frames of the animation containing zoomed
fragments of slices grant assessing noise level easily. 80% of respondents identify
noisier images via viewing of the animated graphical abstract.
Table 14.1 contains the results for all tasks of our survey. The animated graphical
abstract provides the capability of recognising image content and estimating quality
confidently and outperforms conventional icons and thumbnails considerably. In
addition, such animation is an impressive way for navigation through image collec-
tions in software for PCs, mobile applications, and widgets. The idea of the animated
graphical abstract can be extended to other types of files, for example, PDF
documents.
Fig. 14.14 Conventional large icons in the task selection of documents related to the given topic
Fig. 14.15 Frames of animated graphical abstract with panning through the title of the document
Fig. 14.16 Frames of animated graphical abstracts in the task classification of the type of sandstone
Table 14.1 Survey results

Percentage of right answers (%)
Conventional Animated
Task large icon graphical abstract
Detection of a certain person in a photo 60 90
Detection of blurred photos 30 90
Selection of documents related to the given topic 20 80
Classification of sandstones 50 90
Identification of the noisiest tomographic image 20 80
References
Bai, Y., Zhang, Y., Ding, M., Ghanem, B.: Finding tiny faces in the wild with generative adversarial
network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 21–30 (2018)
Berkner, K.: How small should a document thumbnail be? In: Digital Publishing. SPIE. 6076,
60760G (2006)
Berkner, K., Schwartz, E.L., Marle, C.: SmartNails – display and image dependent thumbnails. In:
Document Recognition and Retrieval XI. SPIE. 5296, 54–65 (2003)
Chen, H., Wang, B., Pan, T., Zhou, L., Zeng, H.: Cropnet: real-time thumbnailing. In: Proceedings
of the 26th ACM International Conference on Multimedia, pp. 81–89 (2018)
Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H., Hu, S.M.: Global contrast based salient region
detection. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 569–582 (2014)
Egorova, M.A., Murynin, A.B., Safonov, I.V.: An improvement of face detection algorithm for
color photos. Pattern Recognit. Image Anal. 19(4), 634–640 (2009)
Esmaeili, S.A., Singh, B., Davis, L.S.: Fast-at: fast automatic thumbnail generation using deep
neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 4622–4630 (2017)
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis.
Koik, B.T., Ibrahim, H.: Image thumbnail based on fusion for better image browsing. In: Pro-
ceedings of the IEEE International Conference on Control System, Computing and Engineering,
pp. 547–552 (2014)
Kornilov, A., Safonov, I., Yakimchuk, I.: Blind quality assessment for slice of microtomographic
image. In: Proceedings of the 24th Conference of Open Innovations Association (FRUCT),
pp. 170–178 (2019)
Kornilov, A.S., Reimers, I.A., Safonov, I.V., Yakimchuk, I.V.: Visualization of quality of 3D
tomographic images in construction of digital rock model. Sci. Vis. 12(1), 70–82 (2020)
Li, Y., Sun, B., Wu, T., Wang, Y.: Face detection with end-to-end integration of a ConvNet and a 3d
model. In: Proceedings of the European Conference on Computer Vision, pp. 420–436 (2016)
Lie, M.M., Neto, H.V., Borba, G.B., Gamba, H.R.: Automatic image thumbnailing based on fast
visual saliency detection. In: Proceedings of the 22nd Brazilian Symposium on Multimedia and
the Web, pp. 203–206 (2016)
Lin, K.C.: On improvement of the computation speed of Otsu’s image thresholding. J. Electron.
Imaging. 14(2), 023011 (2005)
Liu, N., Han, J.: DHSNet: deep hierarchical saliency network for salient object detection. In:
pp. 678–686 (2016)
Ramalho, G.L.B., Ferreira, D.S., Rebouças Filho, P.P., de Medeiros, F.N.S.: Rotation-invariant
feature extraction using a structural co-occurrence matrix. Measurement. 94, 406–415 (2016)
Safonov, I.V., Bucha, V.V.: Animated thumbnail for still image. In: Proceedings of the Graphicon
conference, pp. 79–86 (2010)
Samadani, R., Mauer, T., Berfanger, D., Clark, J., Bausk, B.: Representative image thumbnails:
automatic and manual. In: Human Vision and Electronic Imaging XIII. SPIE. 6806, 68061D
(2008)
Suh, B., Ling, H., Bederson, B.B., Jacobs, D.W.: Automatic thumbnail cropping and its effective-
ness. In: Proceedings of ACM UIST (2003)
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In:
pp. 511–518 (2001)
Wang, L., Lu, H., Ruan, X., Yang, M.H.: Deep networks for saliency detection via local estimation
and global search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3183–3192 (2015)
Zhang, C., Zhang, Z.: Improving multi-view face detection with multi-task deep convolutional
neural networks. In: IEEE Winter Conference on Applications of Computer Vision,
pp. 1036–1041 (2014)
Zhao, R., Ouyang, W., Li, H., Wang, X.: Saliency detection by multi-context deep learning. In:
pp. 1265–1274 (2015)
Chapter 15
Real-Time Video Frame-Rate Conversion
Igor M. Kovliga and Petr Pohl
15.1 Introduction
The problem of increasing the frame rate of a video stream started to gain attention a
long time ago in the mid-1990s, just as TV screens started to get larger and the
stroboscopic effect caused by discretisation of smooth motion started to become
more apparent. The early 100 Hz CRT TV image was of low resolution
(PAL/NTSC), and the main point was to get rid of CRT inherent flicker, but with
ever larger LCD TV sets, the need for good and real-time frame-rate conversion
(FRC) algorithms was increasing.
The problem setup is to analyse the motion of objects in a video stream and create
new frames that follow the same motion. It is obvious that with high resolution
content, this is a computationally demanding task that needs to analyse frame data in
real time and interpolate and compose new frames. As for the TV industry, the
problem of computational load was solved by dedicated FRC chips with highly
optimised circuitry and without strict limitations on power consumption.
The prevailing customer of Samsung R&D Institute Russia was a mobile divi-
sion, so we proposed to bring the FRC “magic” to the smartphone segment. The
computational performance of mobile SoCs was steadily increasing, even more so
on the GPU side. The first use cases for FRC were reasonably chosen to have limited
duration, so that the increased power consumption was not a catastrophic problem.
The use cases were Motion Photo playback and Super Slow Motion capture.
We expected that besides relatively fine quality, we will have to deliver a solution
that will be working smoothly in real time on a mobile device, possibly with power
consumption limitations. The requirements more or less dictate the use of block-wise
I. M. Kovliga (*) · P. Pohl

e-mail: igorkovliga@yandex.ru; pohlpetr@gmail.com

374 I. M. Kovliga and P. Pohl
motion vectors, which dramatically decreases the complexity of all parts of the FRC
algorithm.
15.2 Frame-Rate Conversion Algorithm Structure
The high-level structure of a FRC algorithm is, with slight variations, shared
between many variants of FRC (Cordes and de Haan 2009). The main stages of
the algorithm are:
1. Motion estimation (ME) – usually analyses two consecutive video frames and
returns motion vectors suitable for tracking objects, so-called true motion
(de Haan et al. 1993; Pohl et al. 2018).
2. Occlusion processing (OP) and preparation of data for MCI – analyses several
motion vector fields and makes decisions about appearing and disappearing areas
and creates data to guide their interpolation. Often, this stage modifies motion
vectors (Bellers et al. 2007).
3. Motion compensated interpolation (MCI) – takes data from OP (occlusion
corrected motion vectors and weights) and produces the interpolated frame.
4. Fallback logic – keyframe repetition instead of FRC processing is applied if the
input is complex video content (either the scene is changed or either highly
nonlinear or extremely fast motion appears). In this case, strong interpolation
artefacts are replaced by judder globally or locally, which is visually better.
We had to develop a purely SW FRC algorithm for fastest possible
commercialisation and prospective support of devices already released to the market.
In our case, purely SW means an implementation that uses any number of available
CPU cores and GPU via the OpenCL standard. This gave us a few advantages:
• Simple upgrades of the algorithm in case of severe artefacts found in some
specific types of video content
• Relatively simple integration with existing smartphone software
• Rapid implementation of possible new scenarios
• Release from some hardware limitations in the form of a small amount of cached
frame memory and one-pass motion estimation
We chose to use 8 8 basic blocks, but for higher resolutions, it is possible to
increase the block size in the ME stage with further upscaling of motion vectors for
subsequent stages.
15.3 ME Stage: Real-Time 3DRS Algorithm
A motion estimation (ME) algorithm is a crucial part of many algorithms and

systems, for example, video encoders, frame-rate conversion (FRC), and structure
from motion. The performance of the ME algorithm typically provides an
15 Real-Time Video Frame-Rate Conversion 375
overwhelming contribution to the performance of the ME-based algorithm in terms

of both computational complexity and visual quality, and it is therefore critical for
many ME applications to have a low-complexity ME algorithm that provides a
good-quality motion field. However, the ME algorithm is highly task-specific, and
there is no “universal” ME that is easily and efficiently applicable to any task. Since
we focus our efforts on FRC applications, we choose the 3D recursive search
(3DRS) algorithm (de Haan et al. 1993) as a baseline ME algorithm, as it is well
suited for real-time FRC software applications.
The 3DRS algorithm has several important advantages that allow a reasonable
quality of the motion field used for FRC to be obtained at low computational cost.
Firstly, it is a block matching algorithm (BMA); secondly, it checks a very limited
set of candidates for each block; and thirdly, many techniques developed for other
BMAs can be applied to 3DRS to improve the quality, computational cost, or both.
Many 3DRS-based ME algorithms are well known.
One considerable drawback of 3DRS-based algorithms with a meandering scan-
ning order is the impossibility of parallel processing when a spatial candidate lies in
the same row as the current block being propagated. Figure 15.1 shows the
processing dependency. The green blocks are those which need to be processed
before processing the current block (depicted by a white colour in a red border).
Blocks marked in red cannot be processed until the processing of the current block is
finished. A darker colour shows a direct dependency.
This drawback limits processing speed, since only one processing core of a
multicore processor (MCP) can be used for 3DRS computation. This can also
increase the power consumption of the MCP since power consumption rises super-
linearly with increasing clock frequency. A time-limited task can be solved more
power-efficiently on two cores with a lower frequency than on one core with a higher
frequency.
In this work, we introduce several modifications to our variant of the 3DRS
algorithm that allow multithreaded processing to obtain a motion field and also
Fig. 15.1 Trees of direct dependencies (green and red) and areas of indirect dependencies (light
green and light red) for a meandering order (left). Top-to-bottom meandering order used for forward
ME (right top). Bottom-to-top meandering order used for backward ME (right bottom)
improve the computational cost without any noticeable degradation in the quality of
the resulting motion field.
15.3.1 Baseline 3DRS-Based Algorithm
The 3DRS algorithm is based on block matching, using a frame divided into blocks
of pixels, where X ¼ (x, y) are the pixel-wise coordinates of the centre of the block.
Our FRC algorithm requires two motion fields for each pair of consecutive frames
Ft1, Ft. The forward motion field DFW, t1(X) is the set of motion vectors assigned
to the blocks of Ft1. These motion vectors point to frame Ft. The backward motion
field DBW, t(X) is the set of motion vectors assigned to the blocks of Ft. These motion
vectors point to frame Ft1.
To obtain a motion vector for each block, we try a few candidates only, as
opposed to an exhaustive search that tests all possible motion vectors for each
block. The candidates we try in each are block called a candidate set. The rules for
selecting motion vectors in the candidate set are the same for each block. We use the
following rules (CS - candidate set) to search the current motion field Dcur:
CSðXÞ ¼ fCSspatial ðXÞ, CStemporal ðXÞ, CSrandom ðXÞg,

CSspatial ðX Þ ¼ fcsDj Dcur ðX þ UScur Þg,
UScur ¼ fðW, 0Þ, ð0, HÞ, ð4 W, HÞ, ðW, 4 HÞ, ð2 W, 3 HÞg,

CStemporal ðX Þ ¼ csDj Dpred X þ USpred ,
USpred ¼ fð0, 0Þ, ðW, 0Þ, ð0, HÞ, ð4 W, 2HÞg,
CSrandom ðX Þ ¼fcsDjfcsDbest ðX Þ þ ðrndð2Þ, rndð2ÞÞ;
ðrndð2Þ, rndð2ÞÞ; ðrndð9Þ, rndð9ÞÞgg,
csDbest ðXÞ ¼ argmincsDεfCSspatialðXÞ ,CStemporalðXÞ g MADðX, csDÞ,
where W and H are the width and height of a block (we use 8 8 blocks); rnd(k) is a
function whose result is a random value from the range < k, k+1, . . .k>; MAD is
the mean absolute difference between the window over the current block B(X) of one
frame and the window over the block pointed to by a motion vector in another frame;
and the size of the windows is 16 12. Dcur is the motion vector from the current
motion field, and Dpred is a predictor obtained from the previously found motion
field. If the forward motion field DFW, t1(X) is searched, then the predictor will be
(DBW, t1(X)); if the backward motion field DBW, t(X) is searched, then the
predictor PBW, t will be formed from DFW, t1(X) by projecting it onto the block
grid of frame Ft with the subsequent inversion: PBW, t(X+DFW, t1(X)) ¼ DFW,
t1(X).
In fact, two ME passes are used for each pair of frames: the first pass is an
estimation of the forward motion field, and the second pass is an estimation of the
Fig. 15.2 Sources of spatial and temporal candidates for a block (marked by a red box) in an even
row during forward ME (left); sources of a block in an odd row during forward ME (right)
backward motion field. We use different scanning orders for the first and second
passes. The top-to-bottom meandering scanning order (from top to bottom and from
left to right for odd rows and from right to left for even rows) is used for the first pass
(right-top image of Fig. 15.1). A bottom-to-top meandering scanning order (from
bottom to top and from right to left for odd rows and from left to right for even rows;
rows are numbered from bottom to top in this case) is used for the second pass (right-
bottom image of Fig. 15.1).
The relative positions UScur of the spatial candidate set CSspatial and the relative
positions of the temporal candidate set CStemporal in the above description are valid
only for top-to-bottom and left-to-right directions. If the direction is inverted for some
coordinates, then the corresponding coordinates in UScur and USpred should be
inverted accordingly. Thus, the direction of recursion changes in a meandering
scanning order from row to row. In Fig. 15.2, we show the sources of spatial
candidates (CSspatial, green blocks) and temporal candidates (CStemporal, orange blocks)
for two scan orders. On the left-hand side, this is the top-to-bottom, left-to-right
direction, and on the right-hand side, it is the top-to-bottom, right-to-left direction.
The block erosion process (de Haan et al. 1993) was skipped in our modification
of 3DRS. For additional smoothing of the backward motion field, we applied
additional regularisation after ME. For each block, we compared MADnarrow(X, D)
for the current motion vector D of the block and MADnarrow(X, Dmedian), where
Dmedian is a vector obtained by per-coordinate combining of the median values of
a set consisting of the nine motion vectors from an 8-connected neighbourhood and
from the current block itself. Here, MADnarrow is different from MAD, which is used
for 3DRS matching, and the size of the window for MADnarrow is decreased to 8 8
(to give equal block sizes). The original motion vector is overwritten by Dmedian if
MADnarrow is better or worse by a small margin.
15.3.2 Wave-Front Scanning Order
The use of a meandering scanning order in combination with the candidate set
described above prevents the possibility of parallel processing several blocks of a
given motion field. This is illustrated in Fig. 15.1. The blocks marked in green
should be processed before starting the processing of the current block (marked in
white in a red box) due to their direct dependency via the spatial candidate set and the
blocks marked in red, which will be directly affected by the estimated motion vector
in the current block. The light green and light red blocks mark indirect dependencies.
Thus, there are no blocks that can be processed simultaneously with the current
block since all blocks need to be processed either before or after the current block.
In Al-Kadi et al. (2010), the authors propose a parallel processing which pre-
serves the direction switching of a meandering order. The main drawback is that the
spatial candidate set is not optimal, since the upper blocks for some threads are not
processed at the beginning of row processing.
If we change the meandering order to a simple “raster scan” order (always from
left to right in each row), then the dependencies become smaller (see Fig. 15.3a).
We propose changing the scanning order to achieve wave-front parallel processing,
as proposed in the HEVC standard (Chi et al. 2012), or a staggered approach as shown
in Fluegel et al. (2006). A wave-front scanning order is depicted in Fig. 15.3b, together
with the dependencies and sets of blocks which can be processed in parallel
(highlighted in blue). The set of blue blocks is called a wave-front.
In the traditional method of using wave-front processing, each thread works on
one row of blocks. When a block is processed, the thread should be synchronised
with a thread that works on an upper row, in order to preserve the dependency (the
upper thread needs to stay ahead of the lower thread). This often produces stalls due
to the different times needed to process different blocks. Our approach is different;
working threads process all blocks belonging to a wave-front independently. Thus,
synchronisation that produces a stall is performed only when the processing of the
next wave-front starts, and even this stall may not happen since the starting point of
the next wave-front is usually ready for processing, provided that the number of
tasks is at least several times greater than the number of cores. The proposed
approach therefore eliminates the majority of stalls.
The wave-front scanning order changes the resulting motion field, because it uses
only “left to right” relative positions UScur and USpred during the estimation of
forward motion and only “right to left” for backward motion. In contrast, the
meandering scanning order switches between “left to right” and “right to left” after
processing every row of blocks.
Fig. 15.3 Parallel processing several blocks of a motion field: (a) trees of dependencies for a raster
(left-right) scan order; (b) wave-front scanning order; (c) slanted wave-front scanning order (two
blocks in one task)
15.3.3 Slanted Wave-Front Scanning Order
The proposed wave-front scanning order has an inconvenient memory access pattern
and hence uses the cache of the processor ineffectively. For a meandering scanning
order with smooth motion, the memory accesses are serial, and frame data stored in
the cache are reused effectively. The main direction of the wave-front scanning order
is diagonal, which nullifies the advantage of a long cache line and degrades the reuse
of the data in the cache. As a result, the number of memory accesses (cache misses)
increases.
To solve this problem, we propose to use several blocks placed in raster order as
one task for parallel processing (see Fig. 15.3c), where the task consists of two
blocks). We call this modified order a slanted wave-front scanning order. This
solution changes only the scanning order, and not the directions of recursion, so
only rnd(k) influences the resulting motion field. If rnd(k) is a function of X (the
spatial position in the frame) and the number of calls in X, then the results will be
exactly the same as for the initial wave-front scanning order. The quantity of blocks
in one task can vary; a greater number is better for the cache but can limit the number
of parallel tasks. Reducing the quantity of tasks limits the maximum number of MCP
cores used effectively but also reduces the overhead for thread management.
15.3.4 Double-Block Processing
The computational cost for motion estimation can be represented as a sum of the
costs for a calculation of the MAD and the cost of the control programme code
(managing a scanning order, construction of a candidate set, optimisations related to
skipping the calculation of the MAD for the same candidates, and so on). During our
experiments, we assumed that a significant number of calculations were spent on the
control code. To decrease this overhead, we introduce double-block processing. This
means that one processing unit consists of a horizontal pair of neighbouring blocks
(called a double block) instead of a single block. The use of double-block processing
allows us to reduce almost all control cycles by half. One candidate set is considered
for both blocks of this double block. For example, in forward ME, a candidate set CS
(X) from the left block of a pair is also used for the right block of the pair. However,
calculation of the MAD is performed individually for each block of the pair, and the
best candidate is considered separately for each block of the pair. This point
distinguishes double-block processing from a horizontal enlargement of the block.
A slanted wave-front scanning order where one task consists of two double-block
units is shown in Fig. 15.4. The centre of each double-block unit is marked by a red
point, and the left and right blocks of the double-block unit are separated by a red
line. The current double-block unit is highlighted by a red rectangle. For this unit, the
sources are shown for the spatial candidate set (green blocks) and for the temporal
candidate set (orange blocks) related to block A. The same sources are used for block
Fig. 15.4 Candidate set for

a double-block unit. The
current double block
consists of a block A and a
block B. The candidate set
constructed for block A is
also used for block B
B that belong to the same double block as A. Thus, the same motion vector
candidates are tried for both blocks A and B including random ones.
However, it may be that a true candidate set for block B is useful when the
candidate set of block A gets results for blocks A and B that are too variable (the best
MAD values); this is possible on some edges of a moving object or when a nonlinear
object is used. We therefore propose an additional step for the analysis of MAD
values, which are related to the best motion vectors of blocks A and B of the double-
block unit. Depending on results of this step, we either accept the previously
estimated motion vectors or carry out a motion estimation procedure for block B.
If xA ¼ (x, y) and xB ¼ (x + W, y) are centres of two blocks of a double-block pair,
we can introduce the following decision for additional analysis of B block
candidates:
DBEST ðX, CSðX A ÞÞ ¼ argmincsD2CSðX A Þ MADðX, csDÞ,

M A ¼ MADðX A , DBEST ðX A , CSðX A ÞÞÞ,
M B ¼ MADðX B , DBEST ðX B , CSðX A ÞÞÞ:
Candidates from CS(XB) for block B are considered only if MB > T1 or

MB MA > T2. T1 and T2 are threshold values. Reasonable values of the threshold
for the usual 8-bit frames are T1 ¼ 16 and T2 ¼ 5.
15.3.5 Evaluation of 3DRS Algorithm Modifications
The quality of the proposed modifications was checked for various Full HD video
streams (1920 1080) with the help of the FRC algorithm. To get ground truth, we
down-sampled the video streams from 30fps to 15fps and then up-converted them
back to 30fps with the help of motion fields obtained by the tested ME algorithms.
The initial version of our 3DRS-based algorithm was described above in the
section entitled “Baseline 3DRS-based algorithm”. This algorithm was modified
with the proposed modifications. Luminance components of input frames (initially in
YUV 4:2:0) for ME were down-sampled twice per coordinate using an 8-tap filter,
and the resulting motion vectors had a precision of two pixels. The MCI stage
worked with the initial frames in Full HD resolution and was based on motion-
compensated interpolation (MCI). To calculate the MAD match metric, we used
luminance component of the frame and one of the chrominance components. One
chrominance component was used for forward ME and another for backward ME. In
backward ME, the wave-front scanning order was also switched to bottom-to-top
and right-to-left.
Table 15.1 presents the results of the quality of the FRC algorithm based on (a) an
initial version of the 3DRS-based algorithm. (b) a version modified using the wave-
front scanning order, and (c) a version modified using both the wave-front scanning
order and double-block units. The proposed modifications retain the quality of the
FRC output, except for a small drop when double-block processing is enabled.
Table 15.2 presents the overall results for performance in terms of speed. Column
8 of the table contains the mean execution time for the sum of forward and backward
ME for a pair of frames using five test video streams which were also used for quality
testing. Experiment E1 shows the parameters and speed of the initial 3DRS-based
algorithm as described in the section entitled “Baseline 3DRS-based algorithm”.
Experiment E2 shows a 19.7% drop (E2 vs. E1) when the diagonal wave-front
scanning order is applied instead of the meandering order used in E1. The proposed
slanted wave-front used in E3 and E4 (eight and 16 blocks in each task) minimises
the drop to 4% (E4 vs. E1). The proposed double-block processing increases the
speed by 12.6% (E5 vs. E4) relative to version without double-block processing and
same number of blocks in task.
The speed performance of the proposed modifications for the 3DRS-based
algorithm was evaluated using a Samsung Galaxy S8 mobile phone based on the
MSM8998 chipset. The clock frequency was fixed within a narrow range for the
stability of the results. The MSM8998 chipset uses a big.LITTLE configuration with
four power-efficient and four powerful cores in clusters with different performance.
Threads performing ME algorithms were assigned to a powerful cluster of CPU
cores in experiments E1–E11.
Table 15.1 Comparison of initial quality and proposed modifications: baseline (A), wave-front
scan (B), wave-front + double block (C)
A B C
PSNR PSNR PSNR
Video stream (dB) (dB) (dB)
Climb 42.81 42.81 42.77
Dvintsev12 27.56 27.55 27.44
Turn2 31.98 31.98 31.97
Bosphorusa 43.21 43.21 43.20
Jockeya 34.44 34.44 34.34
Average: 36.00 36.00 35.94
a
Description of video can be found in Correa et al. (2016)
Table 15.2 Comparison of execution times for proposed modifications

2. 3. MT 4. 5. 7.
1. Name of Scanning control Thread Double 6. Units Blocks 8. Mean
experiment order code count block in task in task time (ms)
E1 M No 1 No 1 1 24.91
E2 WF No 1 No 1 1 29.81
E3 WF No 1 No 8 8 26.44
E4 WF No 1 No 16 16 25.90
E5 WF No 1 Yes 8 16 22.63
E6 WF Yes 1 No 1 1 33.29
E7 WF Yes 1 No 8 8 28.67
E8 WF Yes 1 No 16 16 27.94
E9 WF Yes 1 Yes 8 16 24.38
E10 WF Yes 2 Yes 8 16 14.77
E11 WF Yes 4 Yes 8 16 14.14
M meandering scanning order, WF wave-front scanning order, MT multithreading
Table 15.3 Execution times of ME on small cluster

Name of Number of Mean execution time Comparison with E12
experiment threads (ms) (%)
E12 1 55.03 100
E13 2 27.38 49.7
E14 3 20.12 36.6
E15 4 18.23 33.1
If we preserve the conditions of experiments E9–E11 and only pin ME threads to

a power-efficient CPU cluster, we obtain different results (see E12–E15 in
Table 15.3). The parallelisation of three threads is closer to the ideal
(E14 vs. E12). The attempt to use four threads did not give a good improvement
(E15 vs. E14). Our explanation of this fact is that one core is at least partially
occupied by OS work.
15.4 OP Stage – Detection of Occlusions and Correction

of Motion
Occlusions are areas of pair frames that will disappear on the video frame on which
we try to find matching positions. For video sequences, there are two types of
occlusions – covered and uncovered areas (Bellers et al. 2007). We consider only
the interpolation that uses two source frames nearest to the interpolated temporal
position. Normal parts of a frame have a good match – that means that locally the
image patches are very similar, and it is possible to use an appropriate patch from
either image or their mixture (in case of a correct motion vector). The main issue
with occlusion parts is the fact that motion vectors are unreliable in occlusions and
appropriate image patches are only in one of the neighbouring frames. So two critical
decision have to be made – what motion vector is right for a given occlusion and
what is the appropriate frame to interpolate from (if a covered area was detected, then
the previous frame should be used only; and if an uncovered area was detected, then
the next frame should be used). Figure 15.5 illustrates the occlusion problem in a
simplified 1D cut. In reality, the vectors are 2D and in case of real-time ME usually
somewhat noisy.
To effectively implement block-wise interpolation (the MCI stage is described in
the next section), one has to estimate all vectors in the block grid of the interpolated
frame. Motion vectors should be corrected in occlusions. In general, there are two
main possibilities of how to fill in vectors in occlusions: spatial or temporal “prop-
agation” or inpainting. Our solution uses temporal filling because its implementation
is very efficient.
Let the forward motion field DFW, N be the set of motion vectors assigned to the
blocks lying in the block grid of frame FN. These forward motion vectors point to
frame FN+1 from frame FN. Motion vectors of the backward motion field DBW, N
point to frame FN 1 from frame FN.
For detection of covered and uncovered areas in frame FN+1, we need motions
DFW, N and DBW, N+2 generated by three consecutive frames (FN, FN+1, FN+2)
(Bellers et al. 2007). We should analyse the DBW, N+2 motion field to detect covered
areas in frame FN+1. Areas where motion vectors of DBW, N+2 do not point to are
covered areas in FN+1. So, motion vectors in DFW, N+1 in those covered areas may be
incorrect. Collocated inverted motion vectors from the DBW, N+1 motion field may be
used for correction of incorrect motion in DFW, N+1 (this is the temporal filling
mentioned above). Motion field DFW, N should be used to detect uncovered areas in
frame FN+1. Collocated inverted motion vectors from DFW, N+1 in detected uncov-
ered areas may be used instead of potentially incorrect motion vectors of motion field
DBW, N+1.
In our solution for interpolation of any frame FN+1+α (α 2 [0. . .1] is a phase of an
interpolated frame), we detect covered areas in FN+1, obtaining CoverMapN+1, and
detect uncovered areas in FN+2, obtaining UncoverMapN+2, as described above.
These maps simply indicate whether blocks of the frame belong to the occluded
area or not. We need two motion fields DFW, N+1, DBW, N+2 between frames FN+1, FN
+2 for detection of those maps and two more motion fields DBW, N+1, DFW, N+2
between FN, FN+1 and FN+2, FN+3 frames for correction of motion vectors in found
occlusions (see Fig. 15.5).
Basically, we calculate one of the predicted interpolated blocks with coordinates
(x, y) in frame FN+1+α by using some motion vector (dx, dy) as follows (suppose that
the backward motion vector from FN+2 to FN+1 is used):
Predðx, yÞ ¼F Nþ1 ðx þ α dx, y þ α dyÞ ð1 αÞ

þ F Nþ2 ðx ð1 αÞ dx, y ð1 αÞ dyÞ α:
Here, we mix previous and next frames with proportion α equal to the phase. But it
needs to understand which is the proper proportion to mix the previous and next
Fig. 15.5 1D visualisation of covered and uncovered areas for an interpolated frame FN + 1 + α
frames in occlusions, because only the previous frame FN+1 should be used in
covered areas and only the next frame FN+2 in uncovered areas. So, we need to
calculate some weight αadj for each block instead using the phase α everywhere:

Predðx, yÞ ¼F Nþ1 ðx þ α dx, y þ α dyÞ 1 αadj ðx, yÞ
þ F Nþ2 ðx ð1 αÞ dx, y ð1 αÞ dyÞ αadj ðx, yÞ:
αadj should tend to 0 in covered areas and tend to 1 in uncovered areas. In other areas,
it should stay α. In addition, it is necessary to use corrected motion vectors in the
occlusions. To calculate αadj and obtain the corrected motion field fixedαDBW, N+2,
we do the following (details of the algorithm below are described in Chappalli and
Kim (2012)):
• Copy DBW,N+2 to fixedαDBW,N+2.
• (See top part of Fig. 15.6 for details) looking at where in the interpolated frame
FN+1+α blocks from FN+2 moved using motion vectors from DBW, N+2, the moved
position of a block with coordinates (x, y) in keyframe FN+2 is x+(1 α) ∙ dx(x, y),
y+(1 α) ∙ dy (x, y) in frame FN+1+α. The overlapped area between each block in
the block grid of the interpolated frame and all moved blocks from the keyframe
can be found. αadj can be the proportional value with the overlapped area in this
case: αadj ¼ α overlapped area of interpolated block
size of interpolated block . If a collocated area in CoverMapN+1
was marked as a covered area for some block in the interpolated frame, then
(a) the block in the interpolated frame is marked as COVER if the overlapped area
of the block equals zero, and (b) the motion vector from the collocated position of
DBW, N+1 is copied to the collocated position fixedαDBW,N+2 if the block was
marked as COVER;
• (see bottom part of Fig. 15.6 for details) for DFW,N+1, we look at where in the
interpolated frame blocks from FN+1 were moved. For blocks in the interpolated
frame which collocated with an uncovered area in UncoverMapN+1 we do the
following:
(a) calculate αadj as: αadj ¼ 1 ð1 αÞ overlapped area of interpolated block
size of interpolated block ,
(b) Mark as UNCOVER blocks in the interpolated frame which have zero
overlapped area,
(c) Copy inverted motion vectors from collocated positions of DFW, N+2 to
collocated positions of fixedαDBW,N+2 for all blocks marked as UNCOVER.
(d) Copy inverted motion vectors from collocated positions of DFW,N+1 for all
other blocks (which were not marked as UNCOVER)
To obtain the ratio overlapped area of interpolated block

size of interpolated block , pixel-wise operations are needed.
Pixel-wise
operations can be removed if the ratio will be replaced by:
max 0, 1 euclidian distance to the nearest moved block
linear size of interpolated block .
To allow correct interpolation even in blocks which contain object and back-
ground both, we need to use at least two motion vector candidates (one for object and
one for background). Further, the fixedαDBW,N+2 motion field is used to obtain these
two motion vector candidates in each block of the interpolated frame FN+1+α. For an
interpolated block with coordinates (x, y), we compose a collection of motion
vectors by picking motion vectors from the collocated block of the motion field
fixedαDBW, N+2 and from neighbouring blocks of the collocated block (see
Fig. 15.7). The quantity and positions of neighbouring blocks depend on the
magnitude of the motion vector in the collocated block. We apply k-means-based
clustering to choose only two motion vectors from the collection (centres of found
clusters). It is possible to use more than two motion vector candidates in the MCI
stage, and k-means-based clustering does not restrict getting more candidates.
Denote the motion vector field which has K motion vector candidates in each
block as CD[K]α,N+2. This is the result of clustering the fixedαDBW,N+2 motion
field. The algorithm of clustering motion vectors has been described in the patent
by Lertrattanapanich and Kim (2014).
Fig. 15.6 Example of obtaining fixedαDBW,N+2, αadj and COVER/UNCOVER marking. Firstly,
cover occlusions are fixed (top part). Secondly, uncover occlusions are processed (bottom part)
Our main work here in the OP stage was to adapt and optimise the implementa-
tion for ARM-based SoC mostly. The complexity of the stage is not so big as the
complexity of ME and MCI stages, because the input of the OP stage is block-wise
motion fields and all operations can be performed in block-wise manner. Neverthe-
less, a fully fixed-point implementation with NEON SIMD parallelisation was
needed.
In the description of the OP stage, we focused only on main ideas. In fact, we
have to use slightly more sophisticated methods to get CoverMap, UncoverMap,
αadj, and fixedαDBW,N+2. Their use is caused by rather noisy input motion fields or
just complex motion in a scene. On the other hand, additional details would make the
description even more difficult to understand than now.
Fig. 15.7 1D visualisation of (a) clustering and (b) applying motion vector candidates
15.5 MCI Stage: Motion-Compensated Frame

Interpolation
Motion compensation is the final stage of the FRC algorithm that directly interpo-
lates pixel information. As already mentioned, we used interpolation based on earlier
patented work by Chappalli and Kim (2012). The algorithm samples two nearest
frames in positions determined by one or two motion vector candidates. Motion
vector candidates come from the motion clustering algorithm mentioned above. It is
possible that only one motion vector candidate will be found for some block because
a collection for the block may contain the same motion vectors. Because all our
previous sub-algorithms working with motion vectors were block-based, the MCI
stage naturally uses motion hypotheses constant over an interpolated block.
Firstly, we form two predictors for the interpolated block with coordinates (x, y)
by using each motion vector candidate: cd[i] ¼ (dx[i](x, y), dy[i](x, y)), i ¼ 1, 2 from
CD[K]α,N+2:

Pred½iðx, yÞ ¼ F Nþ1 ðx þ α dx½iðx, yÞ, y þ α dy½iðx, yÞÞ 1 αadj ðx, yÞ

þF Nþ2 x ð1 αÞ dx½i x, yÞ, y ð1 αÞ dy½iðx, yÞÞ αadj ðx, yÞ
where we use bilinear interpolation to obtain the pixel-wise version of αadj(x, y) that
was naturally block-wise in the OP stage.
Although we use the backward motion vector candidates cd[i] which point to
frame FN+1 from the collocated block with position (x, y) in frame FN+2 (see
Fig. 15.7a), we use them as if the motion vector candidate started in the
(x (1 α) ∙ dx[i] (x, y), y (1 α) ∙ dy[i] (x, y)) position of frame FN+2 (see
Fig. 15.7b) for calculating predictors. So, those motion vector candidates will pass
the interpolated frame FN+1+α exactly in (x, y) position. It simplifies the clustering
process and MCI stage both because we do not need to keep any “projected” motion
vectors and each interpolated block can be processed independently during the MCI
stage.
Despite that we do not use fractional pixel motion vectors in ME and OP stages,
they may appear in the MCI stage, for example, due to following operation:
(1 α) ∙ dx[i] (x, y). In this case, we use bicubic interpolation which gives good
balance between speed and quality.
The interesting question is how to mix a few predictors Pred[i](x, y). Basically, we
obtain a block of the interpolated frame FN+1+α by the sum of a few predictors with
some weights:
Xp Xp
F Nþ1þα ðx, yÞ ¼ i¼1
w½ i ð x, y Þ Pred½iðx, yÞ= i¼1 w½iðx, yÞ,
where p is the number of used motion vector candidates in the block (two in our
solution, so in the worst case, block patches at four positions from two keyframes
have to be sampled). w[i](x, y) are pixel-wise mixing weights or reliability of pre-
dictors. As said in Chappalli and Kim (2012), we can calculate w[i](x, y) as a
function of the difference between patches which were picked out from keyframes
when predictors were calculated:
w½iðx, yÞ ¼ f ðerr½iðx, yÞÞ

X
xþl X
yþu
err½iðx, yÞ ¼ jF Nþ1 ðk þ α dx½iðx, yÞ, m þ α dy½iðx, yÞÞ
k¼xl m¼yu
F Nþ2 ðk ð1 αÞ dx½iðx, yÞ, m ð1 αÞ dy½iðx, yÞÞj,
where l, u are some small values from the [0. . .5] range.
f(e) must be inversely proportional to the argument, for example:
w½iðx, yÞ ¼ exp ðerr½iðx, yÞÞ:

However, in occlusions, a correct (background) motion vector connects the

background in one keyframe to the foreground in another keyframe. So, such
method of calculating weights will lead to visual artefacts. In Chappalli and Kim
(2012), an update formula for calculating wi has been proposed:
w½iðx, yÞ ¼f ðerr½iðx, yÞ, jcd ½iðx, yÞ vref ðx, yÞjÞ

¼ω exp ðerr½iðx, yÞÞ þ ð1 ωÞ exp ðjcd ½iðx, yÞ vref ðx, yÞjÞ,
where vref is a background motion vector. For blocks in the interpolated frame which
were marked as COVER or UNCOVER, vref is a motion vector from the collocated
block fixedαDBW,N+2 and ω ¼ 0. For other normal blocks, ω ¼ 1.
The MCI stage has high computational complexity, comparable to the complexity
of the ME stage, because both algorithms require reading a few (extended) blocks
from original frames per one processed block. But in contrast with ME, each block of
the interpolated frame can be processed independently during MCI. This fact gives
us the opportunity to implement MCI on GPU using OpenCL technology.
15.6 Problematic Situations and Fallback
There are many situations in which the FRC problem is incredibly challenging and
sometimes just plainly unsolvable. An important part of our FRC algorithm devel-
opment is detecting and correcting such corner situations. It is worth noting that the
tasks of detection and correction are different tasks. When a corner case is detected,
two possibilities remain. The first is to fix the problems, and the second is to skip
interpolation (fallback). Fallback, in turn, can be done for the entire frame (global
fallback) or locally, only in those area where problems arose (local fallback in Park
et al. 2012). We apply global fallback only: place the nearest keyframe instead of the
interpolated frame.
The following situations are handled in our solution:
1. Periodic textures. The presence of periodic textures in video frames adversely
affects ME reliability. Motion vectors can easily step over the texture period,
which causes severe interpolation artefacts. We have a detector of periodic
textures as well as a corrector for smoothing motion. The detector recognises
periodic areas in each keyframe. The corrector makes motion fields smooth in
detected areas after each 3DRS iteration. Regularisation is not applied in detected
areas. Fallback does not apply.
2. Strong 1D-only features. Subpixels moving long flat edges also confuse ME due
to different aliasing artefacts in neighbour keyframes, especially in the case of fast
and simple ME with one-pixel accuracy developed for real-time operation. A
detector and a corrector are in the same manner as for periodic textures (detection
of 1D features in keyframes and smoothing motion filled after each 3DRS
iteration, no regularisation in detected areas) with no fallback.
3. Change of scene brightness (fade in/out). ME’s match metric (MAD) and mixing
weights w in MCI are dependent on the absolute brightness of the image. We
detect global changes between neighbouring keyframes by a linear model using
histograms and correct one of the keyframes for the ME stage only. For relatively
small and global changes, such adjustment works well. A global fallback strategy
is applied if calculated parameters of the model are too large or the model is
incorrect.
4. Inconsistent motion/occlusions. When a foreground object is nonlinearly
deformed (flying bird or quick finger movements), it is impossible to restore the
object correctly. The result of an interpolation looks like a mess of blocks, which
greatly spoils the subjective quality. We analyse CoverMap, UncoverMap,
motion fields, and obtain some reliability for each block in an interpolated
frame. If the gathered reliability falls below some threshold for some set of
neighbouring interpolated blocks, we apply global fallback. A local fallback
strategy may be also applied by using the reliabilities like in Park et al. 2012,
but the appearance of strong visual artefacts is still highly possible, although the
PSNR metric will be higher for local fallback.
Detectors use mostly pixel-wise operations with preliminary downscaled
keyframes. All detectors either fit perfectly for SIMD operations or consume small
computations. In our solution, detectors work in a separate stage called the
preprocessing stage, except inconsistent motion/occlusion that is performed during
the OP stage. Correctors are implemented in the ME stage.
15.7 FRC Pipeline: Putting Stages Together
The target devices (Samsung Galaxy S8/S9/S20) are using ARM 64-bit 8-core SoCs
with big.LITTLE scheme with the following execution units: four power-efficient
processor cores (little CPUs), four high-performance processor cores (big CPUs) and
powerful GPU. Stages of the FRC algorithm should be run on different execution
units simultaneously to achieve the highest number of frame interpolations per
second (IPS). This leads us to the idea of organising a pipeline of calculations
where stages of the FRC algorithm are run in parallel. The stages are connected by
a buffer to hide random delays. The pipeline of FRC for 4 upconversion is shown
in Fig. 15.8.
As shown in Fig. 15.8, the following processing are performed simultaneously:
• Preprocessing of keyframe FN+6
• Estimation of motion fields DFW, N+4 and DBW, N+5 by using FN+4, FN+5 and
corresponding preprocessed data (periodic, 1D areas)
• Occlusion processing for interpolated frame FN+2+2/4, where inputs are DBW, N+2,
DFW, N+2, DBW, N+3, andDFW, N+3;
• Motion-compensated interpolation of interpolated frame FN+2+1/4, where inputs
are fixed1/4DBW, N+2 for phase α ¼ 1/4
Fig. 15.8 FRC pipeline for 4 up-conversion – timeline of computations
Note that OP and MCI stages can be performed for multiple interpolated frames
between a pair of consecutive keyframes and each time at least a part of calculations
will be different. Whereas preprocessing and ME stages are performed only once for
a pair of consecutive keyframes, they do not depend on the number of interpolated
frames.
Actually, our main use case is doubling frame rate. We optimised each stage and
assigned execution units (shown in Fig. 15.8) so that their durations became close for
some “quite difficult” scene in the case of doubling frame rate.
15.8 Results
We were able to develop the FRC algorithm of commercial-level quality that could
work on battery-powered mobile devices. The algorithm used only standard modules
of SoC (CPU + GPU), which means upgrades or fixes are quite easy. This algorithm
has been integrated in two modes of the Samsung Galaxy camera:
• Super Slow Motion (SSM) – offline 2 conversion of HD (720p) video from
480 to 960 FPS, Target performance: >80 IPS;
• Motion Photo – real-time 2 conversion, playback of 3 seconds of FHD (1080p)
video clip stored within JPEG file, Target performance: >15 IPS.
Table 15.4 Performance of FRC solution on target mobile devices in various use cases in
interpolations per second (IPS)
Motion Photo (FHD) Super Slow Motion
Device (IPS) (IPS)
Galaxy S8 81 104
Galaxy S9 94 116
Table 15.5 Performance of FRC solution on Samsung Galaxy Note 10

Traffic Jockey Kimono Tennis
Prep. time 5.81 ms 6.71 ms 6.22 ms 7.72 ms
ME time 4.80 ms 5.57 ms 7.39 ms 12.40 ms
OP time 4.11 ms 4.66 ms 4.73 ms 8.31 ms
MCI time 2.17 ms 2.75 ms 3.04 ms 3.64 ms
yPSNR 37.8 dB 35.3 dB 34.4 dB 27.9 dB
IPS 172 149 135 80
IPS Interpolated frames per second (determined by the longest stage here), yPSNR average PSNR of
luma component of a video sequence
The average conversion speed performance for HD and FHD videos can be seen
in Table 15.4. The difference is small because in the case of FHD, the ME stage and
part of the OP stage were performed at reduced resolution.
Detailed measurements of various FRC stages for 2 upconversion are shown in
Table 15.5. Here, Samsung Galaxy Note 10 based on Snapdragon 855 SM8150 was
used. ME and OP stages were performed on two big cores each, the preprocessing
stage (Prep.) uses two little cores, and the MC stage was performed by GPU. Any
fallbacks were disabled, so all frames were interpolated fully. The quality and speed
performance of the described FRC algorithms were checked for various video
streams, for which a detailed description can be found in Correa et al. (2016). We
used only the first 100 frames for time measurements and all frames for quality
measurements. All video sequences were downsized to HD resolution (Super Slow
Motion use case). To get ground truth, we decreased the frame rate of the video
streams twice and then upconverted them back to the initial frame-rate with the help
of the FRC.
It can be seen that with increasing magnitude, and complexity of the motion in a
scene, the computational cost of the algorithm grows, and the quality of the inter-
polation decreases. The ME stage is the most variable and requires the most
computational cost.
For scenes with moderate movements, the algorithm shows satisfactory quality
and attractive speed performance. Actually, even in video with fast movement, the
quality can be good in most places. In the middle part of Fig. 15.9, an interpolated
frame from the “Jockey” video is depicted. Displacement of the background between
depicted keyframes is near 50 pixels (see Fig. 15.10 to better understand the position
of occlusion areas and movement of objects). An attentive reader can see quite a few
Fig. 15.9 Interpolation quality. In the top is drawn keyframe #200, in the middle interpolated frame
#201 (with PSNR quality 35.84 dB for luma component), and in the bottom keyframe #202
Fig. 15.10 Visualisation of movement. Identical features of fast-moving background are connected
by green lines in keyframes. Identical features of almost static foreground are connected by red lines
in keyframes
defects in the interpolated frame. The most unpleasant are those which appear
regularly on a foreground object (look at the horse’s ears). This is perceived as an
unnatural flickering of the foreground object. Artefacts, which regularly arise in
occlusions, are perceived as a halo around a moving foreground object. Artefacts,
which appear only in individual frames (not regularly), practically do not spoil the
subjective quality of the video.
References
Al-Kadi, G., Hoogerbrugge, J., Guntur, S., Terechko, A., Duranton, M., Eerenberg, O.: Meandering
based parallel 3DRS algorithm for the multicore era. In: Proceedings of the IEEE International
Conference on Consumer Electronics (2010). https://doi.org/10.1109/ICCE.2010.5418693
Bellers, E.B., van Gurp, J.W., Janssen, J.G.W.M., Braspenning, R., Wittebrood, R.: Solving
occlusion in frame-rate up-conversion. In: Digest of Technical Papers International Conference
on Consumer Electronics, pp. 1–2 (2007)
Chappalli, M.B., Kim, Y.-T.: System and method for motion compensation using a set of candidate
motion vectors obtained from digital video. US Patent 8,175,163 (2012)
Chi, C., Alvarez-Mesa, M., Juurlink, B.: Parallel scalability and efficiency of HEVC parallelization
approaches. IEEE Trans. Circuits Syst. Video Technol. 22(12), 1827–1838 (2012)
Cordes, C.N., de Haan, G.: Invited paper: key requirements for high quality picture-rate conversion.
Dig. Tech. Pap. 40(1), 850–853 (2009)
Correa, G., Assuncao, P., Agostini, L., da Silva Cruz, L.A.: Appendix A: Common test conditions
and video sequences. In: Complexity-Aware High Efficiency Video Coding, pp. 125–158.
Springer International Publishing, Cham (2016)
de Haan, G., Biezen, P., Huijgen, H., Ojo, O.A.: True-motion estimation with 3-D recursive search
block matching. IEEE Trans. Circuits Syst. Video Technol. 3(5), 368–379 (1993)
Fluegel, S., Klussmann, H., Pirsch, P., Schulz, M., Cisse, M., Gehrke, W.: A highly parallel sub-pel
accurate motion estimator for H.264. In: Proceedings of the IEEE 8th Workshop on Multimedia
Signal Processing, pp. 387–390 (2006)
Lertrattanapanich, S., Kim, Y.-T.: System and method for motion vector collection based on
K-means clustering for motion compensated interpolation of digital video. US Patent
8,861,603 (2014)
Park, S.-H., Ahn, T.-G., Park, S.-H., Kim, J.-H.: Advanced local fallback processing for motion-
compensated frame rate up-conversion. In: Proceedings of 2012 IEEE International Conference
on Consumer Electronics (ICCE), pp. 467–468 (2012)
Pohl, P., Anisimovsky, V., Kovliga, I., Gruzdev, A., Arzumanyan, R.: Real-time 3DRS motion
estimation for frame-rate conversion. Electron. Imaging. (13), 1–5 (2018). https://doi.org/10.
2352/ISSN.2470-1173.2018.13.IPAS-328
Chapter 16
Approaches and Methods to Iris
Recognition for Mobile
Alexey M. Fartukov, Gleb A. Odinokikh, and Vitaly S. Gnatyuk
16.1 Introduction
The modern smartphone is not a simple phone but a device which has access to or
contains huge amount of personal information. Most smartphones have the ability to
perform payment operations by such services as Samsung Pay, Apple Pay, Google
Pay, etc. Thus, phone unlock protection and authentication for payment and for
access to secure folders and files are required. Among all approaches to authenticate
users of mobile devices, the most suitable are knowledge-based and biometric
methods. Knowledge-based methods are based on asking for something the user
knows (PIN, password, pattern). Biometric methods refer to the use of distinctive
anatomical and behavioral characteristics (fingerprints, face, iris, voice, etc.) for
automatically recognizing a person. Today, hundreds of millions of smartphone
users around the world praise the convenience and security provided by biometrics
(Das et al. 2018).
The first commercially successful biometric authentication technology for mobile
devices is fingerprint recognition. Despite the fact that fingerprint-based authentica-
tion shows high distinctiveness, it still has drawbacks (Daugman 2006). Among all
the biometric traits, the iris has several important advantages in comparison with
other biometric traits (Corcoran et al. 2014). The iris image capturing procedure is
contactless, and iris recognition can be considered as a more secure and convenient
authentication method, especially for mobile devices.
A. M. Fartukov (*)
e-mail: a.fartukov@samsung.com
G. A. Odinokikh · V. S. Gnatyuk
Independent Researcher, Moscow, Russia
e-mail: g.odinokikh@gmail.com; vitgracer@gmail.com

This chapter is dedicated to the iris recognition solution for mobile devices.
Section 16.2 describes the iris as a recognition object. A brief review of the
conventional iris recognition solution is given. Main challenges and requirements
for iris recognition for mobile devices are formulated. In Sect. 16.3, the proposed iris
recognition solution for mobile devices is presented. Special attention is paid to
interaction with the user and capturing system. The iris camera control algorithm is
described. Section 16.4 contains a brief description of the developed iris feature
extraction and matching algorithm. Testing results and comparison with state-of-the-
art iris feature extraction and matching algorithms are provided also. In Sect. 16.5,
limitations of iris recognition are discussed. Several approaches which allow to shift
these limitations are described.
16.2 Person Recognition by Iris
The iris is a highly protected, internal organ of the eye, which allows contactless
capturing (Fig. 16.1). The iris is unique for every person, even for twins. The
uniqueness of the iris is in its texture pattern that is determined by melanocytes
(pigment cells) and circular and radial smooth muscle fibers (Tortora and Nielsen
2010). Although the iris is stable over a lifetime, it is not a constant object due to
permanent changes of pupil size. The iris consists of muscle tissue that comprises a
sphincter muscle that causes the pupil to contract and a group of dilator muscles that
cause the pupil to dilate (Fig. 16.2). It is one of the main sources of intra-class
variation, which should be considered in the development of an iris recognition
algorithm (Daugman 2004).
The iris is highly informative biometric trait (Daugman 1993). That is why iris
recognition provides high recognition accuracy and reliability.
A conventional iris recognition system includes the following main steps:
• Iris image acquisition
• Quality checking of obtained iris image
• Iris area segmentation
• Feature extraction (Fig. 16.3).
Fig. 16.1 View of the

human eye
16 Approaches and Methods to Iris Recognition for Mobile 399
Fig. 16.2 Responses of the pupil to light of varying brightness: (a) bright light; (b) normal light; (c)
dim light. (Reproduced with permission from Tortora and Nielsen (2010))
Fig. 16.3 Simplified

scheme of iris recognition
During registration of a new user (enrollment), an extracted iris feature vector is

stored in the system database. The iris feature vector or set of iris feature vectors
resulting from enrollment is called the template. In case of verification (one-to-one
user comparison) or identification (one-to-many user comparison), the extracted iris
feature vector (also known as the probe) is compared with template(s) stored in the
database. Comparison of the probe and the enrolled template is named matching.
Iris image acquisition is performed using a high-resolution near infrared (NIR) or

visible-spectrum (VIS) camera (Prabhakar et al. 2011). The NIR camera is equipped
with an active illuminator. The wavelength of NIR light used for illuminating the iris
should be between 700 and 900 nm, which is better than visible light in acquiring the
texture of dark irises. It also should be noted that capturing in NIR light allows
(to some extent) to avoid reflections and glares which mask the iris texture. For those
reasons, iris images captured in the NIR spectrum are only considered in this chapter
as input data.
If the iris image is successfully acquired, then its quality is determined in terms of
suitability for subsequent extraction of the feature vector. Iris quality checking can
be distributed across several stages of the recognition algorithm. Iris area segmen-
tation separates the iris texture area from the background, eyelids and eyelashes, and
glares, which mask the iris texture. A comprehensive review of iris segmentation
methods can be found in Rathgeb et al. (2012). In particular, an iris segmentation
algorithm based on a lightweight convolutional neural network (CNN) is proposed
in Korobkin et al. (2018).
After that, the obtained iris area is used for feature extraction. This stage consists
of iris area normalization and construction of the feature vector. At normalization,
the iris image is remapped from initial Cartesian coordinates to a dimensionless
non-concentric polar coordinate system (Daugman 2004). It allows to compensate
for variability of iris optical size in the input image and to correct elastic deformation
of the iris when the pupil changes in size. The normalized image is used for
extraction of iris features.
Because the iris is a visible part of the human eye, it is not a secret. So, an iris
recognition system is vulnerable to presenting synthetically produced irises to the
sensor. Prevention of direct attacks to the sensor by discriminating real and fake
irises is called presentation attack detection or liveness detection. Consideration of
iris liveness detection is out of the scope of this chapter. An introduction to iris
liveness detection can be found in Sun and Tan (2014) and Galbally and Gomez-
Barrero (2016). It should be noted that the abovementioned quality checking stage
provides (to some extent) protection against presentation attacks.
Iris recognition systems were implemented and successfully deployed for border
control in the United Arab Emirates and in several European and British airports
(Daugman and Malhas 2004). In such systems, iris acquisition is usually performed
in controlled environment conditions by cameras, which are capable of capturing
high-quality iris images. Minimal requirements for the iris image capturing process
are summarized in ISO/IEC 19794-6:2011. In case of mobile devices, camera
compactness and its cost become even more essential, and thus not all mentioned
requirements imposed on the camera can be satisfied. Development and implemen-
tation of the iris acquisition camera for the mobile devices is also out of the scope of
this chapter. Most of the issues related to the iris capturing device can be found in
Corcoran et al. (2014) and Prabhakar et al. (2011).
Regarding requirements of an iris recognition solution for mobile devices, they
include the ability to operate under constantly changing environmental conditions
and flexibility to a wide range of user interaction scenarios. Mobile iris recognition
Fig. 16.4 Examples of iris images captured with a mobile device
should handle input iris images captured under ambient illumination, which varies
over a range from 104 at night to 105 Lux under direct sunlight. The changing
capturing environment also assumes the randomness of the locations of the light
sources, along with their unique characteristics, which creates a random distribution
of the illuminance in the iris area. The mentioned factors can cause a deformation of
the iris texture due to a change in the pupil size, making users squint and degrading
the overall image quality (Fig. 16.4).
Moreover, different features related to interaction with a mobile device and user
itself should be considered:
• The user could wear glasses or contact lenses.
• The user could try to the perform authentication attempt while walking or just
suffer from a hand tremor, thereby causing the device to shake.
• The user can hold the device too far or too close to them, so the iris turns out of the
camera depth of field.
• There could be occlusion of the iris area by eyelids and eyelashes if the user’s eye
is not opened enough.
All these and many other factors affect the quality of the input iris images, thus
influencing the accuracy of the recognition (Tabassi 2011).
Mobile iris recognition should be used on a daily basis. Thus, it requires provid-
ing an easy user interaction and a high recognition speed, which is determined by the
computational complexity. There is a trade-off between computational complexity
and power consumption: recognition should be performed with the best camera
frame rate and should not consume much power at the same time. Recognition
should be performed in a special secure (trusted) execution environment, which
provides limited computational resources – restricted number of available processor
cores and computational hardware accelerators, reduced frequencies of processor
core(s), and limited amount of memory (ARM Security Technology 2009). These
facts should be taken into account in early stages of biometric algorithm
development.
All mentioned requirements lead to the necessity of the development of an iris

recognition solution capable of providing high recognition performance on mobile
devices.
There are several commercial mobile iris recognition solutions known to date.
The first smartphones enabled with the technology were introduced by Fujitsu in
2015. The solution for this smartphone was developed by Delta ID Inc. (Fujitsu
Limited 2015). It should be noted, in particular, that all Samsung flagship devices
were equipped with iris recognition technology during 2016–2018 (Samsung Elec-
tronics 2018). Recently, the application of mobile iris recognition technology is
shifting from mass market to B2B and B2G areas. Several B2G applications of the
technology are also known in the market, such as Samsung Tab Iris (Samsung
Electronics 2016) and IrisGuard EyePay Phone (IrisGuard UK Ltd. 2019).
16.3 Iris Recognition for Mobile Devices
While developing a biometric recognition algorithm, it should not be considered as

an isolated algorithm. Characteristics of a target platform and efficient ways of
interaction with the environment in which recognition system operates should be
taken into account. In case of a mobile device, it is possible to interact with a user in
order to obtain iris images suitable for recognition. For instance, if the eyes are not
opened enough or the user’s face is too far from the device, then the recognition
algorithm should provide immediate feedback saying “open eyes wider” or “move
your face close to the device,” respectively. To do it, the following algorithm
structure is proposed (Fig. 16.5).
The idea of the proposed structure is to start each next stage of the recognition
algorithm only after a successful pass of corresponding quality assessment. Each
Fig. 16.5 Iris recognition algorithm structure

Fig. 16.6 Interaction of mobile iris recognition algorithm with environment
quality assessment measure is performed immediately after the information for its
evaluation becomes available. It allows us not to waste computational resources (i.e.,
energy consumption) for processing data which are not suitable for further
processing and to provide feedback to user as earlier as possible. It should be
noted that structure of the algorithm depicted in Fig. 16.5 is a modification of the
algorithm proposed by Odinokikh et al. (2018). The special quality buffer was
replaced with the straightforward structure as shown in Fig. 16.5. All the other
parts of the algorithm (except the feature extraction and matching stages) and quality
assessment checks were used with no modifications.
Besides interaction with a user, the mobile recognition system can communicate
with iris capturing hardware and additional sensors such as illuminometer,
rangefinder, etc. Obtained information can be used for control parameters of iris
capturing hardware and for adaptation of the algorithm to constantly changing
environment conditions. The scheme summarizing the described approach is
presented in Fig. 16.6. Details can be found in Odinokikh et al. (2019a).
Along with the possibility to control the iris capturing hardware by the recogni-
tion algorithm itself, a separate algorithm for controlling iris camera parameters can
be applied. The purpose of such algorithm is to provide fast correction of the sensor’s
shutter speed (also known as an exposure time), gain and/or parameters of the active
illuminator to obtain iris images suitable for the recognition.
We propose a two-staged algorithm for automatic camera parameter adjustment
that offers fast exposure adjustment on the basis of a single shot with further iterative
camera parameter refinement.
Many of the existing automatic exposure algorithms have been developed to
obtain an optimal image quality for difficult environmental conditions. In the case of
the most complicated scenes, a significant number of these algorithms have some
drawbacks: poor exposure estimation, complexity of region of interest detection, and
high computational complexity, which may limit their applicability to mobile

devices (Gnatyuk et al. 2019).
In the first stage of the proposed algorithm, a global exposure is adjusted based on
a single shot in order to start the recognition process as fast as possible. The fast
global exposure adjustment problem is presented as a dependency between the
exposure time and the mean sample value (MSV) which corresponds to the captured
image brightness. MSV is determined as follows (Nourani-Vatani and Roberts
2007):
X4
ði þ 1Þhi
MSV ¼ P4 ,
i¼0 i¼0 hi
where H ¼ {hi| i ¼ 0. . .4} is a 5-bin pixel brightness histogram of the captured

image. In contrast to mean brightness, MSV is less sensitive to high peaks in the
image histogram, which makes it useful when the scene contains several objects with
almost constant brightness.
It is known that photodiodes have almost linear transmission characteristic in the
photoconductive mode. However, due to differences in amplification factors for each
photodiode and the presence of noise in the real CMOS sensors, an exposure-to-
brightness function is better approximated by a sigmoid function rather than a
piecewise linear function (Fig. 16.7).
With knowledge of sigmoid coefficients and a target image brightness level, we
can determine sensor parameters which lead to correct captured image exposure.
Therefore, the dependency between an exposure time and MSV can be expressed
as:
1
μ¼ ,
1 þ ecp
Fig. 16.7 The experimental dependency between an exposure time and mean sample value
Fig. 16.8 The visualization of the global exposure adjustment
where μ ¼ 0.25 (MSV 1) is a normalized MSV value, p 2 [1; 1] is a normalized

exposure time, and c ¼ 6 is an algorithm parameter which controls a sigmoid slope
and may be adjusted for a particular sensor (Fig. 16.8). Solving the above equation
for p gives us a trivial result:

1 1
p ¼ ln 1 :
c μ
The optimal normalized exposure time p can be obtained with the value μ which
empirically determines the optimal MSV value that allows the successful pass of
quality assessment checks as described below in this section.
Since the exposure time varies in the (0; Emax] interval, the suboptimal exposure
time E is obtained as:
E 0 ð p þ 1Þ
E ¼ ,
p0 þ 1
where E0 is the exposure time of the captured scene, and p and p0 are normalized
exposure times calculated with the optimal MSV μ and the captured scene MSV μ0,
respectively.
If MSV lies out of the confidence zone μ0 2 [μmin; μmax], we should first make a
“blind” guess on a correct exposure time. This is done by subtracting or adding the
predefined coefficient Eδ to the current exposure E0 several times until MSV
becomes less than μmax or higher than μmin. μmin, and μmax values are determined
based on experimental dependency between an exposure time and normalized MSV.
If there is no place for a further exposure adjustment, we stop and do not adjust a
scene anymore, because the sensor is probably blinded or covered with something.
Once the initial exposure time guess E is found, we try to further refine the
capture parameters.
Table 16.1 The iris image quality difference between competing auto-exposure approaches
Approach Full frame Eye region
Global exposure adjustment
Underexposed
perfect
Proposed
perfect
overexposed
The key idea of the second stage is a mask construction to precisely adjust camera
parameters to the face region brightness in order to obtain the most accurate and fast
iris recognition. In case of the recognition task, it is important to pay more attention
to eye regions, and it is not enough to provide an optimal full-frame visual quality
provided by well-known global exposure adjustment algorithms (Battiato et al.
2009). Table 16.1 illustrates the mentioned drawback.
In order to obtain a face mask, a database of indoor and outdoor image sequences
for 16 users was collected in the following manner. Every user tries to pass the
enrollment procedure on a mobile device in normal conditions, and the
corresponding image sequence is collected. Such sequence is used for enrollment
template creation. Next, the user tries to verify himself, and the corresponding image
sequence (probe sequence) is also collected. All frames from probe sequences are
used for probe (probe template) creation. After that, dissimilarity scores (Hamming
distances) between the user’s enrolled template and each probe are calculated
(Odinokikh et al. 2017). In other words, only genuine comparisons are performed.
Each score is compared with the predefined threshold HDthresh, and the vector of
labels Y ≔ {yi}, i ¼ 1. . .Nscores is created. Nscores is the amount of verification
attempts. The vector of labels represents the dissimilarity between probes and
enrollment templates:
(
0, HDi > HDthresh ,
yi ¼
1, HDi < HDthresh :
Each label yi of Y shows if the person was successfully recognized at the frame i.
Fig. 16.9 Calculated Most significant

weighted mask to represent pixels
each image pixel
significance for the
recognition: bright blobs Less significant
correspond to eye position pixels
After the calculation of the vector Y, we downscale and reshape each frame of
probe sequences to get the feature vector xi and construct the matrix of feature
vectors X:
0 1
x1
B x2 C
B C
X¼B C:
@ ... A
xN scores
Using the feature matrix X and the vector of labels Y, we calculate the logistic
regression coefficients of each feature, where the coefficients represent the signifi-
cance of each image pixel for the successful user verification (Fig. 16.9).
As a result, the most significant pixels emphasize eye regions and the periocular
area. This method allows avoiding handcrafted mask construction. It automatically
finds the regions that are important for correct recognition. Obtained mask values are
used for weighted MSV estimation, where each input pixel has a significance score
that determines the pixel weight in the image histogram.
After mask calculation, the main goal is to set camera parameters that make MSV
fall in the predefined interval which leads to the optimal recognition accuracy. To get
interval boundaries, each pair (HDi, MSVi) is mapped onto the coordinate plane
(Fig. 16.10a), and points with HD > HDthresh are removed to exclude probes which
corresponded to rejection during verification (Fig. 16.10b). The optimal image
quality interval center is defined as:
p ¼ argmaxð f Þ:
x
Here f(x) is the distribution density function, and p is the optimal MSV value
corresponding to the image with the most appropriate quality for recognition.
Visually, the plotted distribution allows to distinguish three significant clusters:
noise, points with low pairwise distance density, and points with high pairwise
distance density. To find the optimal image quality interval borders, we cluster
plotted point pairs for three clusters using the K-means algorithm (Fig. 16.10c).
The densest “red” cluster with the minimal pairwise point distance is used to
determine the optimal quality interval borders. The calculated interval can be
represented as [p – delta, p + delta], where the delta parameter equals:
(a) (b)
(c)
Fig. 16.10 Visualization of cluster construction procedure: (a) all (HDi, MSVi) pairs are mapped
onto a coordinate plane; (b) excluded points with HD > HDthresh and plotted distribution density; (c)
obtained clusters
delta ¼ min ðjl pj, jr pjÞ,
where wr ¼ {MSVi 2 red cluster| i ¼ 1, 2, . . ., Nscores} is the set of points belonging

to the cluster with a minimal pairwise points distance, and l ¼ min (wr), r ¼ max (wr)
are the corresponding cluster borders.
After the optimal image quality interval is determined, we need to adjust the
camera parameters to make the captured image MSV fall in this interval. In order to
implement this idea, we get the pre-calculated exposure discrete (ED) value and
calculate the gain discrete (GD) value. After that, we iteratively add or subtract ED
and GD values from the suboptimal exposure time E and the default gain G in
order to find the optimal exposure time and gain parameters. Thus, updated E and
G values can be calculated according to the following rule: if MSV < p delta,
then:
E ¼ E þ ED,
G ¼ G þ GD;
if MSV p + delta, the updated exposure and gain are:
E ¼ E ED,
G ¼ G GD:
Such iterative technique allows to perform the optimal camera parameter adjust-
ment for different illumination conditions.
The proposed algorithm was tested as a part of the iris recognition system
(Odinokikh et al. 2018). This system operates in a mode where False Acceptance
Rate (FAR) ¼ 107. Testing was performed using a mobile phone which is based on
Exynos 8895 and equipped with an NIR iris capturing hardware. Testing involved
10 users (30 verification attempts were made for each user). The enrollment proce-
dure was performed in an indoor environment, while the verification procedure was
done under harsh incandescent lighting in order to prove that the proposed auto-
exposure algorithm improves recognition.
To estimate algorithm performance, we use two parameters: FRR (False Rejec-
tion Rate), a value which shows how many genuine comparisons were rejected
incorrectly, and recognition time, which determines the time interval between the
start of the verification procedure and successful verification. The results of com-
parison are shown in Table 16.2.
In accordance with obtained results, if the fast parameter adjustment stage of the
algorithm is removed, then the algorithm will adjust the camera parameters in an
optimal way, but the exposure adaptation time will be significantly increased
because of the absence of suboptimal exposure values. If the iterative adjustment
stage is removed, then the adaptation time will be small, but face illumination will be
estimated in non-optimal way, and the number of false rejections will increase.
Therefore, it is crucial to use both fast and iterative steps to reduce both recognition
time and false rejection rate.
A more detailed description of the proposed method with minor modifications
and comparison with well-known algorithms for camera parameter adjustment can
be found in Gnatyuk et al. (2019).
Table 16.2 Performance comparison for the proposed algorithm of automatic camera parameter
adjustment
Second stage only Proposed
No First stage only (fast (iterative parameter method (two
Value adjustment exposure adjustment) refinement) stages)
FRR (%) 82.3 20.0 1.6 1.6
Recognition 7.5 0.15 1.5 0.15
time (s)
In conclusion, it should be noted that the proposed algorithm can be easily

adapted to any biometric security system and any kind of recognition area: iris,
face, palm, wrist, etc.
16.4 Iris Feature Extraction and Matching
As mentioned in Sect. 16.2, the final stage of iris recognition includes construction of
the iris feature vector. This procedure represents the extraction of iris texture
information relevant to its subsequent comparison. The input of the feature vector
construction is the normalized iris image (Daugman 2004) as depicted in Fig. 16.11.
Since the iris region can be occluded by eyelids, eyelashes, reflections, and others,
such areas contain irrelevant information for subsequent iris texture matching and
are not used for the feature extraction. It should be noted that feature extraction and
matching are considered together because they are closely connected to each other.
A lot of feature extraction methods, considering iris texture patterns in different
levels of detail, have been proposed (Bowyer et al. 2008). A significant leap in
reliability and quality in the field was achieved with the start of using deep neural
networks (DNNs). There were numerous attempts from that time to apply DNNs for
iris recognition. In particular, Gangwar and Joshi (2016) introduced their
DeepIrisNet as the model combining all successful deep learning techniques
known at the time. The authors thoroughly investigated obtained features and
produced a strong baseline for the next works. An approach with two fully
convolutional networks (FCN) with a modified triplet loss function was recently
proposed in Zhao and Kumar (2017). One of the networks is used for iris template
extraction, whereas the second produces the accompanying mask. Fuzzy image
enhancement combined with simple linear iterative clustering and a self-organizing
map (SOM) neural network were proposed in Abate et al. (2017). Despite the
method being designed for iris recognition on a mobile device, a real-time perfor-
mance has not been achieved. Another recent work by Zhang et al. (2018) declared
as suitable for the mobile case proposes two-headed (iris and periocular) CNN with
fusion of embeddings. Thus, there is no optimal solution for iris feature extraction
and matching presented in published papers.
Fig. 16.11 Segmented iris

texture area and
corresponding normalized
iris image
Fig. 16.12 Proposed model scheme of iris feature extraction and matching (Odinokikh et al.
2019c)
This section is a brief description of the iris feature extraction and matching
presented in Odinokikh et al. (2019c).
The proposed method represents a CNN designed to utilize advantages of the
normalized iris image as an invariant, both low- and high-level representations of
discriminative features and information about iris area and pupil dilation. It contains
iris feature extraction and matching parts trained together (Fig. 16.12).
It is known that shallow layers in CNNs are responsible for extraction of
low-level textural information, while high-level representation is achieved with
depth. Basic elements of the shallow feature extraction block and their relations
are depicted in Fig. 16.12. High-level (deep) feature representation is performed by
convolution block #2. Feature maps, which come from block #1, are concatenated by
channels and pass through it. The meaning of concatenation at this stage is in the
invariance property of the normalized iris image. The output vector FVdeep reflects
high-level representation of discriminative features and is assumed to handle com-
plex nonlinear distortions of the iris texture caused by the changing environment.
Match score calculation is performed on FVdeep, the shallow feature vector FVsh.,
and additional information (FVenv.) about the iris area and pupil dilation by using the
variational inference technique. The depth-wise separable convolution block struc-
ture, which is memory and computationally efficient, was picked up for basic
structural elements for the entire network architecture. Along with lightweight
CNN architecture, it allows to operate in real time on the device with highly limited
computational power.
The following methods were selected as state of the art: FCN + Extended Triplet
Loss (ETL) (Zhao and Kumar 2017) and DeepIrisNet (Gangwar and Joshi 2016). It
also should be noted that results of lightweight CNN proposed in Zhang et al. (2018)
are obtained on the same datasets, which is used for testing of the proposed method.
For detailed results, please refer to Zhang et al. (2018). Many of the methods were
excluded from consideration due to their computational complexity and unsuitability
for mobile applications.
Three different datasets were used for training and evaluation(CASIA 2015):
CASIA-Iris-M1-S2 (CMS2), CASIA-Iris-M1-S3 (CMS3), and one more (Iris-
Mobile, IM) collected privately using a mobile device with an embedded NIR
camera. The latter is collected simulating real authentication scenarios of a mobile
device user: images captured in highly changing illumination both indoors and
outdoors, with/without glasses. More detailed specifications of the datasets are
described in Table 16.3.
Results on the recognition accuracy are represented in Table 16.4. ROC curves
obtained for comparison with state-of-the-art methods on CMS2, CMS3, and IM
datasets are depicted on Fig. 16.13.
The proposed method outperforms the chosen state-of-the-art ones on all the
datasets. After the division into subsets, it became impossible to estimate FNMR at
FMR ¼ 107 for CMS2 and CMS3 datasets since the number of comparisons in
the test sets did not exceed ten million. So, yet another experiment was to estimate
the performance of the proposed model on those datasets without training on them.
The model trained on IM is evaluated on entire CMS2 and CMS3 datasets in order to
Table 16.3 Dataset details

Dataset Images Irises Outdoor Subjects
CMS2 7723 398 0 Asian
CMS3 8167 720 0 Asian
IM 22,966 750 4933 Asian&Cauc.
Table 16.4 Recognition performance evaluation results

Equal error rate (EER)
Method CMS2 CMS3 IM Testing FPS
DeepIrisNet 0.0709 0.1199 0.1371 WithinDB 11
FCN + ETL 0.0093 0.0301 0.0607 WithinDB 12
Proposed 0.0014 0.0190 0.0116 WithinDB 250
0.0003 0.0086 – CrossDB
(a) (b)
(c)
Fig. 16.13 ROC curves obtained for comparison with state-of-the-art methods on different
datasets: (a) CASIA-Iris-M1-S2 (CMS2); (b) CASIA-Iris-M1-S3 (CMS3); (c) Iris Mobile (IM)
get FNMR at FMR ¼ 107 (CrossDB). Obtained results demonstrate high general-
ization ability of the model.
A mobile device equipped with the Qualcomm Snapdragon 835 CPU was used
for estimating the overall execution time for these iris feature extraction and
matching methods. It should be noted that a single core of the CPU was used. The
results are summarized in Table 16.4.
Thus, the proposed algorithm showed robustness to high variability of iris
representation caused by change in the environment and physiological features of
the iris itself. A profitability of using shallow textural features, feature fusion, and
variational inference as a regularization technique is also investigated in the context
of the iris recognition task. Despite the fact that the approach is based on deep
learning, it is capable of operating in real time on a mobile device in a secure
environment with substantially limited computational power.
16.5 Limitations of Iris Recognition and Approaches

for Shifting Limitations
In conclusion, we would like to shed light on several open issues of iris recognition
technology. The first issue is related to the limitation in usage of iris recognition
technology in extreme environmental conditions. In near dark, the pupil dilates and
masks almost all the iris texture area. In case of outdoors in direct sunlight, the user
could not open his eyes wide enough, and the iris texture can be masked by
reflections. The second issue is related to wearing glasses. Usage of active illumi-
nation leads to glares on glasses, which can mask the iris area. In this case, the user
should change the position of the mobile device for successful recognition or take off
the glasses. It can be inconvenient, especially in daily usage. Thus, the root of the
majority of issues is in obtaining enough iris texture area for reliable recognition. It
has been observed that at least 40% of the iris area should be visible to achieve the
given accuracy level.
To mitigate the mentioned issues, several approaches, except changes of hard-
ware for iris capturing, were proposed. One of them is the well-known multi-modal
recognition (e.g., fusion of iris and face (or periocular area) recognition) as described
in Ross et al. (2006). In this section, approaches related to the eye itself are
considered only.
The first approach is based on the idea of multi-instance iris recognition, which
performs the fusion of the two irises and uses the relative spatial information and
several factors that describe the environment. Often the iris is significantly occluded
by the eyelids, eyelashes, highlights, etc. This happens mainly because of the
complex environment, in which the user cannot open the eyes wide enough (bright
illumination, windy weather, etc.). It makes the application of the iris multi-instance
approach reasonable in case the input image contains both eyes at the same time.
The final dissimilarity score is calculated as a logistic function of the form:
1
Score ¼ ,
P6
1 þ exp wi M i
i¼0
where M ¼ {Δdnorm, davg, AOImin, AOImax, ΔNDmin, ΔNDmax, ΔPIRavg}; Δdnorm is

the normalized score difference for two pairs of irises (dLEFT is the score for the left
eye; dRIGHT is the score for the right eye):
jd LEFT dRIGHT j
Δd norm ¼ ;
d LEFT þ dRIGHT
davg is the average score for the pair:

Fig. 16.14 Parameters of

the pupil and iris used for
the iris fusion
d LEFT þ dRIGHT
davg ¼ :
2
AOImin, AOImax are the minimum and maximum values of the area of intersection
between the two binary masks Mprobe and Menroll in each pair:

AOI ¼ ΣM c = M height
c M width
c , M c ¼ M probe M enroll :
ΔNDmin and ΔNDmax are the minimum and maximum values of the normalized
distance ΔND between the centres of the pupil and the iris:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 2
ΔND ¼ NDXprobe NDXenroll þ NDYprobe NDYenroll
xP xI y yI
NDX ¼ , NDY ¼ P ,
RI RI
where (xP, yP) are the coordinates of the centre of the pupil, and RP is its radius;
(xI, yI) are the coordinates of the centre of the iris, and RI is its radius (Fig. 16.14).
ΔPIRavg represents the difference in pupil dilation during the enrollment and
probe based on the PIR ¼ RP/RI ratio:

LEFT RIGHT
ΔPIRavg ¼ PIRLEFT
enroll PIRprobe þ PIRenroll PIRprobe =2:
RIGHT
The weight coefficients above wi, i E [1; 7] were obtained after the training of the
classifier on genuine and impostor matches on a small subset of the data. In case only
one out of two feature vectors is extracted, all the pairs of values used in the weighted
sum are assumed to be equal.
The proposed method allowed to decrease the threshold for the visible iris area
from 40% to 29% during verification/identification without any loss in the accuracy
and performance, which leads to decrease of overall FRR (or in other words, user
convenience is improved).
Table 16.5 Recognition accuracy for different matching rules

Matching rule
Error rate (%) Method Proposed Minimum Consensus
EER CNN-based 0.01 0.21 0.21
GAQ 0.10 1.31 1.31
FNMR CNN-based 0.48 0.92 1.25
GAQ 1.07 3.17 4.20
Table 16.6 Recognition accuracy in different verification conditions

Verification conditions
Error rate (%) Method IN&NG IN&G OT&NG
EER CNN-based 0.01 0.09 0.42
GAQ 0.10 0.35 3.15
FNMR CNN-based 0.48 5.52 10.1
GAQ 1.07 8.94 32.5
FTA – 0.21 4.52 0.59
To prove effectiveness of the proposed fusion method, a comparison with the

well-known consensus and minimum rules (Ross et al. 2006) was performed.
According to the consensus rule, a matching is considered as successful if both
dLEFT and dRIGHT are less than the decision threshold. In the minimum rule, what is
required is that the minimum of the two values min(dLEFT, dRIGHT) should be less
than the threshold. The testing results are presented in Table 16.5.
The second approach is related to adaptive biometric systems which attempt to
adapt themselves to the intra-class variation of the input biometric data as a result of
changing environmental conditions (Rattani 2015). In particular, such adaptation is
made by replacing or appending the input feature vector (probe) to the enrolled
template immediately after a successful recognition attempt (to avoid impostor
intrusion).
As mentioned in Sect. 16.2, a normalization procedure which assumes uniform
iris elasticity is used for compensation of changes in the iris texture occurring due to
pupil dilation or constriction. Experimental results presented in Table 16.6 show that
John Daugman’s rubber sheet model works well in case of the limited range of pupil
changes which usually occur in indoor conditions. But in case of the wide range of
outdoor illumination changes, additional measures for compensation of the iris
texture deformation are required.
The idea of the proposed method is to perform an update of the enrolled template
by taking into consideration the normalized pupil radius PIR and average mutual
dissimilarity scores of feature vectors (FVs) in the enrolled template. The final goal
of such update is to obtain the enrolled template which contains iris feature vectors
corresponding to iris images captured in a wide range of illumination conditions. In
case of the multi-instance iris recognition, the update procedure is applied indepen-
dently for left and right eyes. Let us consider the update procedure for a single eye.
Fig. 16.15 Eyelid position

determination
The first step of the proposed procedure (except successful pass of the verifica-
tion) is an additional quality check of the probe feature vector FVprobe which can be
considered as input for the update procedure. It should be noted that the thresholds
which are used for the quality check are different in enrollment and verification
modes. In particular, the normalized eye opening (NEO) value, described below, is
set as 0.5 for the enrollment and 0.2 for the verification; the non-masked area (NMA)
of the iris (not occluded by any noise) is set as 0.4 and 0.29 for the enrollment and
probe, respectively (in case of the multi-instance iris recognition).
The NEO value reflects the eye opening condition and is calculated as:
El þ Eu
NEO ¼ :
2 RI
Here El and Eu are lower and upper eyelid positions determined as a distance to
the eyelid from a pupil center (Pc) by a vertical (Fig. 16.15). One of the methods for
eyelid position detection is presented in Odinokikh et al. (2019b). It is based on
applying multi-directional 2D Gabor filtering and is suitable for running on mobile
devices.
Additional checking of the probe feature vector consists of applying enrollment
thresholds for NEO and NMA values associated with FVprobe.
The second step consists of checking the possibility to update the enrolled
template. The structure of the enrolled template is depicted in Fig. 16.16.
All FVs in the enrolled template are divided into three groups: initially enrolled
FVs obtained during enrollment of a new user and two groups corresponding to FVs
obtained at high illumination and low illumination conditions respectively. The latter
groups are initially empty and receive new FVs through appending or replacing. It is
important to note that the group of initially enrolled FVs is not updated to prevent
possible degradation of recognition accuracy.
Each FV in the enrolled template contains information about the corresponding
PIR value and average mutual dissimilarity score:
1 X
d am ðFVk Þ ¼ d ðFVk , FVi Þ:
N
i2f1, ..., N j i6¼kg
Here N is the current amount of FVs in the enrolled template. d(FVk) values are
updated after each update cycle.
Fig. 16.16 Structure of enrolled template

E
Let
FV1 , . . . , FV E
M denote
the set of M initially enrolled FVs. If
PIR FVprobe < min PIR FVE1 , . . . , PIR FVEM , then FVprobe is considered as a
candidate for the update of the group of FVs obtained
at high illumination
(lPIR
group). Overwise, if PIR FVprobe > max PIR FVE1 , . . . , PIR FVEM , then
FVprobe is considered as a candidate for the update of the group of corresponding
FVs obtained at low illumination (hPIR group).
lPIR and hPIR groups have predefined maximum sizes. If the selected group is
not full, FVprobe is added to it. Overwise, the following rules are applied. If PIR
(FVprobe) is the minimal value among all FVs inside the lPIR group, then FVprobe
replaces the FV with the minimal PIR value in the lPIR group. Similarly, if PIR
(FVprobe) is the maximal value among all FVs inside the hPIR group, then FVprobe
replaces the FV with the maximal PIR value in the hPIR group.
Otherwise, the FV which is closest to FVprobe by PIR value is searched inside the
selected group. Let FVi denote the closest feature vector to FVprobe in terms of PIR
value, and FVi1 and FVi+1 – its corresponding neighbors in terms of PIR value.
Then, the following values are calculated:

D ¼ PIRðFVi Þ PIRavg PIR FVprobe PIRavg ,
1
PIRavg ¼ ðPIRðFVi1 Þ þ PIRðFViþ1 ÞÞ:
2
If D exceeds the predefined threshold, then FVprobe replaces FVi. This simple rule
allows to obtain a group of FVs which are distributed uniformly in terms of PIR
values. Otherwise, an additional rule is applied: if dam(FVprobe) < dam(FVi), then
FVprobe replaces FVi. dam(FVprobe) and dam(FVi) are average mutual dissimilarity
Table 16.7 Dataset specification

Dataset Non-glasses (NG) Glasses (G)
Users in dataset 476 224
Max comparisons 22,075,902 10,605,643
Ethnic diversity Asian and Caucasian
Eyes on video Two
Videos per user 10 2
Video length 30 frames
Capturing distance 25–40 cm
Capturing speed 15 frames per second (FPS)
Image resolution 1920 1920
scores calculated as shown above. It aids in selecting the FV that exhibits maximum
similarity with the other FVs in the enrolled template.
In order to prove efficiency of the proposed methods, a dataset which emulates
user interaction with a mobile device was collected privately. It is a set of
two-second video sequences, each of which is a real enrollment/verification attempt.
It should be noted that there are no such publicly available datasets.
The dataset was collected using a mobile device with an embedded NIR camera.
It contains videos captured at different distances, in indoor (IN) and outdoor
(OT) environments, with and without glasses. During dataset capturing, the follow-
ing illumination ranges and conditions are set up: (i) three levels for the indoor
samples (0–30, 30–300, and 300–1000 Lux) and (ii) a random value in the range
1–100K Lux (data were collected on a sunny day with different arrangements of the
device relative to the sun). A detailed description of the dataset can be found in
Table 16.7. The Iris Mobile (IM) dataset used in Sect. 16.4 was randomly sampled
from, as well.
The testing procedure for proposed multi-instance iris recognition considers each
video sequence as a single attempt. The procedure contains the following steps:
1. All video sequences captured in indoor conditions and without glasses (IN&NG)
are used to produce the enrollment template. The enrolled template is successfully
created if the following conditions are satisfied:
(a) At least 5 FVs were constructed for each eye.
(b) At least 20 out of 30 frames were processed.
2. All video sequences are used to produce probes. The probe is successfully created
if at least one FV was constructed.
3. Each enrolment template is compared with all probes except the ones generated
from the same video. Thus, the pairwise matching table of the dissimilarity scores
for performed comparisons was created.
4. Obtained counters of successfully created enrolled templates and probes and the
pairwise matching table are used for calculating FTE, FTA, FNMR, FMR, and
EER as described in Dunstone and Yager (2009).
Fig. 16.17 Verification rate values obtained at each update cycle for Gabor-based feature extrac-
tion and matching (Odinokikh et al. 2017)
The recognition accuracy results are presented in Table 16.6. The proposed
CNN-based feature extraction and matching method described in Sect. 16.4 is
compared with the one described in Odinokikh et al. (2017) as a part of the whole
iris recognition pipeline. The latter method is based on Gabor wavelets with an
adaptive phase quantization technique (denoted as GAQ in Table 16.6).
Both methods were tested in three different verification environments: indoors
without glasses (IN&NG), indoors with glasses (IN&G), and outdoors without
glasses (OT&NG). The enrollment was always carried out only indoors without
glasses, and, for this reason, the value of FTE ¼ 3.15% is the same for all the cases.
The target FMR ¼ 107 was set in every experiment.
Applying different matching rules was also investigated. The proposed multi-
instance fusion showed advantages over the other compared rules (Table 16.5).
To simulate template adaptation in a real-life scenario, the following testing
procedure is proposed. The subset containing video sequences captured both in
indoor and outdoor environmental conditions for 28 users is formed from the
whole dataset. For each user, one video sequence captured in indoor conditions
without glasses is randomly selected for generating the initial enrolled template. All
other video sequences (both indoor and outdoor) are used for generating probes.
After generating probes, they are split into two subsets: one is for performing
genuine attempts during verification (genuine subset); another is for enrolled tem-
plate update (update subset). It should be noted that all probes are used for
performing impostor attempts during verification.
On each update cycle, one probe from the update subset is randomly selected, and
the enrolled template update is started. The updated enrolled template is involved in
performance testing after every update cycle. Figure 16.17 shows the verification
rate values obtained at different update cycles for the proposed method for the
Gabor-based feature extraction and matching algorithm proposed in Odinokikh
et al. (2017). It can be seen that the proposed adaptation scheme allows to increase
the verification rate up to 6% after 9 update cycles.
Portions of the research in this chapter use the CASIA-Iris-Mobile-V1.0 dataset
collected by the Chinese Academy of Sciences’ Institute of Automation
(CASIA 2015).
References
Abate, A.F., Barra, S., D’Aniello, F., Narducci, F.: Two-tier image features clustering for iris
recognition on mobile. In: Petrosino, A., Loia, V., Pedrycz, W. (eds.) Fuzzy Logic and Soft
Computing Applications. Lecture Notes in Artificial Intelligence, vol. 10147, pp. 260–269.
Springer International Publishing, Cham (2017)
ARM Security Technology. Building a secure system using TrustZone Technology. ARM Limited
(2009)
Battiato, S., Messina, G., Castorina, A.: Exposure сorrection for imaging devices: an overview. In:
Lukas, R. (ed.) Single-Sensor Imaging: Methods and Applications for Digital Cameras,
pp. 323–349. CRC Press, Boca Raton (2009)
Bowyer, K.W., Hollingsworth, K., Flynn, P.J.: Image understanding for iris biometrics: a survey.
Comput. Vis. Image Underst. 110(2), 281–307 (2008)
Chinese Academy of Sciences’ Institute of Automation (CASIA). Casia-iris-mobile-v1.0 (2015).
Accessed on 4 October 2020. http://biometrics.idealtest.org/CASIA-Iris-Mobile-V1.0/CASIA-
Iris-Mobile-V1.0.jsp
Corcoran, P., Bigioi, P., Thavalengal, S.: Feasibility and design considerations for an iris acquisi-
tion system for smartphones. In: Proceedings of the 2014 IEEE Fourth International Conference
on Consumer Electronics, Berlin (ICCE-Berlin), pp. 164–167 (2014)
Das, A., Galdi, C., Han, H., Ramachandra, R., Dugelay, J.-L., Dantcheva, A.: Recent advances in
biometric technology for mobile devices. In: Proceedings of the IEEE 9th International Con-
ference on Biometrics Theory, Applications and Systems (2018)
Daugman, J.: High confidence visual recognition of persons by a test of statistical independence.
Daugman, J.: Recognising persons by their iris patterns. In: Li, S.Z., Lai, J., Tan, T., Feng, G.,
Wang, Y. (eds.) Advances in Biometric Person Authentication. SINOBIOMETRICS 2004.
Lecture Notes in Computer Science, vol. 3338, pp. 5–25. Springer, Berlin, Heidelberg (2004)
Daugman, J.: Probing the uniqueness and randomness of iris codes: results from 200 billion iris pair
comparisons. Proc. IEEE. 94(11), 1927–1935 (2006)
Daugman, J., Malhas, I.: Iris recognition border-crossing system in the UAE (2004). Accessed on
4 October 2020. https://www.cl.cam.ac.uk/~jgd1000/UAEdeployment.pdf
Dunstone, T., Yager, N.: Biometric System and Data Analysis: Design, Evaluation, and Data
Mining. Springer-Verlag, Boston (2009)
Fujitsu Limited. Fujitsu develops prototype smartphone with iris authentication (2015). Accessed
on 4 October 2020. https://www.fujitsu.com/global/about/resources/news/press-releases/2015/
0302-03.html
Galbally, J., Gomez-Barrero, M.: A review of iris anti-spoofing. In: Proceedings of the 4th
International Conference on Biometrics and Forensics (IWBF), pp. 1–6 (2016)
Gangwar, A.K., Joshi, A.: DeepIrisNet: deep iris representation with applications in iris recognition
and cross sensor iris recognition. In: Proceedings of 2016 IEEE International Conference on
Image Processing (ICIP), pp. 2301–2305 (2016)
Gnatyuk, V., Zavalishin, S., Petrova, X., Odinokikh, G., Fartukov, A., Danilevich, A., Eremeev, V.,
Yoo, J., Lee, K., Lee, H., Shin, D.: Fast automatic exposure adjustment method for iris
recognition system. In: Proceedings of 11th International Conference on Electronics, Computers
and Artificial Intelligence (ECAI), pp. 1–6 (2019)
IrisGuard UK Ltd. EyePay Phone (IG-EP100) specification (2019). Accessed on 4 October 2020.
https://www.irisguard.com/node/57
ISO/IEC 19794-6:2011: Information technology – Biometric data interchange formats – Part 6: Iris
image data (2011), Annex B (2011)
Korobkin, M., Odinokikh, G., Efimov, Y., Solomatin, I., Matveev, I.: Iris segmentation in chal-
lenging conditions. Pattern Recognit Image Anal. 28, 652–657 (2018)
Nourani-Vatani, N., Roberts, J.: Automatic camera exposure control. In: Dunbabin, M., Srinivasan,
M. (eds.) Proceedings of the Australasian Conference on Robotics and Automation, pp. 1–6.
Australian Robotics and Automation Association, Sydney (2007)
Odinokikh, G., Fartukov, A., Korobkin, M., Yoo, J.: Feature vector construction method for iris
recognition. In: International Archives of the Photogrammetry, Remote Sensing and Spatial
Information Science. XLII-2/W4, pp. 233–236 (2017). Accessed on 4 October 2020. https://doi.
org/10.5194/isprs-archives-XLII-2-W4-233-2017
Odinokikh, G.A., Fartukov, A.M., Eremeev, V.A., Gnatyuk, V.S., Korobkin, M.V., Rychagov, M.
N.: High-performance iris recognition for mobile platforms. Pattern Recognit. Image Anal. 28,
516–524 (2018)
Odinokikh, G.A., Gnatyuk, V.S., Fartukov, A.M., Eremeev, V.A., Korobkin, M.V., Danilevich, A.
B., Shin, D., Yoo, J., Lee, K., Lee, H.: Method and apparatus for iris recognition. US Patent
10,445,574 (2019a)
Odinokikh, G., Korobkin, M., Gnatyuk, V., Eremeev, V.: Eyelid position detection method for
mobile iris recognition. In: Strijov, V., Ignatov, D., Vorontsov, K. (eds.) Intelligent Data
Processing. IDP 2016. Communications in Computer and Information Science, vol. 794, pp.
140–150. Springer-Verlag, Cham (2019b)
Odinokikh, G., Korobkin, M., Solomatin, I., Efimov, I., Fartukov, A.: Iris feature extraction and
matching method for mobile biometric applications. In: Proceedings of International Conference
on Biometrics, pp. 1–6 (2019c)
Prabhakar, S., Ivanisov, A., Jain, A.K.: Biometric recognition: sensor characteristics and image
quality. IEEE Instrum. Meas. Soc. Mag. 14(3), 10–16 (2011)
Rathgeb, C., Uhl, A., Wild, P.: Iris segmentation methodologies. In: Iris Biometrics. Advances in
Information Security, vol. 59. Springer-Verlag, New York (2012)
Rattani, A.: Introduction to adaptive biometric systems. In: Rattani, A., Roli, F., Granger, E. (eds.)
Adaptive Biometric Systems. Advances in Computer Vision and Pattern Recognition, pp. 1–8.
Springer, Cham (2015)
Ross, A., Jain, A., Nandakumar, K.: Handbook of Multibiometrics. Springer-Verlag, New York
(2006)
Samsung Electronics. Galaxy tab iris (sm-t116izkrins) specification (2016). Accessed on 4 October
2020. https://www.samsung.com/in/support/model/SM-T116IZKRINS/
Samsung Electronics. How does the iris scanner work on Galaxy S9, Galaxy S9+, and Galaxy
Note9? (2018). Accessed on 4 October 2020. https://www.samsung.com/global/galaxy/what-is/
iris-scanning/
Sun, Z., Tan, T.: Iris anti-spoofing. In: Marcel, S., Nixon, M.S., Li, S.Z. (eds.) Handbook of
Biometric Anti-Spoofing, pp. 103–123. Springer-Verlag, London (2014)
Tabassi, E.: Large scale iris image quality evaluation. In: Proceedings of International Conference
of the Biometrics Special Interest Group, pp. 173–184 (2011)
Tortora, G.J., Nielsen, M.: Principles of Human Anatomy, 12th edn. John Wiley & Sons, Hoboken
(2010)
Zhang, Q., Li, H., Sun, Z., Tan, T.: Deep feature fusion for iris and periocular biometrics on mobile
devices. IEEE Trans. Inf. Forensics Secur. 13(11), 2897–2912 (2018)
Zhao, Z., Kumar, A.: Towards more accurate iris recognition using deeply learned spatially
corresponding features. In: Proceedings of IEEE International Conference on Computer Vision
(ICCV), pp. 3829–3838 (2017)
Index
A visual estimation, 355

Active/passive glasses, 60 zone detection, 355
Adaptive CS techniques, 307 zoomed face, 361
Adaptive patch-based dictionaries, 305 zoomed-in fragments, 366
Adaptive pulse coding modulation (ADPCM), Animated thumbnail, 220
118, 119 Animation from photography
Additive regularization of topic models animation effect processing flow chart, 222
(ARTM), 261–263, 266 effect initialization, 221
AIM (Advances in Image Manipulation), 46 execution stage, 221
Algorithm-generated summary video, 159 flashing light effect, 221, 223
Algorithmic bias, 239 soap bubble generation, 223–225
Aliasing artefacts, 305 sunlight spot effect, 225, 227, 228
Anatomical landmark detection, 285–288 Animation of still photos, 219, 220
Android application, 344 Application-related data, 272
Animated graphical abstract Application-specific integral circuits (ASIC),
advantages, 355 115, 116
attention zone detection Arbitrary view rendering, 77
document images, 356–359 Artificial intelligence (AI)
for photos, 355, 356 racial profiling, 239
tomographic image, 360 Attention zone detection, 220, 225, 226, 228
conventional large icons, 364, 369 Attentive vision model, 225
CT images, 364 Auantization
effectiveness, 364 linear prediction schemes, 118
frames, 370 Audio-awareness animation effects, 233, 234
generation of animation, 361–364 Audiovisual slideshows, 221
generation, key stages, 355 Authentication methods, 267
goals, 354 Automated update of biometric templates, 406,
hands of kids, 361 416, 417, 419, 420
image content, 368 Automated video editing, 165
image of document, 363 Automatic audio-aware animation
Microsoft PowerPoint, 364 generation, 220
PDF documents, 368 Automatic cropping, 353, 354
sandstone, 370 Automatic editing model training, 172, 173
scanned document, 362 Automatic film editing, 165, 170, 171
task detection of photos, 366 Automatic video editing

and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2
424 Index
Automatic video editing (cont.) background motion, 104, 105

aesthetical scores, video footage, 163, 164 diffusion-based approaches, 103
ASL, 156 forward temporal pass, 106
cinematographic cut, 156 global optimization approach, 104
dialogue-driven scenes, 163 input video sequence, 103
dynamic programming (see Dynamic motion, 104
programming) spatial pass, 106, 107
existing methods, learning editing styles, video inpainting approaches, 103
164, 165 Bayer SR, 21–26
imitation learning (see Imitation learning) Binary coded aperture, 326
nonparametric probabilistic approach for Biometric methods, 397
media analysis, 167–169 Biometric recognition algorithm, 402
shot length metrics, 156 Block circulant with circulant block (BCCB),
single-camera, multiple-angle and multiple- 12, 14, 15
take footage, 163–164 Block diagonalization, 10, 14
time-consuming process, 155 complexity, 16
timing statistics, film, 156 Blur-warp-down-sample model, 4
video clips generation, 165, 167 Brain AVP, 281, 297
video footage from multiple cameras, 159–163 Brain MRI
video summarising, 157–159 MSP computation, 282
Automatic video editing quality, 184, 185 non-traditional landmarks, 279
Automatic view planning (AVP) on test datasets, 291
brain AVP, 281, 297 view planning, 279
brain MRI, 279
cardiac AVP, 299
desired properties, 278 C
framework (see AVP framework) Call + sensors’ data, 271
knee AVP, 298 Camera calibration, 328, 334–337
spine AVP, 298 Cardiac AVP, 299
tomographic diagnostic imaging, 279 Cardiac MRI, 288, 291
verification, 296 Cardiac view planning, 277–279
Average shot length (ASL), 156 Channel attention (CA) mechanism, 50
AVP framework Cinematographic, 155, 156
ambiguity yields, 280 Circulant matrix, 11
anatomical landmark detection, 285–288 Client application, 271
bounding box, 280, 281 Coded aperture approach, 325
brain AVP, 281 Collaborative sparsifying transform, 305
knee AVP, 281 Colour
landmark candidate position refinement, classification error, 208
295, 296 colour coherence vectors, 196
MSP estimation in brain images, 282–284 MPEG-7 visual descriptors, 211
post-processing, 280, 281 properties, 196
pre-processing, 280 RGB colour channels, 205
quality verification, 300 RGB colour space, 195
spine AVP, 291–295 skin tone detector, 199
statistical atlas, 281 ventral pathway, 211
steps, workflow, 280 Colour-based stereo reproduction, 62
training landmark detector, 288–291 Colour-coded aperture
workflow, 280, 282 and chromatic aberration-coded, 328
deep neural network architectures, 328
depth estimation, 332, 333, 335
B depth sensor calibration, 335
Background audio signals, 228 image formation, 327
Background inpainting image restoration, 327
article reports satisfactory, 103 light-efficient designs, 331
Index 425
numerical simulation, 328, 329 Data annotation, 247–249

prototypes, 342 Data augmentation, 251
PSF, 329 Data crowdsourcing, 246
real-time 3D reconstruction scenario, 345 Data engineering, 245, 249
simplified imaging system, 326 Data imbalance, 256
smartphone prototype, 343 Data operations teams (DataOps), 245
spatial misalignment, 326 Dataset collection system, 271, 272
Colour coherence vectors, 196 Dataset synthesis, 328
Colour filter array (CFA), 4 Deconvolution, 36, 42–44, 53
Compound memory network (CMN), 213, 214 Deep back-projection networks (DBPN), 49
Compressed sensing (CS) Deep learning (DL), 328
complex-valued object images, 306 algorithms, 54
compressed sensing, 307 convolution filters, 43
dictionary-based approaches, 307 and GANs, 54
k-space measurements, 306 loss function, 41
MR signals, 304 naïve bicubic down-sampling, 55
Compressed sensing MRI, 306, 307 revolution, 251
Computational complexity, 31 synthetic data, 247
Computed tomographic (CT) images, 351, 355, training and inference, 251
360, 364, 366 to video SR, 55
Computer vision, 325, 328 visual data processing, 155
Computing optical flow, 95 Deep neural networks (DNNs), 410
Confusion matrix, 198 Demographic classifiers, 261
Content-adaptive techniques, 220 Dense compression units (DCUs), 48
Content-based animation effects, 220 Depth-based rendering techniques, 83
Content-based image orientation recognition, 193 Depth control
Context stream, 212 depth tone mapping, 67, 68
Contextual behaviour-based profiling, 267–269 disparity vector, 67
Continuous user authentication, 267 stereo content reproduction, 66, 67
Contrastive learning, 254 Depth estimation, 325, 328, 329, 338, 339
Controlled data acquisition, 246 acquisition, depth data, 68
Conventional camera calibration, 337 cost aggregation, 69
Conventional colour anaglyphs, 62 cross-bilateral filter, 69
Conventional iris recognition system, 398 depth map, 66
Convolutional neural network (CNN), 155, depth smoothing, 72, 74
159, 163, 165, 166, 177, 188, 254 post-processing algorithm, 73
architecture, 37 reference image, 72
multiresolution, 212, 213 stereo matching algorithms, 68
receptive field, 36 Depth image-based rendering (DIBR), 75, 78
in SISR, 35 Depth maps, 54
depth maps, 54 Depth post-processing, 72, 73
FSRCNN, 42–44 Depth propagation
learned perceptual patch similarity, 42 algorithm, 88
upsampling, 37 comparison, interpolation error, 93
super-resolution task, 36 CSH algorithm, 89
super-resolution with residual learning, 37 CSH matching, 91, 92
SVM, 213 depth interpolation, 86
visual recognition tasks, 211 hash tables, 89
two-stream CNN method, 211, 212 interpolated depths, 92
machine learning approach, 86
motion information, 87
D patch-voting procedure, 90, 91
Daisy descriptors, 168 semi-automatic conversion, 86
Data acquisition, 271 superpixels, 87
426 Index
Depth propagation (cont.) Flashing light effect, 220, 221, 223

temporal propagation, 86 Floating point operations per second
2D to 3D conversion pipeline, 86 (FLOPs), 244
Depth tweening, 87 Frame rate conversion (FRC), 204, 205
Dictionary-based CS MRI, 307–310 algorithms, 373, 374
Dictionary learning, 309 magic, 373
Difference of Gaussians (DoG) filter, 26 Fully automatic conversion algorithms, 82
Differential Mean Opinion Score (DMOS), 149 Fully convolutional networks (FCN), 410
Directed acyclic graph (DAG), 201
Directional DoG filter, 26
Discrete cosine transform (DCT) coefficients, G
118, 194, 209 Gaussian distribution, 5
Disparity estimation, 328, 344 Gaussian mixture models (GMMs), 211
Disparity extraction, 326 Generative adversarial networks (GANs), 47,
Disparity map estimation, 332 48, 51, 54
Document segmentation, 355, 356 Global orthonormal sparsifying transform, 307
Domain transfer, 255 GoogLeNet-based network structure, 170, 171
Dynamic programming GPU implementation, 314, 315
automatic editing, non-professional video Greyscale video colourization, 87
footage, 182
automatic video editing, 177
cost function, 180–182 H
evaluation, automatic video editing Hadamard multiplication, 307
quality, 184, 185 Hadamard transform, 141, 142
non-professional video photographers, Hand-annotate existing film scenes, 164
182–184 Hardware complexity, 145
problem statement, 178 Hardware-based parallel data acquisition
raw video materials for automatic editing, 182 (pMRI) methods, 304
reference motion pictures, 182 Hermitian transpose operation, 309
trainable system, 177 Hidden Markov Model (HMM), 164, 196
transition quality, 179, 180 High-order spectral features (HOSF), 208
video editing parameters, 177 Histogram of gradient (HoG), 158
Histogram of optical flow (HoF), 158
Histogram-based contrast method, 226
E Homography, 211
Elliptical classifier, 202
Embedded/texture memory compression, 115,
116 I
Enrolled template, 399 Image classification
Explicit authentication, 267 AI, 239
algorithmic bias, 239
applications, 237
F binary classifiers, 238
Fallback logic, 374 data
Fast real-time depth control technology, 66 calibration dataset, 245
Feature maps, 36, 37, 42, 43, 49–53 commercial dataset, 246
Federated learning (FL), 256 controlled data acquisition, 246
File Explorer for Win10, 352 crowdsourcing platform, 246
Filter-bank implementation, 16–18 data acquisition, 246
Fingerprint recognition, 397 data augmentation, 251
Fisher’s linear discriminant (FLD), 211 data engineering, 249–250
Fixed Quantization via Table (FQvT), 123, 137, data management, 250
139, 140, 142 DataOps, 245
Index 427
dataset preparation process, 245 automatic camera parameter

human annotation/verification, 247–249 adjustment, 403
publicly available data, 246 cluster construction procedure, 408
synthetic data acquisition, 247 ED and GD value, 408
deployment, 257 exposure time and MSV, 404
face recognition technology, 239 face region brightness, 406
hierarchical fine-grained localised multi- FRR, 409
class classification system, 238 global exposure, 404
hierarchical/flat classification systems, 238 iris capturing hardware, 403
ICaaS, 237 iterative adjustment stage, 409
metrics and evaluation MSV, 404, 405
classification metrics, 241, 242 optimal recognition accuracy, 407
end-to-end metrics, 241 performance comparison, 409
F1 score, 243 photodiodes, 404
groups of metrics, 240 recognition algorithm, 402
precision metric, 242 sensors, 403
recall metrics, 242 sigmoid coefficients, 404
training and inference performance special quality buffer, 403
metrics, 244 visualization, 405
uncertainty and robustness metrics, limitations, 414
244–245 person recognition
model architecture, 252–254 CNN, 400
model training contactless capturing, 398
classical computer vision approach, 251 on daily basis, 401
contrastive learning, 254 human eye, 398, 400
data imbalance, 256 informative biometric trait, 398
domain transfer, 255 intra-class variation, 398
fine-tuning, pre-trained model, 254 minimal requirements, 400
FL, 256 mobile device and user, 401
flagship smartphones, 251 mobile devices, 400
knowledge distillation, 254, 255 muscle tissue, 398
self-supervised learning, 254 NIR camera, 400
semi-supervised approaches, 254 registration, new user, 399
multi-label classifiers, 238 Samsung flagship devices, 402
web-based service, 239 simplified scheme, 399
Image similarity, 41 Iris recognition systems, 400
Imitation learning, 179
automatic editing model training, 172, 173
classes, shot sizes, 171 K
features extraction, 171–172 KISS FFT open-source library, 229
frames, 170 k-Nearest neighbour (KNN) classifier, 211
qualitative evaluation, 173–174 Knee AVP, 279, 281, 298
quantitative evaluation, 174–177 Knee MRI, 288
rules, hand-engineered features, 170 Knowledge-based methods, 397
video footage features extraction pipeline, Knowledge distillation, 254, 255
170, 171 Kohonen networks, 196
Interpolations per second (IPS), 390
Iris feature extraction, 410–413
Iris image acquisition, 400 L
Iris quality checking, 400 LapSRN model, 47, 48
Iris recognition Latent Dirichlet allocation (LDA), 261, 266
for mobile devices LG Electronics 47GA7900, 78
auto-exposure approaches, 406 Light polarisation, 61
428 Index
Light star-shape templates, 224 LDA, 261

Lightning effect, 233, 234 machine learning approaches, 266
Location information, 272 mobile data, 260
Loss function, 36, 37, 41, 48 NLP, 261
Lucas–Kanade (LK) optical flow, 31 PLSA, 261
pre-processing of Web pages, 262
probabilistic latent semantic analysis, 263
M SMS logs, 260
Machine learning, 1, 6 topic model, 261
and computer vision, 252 Web data, 260
classification metrics, 241 Modern smartphone, 397
human biases, 240 Android-based, 270
inference and training speed, 244 available data, 260
Magnetic resonance imaging (MRI) explicit authentication, 267
brain MRI, 279 passive authentication, 269
cardiac MRI, 291 sensors and data sources available, 260
compressed sensing MRI, 306, 307 SoCs, 268
computational performance, 315, 320 Modern super-resolution algorithms, 55
computer tomography, 303 Modern super-resolution CNNs, 36
convolutional neural network, 296 Modern video processing systems, 115
CS algorithm, 306 Modified FSRCNN architecture, 43
DESIRE-MRI, 313, 314, 316–320 Motion compensation, 387
dictionary-based CS MRI, 307–310 Motion estimation (ME), 374
efficient CS MRI reconstruction, 310 double-block processing, 379, 380
global sparsifying transforms, 305 evaluation, 3DRS algorithm modifications,
hardware-based acceleration, 304 380–382
k-space, 303 slanted wave-front scanning order, 379
landmark detector training, 296 3DRS algorithm, 376, 377
noninvasive methods, 277 wave-front scanning order, 377, 378
scout image, 278 Motion information, 87, 95
spine MRI, 285, 296 Motion picture masterpieces, 172
3D scout MRI volume acquisition, 280 Motion vectors
view planning, 277, 278 comparison of errors, 101
Magnifier effect, 233, 234 computing optical flow, 95
Matrix class, 12–15 evolution, data term importance, 98, 99
Mean Opinion Score (MOS), 149 Middlebury optical flow evaluation
Mean sample value (MSV), 404 database, 101
Mel-frequency cepstral coefficient (MFCC), 211 motion clustering, 99–101
Middlebury dataset, 329 motion estimation results, 101
Midsagittal plane (MSP), 282–284 non-local neighbourhood smoothness
Mobeye, 246 term, 101
Mobile device, 42–45, 48 optical flow, 95
Mobile iris recognition, see Iris recognition solver, 98
Mobile user demographic prediction special weighting method, 97
additive regularization, topic models, 263, 264 total variation optical flow
analysis, 262 using two colour channels, 96, 97
application usage data, 260 with non-local smoothness term, 97
architecture of proposed solution, 262 variational optical flow, 96
ARTM, 261 Motion-compensated interpolation (MCI), 374,
call logs, 260 381, 383, 385, 387–391
demographic classifiers, 261 MPEG-7 visual descriptors, 211
document aggregation, 264–266 Multi-band decomposition, 312
flexibility, 261 Multi-coil MRI reconstruction, 315, 316
Index 429
Multi-frame SR 6-metre-wide cinema screen, 63

image formation model, 1 types, 62
multilevel matrices, 21 zero, 62
PSNR values, 24 Parametric rectified linear activation units
SISR, 6 (PReLUs), 42, 44
SR problem, 4 Passive authentication, mobile users, 267–270
3DRS, 31 Patch-based approach, 305
threefold model, 3 Peak signal-to-noise ratio (PSNR), 41, 42, 45,
Multi-instance fusion, 414, 415, 420 46, 52
Multi-instance iris recognition, 414, 416, Perception index (PI), 46
417, 419 Perception-distortion plane, 46
Multilevel matrices, 21 Perfect shuffle matrix, 10, 15
Multimedia presentations, 219, 220 Phase-encoding (PE) lines, 303
Multimedia slideshows, 219 Physical sensors, 271
Multi-view face detection, 225 PIRM (Perceptual Image Restoration and
Manipulation) challenge, 46
Point spread function (PSF), 329, 336
N Poissonian distribution, 5
Naïve bicubic down-sampling, 55 Portable devices, 155
Natural Image Quality Evaluator (NIQE), 151 Post-processing, 82
Natural language processing (NLP), 261 Practical SR
Natural stereo effect, 82 DoG filter, 26
Nearest neighbours (NN)-finding algorithm, 182 final reliability max, 28
Neural networks, 196 half-resolution compensated frame, 26
NIR camera, 400 LST space, 28
Noisiest tomographic image, 368 MF SR system architecture, 27
Non-professional video photographers, motion estimation for Bayer MF SR, 30, 31
182–184 post-processing, 28
Non-subsampled shearlet transform (NSST), 211 reconstruction filters, 28
Non-uniform-sized patches, 312, 313 reconstruction model, 28
NTIRE (New Trends in Image Restoration and structure tensor, 29, 30
Enhancement) challenge, 45, 46 system architecture, 26
Numerical simulation, 328–330 visual comparison, 28
Precomputed dictionaries, 311, 312, 314, 320
Principal component analysis (PCA), 196
O Privacy infringement, 259
Occlusion processing (OP), 374, 382–384, 390 Probabilistic latent semantic analysis (PLSA),
Occlusions, 382 261–263
Optical flow Progressive super-resolution network, 49
computing, 95
depth interpolation results, 94
Middlebury optical flow evaluation, 101 Q
two-colour variational, 95, 96 Quantization
and clipping, 127
FQvT, 123
P NSW quantizer set values, 123
Parallax NSW transform, 121
high values, 63
interposition depth, 64
negative (crossed), 63 R
positive (uncrossed), 62 Rainbow effect, 233, 234
positive diverged parallax, 63 RANSAC algorithm, 99
real 3D movies, 63 Real-time methods, 82
430 Index
Real-time multistage digital video Skin colour pixels, 204

processing, 115 Smart TVs, 59
Real-time video classification algorithm, 197 SmartNails, 354
Real-time video processing, 193, 194, 197, 204, Soap bubble effect, 230, 231
205, 207 Soccer genre detection, 199, 208
Real-World Super-Resolution (RWSR) Social video footage, 155
sub-challenges, 46 Special interactive authoring tools, 219
Rectified linear unit (ReLU), 36, 50, 51 Spinal AVP approaches, 279
Region of interest (ROI), 360 Spine AVP, 280, 291, 298
Reinforcement learning, 155 Spine MRI, 285, 296
Residual network (ResNet), 252, 253 Spine view planning, 279
Residual-in-residual (RIR) module, 50 Split Bregman initialization, 310, 311, 314,
RGB colour channels, 205 316, 320
RGB colour sensor, 326 Sport genre detection
RGB colour spaces, 332, 334 multimedia applications, 193
RGB video frames, 89 real-time detection, 193
video classification (see Video sequence
classification)
S video processing pipeline, 194
Scene change detector, 204, 205 Sporting scenes, 199
Scene depth extraction, 325 Sports games, 194, 198, 199
Scout image, 278–280 Sports video categorisation system
Second-order channel attention (SOCA) CMN structure, 213
module, 50, 51 CNN-based approach, 211
Semi-automatic conversion algorithms, 82 DCT coefficients, 194, 209
Shooting in stereo, 82 homography, 211
Signal-to-noise ratio (SNR), 326 HOSF, 208
Similarity, 360, 361, 364 hybrid deep learning framework, 212
Single image super-resolution (SISR) large-scale video classification
AIM challenge, 46 framework, 212
challenges, 45 leveraged classical machine learning
CNN training, 41 approaches, 209
competition, 45 MPEG-7 visual descriptors, 211
feature maps, 43 multimodal data, 210
FSRCNN, 42 multiresolution CNN, 212, 213
GANs, 47 signature heat maps, 211
HR image, 35 via sensor fusion, 210
image quality metrics, 41 WKLR, 209
implementation on mobile device, 42, 43, 45 workflow, 210
loss function, 41 Squared sum error (SSE), 123
LR image, 35 Stereo content reproduction
MRI in medicine, 54 active shutter glasses, 62
neural networks, 36, 38 colour-based views separation, 62
NTIRE competition, 45 depth control, 66, 67
perception-distortion plane, 46 passive systems with polarised glasses, 61
PIRM challenge, 46 Stereo conversion, 82–84, 112
PSNR, 41 See also 2D-3D semi-automatic video
real imaging systems, 40 conversion
single image zooming, 35 Stereo disparity estimation methods, 77
SSIM, 41 Stereo matching algorithms, 68
traditional image quality metrics, 41 Stereo pair synthesis, 103, 108
training datasets, 38–40 Stereo rig, 82
Single-frame super-resolution (SISR), 6 Stereo synthesis
Index 431
DIBR, 75 T
problems during view generation Temporal propagation, 105
disocclusion area, 76 Texture codes, 196
symmetric vs. asymmetric, 77 3D recursive search (3DRS) algorithm, 375
temporal consistency, 77 3D TVs
toed-in configuration, 77, 78 on active shutter glasses, 62
virtual view synthesis, 76 cause, eye fatigue, 62, 63, 65, 66
Structural similarity (SSIM), 41, 46 consumer electronic shows, 59
Structure tensor, 29, 30 interest transformation
Sub-pixel convolutions, 36, 41, 48 extra cost, 60
Sunlight spot effect, 225, 226, 230–232 inappropriate moment, 59
Superpixels, 87 live TV, 61
Super-resolution (SR) multiple user scenario, 61
arrays, 4 picture quality, 61
Bayer SR, 21–26 uncomfortable glasses, 60
BCCB matrices, 12 parallax (see Parallax)
block diagonalization in SR problems, prospective technologies, 59, 60
9, 10, 14 smart TVs, 59
circulant matrices, 11 stereo content (see Stereo content
colour filter arrays, 5 reproduction)
data fidelity, 7 Thumbnail creation, 353, 354
data-agnostic approach, 7 Tiling slideshow, 220
filter-bank implementation, 16–18 Toed-in camera configuration, 77, 78
HR image, 1 Tone-to-colour mapping, 229
image formation model, 1–3, 5 Transposed convolution, 36
image interpolation applications, 2 TV programmes, 194
interpolation problem, 2 TV screens, 373
LR images, 1 2D-3D semi-automatic video conversion
machine learning, 6 advantages and disadvantages, 82
mature technology, 1 background inpainting step
on mobile device, 45 (see Background inpainting)
modern research, 7 causes, 81
optimization criteria, 6 depth propagation from key frame
perfect shuffle matrix, 10, 15 (see Depth propagation)
problem conditioning, 7 motion vector estimation (see Motion
problem formulation, fast implementation, vectors)
8–9 steps, 83
reconstruction, 1 stereo content quality, 111, 112
sensor-shift, 3 stereo rig, 81, 82
single- vs. multi-frame, 2 video analysis and key frame detection,
single-channel SR, 19 84–86
SISR, 6 (see also Single image super- view rendering, 110, 111
resolution (SISR)) virtual reality headsets, 81
symmetry properties, 19–21
warping and blurring parameters, 35
Super-resolution multiple-degradations U
(SRMD) network, 53 Unpaired super-resolution, 55
Support vector machine (SVM), 194, 195, Unsupervised contrastive learning, 254
208, 291 User authentication, 267
Symmetric stereo view rendering, 77 User data collection, 269, 271–274
Synthetic data acquisition, 247 User interfaces, 351, 353, 354
System on chip (SoC), 115
432 Index
V encoding methods (see Visually lossless

Variable-length prefix encoding (VLPE), 119 encoding methods)
Video clips generation, 165, 167 ETC/ETC1/ETC2, 116
Video editing fixed block truncation coding, 116
valuable footage, 155 high-level data processing pipeline, 118
Video sequence classification methods, 118
camera movement analysis, 195 prototypes, 116
“energy flow of activity”, 195 quality analysis (see Visual quality)
genre detection, 194 redundancy elimination mechanisms, 117
modalities, 194 requirements, 117
movies by genre, 195 S3 Texture Compression, 116
PCA, 196 specifications, 152
properties of boundaries, 195 structure, VLCCT encoder, 146
real-time video classification algorithm, 197 subjective testing, 148
sporting events, 196 VLPE, 119
SVM, 194, 195 weighted max-max criterion, 143
visual keys, 196 weighted mean square error, 142
Video stream classifier, 204 with syntax elements, 143, 144
Video summarising, 157–159 Visually lossless encoding methods
View planning 2 2 sub-blocks
automatic system (see Automatic view and 2 4 blocks, 128, 130
planning (AVP)) and 2 4 method, 133
cardiac, 278 bits distribution, sub-method U1, 141
MRI, 277, 278 CDVal distribution, 134–136
View rendering, 110, 111 C-method, 121, 125, 127
Virtual view synthesis, 76, 78, 79 decoding (reconstruction), 133
Visual clustering, 159 difference values, sub-method U2, 140,
Visual keys, 196 141
Visual quality D-method, 129
description, 149 E-method, 121, 127
DMOS, 149 extra palette processing, 135
MOS, 149 features using lifting scheme, 122
testing methodology, 149 F-method, 130
video streams, 115 FQvT, 139
by VLCCT free bits analysis, 125
categories of images, 149 Hadamard transform, 141, 142
dataset for testing, 150 InterCondition, 133
datasets, 151 intra-palette mode, 135
machine learning methods, 151 LSB truncation, 132
“perceptual loss” criterion, 151 M-method, 132
zoom, 149 N-method, 121
Visual quality estimation, 362 pixel errors, 122
Visually lossless colour compression P-method, 121, 124, 125
technology (VLCCT) quantization and clipping, 127
algorithms, 116 quantizers, 122, 123
architecture, 119, 120 quantizers for P-method, 125
bitstream syntax, 143 red colour channel SAD, 138
complexity analysis, 145–149 S-method, 121
compressed elementary packet, 119 SSE, 123
compression ratios, 117 sub-modes in P-method, 124
content analysis, 119 template values, 128
data transformations, 118 U-method, 135
elementary operations, 146
Index 433
2 4 pixels blocks Web camera-based prototype, 344

D-method, 120 Web data, 260, 261, 271, 272
F-method, 120 Weighted kernel logistic regression
H-method, 121 (WKLR), 209
L-method, 121 Weighted max-max criterion, 143
M-method, 120
O-method, 120
U-method, 121 X
VMAF (Video Multi-Method Assessment Xena, 116
Fusion), 151 X-ray microtomographic images, 353
Vowpal Wabbit, 173
Z
W Zero filling (ZF), 304, 306
Walsh-Hadamard (W-H) filters, 88 Zero-shot super-resolution, 55
Warp-blur-down-sample model, 4

Smart Algorithms Multimedia Imaging

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Smart Algorithms Multimedia Imaging

Uploaded by

Copyright:

Available Formats

Signals and Communication Technology

More information about this series at http://www.springer.com/series/4748

Smart Algorithms for

ISSN 1860-4862 ISSN 1860-4870 (electronic)

© Springer Nature Switzerland AG 2021

Moscow, Russia Michael N. Rychagov

1 Super-Resolution: 1. Multi-Frame-Based Approach . . . . . . . . . . . . 1

11 Automatic View Planning in Magnetic Resonance Imaging . . . . . . . 277

Ekaterina V. Tolstaya received her MS in applied mathematics from Moscow

Mikhail Y. Sirotenko received his engineering degree in control systems from

1.1 Super-Resolution Problem

© Springer Nature Switzerland AG 2021 1

Fig. 1.1 Single-frame (left side) vs multi-frame super-resolution (right side)

algorithms that are required to make super-resolution practical, super-resolution with

1.1.2 Super-Resolution and Interpolation, Image

But in many consumer applications, unlike astronomical applications, atmo-

Fig. 1.4 Types of colour ﬁlter arrays

decimation operator B. Thus the degradation operator can be written as Wi ¼ BDGMi.

1.1.3 Optimization Criteria

The SR image reconstruction problem can be tackled within a data-driven approach

for multi-frame problem setting. For single-frame problem setting (single-frame

In Kanaev and Miller (2013), an anisotropic smoothness term is used for

1.2 Fast Approaches to Super-Resolution

1.2.1 Problem Formulation Allowing Fast Implementation

Let us consider a simple L2 L2 problem with Freg(X) ¼ λ2(HX)(HX), where H is a

We assume a high-resolution image to be represented as a stack X ¼ (RT, GT,

1.2.2 Block Diagonalization in Super-Resolution Problems

Voevodin and Tyrtyshnikov (1987). A detailed English description of the main

Notation Pu will mean a cyclic shift by u, providing

ðPu Þ ¼ ðPu Þ1 ¼ Pu :

A perfect shufﬂe matrix Πn1 n2 corresponds to the transposition of a rectangular

where En ¼ e n . The Fourier matrix and its conjugate satisfy

Deﬁnition 1 A circulant matrix is a matrix with a special structure, where every

and corresponds to 1D convolution with cyclic boundary conditions. Circulant

In vectorized form, this can be written as

X T ¼ ½x11 , x21 , . . . , xn1 , x12 , x22 , . . . , xn2 , . . . , x1n , . . . , xnn :

If An is a 1D convolution operator from Rn to Rn with coefﬁcients ai, i ¼ 1, . . ., n,

For example, 2D down-sampling by factor s will be

Two-dimensional non-separable convolution (warp and blur) operators are block

A BCCB matrix can be easily transformed to block diagonal form:

each block of A belongs to class : Multilevel classes like

In 1D, Mi and H being convolution operators provides

b constructed as described above satisﬁes

Fig. 1.6 Pixel enumeration

Fig. 1.7 Swapping matrix classes by permutation of rows and columns

If we want to rearrange the matrix elements in order to swap matrix classes, as

Table 1.1 Complexity

Thus matrix Ab from the 1D single-channel SR problem can be transformed to a

transformed to block diagonal form in a similar way:

ΠT3,n2 I 3 I 2sn ΠT2s, n I 2s ΠT2ns, n I3 Fn F n Ab

I3 F n F n I 3 Π2ns,2sn I 2sn Π2s,2sn I 2s Π3,n2 2  n2 12s2 :

1.2.3 Filter-Bank Implementation

It is quite a common idea in image processing to implement the pixel processing

computed using a structure tensor. In the multi-frame case, we need to merge

Fig. 1.8 Using precomputed inverse matrices for SR reconstruction

Fig. 1.9 Visualization of matrix A and extracting ﬁlters

Fig. 1.11 Online and off-line parts of the algorithm

Fig. 1.12 Online and off-line parts of the algorithm

1.2.4 Symmetry Properties

In the case of a straightforward implementation of the ﬁlter-bank approach, the

ϕ1 ðBÞ ¼ J Ps,s B J Ps,s ,

ϕ2 ðBÞ ¼ I 3 Px,y B I 3 Px,y ,

Table 1.2 Filter-bank compression using symmetries

1.2.5 Discussion of Results

X T ¼ ½x11 , x21 , . . . , xn1 , x12 , x22 , . . . , xn2 , . . . , x1n , . . . , xnn :